AM5728: IPC Big Data example gives low throughput

Jonathan Cormier

Part Number: AM5728

I updated the big data ipc example to benchmark the data being sent similar to how MessageQBench is setup. However, when looking at the results, I am getting really low throughput numbers no matter the buffer size. Ranging from 16KB buffers up to 4MB buffers, the throughput numbers ranged from 11,000 KB/s to 15,000KB/s. Where we are expecting in the 100's of MB/s.

Posted the code here: https://github.com/jcormier/big-data-ipc-example/commits/benchmark

I started digging into where the majority of time is spent and was able to localize it to the while loop on the DSP which validates the ARM's count pattern.

So for the 2MB messages, it consistently takes the DSP 161ms (~12.4MB/s) to read from RAM

            /* Check values to see expected results */
            for( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++) {
                if ( bigDataLocalPtr[j] != (msg->id+j) ) {
                    errorCount++;
                }
            }

but only 9ms (~222MB/s) to fill it

            /* Fill new data */
            for ( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++)
                bigDataLocalPtr[j] = msg->id + 10 +j;

I ran a couple of tests:

Disabled cache with Cache_disable(Cache_Type_ALL);
With this change, the check loop stayed the same 161ms but the Fill loop now took >300ms. This would seem to indicate that somehow the cache isn't affecting reading the data. I would have expected both loops to become slower.
Malloc'd a 256KB buffer on the DSP Heap and used that pointer instead of the one passed via IPC
Note: I wasn't able to figure out how to get a larger malloc so had to adjust all my number for the now much smaller buffer but the issue is still visible
ARM CMEM ptr (256KB) took ~21ms to verify buffer and ~1ms to write
DSP Malloc ptr (256KB) took ~1ms to verify buffer and ~1ms to write

Note: I'm using the timestamps in the DSP tracebuffer to time the DSP code.
And with such a small buffer, I'd need to get better timing resolution to calculate anything but regardless the big difference in timing between the ARM CMEM pointer and the DSP malloc pointer shows that something is wrong.

The debugging commits are in this branch: https://github.com/jcormier/big-data-ipc-example/commits/benchmark_debugging

Can TI run my benchmark code and verify they are seeing this same slow throughput?
Any feedback on what may be misconfigured in the DSP would be helpful.

over 3 years ago

0 Jonathan Cormier over 3 years ago

Genius 3635 points

big_data_bench.txt

root@mitysom-am57x:~# ./rebind.sh && ./app_host DSP1 1000 16
1000 16kB messages sent
Round trip time
min: 1711 uS
avg: 4228 uS
max: 4605 uS
Round trip throughput
min: 3474 KB/s
avg: 3784 KB/s
max: 9351 KB/s
Total throughput for all messages
Total time: 1420486 uS
Overall throughput: 11,263 KB/s

root@mitysom-am57x:~# ./rebind.sh && ./app_host DSP1 100 256
100 256kB messages sent
Round trip time
min: 21497 uS
avg: 60979 uS
max: 62189 uS
Round trip throughput
min: 4116 KB/s
avg: 4198 KB/s
max: 11908 KB/s
Total throughput for all messages
Total time: 2017805 uS
Overall throughput: 12,687 KB/s

root@mitysom-am57x:~# ./rebind.sh && ./app_host DSP1 100 1024
100 1024kB messages sent
Round trip time
min: 84094 uS
avg: 243119 uS
max: 245928 uS
Round trip throughput
min: 4163 KB/s
avg: 4211 KB/s
max: 12176 KB/s
Total throughput for all messages
Total time: 8024165 uS
Overall throughput: 12,761 KB/s

root@mitysom-am57x:~# ./rebind.sh && ./app_host DSP1 10 2048
10 2048kB messages sent
Round trip time
min: 167351 uS
avg: 442101 uS
max: 490906 uS
Round trip throughput
min: 4171 KB/s
avg: 4632 KB/s
max: 12237 KB/s
Total throughput for all messages
Total time: 1312506 uS
Overall throughput: 15,603 KB/s

root@mitysom-am57x:~# ./rebind.sh && ./app_host DSP1 10 4096
10 4096kB messages sent
Round trip time
min: 334403 uS
avg: 883494 uS
max: 980857 uS
Round trip throughput
min: 4175 KB/s
avg: 4636 KB/s
max: 12248 KB/s
Total throughput for all messages
Total time: 2623419 uS
Overall throughput: 15,613 KB/s



[      1.783] [t=0x3d466b3e] Server: Message received...6
[      1.783] [t=0x3d49ac0c] Server: ToLocal
[      1.944] [t=0x42ccf6c9] Server: Checked --161ms
[      1.953] [t=0x43198d76] Server: Filled  --9ms
[      1.953] [t=0x431ddbe7] Server: Server_exec: processed id 6, cmd=0x2


So for the 2MB messages it takes the DSP 161ms to read from RAM

```
            /* Check values to see expected results */
            for( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++) {
                if ( bigDataLocalPtr[j] != (msg->id+j) ) {
                    errorCount++;
                }
            }
```

but only 9ms to fill it

```
            /* Fill new data */
            for ( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++)
                bigDataLocalPtr[j] = msg->id + 10 +j;
```


ARM CMEM 256KB
[      1.177] [t=0x287b9440] Server: 0x10000, 0x10001, 0x10002, 0x10003
[      1.198] [t=0x292c182f] Server: Checked -- 21ms
[      1.199] [t=0x2934e4f3] Server: Filled -- 1ms
[      1.199] [t=0x293522fa] Server: Cache
[      1.219] [t=0x29e57bd5] Server: Checked10 -- 1ms
[      1.219] [t=0x29e5b8b0] Server: Server_exec: processed id 4, cmd=0x2


DSP 256KB Malloc
[      1.103] [t=0x25eb8125] Server: 0x1000a, 0x1000b, 0x1000c, 0x1000d
[      1.104] [t=0x25f2d8cc] Server: Checked -- 1ms
[      1.105] [t=0x25fb3e91] Server: Filled -- 1ms
[      1.105] [t=0x25fb8097] Server: Cache
[      1.106] [t=0x2602c6f8] Server: Checked10 -- 1ms
[      1.106] [t=0x26031edc] Server: Server_exec: processed id 5, cmd=0x2

4 256kB messages sent
Round trip time
min: 2958 uS
avg: 3015 uS
max: 3149 uS
Round trip throughput
min: 81295 KB/s
avg: 84908 KB/s
max: 86544 KB/s
Total throughput for all messages
Total time: 12846 uS
Overall throughput: 79713 KB/s

Attaching just some of the raw output I captured.

0 Stanley Liu over 3 years ago in reply to Jonathan Cormier

TI__Guru 50045 points

Hi,

You will only benefit from the cache performance when the data is already in the cache.

If the benchmark only covers read operation from cache misses and it includes the time for cache to bring the data from DDR to cache, it is the same as reading data from DDR without cache.

The write number makes a difference with cache on/off because the benchmark doesn't explicitly show the time for cache to write back data from cache write buffer to DDR since it happens in the background by cache controller without taking CPU cycles. The benchmark with cache on only accounts for the time for CPU to write data into cache write buffer.

Regards,

Stanley

0 Jonathan Cormier over 3 years ago in reply to Stanley Liu

Genius 3635 points

Stanley thanks for the response.

Stanley Liu said:
If the benchmark only covers read operation from cache misses and it includes the time for cache to bring the data from DDR to cache, it is the same as reading data from DDR without cache.

The write number makes a difference with cache on/off because the benchmark doesn't explicitly show the time for cache to write back data from cache write buffer to DDR since it happens in the background by cache controller without taking CPU cycles. The benchmark with cache on only accounts for the time for CPU to write data into cache write buffer.

Is it expected that the DSP read speed from RAM is only ~12MB/s? Is there some way to improve this?

The DDR throughput for the system overall should be in the 100s of MB/s. What am I not understanding? Also where is the dsp heap located, since that is fast.

Why was the big data example setup this way if it's going to be really slow?

For the write benchmark, I tried adding the cache invalidate but that only increased the time by less than a uS for a 2MB buffer.

0 Jonathan Cormier over 3 years ago in reply to Jonathan Cormier

Genius 3635 points

Stanley, can you shed any light on my questions?

0 Stanley Liu over 3 years ago in reply to Jonathan Cormier

TI__Guru 50045 points

For the DSP DDR throughput performance number, you can find the detail analysis done in the following application note (chapter 6).

https://www.ti.com/lit/pdf/sprac46

It is for DRA7x/TDA2x but it is also applicable for AM572x which has the same architecture.

The big data example is to provide a way to pass buffer address of a large buffer in DDR between multiple cores.

There won't be any IPC latency from buffer copy since only buffer address is being passed from one core to the other.

This is different method than MessageQ in IPC where the data is copied to a queue and passed to other cores. MessagQ is used for small size message transfer. The latency from IPC would be greatly impacted by message size since it is being copied.

Regards,

Stanley

0 Jonathan Cormier over 3 years ago in reply to Stanley Liu

Genius 3635 points

The pdf you provided was an interesting read. However it really only confirms what I've been thinking. The DSP read from RAM speed should be much greater than 12MB/s. I wouldn't expect as high as they were reporting (>1GB/s) as this code is hardly pipeline optimized but there is a large gap between that and 12MB/s.

I'm trying to report a bug with the big data example as well as try to identify how to improve it. Your example provides a clear way of sharing pointers between the ARM/DSP but its not of much help if it does so in a way that the performance is bad.

My hope is that you guys could try to reproduce this and see if perhaps there is a compiler flag missing or it isn't setting up cache correctly. Or perhaps the way the count pattern is being read is generating really bad instructions.

0 Stanley Liu over 3 years ago in reply to Jonathan Cormier

TI__Guru 50045 points

I agree that your benchmark does not seem to be correct.

How did you measure the time spent in memory read routine on DSP side?

Did you use TSC timestamp counter for benchmarking?

If yes, did you convert the timestamp value to seconds using the correct DSP clock speed?

For cache configuration, did you check if MARx registers are configured correctly to enable cache and prefetch bits?

Regards,
Stanley

0 Jonathan Cormier over 3 years ago in reply to Stanley Liu

Genius 3635 points

Jonathan Cormier said:
Note: I'm using the timestamps in the DSP tracebuffer to time the DSP code.
And with such a small buffer, I'd need to get better timing resolution to calculate anything but regardless the big difference in timing between the ARM CMEM pointer and the DSP malloc pointer shows that something is wrong.

Note I also included a link to the code. https://github.com/jcormier/big-data-ipc-example/commits/benchmark_debugging

Stanley Liu said:
For cache configuration, did you check if MARx registers are configured correctly to enable cache and prefetch bits?

I am running the big data example provided by TI. How do I check that?

0 Stanley Liu over 3 years ago in reply to Jonathan Cormier

TI__Guru 50045 points

I can't find in your code where you are measuring DSP memory read time with TSC timestamp counter on DSP side.

I only see the time measurement done on Linux side.

When you are talking about throughput, are you referring to the round trip time from A15 to DSP and back?

This will include Linux kernel/user space, interrupt, and system overheads.

The benchmark I shared is just for DSP reading memory. It doesn't include any other system overhead from IPC and Linux side.

For IPC round trip, the overhead per message typically is less than 100 us.

0 Jonathan Cormier over 3 years ago in reply to Stanley Liu

Genius 3635 points

Stanley Liu said:
I can't find in your code where you are measuring DSP memory read time with TSC timestamp counter on DSP side.

I placed various Log statements in the code and am using the timestamp in the tracebuffer to measure time between parts of the code. A bit crude but I couldn't find any examples on timing the DSP code and the resolution was good enough.

Stanley Liu said:
When you are talking about throughput, are you referring to the round trip time from A15 to DSP and back?

Good question. There is really two throughputs being discussed. For the "throughput" in the header; I'm measuring how fast data can be sent to the DSP and back, including creation and verification of the count pattern. The idea behind the big data stuff was that by passing pointers to big buffers, you can get higher data throughput even with the overhead of the IPC comms.

Since we were getting such low numbers and they didn't really improve when varying the buffer size, I started investigating. And discovered that the majority of the time was spent in the DSP, validating the count pattern and not during IPC transmission.

Jonathan Cormier said:

So for the 2MB messages, it consistently takes the DSP 161ms (~12.4MB/s) to read from RAM

Fullscreen1
2
3
4
5
6
            /* Check values to see expected results */
            for( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++) {
                if ( bigDataLocalPtr[j] != (msg->id+j) ) {
                    errorCount++;
                }
            }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
            /* Check values to see expected results */
            for( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++) {
                if ( bigDataLocalPtr[j] != (msg->id+j) ) {
                    errorCount++;
                }
            }

but only 9ms (~222MB/s) to fill it

Fullscreen1
2
3
            /* Fill new data */
            for ( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++)
                bigDataLocalPtr[j] = msg->id + 10 +j;
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
            /* Fill new data */
            for ( j=0; j < bigDataLocalDesc.size/sizeof(uint32_t); j++)
                bigDataLocalPtr[j] = msg->id + 10 +j;

The MB/s numbers here were calculated by dividing 2MB by the time recorded by the print statements. This would be the 2nd alternative measurement of "throughput".

Stanley Liu said:
The benchmark I shared is just for DSP reading memory. It doesn't include any other system overhead from IPC and Linux side.

Understood, it's relevant because the above debugging shows the DSP reads were taking the majority of the time. And those benchmarks show much higher speeds are possible.

Note: Ordinarily I'd be concerned that the Log statements would be slowing down the test, but because the read while loop is so slow. They had minimal effect on the overall throughput.

Stanley Liu said:
For IPC round trip, the overhead per message typically is less than 100 u

Per my testing using MessageQBench, I got on average ~140uS round trip.

0 Jonathan Cormier over 3 years ago in reply to Stanley Liu

Genius 3635 points

Stanley Liu said:
TSC timestamp counter on DSP side

Note do you have any reference on how to use the TSC register and properly convert it to time instead of clocks. That also would be helpful.

0 Stanley Liu over 3 years ago in reply to Jonathan Cormier

TI__Guru 50045 points

You can refer to ~/pdk_am57xx_1_0_17/packages/ti/osal/arch/core/c6x/Arch_util.c.

CSL_tscRead() will read the timestamp counter.

It is implemented in ~/pdk_am57xx_1_0_17/packages/ti/csl/src/ip/tsc/V0/csl_tsc.asm

/* Osal time stamp provider implementations */
void osalArch_TimestampGet64(TimeStamp_Struct *tStamp)
{
    uintptr_t  key;
    uint64_t   cycle, cycleHi;
    uint32_t   lo, hi;

    osalArch_TimestampInit();

    key        = HwiP_disable();
    cycle   = CSL_tscRead();
    cycleHi = ((uint64_t)(cycle >> 32U));

    /* get the lo and hi parts */
    lo    = ((uint32_t)(cycle   & ((uint32_t)(0xFFFFFFFFU))));
    hi    = ((uint32_t)(cycleHi & ((uint32_t)(0xFFFFFFFFU))));

    tStamp->lo         = lo;
    tStamp->hi         = hi;

    /* restore */
    HwiP_restore(key);
}

    .global _CSL_tscEnable
    .global CSL_tscEnable
    .sect ".text:cslsys_section:tsc"

_CSL_tscEnable:
        BNOP            B3, 4           ; 
        MVC             A4,     TSCL    ; Initiate CPU Timer by writing to TSCL 

    .global _CSL_tscRead
    .global CSL_tscRead
    .sect ".text:cslsys_section:tsc"
_CSL_tscRead:
        BNOP    B3,     2       ; Branch Return Pointer
        MVC     TSCL,   B0      ; Read TSCL
        MVC     TSCH,   B1      ; Read TSCH
	||  MV      B0, A4
        MV      B1, A5

You can also use inline asm to read TSCL and TSCH.

0 Jonathan Cormier over 3 years ago in reply to Stanley Liu

Genius 3635 points

Stanley Liu said:
CSL_tscRead() will read the timestamp counter.

I assume this returns time in clocks or something similar. How can this be converted to seconds?

I replied to a lot of your questions in my last post but you only replied to one of mine.

Jonathan Cormier said:
For cache configuration, did you check if MARx registers are configured correctly to enable cache and prefetch bits?

How do I check this? I'd expect TI's big data example to have set these up correctly

0 Stanley Liu over 3 years ago in reply to Jonathan Cormier

TI__Guru 50045 points

TSC timestamp counter return timestamp which increments at the same speed as DSP clock.

So, to convert it to seconds, you just divide the timestamp counts by DSP frequency.

Jonathan Cormier said:
I replied to a lot of your questions in my last post but you only replied to one of mine.

I thought your main question was about DSP memory throughput. I suggested to make sure the benchmark is done properly first because your benchmark results don't match the expectation.

Jonathan Cormier said:
Note: Ordinarily I'd be concerned that the Log statements would be slowing down the test, but because the read while loop is so slow. They had minimal effect on the overall throughput.

In our testing, printing traces via UART inside the function being measured will always have non-trivial negative impact.

Jonathan Cormier said:
How do I check this? I'd expect TI's big data example to have set these up correctly

If you didn't modify Dsp.cfg, it would be configured correctly.

0 Bob Duke over 3 years ago in reply to Stanley Liu

Intellectual 270 points

Stanley,

Thank you for your attention. I work with Jon and I've been lurking on this thread.

I understand that there are many little details about how the benchmarking is done and how to record the timing.

At a high level, though, we are suggesting that the TI-provided example has worse performance than expected.

Could you or someone else at TI run the example and investigate if the performance is what you expect? We are not seeing a subtle performance reduction, it is off by an order of magnitude. If you believe the performance is OK, please let us know how you measured it.

Alternatively, Jon has spent some time benchmarking the current example and you could try running that code (which he provided). If you examined his code you would see that he has not modified DSP.cfg, so it is using the TI-supplied config.

-Bob

0 Stanley Liu over 3 years ago in reply to Bob Duke

TI__Guru 50045 points

Hi Bob,

For IPC latency, you can get the benchmark number from running MessageQBench test.

Jon has reported the number is ~ 140 us for a round trip message between A15 Linux and DSP RTOS. This is expected.

This is the time for a buffer pointer to be past to DSP using MessageQ and respond back.

Now, the buffer pointer can point to any large size memory buffer in DDR.

Regarding the performance of DSP accessing data from DDR, we have done extensive benchmark and provided the results in https://www.ti.com/lit/pdf/sprac46. It also shows how to optimize the DDR memory access from DSP.

In addition, we typically use EDMA to move data from DDR to DSP internal memory so DSP can do computation task in parallel with data in internal memory without wasting CPU cycles to wait for cache to bring data from DDR.

Regards,
Stanley

Processors

Processors forum

AM5728: IPC Big Data example gives low throughput