Other Parts Discussed in Thread: SYSBIOS
Dear TI team,
we're currently looking into an issue regarding the R5F's performance.
Our application uses a test routine that runs in a tight loop that fetches a timestamp using Timestamp_get32() and calculates the differences between the current timestamp and the previous timestamp. This loop usually executes in ~30-40 cycles, with some additional instrumentation ~160 cycles. Sometimes the loop takes much longer, on the order of 10,000-13,000 cycles. We've tracked this down to the execution of the TI-RTOS system tick, i.e. everytime the loop takes considerably longer it was interrupted by a timer interrupt. Subsequent occurences of the timer interrupt impact the loop a lot less, around 1000-2000 cycles.
Using the core trace we've been able to see that in case of a long delay the core executed considerably more instructions (~460 vs. ~80), but that doesn't explain the increase in the number of cycles (~6 times as many instructions take ~60 times as long).
Using the performance monitoring unit we've been able to see that the core stalls for ~8000 of ~10000 cycles because the instruction buffer can't deliver an instruction (event 0x40), and that it experienced ~70-90 I-cache misses.
It therefore seems that the huge delay introduced by the timer interrupt is due to the code not being available in the I-cache, and that fetching the code from memory takes a long time.
Our application runs from DDR memory, which of course explains some of the delay, but I also tried putting the SYSBIOS code (that includes the timer interrupt handler and everything called from there) into MSMC SRAM, and the performance improved only slightly (13,000 cycles -> 10,000 cycles).
We've also written some test applications that test the performance of accessing various memories from the R5F core. Our code is able to read/write ~1200-1300 MB/s from/to TCMA, it can write 600 MB/s to MCU SRAM and read ~300 MB/s from it, but is is only able to write ~160-200 MB/s to MSMC SRAM and reads 90-100 MB/s from it. For DDR memory the numbers are even worse with 150MB/s-180MB/s writing and 65MB/s-70MB/s reading.
My tests show that it takes on average 23 cycles to read a 32-bit word from DDR memory and ~17 cycles from MSMC memory. Since the memory is cached I'm guessing that the latency is actually ~160 cycles for DDR and ~130 cycles for MSMC SRAM, because the R5F is fetching 8 consecutive words that I'm also reading sequentially. The first access is going to stall, while the remaining 7 are should execute with zero delay. These latencies appear to be rather high.
The memories are all mapped as cacheable normal memory. Clearing the FWT bit that is left by the SBL has only a small impact (worse for DDR memory, better for MSMC, but only by 10-40 MB/s). Mapping the memory as uncacheable makes the DDR memory significantly worse (~3.5x less performance when writing).
Seeing that DDR memory is only slightly worse than the MSMC SRAM I'm guessing that there's a bootleneck (or rather high latency?) when accessing MAIN memories from the R5F.
- Does TI have any numbers regarding the R5F performance when running from or accessing memories in the MAIN domain? The "AM65xx System Performance" document (SPRACI6–November2018) contains number similar to what I'm interested in (e.g. 6.1.1: System Access Latency, 6.1.2 Instruction Cache Bandwidth), but only for the A53 cores and not for the R5F cores.
- Are there any means to optimize access to MAIN domain memories in the AM65xx? In this case the A53 cores are not going to be used.
Best Regards,
Dominic