AM6548: R5F instruction fetch performance

Dominic Rath

Part Number: AM6548
Other Parts Discussed in Thread: SYSBIOS

Dear TI team,

we're currently looking into an issue regarding the R5F's performance.

Our application uses a test routine that runs in a tight loop that fetches a timestamp using Timestamp_get32() and calculates the differences between the current timestamp and the previous timestamp. This loop usually executes in ~30-40 cycles, with some additional instrumentation ~160 cycles. Sometimes the loop takes much longer, on the order of 10,000-13,000 cycles. We've tracked this down to the execution of the TI-RTOS system tick, i.e. everytime the loop takes considerably longer it was interrupted by a timer interrupt. Subsequent occurences of the timer interrupt impact the loop a lot less, around 1000-2000 cycles.

Using the core trace we've been able to see that in case of a long delay the core executed considerably more instructions (~460 vs. ~80), but that doesn't explain the increase in the number of cycles (~6 times as many instructions take ~60 times as long).

Using the performance monitoring unit we've been able to see that the core stalls for ~8000 of ~10000 cycles because the instruction buffer can't deliver an instruction (event 0x40), and that it experienced ~70-90 I-cache misses.

It therefore seems that the huge delay introduced by the timer interrupt is due to the code not being available in the I-cache, and that fetching the code from memory takes a long time.

Our application runs from DDR memory, which of course explains some of the delay, but I also tried putting the SYSBIOS code (that includes the timer interrupt handler and everything called from there) into MSMC SRAM, and the performance improved only slightly (13,000 cycles -> 10,000 cycles).

We've also written some test applications that test the performance of accessing various memories from the R5F core. Our code is able to read/write ~1200-1300 MB/s from/to TCMA, it can write 600 MB/s to MCU SRAM and read ~300 MB/s from it, but is is only able to write ~160-200 MB/s to MSMC SRAM and reads 90-100 MB/s from it. For DDR memory the numbers are even worse with 150MB/s-180MB/s writing and 65MB/s-70MB/s reading.

My tests show that it takes on average 23 cycles to read a 32-bit word from DDR memory and ~17 cycles from MSMC memory. Since the memory is cached I'm guessing that the latency is actually ~160 cycles for DDR and ~130 cycles for MSMC SRAM, because the R5F is fetching 8 consecutive words that I'm also reading sequentially. The first access is going to stall, while the remaining 7 are should execute with zero delay. These latencies appear to be rather high.

The memories are all mapped as cacheable normal memory. Clearing the FWT bit that is left by the SBL has only a small impact (worse for DDR memory, better for MSMC, but only by 10-40 MB/s). Mapping the memory as uncacheable makes the DDR memory significantly worse (~3.5x less performance when writing).

Seeing that DDR memory is only slightly worse than the MSMC SRAM I'm guessing that there's a bootleneck (or rather high latency?) when accessing MAIN memories from the R5F.

Does TI have any numbers regarding the R5F performance when running from or accessing memories in the MAIN domain? The "AM65xx System Performance" document (SPRACI6–November2018) contains number similar to what I'm interested in (e.g. 6.1.1: System Access Latency, 6.1.2 Instruction Cache Bandwidth), but only for the A53 cores and not for the R5F cores.
Are there any means to optimize access to MAIN domain memories in the AM65xx? In this case the A53 cores are not going to be used.

Best Regards,

Dominic

over 5 years ago

0 Armin Mueller over 5 years ago

Prodigy 181 points

Are there any news to that Topic?

0 jian35385 over 5 years ago

TI__Mastermind 23125 points

Dominic,

Sorry for the delayed response, and thanks for the detailed descriptions of your experiments. So my understanding is:

If we assume the 70-90 I-cache misses to MSMC SRAM per RTOS Tick interrupt, and each cache misses are on separate cache lines, then we would expect to see 90*130=11700 MCU cycles of extra latency. This is somewhat aligned with the data you saw 10000-13000 cycles. So the root cause is the long distance access from R5 to MSMC, that your benchmarked as 17 cycles per word. I will verify the accuracy of this number, but there is very little we can do, as it is determined by the device interconnect topology.

I am not familiar with the timer TI-RTOS interrupt ISR, but causing 90 cache line misses seems to be quite unusual. Also will it be possible to place the tick ISR in the TCM, or the OC-MSRAM in the MCU domain? i will also ping our RTOS guys on this question.

Regards

Jian

0 Dominic Rath over 5 years ago in reply to jian35385

Mastermind 6985 points

Hello Jian,

thanks for getting back to me on this issue.

jian35385 said:
If we assume the 70-90 I-cache misses to MSMC SRAM per RTOS Tick interrupt, and each cache misses are on separate cache lines, then we would expect to see 90*130=11700 MCU cycles of extra latency. This is somewhat aligned with the data you saw 10000-13000 cycles. So the root cause is the long distance access from R5 to MSMC,

Agreed.

jian35385 said:
that your benchmarked as 17 cycles per word. I will verify the accuracy of this number, but there is very little we can do, as it is determined by the device interconnect topology.

I rather think the latency is closer to 160 cycles for DDR and 130 cycles for MSMC SRAM, since in that test I was reading consecutive words of memory, and the /average/ latency for that was 17-23 cycles per word. Only the first access is affected by the latency, and the remaining 7 words can be used instantly. The actual latency for fetching a line from DDR or MSMC would therefore be in the range of 325-400 ns.

A latency in the range of 360-380 ns for a single 4 byte access is also what I'm seeing for another test that I've run since then using strongly ordered DDR memory.

This appears to be way too high. If the latency is actually that high this puts a serious limit on what you can use the R5f for.

jian35385 said:
I am not familiar with the timer TI-RTOS interrupt ISR, but causing 90 cache line misses seems to be quite unusual. Also will it be possible to place the tick ISR in the TCM, or the OC-MSRAM in the MCU domain? i will also ping our RTOS guys on this question.

Please note that this was on the first execution of the timer interrupt, so it appears to be plausible that ALL of the code had to be fetched. Since the timer ISR amounted to ~400 instructions I'd expect 50 cache misses in the best case (all perfectly aligned, no unused instructions fetched), and 70-90 seems like a reasonable result. With a dummy application that doesn't really do anything but idle, subsequent timer interrupts were a lot better, but our application has a much larger working set, and causes the interrupt handler to be evicted from cache.

I'm seeing several issues with putting the ISR code in a TCM or MCU SRAM:

Our application is going to be too large to be all put into TCM or MCU SRAM. A (debug) build is already closer to 1 MB.
The TI-RTOS code isn't as straight forward as I would like it to be, e.g. the timer interrupt is executing 13 functions.
There's not just the timer ISR to be considered, but of course also the interrupts that are going to drive our application.

If there really is such a high latency for accessing MAIN domain memories from the R5f then I agree, putting our critical code, including all of the ISRs, into SRAM is going to be our only chance, but since the number appears way too high I'm still hoping that there's something wrong with our setup.

Regards,

Dominic

0 jian35385 over 5 years ago in reply to Dominic Rath

TI__Mastermind 23125 points

Dominic,

130 cycles for a cache line load from MSMC would be way much than expected.

My team alerted seeing some very long latency when the accessing region is marked as Shareable in R5 MPU registers. Could you check the MSMC region is marked as Non-Shared in the MPU Region Access Control Register? We also have a GEL script to dump MPU settings. let me know if you need.

Once we get through the latency, we may further explore TCM and OC-MSRAM usage between TI-RTOS and your application code. Based on our discussion, We may put some partitions of TI-RTOS to different memory.

regards

Jian

0 Dominic Rath over 5 years ago in reply to jian35385

Mastermind 6985 points

Hello Jian,

jian35385 said:
130 cycles for a cache line load from MSMC would be way much than expected.

That's what I was hoping. Do you have an indication what kind of latency would be expected? From SPRACI6 I know that the A53 should see a latency of ~40ns for accesses to MSMC SRAM, but there's no data for the R5f.

jian35385 said:
My team alerted seeing some very long latency when the accessing region is marked as Shareable in R5 MPU registers. Could you check the MSMC region is marked as Non-Shared in the MPU Region Access Control Register? We also have a GEL script to dump MPU settings. let me know if you need.

I've checked my MPU settings. The cacheable regions are marked as TEX = b001, C = 1, B = 1, S = 0, and should therefore be treated as outer and inner write-back, write-allocate cacheable non-shareable memory. I've verified the the MPU access registers and these are set to 0x10b for the cacheable regions which matches the settings from the MPU file.

To make sure we're using valid numbers I've refined my tests and added tests using non-cacheable and strongly ordered MSMC regions. I now get the following results:

For cacheable DDR it takes ~1.4 million cycles for 256 KB -> 21 cycles per 4 byte. Since a cache line is 8 words I'm assuming that the line fetch took ~168 cycles.

For non-cacheable DDR it takes ~9.6 million cycles for 256 KB -> ~146 cycles per 4 byte.

For strongly ordered DDR memory it takes almost 10 million cycles for 256 KB -> 152 cycles per 4 byte.

For cacheable MSMC it takes 986 thousand cycles for 256 KB -> 15 cycles per 4 byte. Since a cache line is 8 words I'm assuming that the line fetch took ~120 cycles.

For non-cacheable MSMC it takes ~6.8 million cycles for 256 KB -> ~104 cycles per 4 byte.

For strongly ordered MSMC memory it takes almost ~7 million cycles for 256 KB -> 106 cycles per 4 byte.

I'm now accessing 256 KB of memory in a loop that performs 8 sequential accesses using a 32-bit pointer. The loop gets compiled to 8 LDR instructions, a SUBS and a BNE. The loop overhead should therefore be negligible. This change resulted in slightly better numbers (e.g. 130 cycles for a line fetch -> 120 cycles for a line fetch from MSMC).

The difference between MSMC and DDR memory seems justifiable, since addressing DDR memory obviously takes extra time, and this access pattern allows for little or no optimization.

My calculations for the line fetch latency (15 cycles per 4 byte -> 120 cycles per line) ignore the actual loads, so the actual line fetch latency is probably about 8 cycles lower than the above numbers, and that roughly matches the results for non-cacheable (104 cycles) or strongly ordered loads (108 cycles).

A latency of > 100 cycles / 250ns for a read from MSMC SRAM still seems excessively high.

Unfortunately the documentation on the various interconnects is rather sparse. There is some information in SPRACI6, but I'm uncertain which way my R5f accesses travel to the MSMC SRAM or DDR memory.

jian35385 said:
Once we get through the latency, we may further explore TCM and OC-MSRAM usage between TI-RTOS and your application code. Based on our discussion, We may put some partitions of TI-RTOS to different memory.

I think that's an optimization that we'll have to consider once we understand what kind of performance we can achieve running from DDR and MSMC memory.

Regards,

Dominic

0 jian35385 over 5 years ago in reply to Dominic Rath

TI__Mastermind 23125 points

Dominic,

I got hold of our internal benchmark, and in discussion with the engineer on the test conditions. Will need to update you tomorrow.

In the mean time if you have a comparison to the MCU SRAM region, that would be a good reference.

regards

Jian

0 Dominic Rath over 5 years ago in reply to jian35385

Mastermind 6985 points

Hello Jian,

we've got a holiday today, so I can't make any new measurements.

I believe access time to cached MCU SRAM was about 5.x cycles per 4 byte, so the latency should be around 30-40 cycles. This is also not as good as I would have hoped. Unfortunately it is consistent with the "R5 to OSPI via interconnect" latency listed in SPRACI6 as 92.5ns (37 cycles). The MCU SRAM and the FSS seem to be on the same interconnect. Do you know if the 92.5ns are true for all slaves on the SCRM_64b_PCLK2_R5 interconnect?

Isn't that going to limit code execution even from MCU SRAM? ~30-40 cycles per 8 instructions would mean 0.2-0.25 instructions per cycle?

Regards,

Dominic

0 jian35385 over 5 years ago in reply to Dominic Rath

TI__Mastermind 23125 points

Dominic,

I reviewed our internal benchmark data, but seeing significant discrepancy between your measurements and ours, on a test similar to:

"For cacheable MSMC it takes 986 thousand cycles for 256 KB -> 15 cycles per 4 byte. Since a cache line is 8 words I'm assuming that the line fetch took ~120 cycles."

The only difference is that his code is using 512bytes block read on a cold cache, and used GTC as timer to benchmark.

So would like to see your measured data on MCU SRAM once you get to chance to test. also as a sanity check, did you verify the accuracy of the PMU configuration against a known routine of known number of execution cycles?

Also I noticed simply multiplying 15 cycles by cache line size to estimate loading time of cache line, may not be accurate. so it will be desirable to see yous MCU SRAM latency numbers as the following:

total cycles for loading 256KB, when MCU RAM is configured as cacheble
total cycles for loading 256KB, when MCU RAM is configured as non-cacheable
Cycles of single 4B word loading, out of cold cache each time. Not sure if PMU overhead can be sufficiently extracted.
cycles of single byte loading, out of cold cache each time

I am expecting measures from 3 and 4 above to be inline with 3, assuming CBASS interconnect can load cache lines fast.

regards

Jian

0 Dominic Rath over 5 years ago in reply to jian35385

Mastermind 6985 points

Hello Jian,

jian35385 said:
I reviewed our internal benchmark data, but seeing significant discrepancy between your measurements and ours, on a test similar to:

Is there any chance you could share some numbers?

jian35385 said:
The only difference is that his code is using 512bytes block read on a cold cache, and used GTC as timer to benchmark.

I'm using the Timestamp_get32() function that reads out the R5's PMCCNTR register. As far as I understand it the GTC can't possibly be a better time source?

jian35385 said:
also as a sanity check, did you verify the accuracy of the PMU configuration against a known routine of known number of execution cycles?

I'm running two different tests. The one that originally caused us to look into the memory performance is a loop that just calls Timestamp_get32(). Sometimes the delta was huge, and we tracked that down to the R5 having to fetch the timer ISR (and code called from there). That code used the PMU to track down the reason for the delays. The counters for instructions executed matched a "core trace" very well, and the numbers for things like cycles stalled because of instruction fetches matched the overall increase of the loop reasonably well.

The other test merely uses the PMCCNTR register from the PMU to measure the time taken to read a 256 KB array from some memory. I'm using reads from TCMA to sanity-check these results. From TCMA it takes ~4650 cycles to read 16 KB. This means it took ~1.13 cycles per 4 byte. The overhead of ~0.13 cycles can be explained by the loop, so the TCMA was indeed able to deliver values with 0 cycle latency.

jian35385 said:
So would like to see your measured data on MCU SRAM once you get to chance to test.

Reading from cached MCU SRAM takes ~337 thousand cycles for 256 KB -> 5.14 cycles per 4 byte. Since a cache line is 8 words I'm assuming that the line fetch took ~41 cycles (more likely ~33).

Reading from uncacheable MCU SRAM takes 2.25 million cylces for 256 KB -> 34 cycles per 4 byte

These numbers would be close enough to 92.5 ns (37 cycles) latency to OSPI mentioned in SPRACI6. Can you comment on my assumption that every slave on SCRM_64b_PCLK2_R5 is going to be affected by a similar latency?

jian35385 said:
Also I noticed simply multiplying 15 cycles by cache line size to estimate loading time of cache line, may not be accurate.

I guess you're refering to my numbers for cacheable MSMC which took 986k cycles, which means it took ~15 cycles per 4 byte _on average_. Since the processor is going to stall on the first cache miss, but is then going to deliver the data critical word first and non-blocking it should be able to process the remaining 7 words with 0 additional delay. I agree that this is not fully accurate, which is what I was refering to with this statement:

Dominic Rath said:
My calculations for the line fetch latency (15 cycles per 4 byte -> 120 cycles per line) ignore the actual loads, so the actual line fetch latency is probably about 8 cycles lower than the above numbers, and that roughly matches the results for non-cacheable (104 cycles) or strongly ordered loads (108 cycles).

jian35385 said:

total cycles for loading 256KB, when MCU RAM is configured as cacheble

total cycles for loading 256KB, when MCU RAM is configured as non-cacheable

See above.

jian35385 said:

Cycles of single 4B word loading, out of cold cache each time. Not sure if PMU overhead can be sufficiently extracted.

cycles of single byte loading, out of cold cache each time

I don't have these tests ready, and I'm not sure if the numbers are going to be meaningful, because as you already said for a single load the cost of measuring time is going to become a relevant factor. Also I'm not sure what additional insight that would provide. Unfortunately CCS can't do cycle accurate traces, or at least not with XDS100/200 probes and only the ETBs.

This leaves us with ~34 cycles for accesing MCU SRAM, ~104 cycles for accessing MSMC SRAM and ~146 cycles for accessing DDR RAM. The question remains whether these are expected values, which is what I was hoping you (TI) would be able to provide.

Regards,

Dominic

0 jian35385 over 5 years ago in reply to Dominic Rath

TI__Mastermind 23125 points

Dominic,

Agree with your summary:

>>This leaves us with ~34 cycles for accesing MCU SRAM, ~104 cycles for accessing MSMC SRAM and ~146 cycles for accessing DDR RAM.

I will send you confirmation later today.

Regards

Jian

0 jian35385 over 5 years ago in reply to jian35385

TI__Mastermind 23125 points

Dominic,

yes we can confirm these latency numbers.

>>This leaves us with ~34 cycles for accesing MCU SRAM, ~104 cycles for accessing MSMC SRAM and ~146 cycles for accessing DDR RAM.

Also to your question:

>>Can you comment on my assumption that every slave on SCRM_64b_PCLK2_R5 is going to be affected by a similar latency?

[Jian] yes. only differences will be access lateness of peripheral controllers itself.

Let me know how if you need further information on relocating some RTOS buffers.

Jian

0 Dominic Rath over 5 years ago in reply to jian35385

Mastermind 6985 points

Hello Jian,

thanks for confirming my measurements.

It would be great if TI could document these numbers somewhere, for example in an updated SPRACI6. It took us a lot of effort to reproduce benchmarks that TI apparently already had, and it might be nice saving others this hassle.

Could you maybe explain why for a while you were thinking that the latency was "way much than expected"?

I've moved all of the code required for the timer interrupt to MCU SRAM (verified addresses via core trace). The results are of course much better than what we've seen before using DDR memory, but the MCU SRAM latency still badly limits the R5F performance. Coming back to my original test case (loop reading cycles counter, calculating delta) I see 3000+ cycles caused by an extra ~400 instructions that caused ~69 instruction cache misses which stalled the core for ~2400 cycles. The instruction fetches alone cause ~6us of of delay by the timer interrupt. The numbers are in line with my previous measurements: 69 cache misses * 34 cycles latency = 2346 cycles.

We might have to move parts of the code to TCM, but since the TCM is even smaller that most likely means moving individual functions, not whole objects like I'm doing it now.

Regards,

Dominic

Processors

Processors forum

AM6548: R5F instruction fetch performance