AM2434: Understanding AM2434 performance

Gael Le Moing

Part Number: AM2434
Other Parts Discussed in Thread: TMDS243EVM, AM2612, , LP-AM261, AM6442, SYSCONFIG, MATHLIB

Hi support team,

I am evaluating both AM2612 and AM2434. With the same code used for benchmarking, and placing the code, data and stack in the main memory (OCSRAM or MSRAM), with cache enabled, MPU correctly configured, and the clocks as provided in the example demos (R5f @500MHz on AM2612 and 800MHZ on AM2434 on TMDS243EVM board), I am experiencing similar results on both targets, while I expected a better performance with AM2434 which has a higher clock speed for R5F core.

I am trying to find the differences between the two setups: I have the same code, same compiler, same compiler options (-O2), same MPU configuration, different R5 clocks, and the only factor I could think is the MSRAM/OCSRAM clock frequency.

I found in the AM2612 reference manual that the OCSRAM is clocked at 200MHz (see below), but I can not find any information about the MSRAM in AM2434 (as refered in this post: https://e2e.ti.com/support/microcontrollers/arm-based-microcontrollers-group/arm-based-microcontrollers/f/arm-based-microcontrollers-forum/1326676/mcu-plus-sdk-am243x-sram-clock?tisearch=e2e-sitesearch&keymatch=am2434%2525252520sram%2525252520clock):

I am just trying to figure out what leads to this difference, so if you could at least give me the MSRAM frequency in AM2434, that would be great. In my project, the AM24x_GP_EVM.gel script is used to initialize the CPU clocks for the board (with DDR).

Best regards

4 months ago

0 Stanislav Stilyanov 4 months ago

TI__Mastermind 25050 points

Hello Gael,

Can you please post screenshots of the benchmark results? Thanks!

Regards,

Stan

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Hi Stanislav,

I don't have a screenshot with my measurements, but I can give the average execution time from my program on both targets:

- LP-AM261 (R5f monocore @ 500MHz, -O2, code/data/stack in OCRAM, OCRAM bus width and speed: 64 bits, 200MHz): 945.6us

- TMDS243EVM (R5f monocore @ 800MHz, -O2, code/data in MSRAM, stack in TCMA; MSRAM bus width and speed: ?): 1132.9us

Can you share some information about the MSRAM on the AM2434, at least clock speed and bus width (64 or 32 bits)?

Thanks,

Gael

0 Stanislav Stilyanov 4 months ago in reply to Gael Le Moing

TI__Mastermind 25050 points

Gael,

I will need to do some more checks but looks like AM243x MSRAM is running at 250 MHz with bus with 64 bits.

What is the benchmark you are running?

Thanks,

Stan

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Stanislav,

If it runs at 250MHz, then I need to dig deeper to understand if the SDK example SW from where I started (it was the ddr ecc example from the drivers folder), configures correctly all the clocks.

In the TRM, I did not find more information beyond the MAIN_PLL0-HSDIV0 clock output being set at 500MHz, but it is not indicated anywhere which divider is applied to clock the MSRAM banks. If you could confirm the 250MHz, it would be great, thanks.

The benchmark code I am running is a private application from my company.

Best regards,

Gael

0 Anastas Yordanov 4 months ago in reply to Gael Le Moing

TI__Genius 12550 points

Hello Gael,

Would you please open the Sysconfig / Clock Tree Tool for the AM6442 (should be valid also for AM243x) and check the clock diagram ?

I will follow-up shortly with more details.

Thanks

Best Regards

Anastas Yordanov

0 Anastas Yordanov 4 months ago in reply to Anastas Yordanov

TI__Genius 12550 points

Hello Gael,

Here is an extract of the diagram from the online available Sysconfig/Clock Tree Tool - https://dev.ti.com/sysconfig

for the AM6442:

The divider_219_2 is a SoC level integrated fixed (non-programmable) "/2" divider. Such clock frequency division to functional clock VCLK input applies to all 8x256KB instances: MSRAM_256K0 - MSRAM_256K7.

So I confirm - MAIN_SYSCLK0_freq / 2 = 250 MHz is the used MSRAM interface clock.

Looking forward to your feedback !

Thanks

Best Regards

Anastas Yordanov

0 Gael Le Moing 4 months ago in reply to Anastas Yordanov

Prodigy 130 points

Hi Anastas,

Thanks for your inputs.

Using the sysconfig online tool and the different sub-tools, I checked the internal configuration on the AM24x and compared the different dividers and PLL HSDIV configurations to try to understand how I can have the same performance measured on LP-AM261 at 500MHz and TMDS243EVM at 800MHz, when all my code is run from internal RAM.

I checked the R5f registers on AM2434 against the configuration provided by the sysconfig tool for AM6442 which is similar to AM2434, especially PLL0_HSDIV0 for MSRAM, and PLL14_HSDIV0 for R5f core clock (800MHz). I checked again that the MPU region for the whole MSRAM was configured as "Cached", I checked the System control register that the cache, I-cache and D-cache are enabled, and I still don't understand why I get such results.

Do you have a lead on what I could check that I may have missed?

Do you have any software that has been run on AM261 from internal memory, and on AM2434 from internal memory, to compare the performance with a software of yours? I checked the SDK online under the benchmarks, but I could not find any identical SW with comparable measurements that could give me some comparison.

I also checked the DDR configuration (for my benchmark which has code only running from DDR), and I checked that the GEL script is correctly configured. I checks the PLL12_HSDIV0 which is set at 400MHz and for this point I have a question: in the GEL scripts, it is clearly written that we want to configure the DDR of TMDS243EVM at 800MHz clock, but the PLL frequency is set to 400MHz. In the reference manual, I was not able to find information about anything doubling the DDR clock. I was thinking that if we wanted to have a DDR clock at 800MHz, that we would need a PLL 12 HSDIV0 output set at 800MHz.

Could you explain this point?

Best regards

Gael

0 Stanislav Stilyanov 4 months ago in reply to Gael Le Moing

TI__Mastermind 25050 points

Hi,

Just wanted to chime in. I recall that if a benchmark code is small enough to fit the cache, it will measure the pure CPU/cache performance. This still doesn't explain the results though.

Best,

Stan

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Hi,

Actually, the code of my benchmark application is small enough to fit in the internal RAM of AM2612 and AM2434 (not the cache). But as you say, I am focusing only on this scenario (not involving the DDR of AM2434) just to be sure the pure performance is better on AM2434. As I have 200MHz clock for MSRAM on AM2612 side with 500MHz R5, and 250MHz clock for MSRAM on AM2434 with 800MHz R5, I was expecting a better performance on AM2434 but I don't observe that. That is why I am interested if you have any piece of software that would have been executed on those processors with comparable measures.

I also added in my code cache enabling/disabling just to be sure the MPU attributes and caches are enabled, and I confirm that the caches are enabled and MPU is configured correctly for the MSRAM to be cached.

By the way, if that is of interest, on AM2612 as well as AM2434, when code is executed from MSRAM/OCRAM, the application average execution time is 40% better with cache enabled.

Best regards,

Gael

0 Stanislav Stilyanov 4 months ago in reply to Gael Le Moing

TI__Mastermind 25050 points

Hi Gael,

You can refer to benchmarking code in MCU+ SDK here:

https://dev.ti.com/tirex/explore/node?isTheia=false&node=A__AD2nw6Uu4txAz2eqZdShBg__com.ti.MCU_PLUS_SDK_AM243X__AROnekL__LATEST

And here for AM621x:

https://dev.ti.com/tirex/explore/node?isTheia=false&node=A__AD2nw6Uu4txAz2eqZdShBg__MCU-PLUS-SDK-AM261X__ZKtBu0R__LATEST

Please note that the results are normalized per MHz.

Also I got some feedback from one of our experts:

Basically if the R5 is operating out of its own TCM you would expect near single cycle access and optimal performance. If it is going off to main OCSRAM or the other R5 TCM then there will be some access latency which could influence the result

Thanks,

Stan

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Hi Stanislav,

I looked into the benchmarks in details and there are coremark and drysthone benchmarks that could be compared on both targets, indeed.

In both benchmarks, code and bss are placed in MSRAM/OCSRAM, and data is placed in TCM, so that compares with my own application benchmark setup.

On AM2434, the SDK page shows the following numbers, but I think the "usertime in sec" is wrong, since it is computed in the code as "USER cycle count" divided by cpuClockRate which is 800MHz (as shown on the screenshot below). So with 144641654 it should be 0,1808 seconds (instead of 0.723208), right?

I also took the drysthone example as is, without modifying anything, from SDK "motor_control_sdk_am243x_11_00_00_06", and ran it on the TMDS243EVM board. Here are the results:

I followed the EVM setup page so I'm sure the SBL null is flashed, and the BOOT pins are set as per the SDK EVM setup page.

Does this make you think of anything that could be wrong either in my setup, in the clocks configuration, the SDK...?

Regards,

Gael

0 Stanislav Stilyanov 4 months ago in reply to Gael Le Moing

TI__Mastermind 25050 points

Hi Gael,

I will need some more time to get back to you.

Thanks,

Stan

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Hi Stan,

Ok, no problem.

Gael

0 Stanislav Stilyanov 4 months ago in reply to Gael Le Moing

TI__Mastermind 25050 points

Hi Gael,

I will need an hour or two

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Hi Stan, no problem.

Thanks

Gael

0 Gael Le Moing 4 months ago in reply to Gael Le Moing

Prodigy 130 points

Hi Stan,

I realized that by default CCS THEIA uses "Debug" instead of "Release" build configuration...

I've set the drysthone example in Release configuration and I got those results:

I don't know what are the differences between debug and release configurations, in terms of debug functions for logs etc... but this result is close to what is published on the SDK website (except "usertime in secs" and "Microseconds for one run").

I also changed the build configuration from "Debug" to "Release" in my own benchmark application project, and I still have the same results (because the code being measured doesn't contain any call to any function from the SDK that may hide functions called in Debug configuration). So I still don't understand the close performance on both targets. I also checked the datasheets, and the AM2612 has less I-cache and D-cache (16KB each) than AM2434 (32KB each) which should provide even more performance than with the core speed.

Regards,

Gael

0 Stanislav Stilyanov 4 months ago in reply to Gael Le Moing

TI__Mastermind 25050 points

Hi Gael,

Sorry for the late reply, I was out of office for 2 days.

Gael Le Moing said:
So with 144641654 it should be 0,1808 seconds (instead of 0.723208), right?

I was wondering too about 0.723208. I've opened the original Dhrystone benchmark code in GitHub and I found there that this value is derived by dividing by number-of-runs that is 2000 by default. 0.723208 is exact div by 2 values not accounting zeroes in the formula.

Gael Le Moing said:
So I still don't understand the close performance on both targets

Sorry if you already told that but do you mean you are are getting similar results in your own benchmarking code or both - yours and Dhrystone too?

Regards,

Stan

0 AnBer 4 months ago in reply to Stanislav Stilyanov

TI__Mastermind 31005 points

Stan, Anil,
I think that Gael refers to his own proprietary benchmarks (as Dhrystone normalizes the results to MIPS/Mhz so you might not see that much differences).

For his benchmarks on one R5F all code, data and stack are placed in OCRAM or MSRAM. No external memory used.
I assume same src code and compilation options are being used.
AM243x performance should be significantly higher than AM261x because:

AM243x: AM261x:

32kB cache 16kB cache

800Mhz CPU clk 500MHz CPU clk

250Mhz clocked RAM 200Mhz clocked RAM

May be you could agree to re-run some of the MCU SDK 11.01.xx benchmarks on both platform to compare results?
For example the Mathlib benchmarks seem to use the same code base:
https://dev.ti.com/tirex/explore/content/mcu_plus_sdk_am243x_11_01_00_19/docs/api_guide_am243x/EXAMPLES_MATHLIB_BENCHMARK.html
https://dev.ti.com/tirex/explore/content/mcu_plus_sdk_am261x_11_01_00_19/docs/api_guide_am261x/EXAMPLES_MATHLIB_BENCHMARK.html

Thanks in advance,
Anthony

0 Swargam Anil 4 months ago in reply to AnBer

TI__Guru 54208 points

Hello Anthony,

I can update the status by tomorrow .

Regards,

Anil.

0 Gael Le Moing 4 months ago in reply to Stanislav Stilyanov

Prodigy 130 points

Hi. (I was out of office too)

Stanislav Stilyanov said:
Sorry if you already told that but do you mean you are are getting similar results in your own benchmarking code or both - yours and Dhrystone too?

Regards,

With my benchmarking code on both target I have similar average execution time, but I expected it to be less on AM2434 than on AM261. With the Drysthone, I ran only on AM2434, and I confirmed it gives the same DMIPS/MHz than on AM261, which means the drysthone average execution time lasts less on AM2434 than on AM261 because it runs at 800MHz instead of 500MHz on AM261.

About the benchmarks rerun: I would focus on a benchmark code that is bigger than Drysthone code size. In my own application benchmark, code is about 1MB and I think only of this as a difference between the drysthone benchmark and my application benchmark. If the code size is small enough, maybe it fits in cache and dryshtone measured performance only reflects this specific case where code is mainly run from cache. I am curious about how one of your application bigger in size would affect the performance.

Regards,

Gael

0 Swargam Anil 3 months ago in reply to Gael Le Moing

TI__Guru 54208 points

Hi Gael,

When performing benchmarks, it is critical to use the Release build configuration rather than Debug build.

This can significantly affect your measured performance.

Debug Build Characteristics:
- Compiler optimization disabled (typically -O0 flag)
- Debug symbols included, increasing code size
- Additional runtime checks and assertions enabled
- Larger code footprint leads to more cache misses

Release Build Characteristics:
- Compiler optimization enabled
- Smaller and faster generated code
- Dead code elimination reduces code size
- Better cache utilization due to compact code

Memory Latencies :

The 63.75 ns latency figure is measured specifically for AM64x/AM243x MSRAM from our application note SPRACV1B. We do not have equivalent published latency measurements for AM261x OCSRAM in the benchmark document.

https://www.ti.com/lit/an/spracv1b/spracv1b.pdf?ts=1773221759391&ref_url=https%253A%252F%252Fwww.google.com%252F

AM243x: 51 cycles memory latency (63.75 ns at 800 MHz)
AM261x: 24 cycles memory latency (48 ns at 500 MHz) This data taken from internal Team.

This shows AM261x has lower memory latency in both cycles and absolute time.

DHRYSTONE ANALYSIS :

Your measurements show:
- AM243x: 1.96 MIPS/MHz at 800 MHz

https://software-dl.ti.com/mcu-plus-sdk/esd/AM243X/11_02_00_24/exports/docs/api_guide_am243x/EXAMPLES_DHRYSTONE.html

- AM261x: 1.94 MIPS/MHz at 500 MHz

https://software-dl.ti.com/mcu-plus-sdk/esd/AM261X/latest/exports/docs/api_guide_am261x/EXAMPLES_DHRYSTONE.html

The normalized performance (DMIPS/MHz) is nearly identical at approximately 1.95 MIPS/MHz.

In absolute terms:
- AM243x: 1.96 × 800 = 1568 DMIPS
- AM261x: 1.94 × 500 = 970 DMIPS

AM243x completes Dhrystone 1.6x faster in absolute time, matching the clock ratio (800/500 = 1.6x).

Dhrystone is specifically designed to fit entirely within L1 cache. The R5F has 32 KB I-Cache and 32 KB D-Cache. Dhrystone code plus data is small enough (only a few KB) to run entirely from cache after the first iteration. It effectively measures CPU plus cache performance only, not memory subsystem performance.

Your application with 1 MB code size is the critical difference. 1 MB code cannot fit in the 32 KB I-cache. Frequent cache misses force instruction fetches from MSRAM/OCSRAM.

The memory penalty comparison:
- AM243x: 51 cycles wasted per cache miss
- AM261x: 24 cycles wasted per cache miss

AM243x wastes more than twice the cycles per memory access (51/24 = 2.125x). This additional memory penalty nearly cancels out the 1.6x CPU clock advantage, resulting in similar overall execution time for memory-bound applications.

SUGGESTED TESTS TO VALIDATE

Test 1 - Measure Cache Miss Rate:
Use the R5F Performance Monitoring Unit (PMU) to measure cache misses. Enable counters using PMCNTENSET. Use Event 0x01 for I-cache refill (instruction cache miss) and Event 0x03 for D-cache refill (data cache miss). Compare cache miss counts between Dhrystone and your 1 MB application.

Test 2 - Run a Subset from TCM:
If possible, identify the hottest 64 KB of your code and place it in TCM using linker placement or section attributes. Then measure if performance improves significantly. This would confirm that memory access latency is the bottleneck.

Regards,

Anil.

0 AnBer 3 months ago in reply to Swargam Anil

TI__Mastermind 31005 points

Hi Anil,
I am posting on behalf of Gail:

"Thanks for this detailed answer.

For the suggested tests: I don't think they are necessary since it is certain the 1MB code from our application doesn't fit in the cache, leading to a high cache miss rate. Test 2 is already done using either our application or the Drysthone application, since you confirmed that it has been specifically designed to fit in the cache. Though, it is true that it could give information about any difference about the TCM latency on AM261 compared to AM2434.

In fact, you provided a link to the SPRACV1B document about "Sitara AM64x/AM24x Benchmarks" which provides interesting information.
-Is there the same document for AM261x (even AM263Px or AM263x)?

That being said, I think this doesn't explain the difference: the 51/24 cycles factor (2.125x) is expressed in cycles. But in absolute time, 63.75/48 = 1.328x.

Our reference application runs in 945ms on AM2612, and 1133ms on AM2434. 945 * (500/800) * (63.75/48) = 785ms, while I measured 1133ms.

I considered the cache size, but I think once it is full, I would have around as many cache linefills on either CPU and initial cached code size (either 16KB or 32KB) is negligible compared to the code size executed (more like 700KB).

I also considered if the MSRAM/OCRAM access latency should apply to the whole SW execution time or a portion of it, but I think it applies as global, since all is placed in MSRAM (code, data and stack).

-Is there any difference between AM261 and AM2434 about the integration of the cores, the interconnect and the MSRAM/OCRAM?
-Are there for example 1 port (read/write) to MSRAM only for AM2434 but 2 ports (1 read + 1 write) in AM261?

For information, measured execution time of our reference application was 3030ms with code in DDR and data/stack in internal RAM, which represents a 2.67x factor compared to our measured execution time (1133ms). Compared to the expected duration (785ms), this represents a 3.96x factor. When I look at the AM64x/AM24x Benchmark document and the DDR (280ns) versus MSRAM (63.75ns) latency, the factor would be 4.39x. As in our scenario data and stack are placed internally, this makes sense that we are less impacted by the DDR latency.

Regards,

Gael»

0 Gael Le Moing 3 months ago in reply to AnBer

Prodigy 130 points

Thanks Anthony. I don't know why I couldn't post yesterday.

"Gael"

0 Swargam Anil 3 months ago in reply to Gael Le Moing

TI__Guru 54208 points

Hello Gael,

I am not sure whether a benchmark application note similar to SPRACV1B (Sitara AM64x/AM243x Benchmarks) exists for AM261x.

I recommend raising a new E2E thread specifically for AM261x to inquire about available benchmark documentation.

The AM261x and AM243x are supported by different teams within TI, and the architectures have significant differences.
The on-chip RAM latencies are different between these device families:

| Device | Memory | Latency |
|--------|--------|-------------------------------|
| AM261x | OCRAM | ~48ns (24 cycles @ 500MHz) | (This data taken from Internal Teams)
| AM243x | MSRAM | ~63.75ns (51 cycles @ 800MHz) |

This difference in memory access latency could explain the performance gap you are observing on AM243x compared to AM261x.

To better understand your use case and assist you further:

1. What performance value are you expecting? Is the same execution time expected on both AM261x and AM243x?
2. TCM Memory Option: If placing the code in TCM memory improves performance, but your application code does not fit in TCM, what is the code size of your
application?
3. Application Details: May I know what code/application you are running? Is it possible to share the application with us for further analysis?

Please provide the additional information requested above, and we will continue to investigate this issue.

Regards,

Anil.

0 Gael Le Moing 3 months ago in reply to Swargam Anil

Prodigy 130 points

Hello Anil,

Swargam Anil said:
1. What performance value are you expecting? Is the same execution time expected on both AM261x and AM243x?

According to the architectures of both chips, I would have expected a better performance on AM2434 when code is placed in internal RAM. Currently, it is globally slower on AM2434 (1133ms compared to 945ms on AM261). Also, note this reference application running on TMS570 average execution time is 1788ms (R5 at 300MHz, equivalent compiler options, caches enabled). Reaching only 1133ms (1.6x factor) while core frequency went from 300MHz on TMS570 to 800MHz on AM2434 (2.6x) makes me doubt the measurement on AM2434. As I explained in the previous post, if you consider the absolute time difference for MRAM/OCSRAM latency between AM261 and AM2434, the ratio of performance loss on AM2434 would be 1.3x, but core frequency improves by 1.6x so AM2434 execution time should be slightly lower (= better) on AM2434. At least, I was expecting this.

Swargam Anil said:
2. TCM Memory Option: If placing the code in TCM memory improves performance, but your application code does not fit in TCM, what is the code size of your
application?

Our reference application is around 1MB big for code size, around 100KB of data, as I have written in previous posts.

Swargam Anil said:
3. Application Details: May I know what code/application you are running? Is it possible to share the application with us for further analysis?

I can not provide source code, that is why I asked if there would be some TI reference application that would look like ours.

Best regards,
Gael

0 Swargam Anil 3 months ago in reply to Gael Le Moing

TI__Guru 54208 points

Hello Gael,

I hope you have taken the measurements using the Release build and not the Debug build.

If possible, please share the measurement method so I can confirm whether you are doing it correctly.

I need to consult with other experts on the above query .

Currently, I don't have any insights on this. If you can share the application, I will run it on my side and check with internal teams to improve the performance.

Regards,

Anil.

0 Gael Le Moing 3 months ago in reply to Swargam Anil

Prodigy 130 points

Hi,

Swargam Anil said:
I hope you have taken the measurements using the Release build and not the Debug build.

Yes.

Swargam Anil said:
If possible, please share the measurement method so I can confirm whether you are doing it correctly.

Using the PMU of the Cortex-R5 core. In the file ti_drivers_config.c generated by syscfg, the PMU is initialized in line

CycleCounterP_init(SOC_getSelfCpuClk());

In my code, at the beginning of each code execution iteration, I reset the PMU using

CycleCounterP_reset();

Then, between two mesurement points, I call CycleCounterP_getCount32(); and then performs the following operation to get the number in microseconds: (endCounter - startCounter) / CoreFreq_MHZ.

Swargam Anil said:
I need to consult with other experts on the above query .

Thanks, that would be appreciated.

Swargam Anil said:
If you can share the application,

As I said, I can't.

And on your side, in TI code base of examples and demos among all projects done for all MCU's, you have searched and you didn't find any application that could be used?

Regards,

Gael

0 Swargam Anil 3 months ago in reply to Gael Le Moing

TI__Guru 54208 points

Hello Gael,

The above measurement setup is fine.

However, if you run this test case in FreeRTOS, the measurement may give wrong values because when the application idle hook runs the WFI instruction, the PMU timer gives incorrect values .

So, if you are performing this test case on FreeRTOS, then instead of PMU, use the GTC timer as it is a 64-bit timer and never overflows.

Next, I have routed this query to other experts and will update you once I receive any feedback.

Regards,

Anil.

0 Gael Le Moing 3 months ago in reply to Swargam Anil

Prodigy 130 points

Hi,

Thanks for the confirmation. My application runs in baremetal so FreeRTOS is not involved.

Swargam Anil said:
Next, I have routed this query to other experts and will update you once I receive any feedback.

Thanks.

In the meantime, if anyone (outside of TI) has already evaluated AM2434 and TMS570 with their reference application and that application profile is similar to ours (big linear code that doesn't fit in cache), feel free to jump in and share your results for comparison. On my side, AM2434 (code, data and stack from MSRAM, caches enabled, compiler -O2 -mthumb) is 1.6x faster than TMS570 (code in internal flash, data/stack internal RAM, caches enabled, compiler -O2 -mthumb).

Thanks,

Regards,

Gael

0 Gael Le Moing 3 months ago in reply to Gael Le Moing

Prodigy 130 points

Hi,

Any update?

0 AnBer 3 months ago in reply to Gael Le Moing

TI__Mastermind 31005 points

Hi Anil,
I think the question Gael need an answer for is:

-Is the approximate calculation below correct? (if not what would be missing that would slower the AM24x perf)?
Or are the benchmarks/measurements he made correct?

TI info:

AM243x

AM261x

32kB cache
128kB TCM (single core)
800Mhz CPU clk

16kB cache
256kB TCM (single core)
500MHz CPU clk

250Mhz clocked VBUSM

2MB OCSRAM (8x256kB)

51 cycles memory latency

(ie cycles wasted per cache misses)
->63.75 ns at 800 MHz

200Mhz clocked VBUSM

1.5MB OCSRAM (3x512kB)

(but only 2x512kB on VBUSM)

24 cycles memory latency

(ie cycles wasted per cache misses)
-> 48 ns at 500 MHz

Same proprietary benchmarks (same 1MB code, same build options, same methord to measure cycle counts) on both MCUs:
Execution time:

1133ms

945ms

But the assumption based on the difference in architecture would be:

945 * (500/800) * (63.75/48) = 785ms

instead of the measured 1133ms.

0 Swargam Anil 3 months ago in reply to AnBer

TI__Guru 54208 points

Hello Anber,

Point 1 — Memory access speed is the real differentiator

AM243x: CPU waits 63.75ns per memory fetch (51 cycles @ 800MHz)
AM261x: CPU waits 48.00ns per memory fetch (24 cycles @ 500MHz)

AM261x fetches memory 25% faster in real time (ns).

Higher clock on AM243x does NOT reduce the wait time in nanoseconds.
The CPU simply idles through more of its own clock cycles.

Point 2 — The original formula is physically wrong

945 * (500/800) * (63.75/48) = 785ms <-- WRONG

This is only valid if 100% of execution time is memory stall time and compute time simultaneously — which is physically impossible. The two scaling factors apply to different components of execution time.

T_243 = T_261_compute * (500/800) + T_261_stall * (63.75/48)

Compute time --> scales with clock frequency (500/800)
Stall time --> scales with memory latency ns (63.75/48)

These are two separate components and must be treated separately.

The measurement confirms it Solving with measured result 1133ms:

1133 = 945 * [0.625 * (1-f) + 1.328 * f]

→ f = 0.82

Meaning 82% of AM261x execution time is memory stall time. This is why AM243x's higher clock gives almost no benefit -- the CPU is idle 82% of the time on both devices.

AM243x has a faster CPU but slower memory access per fetch. Since the workload spends 82% of its time waiting for memory and only 18% executing instructions, the faster CPU clock is largely wasted -- and AM261x wins on real-world performance.

Regards,

Anil.

0 Gael Le Moing 2 months ago in reply to Swargam Anil

Prodigy 130 points

Thanks for the answer, and I agree with your formula.

I understand that the AM261 is more efficient than the AM24x for internal architecture related to MSRAM.

As our code won't fit in internal RAM, we will need to load it in external memory. I checked the performance with XIP code in OSPI DDR 166MHz on AM261, and the performance of that same code is 5x slower than compared to execution from internal MSRAM on AM261 (945us against 4773us). In that scenario, RL2 was enabled with 128KB L2 cache.

I also executed code from DDR4 on AM2434 with DDR configuration provided by the SDK, and it was 2.7x slower than execution from internal RAM (3030us against 1133us).

On AM261, which is more efficient, do you think it is worth trying executing code from OSPI PSRAM? Would the performance be similar to OSPI flash or better? Did you try this in a demo app using the LP-AM261?

Best regards,

Gael

0 Swargam Anil 2 months ago in reply to Gael Le Moing

TI__Guru 54208 points

Hello Gael,

We performed benchmarking of the application on different memories with .text, .data, and .rodata sections placed in different memory locations on AM243x devices.

We found that code executing from OSPI PSRAM demonstrates better performance compared to Flash.

If you need these benchmark details for AM261x devices, please raise an e2e ticket.

The appropriate engineer can provide the AM261x device-specific details.

Regards,

Anil.

0 Gael Le Moing 2 months ago in reply to Swargam Anil

Prodigy 130 points

Thanks Anil,

Can I ask the code size (.text) of the used application for this benchmark?

I was not aware of these benchmarks. Is it documented in an application note or on the SDK website?

Yes, those measurements would be really interesting to get on AM261 too, with different scenarios according to RL2 ON/OFF and different RL2 sizes (for flash scenarios, as Optiflash is not available on OSPI1 controller). I will raise a ticket related to this one: e2e.ti.com/.../am2612-am2612-benchmarks-request

Best regards,

Gael

0 Swargam Anil 2 months ago in reply to Gael Le Moing

TI__Guru 54208 points

Hello Gael,

The above benchmarking details have not been uploaded anywhere yet and are still in progress.

We are planning to publish these details in the AppNote. Before that, we need to conduct some additional test cases, which will take some more time.

The entire code runs from PSRAM .text, .rodata, and .data segments, and the application size is approximately 2.5 MB.

Regards,

Anil.

0 Gael Le Moing 2 months ago in reply to Swargam Anil

Prodigy 130 points

Hi Anil,

Ok, thanks. I think this is more reflecting our kind of application, regarding the size.

I know that in aerospace applications, TMS570 has been and is still used in a lot of projects. Considering that the Sitara family represents the evolution of the Hercules family, do you think that your benchmark activity could include TMS570LC (with cache) with that same application, at least for the case where code resides in internal flash and data is placed in internal RAM?

I think this could provide information to users planning to use Sitara family about the performance increase they could hope for when going from a current project with TMS570 to newer product with SItara processor. On my side, we also did this measurement with our benchmark application, so it would be an other way to check if our performance evaluation of AM2434 and AM261 is similar to what you measure too, compared to TMS570.

Best regards,

Gael

0 Aryamaan Chaurasia 2 months ago in reply to Gael Le Moing

TI__Intellectual 2320 points

Hi Gael,

These are the example codes which were used for benchmarking PSRAM:

ocmc_benchmarking.zip

xip_benchmark.zip

Please note, these benchmarks were tested on the PSRAM part: APS6408L

For booting, the modified SBL SD example which initializes APS6408L was used:

sbl_sd.zip

Regards,

Aryamaan Chaurasia

0 Gael Le Moing 2 months ago in reply to Aryamaan Chaurasia

Prodigy 130 points

Thanks.

With this code, I could only reproduce the columns "PSRAM" from the table AM2434 above?

Regards,

Gael

0 Aryamaan Chaurasia 2 months ago in reply to Gael Le Moing

TI__Intellectual 2320 points

Hi Gael,

The modified SBL SD was used only for the PSRAM.

The OCMC benchmarking and XIP benchmarking can be used with the default SBL OSPI to benchmark MSRAM/On-Chip RAM, DDR, Flash.

Please make sure the linker.cmd for both these examples are correctly configured for MSRAM, DDR, and Flash.

Regards,

Aryamaan Chaurasia

Arm-based microcontrollers

Arm-based microcontrollers forum

AM2434: Understanding AM2434 performance