AM2634: Benchmarking F28388D vs AM2634 for FOC

Part Number: AM2634
Other Parts Discussed in Thread: TMS320F28388D, , AM263P4, C2000WARE

Tool/software:

Hello,

I would like to compare the performance of TMS320F28388D and Sitara AM2634 for motor control applications.
The goal is to understand which device is better suited to:

  • Execute a Field-Oriented Control algorithm reliably in real time.

  • Still leave enough CPU headroom to run an additional complex algorithm in double precision.

I am looking for hints and best practices on how to set up such a benchmark.

  • Which parameters are most relevant to measure?

  • What kind of methodology would you recommend?

  • Are there any TI reference materials, existing benchmarks, or example projects that could help?

Any guidance from the community or TI experts would be very helpful.

Thanks in advance!

  • I have results comparing the two devices when running an AC Induction motor complete signal chain benchmark (with the FOC as its core). This tests not only CPU performance but also peripheral response and interrupt latency. It runs on one of the cores of each device.

    Note that the results are for the AM263P, which contains a TMU. The AM263 does not, so on this benchmark it would be much slower than the AM263P (by atleast a couple hundred cycles), since the TMU is used for operations in the park transform, inverse park transform, and several times in the flux estimator.

    MCU

    AM263P

    F2837x

    CPU

    Cortex-R5F

    C28

    CPU type

    8-stage pipeline, limited dual-issue, branch prediction

    8-stage pipeline, limited dual-issue

    CPU freq.

    400

    200

    Accelerator

    TMU

    TMU

    Cycles

    705

    527

    Perf. ratio (cycles)

    1

    1.33

    eMHz/Core

    (effective MHz)

    400

    267

    Thanks,

    Sira

  • Hi Sira,

    Thanks a lot for sharing your results; they are very close to what I am looking for, and they have raised a few more questions in my mind. I would really appreciate it if you could help me with the following:

    1. I see you used an AM263P, which includes a TMU. I will definitely consider it for my benchmark, thanks! In your opinion, is the AM263P4 the best choice for motor control and real-time applications within the Sitara Arm-based microcontrollers? From the documentation, it looks like the AM263P4 is an improved version of the AM2634. Does it make sense to say that AM263P4 > AM2634 and therefore run benchmarks only on the P version?

    2. Are the f2837x results sufficient to compare the AM263P with the F28388? At the end of the day, we are mainly evaluating the C28x processor.

    3. How was the benchmark organized? Was the same code used across devices, and if so, how was it adapted?

    4. Considering the CPU frequencies, results show that the R5F was about 33% faster than the C28x. Did you also take into account the use of the FPU?

    5. I was thinking of using the CLA on the F28388D to run the FOC, offloading the C28x for other double-precision calculations. Would such a combination still be better than using the R5F for both the FOC ISR and the double-precision tasks? On the Sitara, is there a way to replicate an architecture similar to the C28x+CLA pair?

    6. Is the code you used for the benchmark shareable? It would be extremely helpful for me.

    7. How did you collect these results — specifically, how did you measure the clock cycles? Did you design a repeatable benchmark? Did you use a scheduler or simple interrupts to schedule the tasks?

    Thanks again for your time and support!

    Best regards,

    Pasquale

  • Hi Pasquale,

    Replies in-line below

    1. I see you used an AM263P, which includes a TMU. I will definitely consider it for my benchmark, thanks! In your opinion, is the AM263P4 the best choice for motor control and real-time applications within the Sitara Arm-based microcontrollers? From the documentation, it looks like the AM263P4 is an improved version of the AM2634. Does it make sense to say that AM263P4 > AM2634 and therefore run benchmarks only on the P version?

      The P version has a TMU, a hardware resolver, and a few other improvements. It definitely appears to be better for motor control. http://www.ti.com/lit/spradb3
    2. Are the f2837x results sufficient to compare the AM263P with the F28388? At the end of the day, we are mainly evaluating the C28x processor.

      Correct. I don't have results specifically for the F2838x but it should be the same, atleast from RAM. From Flash, there could be differences in wait states etc. that impact performance. But my assumption is real-time code like this will be run from RAM.
    3. How was the benchmark organized? Was the same code used across devices, and if so, how was it adapted?

      Yes, C code was developed and run on each device. Minor device specific updates were made.
    4. Considering the CPU frequencies, results show that the R5F was about 33% faster than the C28x. Did you also take into account the use of the FPU?

      Yes, the FPU was used in both cases. If you take into account just the cycles, the C28x is 33% faster. But if you take the CPU frequency as well, the AM263P is 50% faster.
    5. I was thinking of using the CLA on the F28388D to run the FOC, offloading the C28x for other double-precision calculations. Would such a combination still be better than using the R5F for both the FOC ISR and the double-precision tasks? On the Sitara, is there a way to replicate an architecture similar to the C28x+CLA pair?

      Certainly having the CLA to offload control tasks helps boost performance, since you basically have an additional core compared to the AM263P But then again, it depends on which specific part numbers are being compared. On the C2000 side, you might have 2 C28x and 2 CLAs, and on the AM263x you may have 1, 2, or even 4 R5F cores. On the Sitara side, there isn't a corresponding co-processor like we have on the C28 side. Specifically on this benchmark, the C28 is about 60% faster than the CLA, refer to http://www.ti.com/lit/spracw5
    6. Is the code you used for the benchmark shareable? It would be extremely helpful for me.

      Yes, it's in C2000Ware at examples\demos\benchmark\aci_motor_benchmark
    7. How did you collect these results — specifically, how did you measure the clock cycles? Did you design a repeatable benchmark? Did you use a scheduler or simple interrupts to schedule the tasks?

      The app note spracw5 above describes the benchmarking methodology.
  • Hi Sira,

    Thanks a lot for all your answers! Slight smile 

    Starting from a quote from http://www.ti.com/lit/spracw5“Many software benchmarks only focus on the processing aspect typically expressed in million instructions per second (MIPS), without full regard for the interaction between peripherals, CPU, and co-processors. Such benchmarks do not provide a full view of the real-time performance capabilities of a system.”, I defined five candidate configurations to compare for a motor-control use case (FOC real-time loop + a separate, computationally heavy algorithm that ideally runs in double precision):

    1. TI28388D — C28 standalone (FPU + TMU) running both tasks

    2. TI28388D — C28 + CLA (C28 = trajectory; CLA = FOC)

    3. Sitara AM263P4 — single R5F running both tasks

    4. Sitara AM263P4 — R5FSS0_0 = trajectory ; R5FSS0_1 = FOC (same cluster)

    5. Sitara AM263P4 — R5FSS0_0 = trajectory ; R5FSS1_0 = FOC (different clusters)

    Questions: 

    1. From an architecture and real-time motor-control perspective, which of these configurations would you consider the best choice and why?
    2. Does it make sense to run a formal benchmark across all five configurations, or is there a configuration that’s clearly preferable (making the full benchmark unnecessary)?

    I tried to answer them on my own and here there is a resume of what I got:

    Your feedback would be extremely appreciated! 

    Best,

    Pasquale

  • Hi Pasquale,

    Your analysis table with Pros and Cons seems fairly thorough. I don't think I have a whole lot to add to it, other than to maybe suggest that you can use the TCM in the Sitara case to even run FOC code, so you get 0 wait-state execution.

    If your decision solely rests on performance, then in this case I think the Sitara wins, due to its 400MHz clock frequency. I would try to use option 4 and see if everything fits in the TCM so I don't have to deal with cache and so on.

    Thanks,

    Sira