This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357: Fastest sin/cos code functions

Part Number: TMS570LC4357
Other Parts Discussed in Thread: TMS570LS3137, HALCOGEN, , TMS570LS20206, RM46L852

I am working on trying to improve the performance of my motor-control application. I have the VFPU enabled as per my previous thread of "How to enable the FPU", I am linking to the "ti_math_Cortex_R4_bspf.lib" library, and calling the "arm_sin_f32" function, and this single sine calculation is taking 149 cycles -- is that expected behavior?

We have a number of 28377 and 28388-based projects, and the sine/cosine take 3or4 cycles. Also I believe my TMS570LS3137 application took ~30 cycles to perform a sine, so I am hoping I am linking to the wrong library, calling the wrong fast-function, or missing a constant defined. 

Also note this application is built with HalCoGen (which I could provide), I have "CCS" and "FPU_PRESENT" defined in my project; CCS was required to compile, but FPU_PRESENT did not make a difference in the 149 cycle time. 

Thanks,

Jim

  • Hi Jim,

    Please check the benchmark in Docs folder of CMSIS: C:\ti\Hercules\Cortex-R4 CMSIS DSP Library\1.0.0\Docs

    Cortex-R4 DSP Software Benchmarks.pdf

  • Thanks for the quick reply. There are 2 numbers in that data for Sine: Fast-math takes 2747 cycles and "controller" functions sine take 54. Do you know how I determine whether I am running fast-math or controller sine?

    That data is for a TMS570LS20206 and the datasheet states it is 1.66DMIPS/Mhz running at 160Mhz. This TMS570LC4357 is the same DMIPS/Mhz of 1.66, but I am running the CPU at 300Mhz, so shouldn't my sine-time be about 1/2 of the TMS570LS20206 because I am running the CPU almost twice as fast? If I can figure out how to call the controller sine, I should be about 27 cycles (currently 149).

    Thanks,

    Jim

  • Sorry, I mixed up time and cycles in the last email -- that TMS570LS20206 runs a sine in 54 cycles using the controller function. My current time is 149 cycles, but it should be 54? How do I verify I am calling the "controller" sine function?

    Thanks,

    Jim

  • I found this TI-app note for a Sitara, but it also uses the ARM R5F; It runs the sin in 34 cycles, but it uses an R5F library. Will there be an R5F library update for this 4357 that might improve performance?

    Thanks

    Jim

     https://www.ti.com/lit/an/spracv1/spracv1.pdf?ts=1629415493134&ref_url=https%253A%252F%252Fwww.google.com%252F

  • Do you know how I determine whether I am running fast-math or controller sine?

    The "controller" function is arm_sin_cos_f32 which outputs both sine and cosine outputs for a theta input (theta is in degrees).

    Whereas the  "fast math" functions are arm_sin_f32 and arm_cos_f32 and take inputs in radians.

  • We have a number of 28377 and 28388-based projects, and the sine/cosine take 3or4 cycles.

    Those C2000 devices have a Trigonometric Math Unit (TMU) which contains dedicated SINPUF32 and COSPUF32 instructions which are documented as taking 4p cycles.

    Whereas for the VFPv3-D16 in a Cortex-R5F I can't see any trigonometric hardware instructions, so the functions have to be emulated in software. 

  • My current time is 149 cycles, but it should be 54?

    The cycle time is a combination of the instructions and memory access time. Where the memory access time depends upon:

    a. The number of wait states, which can be different for FLASH and SRAM.

    b. For the TMS570LC4357 Cortex-R5F device if the instruction and data caches are enabled.

    The Cortex-R4 DSP Software Benchmarks.pdf for the Cortex-R4F based TMS570LS20206 (i.e no caches) doesn't describe if the measured timings were with the code under test in FLASH or SRAM.

    I created some timing tests for the following functions:

    • An "empty" test just to measure the overhead of reading the cycle counter
    • sinf and cosf from the TI compiler run-time standard library
    • arm_sin_f32 and arm_cos_f32 from the CMSIS DSP Fast Math functions
    • arm_sin_cos_f32 from the CMSIS DSP Controller functions

    The same test functions were run on:

    1. A Cortex-R4F based RM46L852 (i.e. no caches). GCLK and HCLK were both set to 220 MHz.
    2. A Cortex-R5F based TMS570LC4357. The instruction and data caches were enabled. GCLK (CPU clock) was 300 MHz. HCLK (peripheral bus clock) was 150 MHz.

    Each set of tests was run 3 time in succession, for the case where caches were used to see if there was any variation where the code had to be loaded into cache, compared to the subsequent tests where the code could still be in the cache.

    For the projects for each device there were two build configurations:

    • Debug_flash : All of the code was in flash
    • Debug_sram : The code being timed and it's lookup tables were placed in SRAM, where SRAM access is faster than FLASH due to SRAM having fewer wait-states at the highest HCLK frequency

    The results from the RM46L852 test with the Debug_flash build configuration:

    Starting tests
    Cycle count overhead = 24
    sinf(1.234500)=0.943983, took 536 cycles
    cosf(1.234500)=0.329993, took 572 cycles
    arm_sin_f32(1.234500)=0.943983, took 118 cycles
    arm_cos_f32(1.234500)=0.329993, took 114 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 100 cycles
    Cycle count overhead = 24
    sinf(1.234500)=0.943983, took 531 cycles
    cosf(1.234500)=0.329993, took 572 cycles
    arm_sin_f32(1.234500)=0.943983, took 118 cycles
    arm_cos_f32(1.234500)=0.329993, took 114 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 100 cycles
    Cycle count overhead = 24
    sinf(1.234500)=0.943983, took 531 cycles
    cosf(1.234500)=0.329993, took 572 cycles
    arm_sin_f32(1.234500)=0.943983, took 118 cycles
    arm_cos_f32(1.234500)=0.329993, took 114 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 100 cycles

    The results from the RM46L852 test with the Debug_sram build configuration:

    Starting tests
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 444 cycles
    cosf(1.234500)=0.329993, took 490 cycles
    arm_sin_f32(1.234500)=0.943983, took 85 cycles
    arm_cos_f32(1.234500)=0.329993, took 85 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 54 cycles
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 437 cycles
    cosf(1.234500)=0.329993, took 490 cycles
    arm_sin_f32(1.234500)=0.943983, took 85 cycles
    arm_cos_f32(1.234500)=0.329993, took 85 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 54 cycles
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 437 cycles
    cosf(1.234500)=0.329993, took 490 cycles
    arm_sin_f32(1.234500)=0.943983, took 85 cycles
    arm_cos_f32(1.234500)=0.329993, took 85 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 54 cycles

    The results from the TMS570LC4357 with the Debug_flash configuration:

    Starting tests
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 687 cycles
    cosf(1.234500)=0.329993, took 777 cycles
    arm_sin_f32(1.234500)=0.943983, took 256 cycles
    arm_cos_f32(1.234500)=0.329993, took 181 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 176 cycles
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 465 cycles
    cosf(1.234500)=0.329993, took 514 cycles
    arm_sin_f32(1.234500)=0.943983, took 80 cycles
    arm_cos_f32(1.234500)=0.329993, took 80 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 50 cycles
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 445 cycles
    cosf(1.234500)=0.329993, took 493 cycles
    arm_sin_f32(1.234500)=0.943983, took 80 cycles
    arm_cos_f32(1.234500)=0.329993, took 80 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 50 cycles

    The results from the TMS570LC4357 with the Debug_sram configuration:

    Starting tests
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 561 cycles
    cosf(1.234500)=0.329993, took 606 cycles
    arm_sin_f32(1.234500)=0.943983, took 176 cycles
    arm_cos_f32(1.234500)=0.329993, took 170 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 100 cycles
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 450 cycles
    cosf(1.234500)=0.329993, took 507 cycles
    arm_sin_f32(1.234500)=0.943983, took 79 cycles
    arm_cos_f32(1.234500)=0.329993, took 80 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 57 cycles
    Cycle count overhead = 9
    sinf(1.234500)=0.943983, took 441 cycles
    cosf(1.234500)=0.329993, took 495 cycles
    arm_sin_f32(1.234500)=0.943983, took 79 cycles
    arm_cos_f32(1.234500)=0.329993, took 80 cycles
    arm_sin_cos_f32(70.731636)=0.943955,0.329983, took 50 cycles

    The following table looks at the "corrected" number of cycles for the arm_sin_cos_f32() Controller function for the different combinations. Where the "correction" means subtracting the "cycle count" overhead from the arm_sin_cos_f32() results:

    Device Build configuration Results
    RM46L852 Debug_flash All 3 test iterations took 76 cycles
    TMS570LC4357 Debug_flash The first test iteration (when had to load code in cache) took 167 cycles. The subsequent test iterations (when the code could already be in cache) took 41 cycles.
    RM46L852 Debug_sram All 3 test iterations took 45 cycles
    TMS570LC4357 Debug_sram The first test iteration took 91 cycles. The subsequent tests iterations took 48 or 41 cycles.

    This shows that for the TMS570LC4357 when has to load the code/data in cache takes more cycles that for the RM46L852 which doesn't have a cache.

    Therefore, any timing measurements for a TMS570LC4357 need to consider if the code/data is cached or not.

    The projects used are attached, which used CCS 10.4, HALCoGen 04.07.01 and TI ARM compiler v20.2.5

    RM46L852_sin_cos.zip

    TMS570LC4357_sin_cos.zip

  • Thank you Chester; this should help; let me try to get my code showing this speed.

    Jim