This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28075: Timing of Multiplication in C

Other Parts Discussed in Thread: TMS320F28075, CONTROLSUITE

HI all,

I am using TMS320F28075. This device is being operated at 120MHz clock.

I am executing this single line of code on adc results

ia = (AdcaResultRegs.ADCRESULT0) * 1.3;

here ia is float type variable and declared globally.

When I am measuring the time taken by code to execute with the help of

       GpioDataRegs.GPASET.bit.GPIO31 =1 ;//to profile code timing

       ia = (AdcaResultRegs.ADCRESULT0) * 1.3;

       GpioDataRegs.GPACLEAR.bit.GPIO31 =1 ;//to profile code timing

      

 

It takes 1 micro second time.

I have even enabled the TMU support for device (see the compiler flags below)

-v28 -ml -mt --cla_support=cla1 --float_support=fpu32 --tmu_support=tmu0 --fp_mode=relaxed --include_path="C:/ti/ccsv6/tools/compiler/ti-cgt-c2000_16.9.0.LTS/include" --include_path="D:/projects/background study software/freq variable/hw_files" --advice:performance=all -g --diag_warning=225 --diag_wrap=off --display_error_number

Here is the dis assembly of the above code

0821be:   E801FD30    MOVIZ        R0, #0x3fa6

0821c0:   761F002C    MOVW         DP, #0x2c

0821c2:   E2C40100    UI16TOF32    R1H, @0x0

0821c4:   E80B3330    MOVXI        R0H, #0x6666

0821c6:   E7000040    MPYF32       R0H, R0H, R1H

0821c8:   761F0242    MOVW         DP, #0x242

0821ca:   E2030020    MOV32        @0x20, R0H

I don’t understand why this much time is being taken.

Multiplication is supposed to be a single cycle instruction.

Can you please help me with this issue?

 

 

  • Keyur,

    You're basically saying that it is taking 120 clock cycles to execute the code (1 uS at 120 MHz is 120 cycles).  OK, so something is off here.  The instructions are basically single cycle as you said.  We need to just figure out what is going on.

    Are you sure you are running at 120 MHz?  Did you test this using XCLKOUT pin?

    Are you running from flash or RAM?  If flash, do you have the flash wait-states set?

    Try this: get rid of your multiply code, and place a bunch of NOPs in between the GPIO set and clear, like this:

           GpioDataRegs.GPASET.bit.GPIO31 =1 ;//to profile code timing

           asm(" RPT #198 || NOP);      // 200 cycles

           asm(" RPT #198 || NOP);      // 200 cycles

           asm(" RPT #198 || NOP);      // 200 cycles

           asm(" RPT #198 || NOP);      // 200 cycles

           asm(" RPT #198 || NOP);      // 200 cycles

           GpioDataRegs.GPACLEAR.bit.GPIO31 =1 ;//to profile code timing

    This should take roughly 1000 cycles to execute (plus a few more for the GPIO setting and clearing).  It will eliminate flash setting problems because the RPT will only have to fetch the NOP once.  If it is still too slow, that means your clocks are not 120 MHz.  If it is correct timing, then I'd look at the flash (and probably there is something else too as I don't think the flash wait-states alone would account for everything).

    Regards,

    David

  • Hi David

    Thanks for the response.

    1)      I checked out the clock frequency using  XCLKOUT pin. It is 120MHz.

    2)      I am running my code from flash I have not copied it to ram during power on time.

    3)      I tried executing your code and it takes 8.4uS that is near by 1000 clock cycles at 120MHz.

    4)      I tried running the code from ram and the problem solved!!!  It runs around 10 times faster. The multiplication which was earlier taking 1uS time now takes only 95nS time. I know that moving code to RAM makes executing faster but didn’t know that it makes execution 10 times faster.

    Regards,

    Keyur

  • Hi David,

    I want to ask another question regarding copying the code to RAM.

    I have read this https://e2e.ti.com/blogs_/b/toolsinsider/archive/2015/12/11/helpful-tips-executing-code-from-ram and i am using this to move function to ram.

    I am using the linker file (2807x_FLASH_CLA_lnk_cpu1.cmd) which is given in control suite at

    C:\ti\controlSUITE\device_support\F2807x\v210\F2807x_common\cmd

    As I have mentioned in the above linker file both CLA and CPU ram functions are placed in RAMLS4_LS5. But if we want to run CLA code from this ram we have to share this memory between CLA and CPU using

    MemCfgRegs.LSxMSEL.bit.MSEL_LS4 = 1;  

    This will result in problem of Access Arbitration as per the technical reference manual(SPRUHM9B) section 2.11.1.5.

    So is it recommended to do this?
    Should I use same RAM for CLA and cpu programs?

    Or I should use separate RAM blocks for both the codes?

    Right now I am using differnt ram bocks.

    Thanks

    Regards,

    Keyur

  • Keyur,

    On the original issue of running your multiply code, I suspect you do not have the flash wait-states configured properly.  Running from RAM vs. flash does not make that much of a difference.  At 120 MHz, you would likely not notice much difference at all given the flash pre-fetch buffer.

    On your new question of sharing LS RAM between the CPU and the CLA, if you intend to put CLA code in the memory then you cannot access that memory from the CPU.  The sharing is for data access from CPU and CLA.  See spruhm9b section 2.11.1.2: "CPU access to all memory blocks, which are programmed as CLA program memory, are blocked."  My suggestion would be to keep CPU and CLA code in separate blocks.

    Regards,

    David

  • Keyur:
    (From the Piccolo Multi-day workshop)
    FlashRegs.FOPT.bit.ENPIPE = 1;
    This is used to "enable the flash pipeline." For some reason, it is defaulted to 0. If it is not turned on, the flash will not run at a faster speed.

    Hopefully, this helps.
  • Hi David and Todd,

    Here is the Flash configuration which I am using.
    mostly  it  is taken from the device specific demo code.

        Flash0CtrlRegs.FPAC1.bit.PMPPWR = 0x1;
        Flash0CtrlRegs.FBFALLBACK.bit.BNKPWR0 = 0x3;
    
        //
        // Disable Cache and prefetch mechanism before changing wait states
        //
        Flash0CtrlRegs.FRD_INTF_CTRL.bit.DATA_CACHE_EN = 0;
        Flash0CtrlRegs.FRD_INTF_CTRL.bit.PREFETCH_EN = 0;
    
        //
        // Set waitstates according to frequency
        //
        //      *CAUTION*
        // Minimum waitstates required for the flash operating at a given CPU rate
        // must be characterized by TI. Refer to the datasheet for the latest
        // information.
        //
        Flash0CtrlRegs.FRDCNTL.bit.RWAIT = 0x2;
     
        //
        // Enable Cache and prefetch mechanism to improve performance of code
        // executed from Flash.
        //
        Flash0CtrlRegs.FRD_INTF_CTRL.bit.DATA_CACHE_EN = 1;
        Flash0CtrlRegs.FRD_INTF_CTRL.bit.PREFETCH_EN = 1;
    
        //
        // At reset, ECC is enabled. If it is disabled by application software and
        // if application again wants to enable ECC.
        //
        Flash0EccRegs.ECC_ENABLE.bit.ENABLE = 0xA;
    

    I could not find anything like

     FlashRegs.FOPT.bit.ENPIPE = 1;

    in the datasheet or TRM of 28075.

  • Keyur,

    Todd was just remembering a bit incorrectly.  ENPIPE bit is on the older C2000 devices.  On F2807x, it was replaced with PREFETCH_EN bit.  So, you have the flash config code correct.

    Dumb question, but you did call this code, say function InitFlash(), in your code, right?

    Here's another test to run.  Run this code from FLASH:

       GpioDataRegs.GPASET.bit.GPIO31 =1 ;//to profile code timing
       asm(" NOP);    // 1
       asm(" NOP);    // 2
       asm(" NOP);    // 3
       asm(" NOP);    // 4
       asm(" NOP);    // 5
       asm(" NOP);    // 6
       asm(" NOP);    // 7
       asm(" NOP);    // 8
       asm(" NOP);    // 9
       asm(" NOP);    // 10
       asm(" NOP);    // 11
       asm(" NOP);    // 12
       asm(" NOP);    // 13
       asm(" NOP);    // 14
       asm(" NOP);    // 15
       asm(" NOP);    // 16
       asm(" NOP);    // 17
       asm(" NOP);    // 18
       asm(" NOP);    // 19
       asm(" NOP);    // 20
       GpioDataRegs.GPACLEAR.bit.GPIO31 =1 ;//to profile code timing

    At 120 MHz, this code should take roughly 0.2 uS to execute (rough, because we don't know what code the compiler will generate for the GPIO stuff).  What I'm doing here is inlining the NOPs rather than using RPT || NOP as before.  This will force fetching of each NOP from flash.

    - David

  • Thanks for correcting - sorry 'bout that. Why was the name changed on flash pipeline enable?
  • Hi Todd,

    Todd Anderson78572 said:
    Thanks for correcting - sorry 'bout that. Why was the name changed on flash pipeline enable?

    I suspect no deliberate reason.  The spec writer probably just used a different name.
    Regards,
    David
  • David:
    I think InitFlash( ) is probably the same - calling that function is key. I have actually seen customer code that did not have that call...
  • Hi David and todd
    You two are really “Guru” of the community.
    Actually InitFlash(); was the problem.
    Initially, when I measured time with flash this function was not present.
    But when I moved the code to ram I also introduced this function. So when you asked me about the flash configuration I gave you all the register details about flash wait stat.

    I consider this issue solved now.

    Sorry for the silly issue.

    Thanks for the help, both of you.
    Regards,
    Keyur