This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Cycle Count different in Simulation mode Vs Actual target - C6455

I tried to figure out how long it takes for a function execution in simulator mode and in actual target. In simulator mode it took 2.5 milliseconds where as in actual target it took 3.4 milliseconds. In simulator mode, L2 memory was used (No caching). In actual target, disabled HWI during function execution and L1 caching was enabled. I could not understand still why the target run was slower by 900 microsec ?

  • Hi Sureshbabu,

    When you say simulator mode, is it through CCS?

    Would you please let us know the methodology on measuring the time taken? Did you use the profile clock when used with CCS?

    Did you use timer / GPIO to measure the time taken in target?

  • Hi Shankari,

    Simulator mode : TI DSP Bios Clock. Took a snapshot of clock before and after the function and computed the difference. clock cycles : 2105987
    Target : Used TSCL/TSCH (64-bit free running counter) . Similarly took a snapshot before and after the function. The diff clock cycles : 2792287

    Thanks,
    Suresh
  • What if you use TSCL/TSCH in the simulator? Do you get consistent results?

  • I tried TSCH/TSCL in the simulator, it is in the same order as before when I was using Bios Clock function. The results are consistent in Simulator and Target. I consistently see a huge clock cycle difference between the simulator and target.
  • Which simulator are you using? I assume you have CCSv5 since the simulators are not supported in CCSv6. I ask because we typically had two versions of each simulator, one which we called functional simulator, which gave correct results but not necessarily cycle accurate, this one runs faster and leads to lower simulation times. The other one we call cycle approximate which works harder to be cycle accurate but takes longer to run.
  • I used CCSV4 "C64x+ CPU Cycle Accurate Simulator"and got these results.
  • Hmm, there are several things I can think of. Given it's CCSv4, maybe that older version of the simulator is not as accurate as what might be in CCSv5. I can't say for sure what simulator updates there might be with CCSv5 though. You could install CCSv5 to get the latest simulator and try that.

    Usually what ends up being the most significant difference between actual hardware and the simulator is the memory architecture. The more layers of memory buses and caches you have the harder it is to accurate model without taking forever to do so. For example, running your code out of DDR with L1 and L2 cache enabled won't always give perfect results. However, I wouldn't expect it to be as far off as you are reporting. Besides, you said your running out of L2 which should give pretty good results.

    Another thing that can interfere are interrupts but you mentioned you disable global interrupts during your function measurement.

    One question I have is, are you running the SAME code load on both the hardware and the simulator?

  • My intent is to provide some additional details on this issue since Suresh and I have been working on this for the past month. The analysis started from the perspective of how many of instances of a specific protocol we would be able to fit within a given time frame. When we had worked on this module  (this is  the most MIPS intensive portion of the the demodulator) back in 2012, we had estimated this to be ~2.5ms.  At that time, we did not specifically check the timing on the actual target to be the same. The performance requirement was ~5ms and that it worked within that time frame and was never an issue.

    The DSP is clocked at 831.6MHz but we do not use any of the internal timers to trigger the HWIs. There is an independent clock that runs at 55.44MHz in an FPGA. That was how we converted the cycle counts that Suresh provided to  ~2.5ms and that compared with the FPGA ticks we were capturing. Now, since we had tighter constraints in terms of DSP performance (even though we will move to a 66X based processor clocking at 1GHz), we wanted to benchmark our performance on a 64X+ because (a) that was what we had at our disposal and (b) The performance estimate on 64X+ gives us high confidence in what is achievable on the 66X.

    Now for the investigation itself:

    Initially, I had made a comment to Suresh stating that it was around 2.5ms based on my memory from the previous platform. Since he had the Base Station to test with, he went ahead and tested it on the target directly and we noticed that it was coming in at around 3.7ms. (based on the FPGA clock which we have independently verified) We subtracted (based on information we had for all the higher priority tasks/SWIs and HWIs) about 200us and estimated that the time was 3.5ms. We subsequently disabled HWI when we entered the function (our watchdog is far enough out that we wouldn't hit it before getting the info we needed) and it came out to ~3.5ms. We suspected that maybe that the DSP was not being clocked at 831.6MHz since this clock is a multiple of 55.44. This was the reason why Suresh took a snapshot of TCSL and we verified that was not the case and the DSP was indeed being clocked at 831.6MHz (there were other verification that we did as well including independently looking at the 55.44 clock and corroborating with a wait timer we have that depends on TCSL).  

    Every way we have checked, the time on the hardware is higher by 25% and that is really the concern. If it had been off by 100's of cycle counts, we wouldn't have been concerned. It was, as stated, 900us off. I am not sure if this is a function of CCSV4 especially given how far out it is.

    Again, we would be more than happy to share some code and configurations under NDA (i will need to double check that we are allowed to but my prior experience has been that it would be okay) but that would need to occur outside this forum.

  • I have a couple comments.

    I suggest you always use the C6x core time-stamp counter, TSCL/TSCH, for both simulator and hardware. Of all methods of measuring performance, the time-stamp counter will provide the best accuracy and it's consistent between HW and simulator.

    I get the feeling we're not comparing apples to apples between HW and simulator. You mention subtracting other tasks and SWIs, how accurate is this? I come back to the question I raised in my last post, are you running the same code executable on both simulator and HW? It would seem there are other system factors coming into play in your hardware setup.


    Do you guys have a C6455 EVM that you can do similar measurements on? I assume you have some left over from when your project began before you had your own hardware. If you have some code you could share with me (offline)  I'd be willing to run it on both the CCS simulator and on my C6455 EVM and do the comparison. Ideally this would be a simple CCS project that I can rebuild and the code should not have any special dependencies on hardware.

    if you ultimately plan to use the C66x DSP core, then I suggest installing CCSv5 to get the latest simulators including support for C66x devices. Then when you acquire (if you haven't already done so) a C66x DSP based EVM, you can do similar comparisons. I'm pretty confident you'll get better performance results with the C66x.Not sure how much time I'd spend on the C64x.

  • I would like to summarize what I have mentioned earlier, it might be easier for future discussions:

    Problem :
    Execution time of a module is different by 686300 cycles, in simulator mode Vs Target

    Environment:
    Simulator mode : Used TSCH/TSCL . Took a snapshot of clock before and after the function and computed the difference. clock cycles : 2105987 (I got the same result with Bios clock as well).
    Target : Used TSCL/TSCH (64-bit free running counter) . Similarly took a snapshot before and after the function. The diff clock cycles : 2792287.

    In simulator mode, L2 memory was used (No caching). In actual target, disabled HWI during function execution and L1 caching was enabled.

    I think it is a fair assumption that I am trying to compare apples to apples here.

  • Are the cache settings identical between HW and simulator? L1D, L1P, and L2? Do you also disable HWI during function execution in simulator mode?
  • FYI, I moved this to the CCS forum so we can get inputs from that team too.
  •  

    I’ve been wondering about this 25% cycle difference between the simulator and HW so I decided to go ahead and create a test application and run it on both a C6455 CCS simulator and on a C6455 DSK hardware platform. I used CCS v5.5.0. I ran the same exact .out file on both, same cache settings for L1D, L1P, and L2, all set for maximum cache size. Global interrupts disabled. All code and data is linked into L2 SRAM. I added in a few DSPLIB calls (FIR filter and matrix multiply) to add some realistic computational loading, also threw in a couple QDMA background DMA operations, and finally, performed a few L1D cache writeback and L1D cache invalidate operations into the mix.

     

    I took the measurements using TSCL and TSCH. The test function is run once then measured, then ran again repeatedly in a loop to get an average. The first run should have a cache penalty, but after that everything should run out of cache for the average loop. The data collected does indeed show a difference between the first call of the test function and the looped average. The test harness app (main.c) is included below after the results data. I ran the test five times on each setup for consistency. I used printf()’s to show the results in the CCS console window.

     

    Based on the data I collected, the two setups are within 1% of each other. So either your code is touching on something my test app isn't or there are other factors coming into play and I suspect it's differences between the code you're running on the simulator versus the code you're running on hardware. If we can narrow down those differences and eliminate them, I suspect you get much closer results.

    I ran my test code on CCSv5 and the simulator that comes with that. You are using the simulator that comes with CCSv4, there could be differences. I'm going to attach my CCS project in a follow-up post so maybe you run my .out file (as-is, don't even rebuild it as there could be slight compiler differences) on your simulator and see what results you get.

     

    Used same GEL file in both envirnments:

    ..\..\emulation\boards\dsk6455\gel\DSK6455.gel

     

    RESULTS:

     

    C6455 Simulator

    Build date/time May 18 2016 14:52:01:   dt_first 1992690,   dt_average 1792008

    Build date/time May 18 2016 14:52:01:   dt_first 1992690,   dt_average 1792008

    Build date/time May 18 2016 14:52:01:   dt_first 1992690,   dt_average 1792008

    Build date/time May 18 2016 14:52:01:   dt_first 1992690,   dt_average 1792008

     

    C6455 DSK

    Build date/time May 18 2016 14:52:01:   dt_first 2010395,   dt_average 1807525

    Build date/time May 18 2016 14:52:01:   dt_first 2010397,   dt_average 1807525

    Build date/time May 18 2016 14:52:01:   dt_first 2010399,   dt_average 1807525

    Build date/time May 18 2016 14:52:01:   dt_first 2010395,   dt_average 1807525

     

    1807525 – 1792008 = 15517 cycle difference

    15517 / 1807525 * 100% = 0.86% error

     

     

     

    main.c

     

    #include <stdint.h>

    #include <stdbool.h>

    #include <stdlib.h>

    #include <stdio.h>

    #include <c6x.h>

     

    #include "DSP_fir_gen.h"

    #include "DSP_mat_mul_cplx.h"

     

    #define EDMA_BASE (0x02A00000) // C6455

    #define REG32(addr) *(volatile uint32_t *)(addr)

     

    short x_array[5000];

    short y_array[5000];

    short h_array[5000];

    short r_array[5000];

    volatile short qdma_src_array[5000];

    volatile short qdma_dst_array[5000];

     

    void doQdma();

    uint64_t doFunctionUnderTest(void);

     

    int main(void) {

     

         volatile uint32_t junk;

         volatile uint64_t dt_first, dt_count, dt_total, dt_average;

     

         // disable global interrupts, enable DCC, enable PCC

         CSR = 0x00000000;

     

         // setup L2 cache

         REG32(0X01840000 /* L2CFG */) = 0x00000007; // set maximum L2 cache size

         junk = REG32(0X01840000 /* L2CFG */); // read it back, ensures completion

     

         // setup L1P cvache

         REG32(0X01840020 /* L1PCFG */) = 0x00000007; // set maximum L1P cache size

         junk = REG32(0X01840020 /* L1PCFG */) = 0x00000007; // read it back, ensures completion

     

         // setup L1D cache

         REG32(0X01840040 /* L1DCFG */) = 0x00000007; // set maximum L1D cache size

         junk = REG32(0X01840040 /* L1DCFG */) = 0x00000007; // read it back, ensures completion

     

         // start the timestamp counter

         TSCL = 0x00000000;

     

         // call the function the first time

         dt_first = doFunctionUnderTest();

     

         // call the function repeatedly in a loop and take an average time

         dt_total = 0;

         for (dt_count = 1; dt_count < 10; dt_count++) {

               dt_total += doFunctionUnderTest();

         }

         dt_average = dt_total / dt_count;

     

         // print the results

         printf("Build date/time %s %s:   ", __DATE__, __TIME__);

         printf("dt_first %lld,   ", dt_first);

         printf("dt_average %lld", dt_average);

         printf("\n");

         fflush(stdout);

     

         return 0;

    }

     

     

    uint64_t doFunctionUnderTest(void) {

     

         volatile uint32_t tscl, tsch;

         volatile uint64_t t0, t1, dt;

         volatile int x;

     

         // grab initial timestamp

         tscl = TSCL;

         tsch = TSCH;

         t0 = _itoll(tsch, tscl);

     

         // writeback L1D QDMA_SRC buffer

         REG32(0x01844040 /* L1DWBAR */) = (uint32_t)qdma_src_array;

         REG32(0x01844044 /* L1DWWC   */) = sizeof(qdma_src_array);

         while (REG32(0x01844044 /* L1DWWC   */)) { asm(" NOP");}

     

         // kick off a QDMA transfer to add stimulus

       doQdma();

     

         // call some heavy lifting DSPLIB functions

         DSP_fir_gen(x_array, h_array, r_array, 100, 100);

         DSP_mat_mul_cplx(x_array, 30, 30, y_array, 30, r_array, 4);

     

         // invalidate L1D QDMA_DST buffer

         REG32(0x01844048 /* L1DIBAR */) = (uint32_t)qdma_dst_array;

         REG32(0x0184404C /* L1DIWC   */) = sizeof(qdma_dst_array)/4;

         while (REG32(0x0184404C /* L1DIWC   */)) { asm(" NOP");}

     

         // writeback L1D QDMA_SRC buffer

         REG32(0x01844040 /* L1DWBAR */) = (uint32_t)qdma_src_array;

         REG32(0x01844044 /* L1DWWC   */) = sizeof(qdma_src_array);;

         while (REG32(0x01844044 /* L1DWWC   */)) { asm(" NOP");}

     

         // kick off a QDMA transfer to add stimulus

       doQdma();

     

         // call some heavy lifting DSPLIB functions

       DSP_fir_gen(y_array, h_array, r_array, 100, 100);

         DSP_mat_mul_cplx(x_array, 30, 30, y_array, 30, r_array, 4);

     

         // invalidate L1D QDMA_DST buffer

         REG32(0x01844048 /* L1DIBAR */) = (uint32_t)qdma_dst_array;

         REG32(0x0184404C /* L1DIWC   */) = sizeof(qdma_dst_array);;

         while (REG32(0x0184404C /* L1DIWC */)) { asm(" NOP");}

     

       // take final timestamp

         tscl = TSCL;

         tsch = TSCH;

         t1 = _itoll(tsch, tscl);

     

         // compute the delta time in cycles

         dt = t1 - t0;

     

         return dt;

    }

     

    void doQdma() {

     

         // setup then kick off a QDMA transfer using QDMA channel 0 and PaRAM set #0

     

         REG32(EDMA_BASE+0x1088 /* QEECR           */) = 0x00000001; // disable the QDMA event

         REG32(EDMA_BASE+0x0200 /* QCHMAP0         */) = 0x00000000; // map QDMA channel to PaRAM set#0 and set trigger word to 0 (OPT)

     

         // setup the PaRAM, static A-sync 1D transfer

         REG32(EDMA_BASE+0x4000 /* OPT             */) = 0x00000008;

         REG32(EDMA_BASE+0x4004 /* SRC             */) = (uint32_t)qdma_src_array;

         REG32(EDMA_BASE+0x4008 /* BCNT:ACNT       */) = 0x00010400;

         REG32(EDMA_BASE+0x400C /* DST             */) = (uint32_t)qdma_dst_array;

         REG32(EDMA_BASE+0x4010 /* DSTBIDX:SRCBIDX */) = 0x00000000;

         REG32(EDMA_BASE+0x4014 /* BCNTRLD:LINK   */) = 0x00000000;

         REG32(EDMA_BASE+0x4018 /* DSTCIDX:SRCCIDX */) = 0x00000000;

         REG32(EDMA_BASE+0x401C /* RSVD:CCNT       */) = 0x00000001;

     

         REG32(EDMA_BASE+0x108C /* QEESR           */) = 0x00000001; // enable the QDMA event

         REG32(EDMA_BASE+0x4000 /* OPT             */) = 0x00000008; // write the trigger word, kick off the QDMA

    }