This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6748: CPU speed

Part Number: TMS320C6748

Hello

I tried to measure the CPU speed by running the code below and measure the time it get to stop in the break point (the while(1) command at the end)

void main(void)
{
volatile unsigned int dlyCount,counter=0;
unsigned long long ullTimeStart, ullTimeDiff, ullTimeEnd;

TSCL = 0;
ullTimeStart = TSCL;
ullTimeStart += (unsigned long long)TSCH<<32;

while(1)
{

counter++;


for (dlyCount=0;dlyCount<100000;dlyCount++);

if (counter>10000)
break;
ullTimeEnd = TSCL;
ullTimeEnd += (unsigned long long)TSCH<<32;
ullTimeDiff = ullTimeEnd - ullTimeStart;
};
while(1); //breakpoint

}

the time its take to arrive to the breakpoint is 64 second and the ullTimeDiff =19,000,620,000.

I run this code on C6748 LCDK board and I use the LCDK gel file that configure the cpu clock to 300MHZ.

when I look on the assembly code, the for command is done on a single instruction, this mean that for 100000*10000 loops it will take 1000000000 * 3.3nsec= 3.3sec insted it takes 64sec - 20 times longer.

The ullTimeDiff seems to be correct.

what can be the reason for this delay?

  • Hi Ilan,

    I've forwarded this to the software experts. Their feedback should be posted here.

    BR
    Tsvetolin Shulev
  • You are computing the cycles count incorrectly. Run the code from DSP L2 RAM and put the ullTimeStart inside the while loop just above the for loop.

    If you run code from Shared RAM or DDR memory, then you need to configure L1 and L2 cache hence for simple benchmarking like this we recommend moving all code and data in DSP L2 memory so cache latency or external memory access latency is not part of the core benchmarks.

    For reference , here are core reference benchmarks and how they were captured for this core:
    www.ti.com/.../core-benchmarks.page

    Regards,
    Rahul
  • Ilan,

    There is a Wiki article with a sample program to help you measure the DSP clock speed by comparing with the wall clock. The article is What is my DSP clock speed, and includes a zip file that you can download and import. Even though it was tested on CCSv4 and CCSv5, it should easily work on CCSv6 and CCSv7. If you have any trouble importing it, just use the unzipped folder and pull out the source files, including .cmd. You might need to make changes for your processor, but it may work with any C64x+ and later DSPs.

    Someone wrote a better one that uses wall-clock APIs to get it figured out, but I am not sure where that is.

    Regards,
    RandyP

  • Hi

    The code is located in L2RAM and gives the same results. the article "what is my DSP clock speed" use the TCSL to measure the time, but in my test the value that I am getting from TCSL is correct - this mean the CPU is running on 300MHz as expected, but still this simple loop code takes 20 time more then expected. what can be the reason for this?

    I use the timestamp to measure the time for the 1000000*1000 loop (the UllTimeStart is before the loop, and the ullTimeEnd is after the loop). and it gives the correct value :

    19019033046/64=3.3nsec. but the loop of 1,000,000,000 round, i will expect to get 3.3nsec*1,000,000,000=3.3sec and not 64 sec.

    I made another test without defining the dlyCount  and counter as volatile and I got 30sec delay.

    void main(void)
    {
    volatile unsigned int dlyCount,counter=0;
    unsigned long long ullTimeStart, ullTimeDiff, ullTimeEnd;

    TSCL = 0;
    ullTimeStart = TSCL;
    ullTimeStart += (unsigned long long)TSCH<<32;

    while(1)
    {

    counter++;


    for (dlyCount=0;dlyCount<1000000;dlyCount++);

    if (counter>1000)
    break;

    };
    ullTimeEnd = TSCL;
    ullTimeEnd += (unsigned long long)TSCH<<32;
    ullTimeDiff = ullTimeEnd - ullTimeStart;
    while(1); //breakpoint

    }

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    DSPL2ROM 00700000 00100000 00000000 00100000 RWIX
    DSPL2RAM 00800000 00040000 0000106c 0003ef94 RWIX
    DSPL1PRAM 00e00000 00008000 00000000 00008000 RWIX
    DSPL1DRAM 00f00000 00008000 00000000 00008000 RWIX
    SHDSPL2ROM 11700000 00100000 00000000 00100000 RWIX
    SHDSPL2RAM 11800000 00040000 00000000 00040000 RWIX
    SHDSPL1PRAM 11e00000 00008000 00000000 00008000 RWIX
    SHDSPL1DRAM 11f00000 00008000 00000000 00008000 RWIX
    EMIFACS0 40000000 20000000 00000000 20000000 RWIX
    EMIFACS2 60000000 02000000 00000000 02000000 RWIX
    EMIFACS3 62000000 02000000 00000000 02000000 RWIX
    EMIFACS4 64000000 02000000 00000000 02000000 RWIX
    EMIFACS5 66000000 02000000 00000000 02000000 RWIX
    SHRAM 80000000 00020000 00000000 00020000 RWIX
    DDR2 c0000000 20000000 00000000 20000000 RWIX

  • Ran,

    I mistakenly assumed from your post's title and the first line, that you were trying to determine the CPU speed - how fast the DSP was running.

    Now, I understand you are concerned with how fast the DSP executes your benchmarking code. That is a different issue, and I apologize for my confusion.

    In your original post, you said

    Ilan R said:
    when I look on the assembly code, the for command is done on a single instruction

    This is not possible, so there is also a misunderstanding on what the DSP is doing.

    L1P and L1D caches should be enabled by default. Please confirm that these are enabled. Because your code is a small loop with few data variables, everything should be in L1P and L1D cache after the first pass; any memory latencies will be avoided after that first pass.

    Please find the compiler option to generate a listing file and attach that file to this thread. This will include compiler settings and the assembly code that is used. In that file, you and we will be able to count the actual instructions used in the for-loop. I assume the total will add up to around 20. The absolute minimum cycle count for any loop is 6, since a B branch instruction takes 6 cycles even if it is branching just to itself in a super-tight loop.

    The better question to be asking is, why are you doing this? You know the DSP clock cycle is 300MHz, so you are benchmarking your simple test code for some reason?

    Regards,
    RandyP

  • Hi Randy,

    I attach th main.c and main.lst file.

    the gel file (LCDK gel) do not seems to configure the L1PCFG or L1DCFG. I checked the value of this register in the debuger and it was 7 and 3. I changed the L1DCFG to 7 (all cache).

    in the debuger. The loop is done on line LDW.D2T2      *B15[1],B4. where the B4 is incremented, that is why I assumed it takes a single cycle, but this command probebly take more cycle.

    As I mention when I was removing the volatile the time was 30sec (even if the main.lst didnt changed).

    I started doing this test because I need to activate the UART in a speed of ~3mbps. where I need to receive data and send data every 100usec. I tried to use the DMA to do this and I measure a time of ~60usec to receive the data and to transmit it back, so it means that the DSP will spend most of its time to send and receive DATA (without even processing it or doing ather tasks). so to check that my measurment are OK I started to check the CPU clock and the code execution time.

    currently for the uart I still need to decide if I will use the EDMA or it will be faster to write directly the message to the uart fifo and use uart interrupt (message size is less then uart fifo size)

    Best Regards

    Ilan

  • Ilan,

    Can you please try attaching the files again? They are not coming up when I try to view them.

    LDW.D2T2 *B15[1],B4 only reads a single value from the stack. B15 == SP. That is not the full loop.

    Regards,
    RandyP
  • Hi Randy

    You are right , after checking the step it takes (wite the assebly single step) to make the for loop it takes 19 cycle in the volatile example, and 9 without volatile.

    volatile loop:

    008003a8:   BC4D                LDW.D2T2      *B15[1],B4
    008003aa:   6C6E                NOP           4
    008003ac:   2641                ADD.L2        B4,1,B4
    008003ae:   BC45                STW.D2T2      B4,*B15[1]
    008003b0:   BC4D                LDW.D2T2      *B15[1],B4
    008003b2:   6C6E                NOP           4
    008003b4:   00148BFA            CMPLTU.L2     B4,B5,B0
    008003b8:   2004A120     [ B0]  BNOP.S1       $C$L2 (PC+8 = 0x008003a8),5
    008003bc:   E3A00000            .fphead       n, l, W, BU, nobr, nosat, 0011101

    non volatile loop:

    008003a4:   2641                ADD.L2        B4,1,B4
    008003a6:   BC45                STW.D2T2      B4,*B15[1]
    008003a8:   00148BFA            CMPLTU.L2     B4,B5,B0
    008003ac:   2002A120     [ B0]  BNOP.S1       $C$L2 (PC+4 = 0x008003a4),5

    this explaine the time delay for the for loop command.

    regarding my question for the uart:

    I had to receive 14 bytes and send 14 bytes using the UART that run with a boudrate of ~3MBPS (I can change this boudrate if I must). data is received and sent every ~100usec. Is it will be better (less CPU time usage) to send and receive the data directly to/from the uart fifo register(using interrupts) or using the EDMA to do this. does the time delay of ~60usec from the time data was received to the time data was sent (mesured by scope on uart lines) makes sense. This test was done by configuring the EDMA to receive the data from the UART and to send it to the uart. (almost the same delay was measured if it was 16 bytes or 88 bytes)

    Best Regards

    Ilan 

  • Ilan,

    Baudrates are usually written in bits per second, bps, instead of bytes per second, BPS. I would assume you meant 3Mbps for your baudrate, but if not, it could make a big difference.

    Even 3Mbaud is a high data rate, but the device can handle up to 12 Mbaud.

    Can you clarify, please, the timing of the 14 Rx bytes and the 14 Tx bytes? Do they overlap in some way? What is the 100us: for each byte, each set of 14 bytes, etc.? And the 60us? I obviously have not located my calculator to figure that out. I will let you do the heavy work.

    You will certainly be better off using the EDMA for data movement instead of the DSP. And if you can use the UART's FIFOs, that will help, too, but your clarification of the transfer timing is needed to understand how you are using it.

    Regards,
    RandyP
  • Hi Randy

    I am working with uart configured to work on 4687500 bit per second.

    my program receives 14 bytes every ~100usec checks the header  and send back 14 bytes.

    I made some new tests and I was measuring around ~5usec delay between the time data were received to the time the data were sent back. time was measured with osciloscope between last received bit to first transmit bit.

    with FIFO 14 and uart interrupt - 5usec

    with FIFO 1 and EDMA interrupt - 5.9usec

    with FIFO 14 and EDMA interrupt - 7.2usec

    Receive interrupt time (both in edma and uart test) was 1.6usec after last received bit.

    so as you can see the diffrences are not so high and depends mostly on code implementation. (I work with 300mhz clk, code in L2, cache in L1p/d)

    so I probably will chooses to work with the EDMA, this will give me more flexibility on data size, less or more then 14 (the uart fifo without edma limits the data size to be 14/8/4 as the fifo size)

    Thanks

    Ilan