This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5718: OpenCL profiling support/tools for C66x DSP

Part Number: AM5718

Hi,

I am trying to figure out  a good way to profile my program running on DSP (in the OpenCL runtime). Normally, when developing on an ARM or x86 CPU, I would for example, #include <sys/time.h>, and then use the struct timeval and function gettimeofday to set a timestamp and then get delta times from the tiemstamp throughout my code in order to measure how long different sections of code take to run (with more or less microsecond accuracy). Is there something similar to this available when programming for the DSP?

I have looked in various locations for include files that might expose a high resolution timer functionality:

ti/ccsv6/tools/compiler/ti-cgt-c6000_8.1.0/include
tisdk/filesystem/tisdk-rootfs-image-am57xx-evm/usr/share/ti/opencl
tisdk/filesystem/tisdk-rootfs-image-am57xx-evm/usr/share/ti/cgt-c6x/include

etc.. But I have not found something similar to the sys/time.h API

I began to investigate DSP simulators here: processors.wiki.ti.com/index.php/Category:Simulation, but then ran into this information:

"CCSv6 does NOT have any simulators. Texas Instruments is moving away from providing simulators and instead is focusing on providing low cost development boards." (per processors.wiki.ti.com/index.php/List_of_Simulator)

So it looks like the simulator path is a non-optimal one. My research thus far has indicated that CCS has some code profiling features for DSP, but I would like to do runtime profiling outside of a debugging environment. I want to profile my code without the use of JTAG or other debugging support, in as real-world of an environment as possible (read minimally intrusive), in order to capture the full dynamics of the OpenCL dispatch, compute device processing, and IPC overhead.

Thanks,

Weston

  • Hi,

    The DSP experts have been notified. They will respond here.
  • Hi Weston

    My understanding is that you try to benchmark DSP execution from OpenCL environment.  OpenCL has a good tool to do this.

    Look at TI Design  www.ti.com/.../TIDEP0046 and see how the design measures time of execution using the Event object (I called the Event instance ev  and ev2 and so on).  The TI Design is C++ but there is an equivalent in C. The function getProfilingInfo part of teh Event object gives the user lots of information (when was the task sent to the DSP,     how long was DSP execution and so on and so forth)

    Look at TI OpenCL documentation for more information

    Did I answer your question?  if so, close the thread

    Regards

    Ran

  • Hi Ran,

    Thanks for your response. I am using OpenCL events to benchmark the overall execution time for the program, similar to the Monte-Carlo simulation example. So I have coarse timing for the kernel execution itself, as well as fine grained timing for the IPC read/write buffers. But I am also executing a large amount of C library code (using the TI extension: Calling Standard C Code From OpenCL C Code: downloads.ti.com/mctools/esd/docs/opencl/extensions/standard-c-code.html) outside of the OpenCL kernel. It is this library code that I would like to profile.

    Thanks,

    Weston

  • Two comments

    1. You can still dispatch the DSP library functions from OpenCL and measure the time using opencl
    2. Each library function has a unit test. You can use CCS to measure the cycles. Hint - compile the test code with full symbolic debug and use the release library function for measuring the time.

    By the way, what function are you interested in?
  • Hi Ran,

    The library I am calling into is a custom one. The task that the DSP is performing is an image processing operation. I enqueue an image in to a CL write buffer and a number of parameters to control the image processing into another write buffer, then perform the image processing task on the DSP (using library code), then return the processing results via read buffer. I'd like to carry along execution time-stamps for the various steps in the image processing sequence in the return read buffer as well.

    Weston

  • Hi Weston

    Here is my suggestion. You can use the TSCl TSCH registers on the DSP core. Read these two values at the start of the execution and store the values in global variable, read them again before the end of execution and store again in (different) global variables. Then print the values and calculate the number of cycles

    Here is a pseudo code that does something like that using printf instead of storing the data in a global variables

    long long diffTime , diffTime2 ;

    unsigned long t1, t2,t1H, t2H ;


    TSCL = 0; // this is a must to enable the register clock
    t1 = TSCL ;
    t1H = TSCH ;
    t2 = TSCL ;
    t2H = TSCH ;
    diffTime = ((long long) (t2H - t1H)) * 0x0100000000 + (long long)(t2-t1) ; // this is to determine the overhead
    printf(" overhead time is %d \n", diffTime) ;

    t1 = TSCL ;
    t1H = TSCH ;

    executeTheSignalProcessingCode() ;

    t2 = TSCL ;
    t2H = TSCH ;
    diffTime2 = ((long long)(t2H - t1H)) * 0x0100000000 + (long long)(t2-t1) ;
    printf("t1 %d t1H %d t2 %d t2H %d \n",t1, t1H, t2, t2H) ;
    printf("times passes (in milliseconds) %e \n ",
    (float) (((float)(diffTime2 - diffTime))/(1000000000.0)) ) ;
    Now instead of printf store the values and read them later

    Ran
  • Ran,

    Thank you very much for your help.  I should be close to a solution with the above pseudo code. Using the document: TMS320C6000 Optimizing Compiler v8.1.x User's Guide, section: 7.5.3 The __cregister Keyword, and document: TMS320C66x DSP CPU and Instruction Set Reference Guide, section: 2.9.13 Time Stamp Counter Registers (TSCL and TSCH),  I have implemented the following:

    #include "c6x.h"
    
    // Local t0 for start time of processing.
    unsigned long long int _startTime = 0;
    
    // Control register access provided by __cregister keyword, see: c6x.h.
    extern __cregister volatile unsigned int TSCL;
    extern __cregister volatile unsigned int TSCH;
    
    void enable_cpu_clock_counter(void) {
        TSCL = 0;
    }
    
    unsigned long long int get_timestamp(void)
    {
        return ((unsigned long long int)TSCH << 32) + (unsigned long long int)TSCL;
    }
    
    void set_timestamp(unsigned long long int timestamp)
    {
        _startTime = timestamp;
    }
    
    unsigned int uptime(void)
    {
        return (unsigned int)(get_timestamp() - _startTime);
    }

    Which generates the following assembly code for enable_cpu_clock_counter:

    ;******************************************************************************
    ;* FUNCTION NAME: enable_cpu_clock_counter()                                  *
    ;*                                                                            *
    ;*   Regs Modified     : B4                                                   *
    ;*   Regs Used         : B3,B4                                                *
    ;*   Local Frame Size  : 0 Args + 0 Auto + 0 Save = 0 byte                    *
    ;******************************************************************************
    _Z24enable_cpu_clock_counterv:
    ;** --------------------------------------------------------------------------*
    ;          EXCLUSIVE CPU CYCLES: 8
               ZERO    .L2     B4                ; [B_L66] |25| 
               MVC     .S2     B4,TSCL           ; [B_Sb66] |25| 
               RETNOP          B3,5              ; [] |26| 
               ; BRANCH OCCURS {B3}              ; [] |26| 
    	.sect	".text"
    	.clink
    	.global	_Z13set_timestampy

    This looks like a correct way to start the cpu clock counter per the documentation. Instead of converting clock ticks to an actual time value, I am simply returning the delta clock ticks. 

    This appears to work.

    Thanks,

    Weston