This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: Curious difference in performance between C66x and C7x

Part Number: TDA4VM

Hello,

I would like to make performance comparisons between the C66x and C7x cores.
For that I use TSCL for the first one and TSC for the second one.


For my first try, I wanted to compare the number of cycles on the two cores for a simple loop, repeating 100 times the asm(    "NOP") instruction.


Compilers configured in the same way, no optimizations. I find 4807 cycles for the C7x and 1546 cycles for the C66x. I can't explain this difference, which I would have expected minor, or at least the other way around.

Is this normal? If so, why?

If not, what do you think is the cause?

Thanks,

Clément

  • Because of this compiler setting ...

    user6476994 said:
    no optimizations

    ... it is not a valid comparison.  No expects unoptimized code to perform well.  A comparison of two bad implementations of a loop doesn't mean anything.  

    Instead build with at least --opt_level=2.  

    One other suggestion ... Be sure to run the code from on-chip memory.  Otherwise, you can end up counting a lot of cycles due to wait states, cache misses, etc.

    Thanks and regards,

    -George

  • Thank your for your answer.

    Even with the compiler optimizations, the problem is more or less the same.
    Both programs are loaded in L2, configured with 0 cache.

    While performing further tests, I realized that a function for displaying the result of matrix in the console by printfs is significantly faster on the C7x core. I decided to measure the number of cycles on both cores, and I get about 800 million for the C7x versus 100,000 for the C66x, while the display is way faster on the C7x. So I wonder if I measure the cycles well, I use these two methods pour C7x:

    static unsigned long t_start = 0;
    
    void Start_Profiler(){
       t_start = __TSC;
    }
    
    void Stop_profiler(){
       unsigned long t_stop;
       unsigned long t_overhead;
       t_stop = __TSC;
       t_overhead = t_stop - t_start;
       printf("Cycles = %lu\n", t_overhead);
    }

    and the same thing with TSCL for C66x.

    Should it works fine ?

    Thanks and regards,

    Clément

  • Hi,

    Your method for profiling seems to be ok. Can you please check weather compiler is able to software pipeline the loop or not . You can check that in generated assembly file. Assembly file can be generated by putting compile time option of -k or --keep_asm.

    Also can you please try some simple mac operation (a+=x[i]*y[i]) instead of just NOP with --opt_level=3

    Regards

    Deepak Poddar