This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello,
I would like to make performance comparisons between the C66x and C7x cores.
For that I use TSCL for the first one and TSC for the second one.
For my first try, I wanted to compare the number of cycles on the two cores for a simple loop, repeating 100 times the asm( "NOP") instruction.
Compilers configured in the same way, no optimizations. I find 4807 cycles for the C7x and 1546 cycles for the C66x. I can't explain this difference, which I would have expected minor, or at least the other way around.
Is this normal? If so, why?
If not, what do you think is the cause?
Thanks,
Clément
Because of this compiler setting ...
user6476994 said:no optimizations
... it is not a valid comparison. No expects unoptimized code to perform well. A comparison of two bad implementations of a loop doesn't mean anything.
Instead build with at least --opt_level=2.
One other suggestion ... Be sure to run the code from on-chip memory. Otherwise, you can end up counting a lot of cycles due to wait states, cache misses, etc.
Thanks and regards,
-George
Thank your for your answer.
Even with the compiler optimizations, the problem is more or less the same.
Both programs are loaded in L2, configured with 0 cache.
While performing further tests, I realized that a function for displaying the result of matrix in the console by printfs is significantly faster on the C7x core. I decided to measure the number of cycles on both cores, and I get about 800 million for the C7x versus 100,000 for the C66x, while the display is way faster on the C7x. So I wonder if I measure the cycles well, I use these two methods pour C7x:
static unsigned long t_start = 0; void Start_Profiler(){ t_start = __TSC; } void Stop_profiler(){ unsigned long t_stop; unsigned long t_overhead; t_stop = __TSC; t_overhead = t_stop - t_start; printf("Cycles = %lu\n", t_overhead); }
and the same thing with TSCL for C66x.
Should it works fine ?
Thanks and regards,
Clément
Hi,
Your method for profiling seems to be ok. Can you please check weather compiler is able to software pipeline the loop or not . You can check that in generated assembly file. Assembly file can be generated by putting compile time option of -k or --keep_asm.
Also can you please try some simple mac operation (a+=x[i]*y[i]) instead of just NOP with --opt_level=3?
Regards
Deepak Poddar