TDA4VM: Curious difference in performance between C66x and C7x

Clément Fau

Part Number: TDA4VM

Hello,

I would like to make performance comparisons between the C66x and C7x cores.
For that I use TSCL for the first one and TSC for the second one.

For my first try, I wanted to compare the number of cycles on the two cores for a simple loop, repeating 100 times the asm( "NOP") instruction.

Compilers configured in the same way, no optimizations. I find 4807 cycles for the C7x and 1546 cycles for the C66x. I can't explain this difference, which I would have expected minor, or at least the other way around.

Is this normal? If so, why?

If not, what do you think is the cause?

Thanks,

Clément

over 4 years ago

0 George Mock over 4 years ago

TI__Guru**** 243870 points

Because of this compiler setting ...

user6476994 said:
no optimizations

... it is not a valid comparison. No expects unoptimized code to perform well. A comparison of two bad implementations of a loop doesn't mean anything.

Instead build with at least --opt_level=2.

One other suggestion ... Be sure to run the code from on-chip memory. Otherwise, you can end up counting a lot of cycles due to wait states, cache misses, etc.

Thanks and regards,

-George

0 Clément Fau over 4 years ago in reply to George Mock

Prodigy 135 points

Thank your for your answer.

Even with the compiler optimizations, the problem is more or less the same.
Both programs are loaded in L2, configured with 0 cache.

While performing further tests, I realized that a function for displaying the result of matrix in the console by printfs is significantly faster on the C7x core. I decided to measure the number of cycles on both cores, and I get about 800 million for the C7x versus 100,000 for the C66x, while the display is way faster on the C7x. So I wonder if I measure the cycles well, I use these two methods pour C7x:

static unsigned long t_start = 0;

void Start_Profiler(){
   t_start = __TSC;
}

void Stop_profiler(){
   unsigned long t_stop;
   unsigned long t_overhead;
   t_stop = __TSC;
   t_overhead = t_stop - t_start;
   printf("Cycles = %lu\n", t_overhead);
}

and the same thing with TSCL for C66x.

Should it works fine ?

Thanks and regards,

Clément

0 Deepak Poddar over 4 years ago in reply to Clément Fau

TI__Expert 4725 points

Hi,

Your method for profiling seems to be ok. Can you please check weather compiler is able to software pipeline the loop or not . You can check that in generated assembly file. Assembly file can be generated by putting compile time option of -k or --keep_asm.

Also can you please try some simple mac operation (a+=x[i]*y[i]) instead of just NOP with --opt_level=3?

Regards

Deepak Poddar

Processors

Processors forum

TDA4VM: Curious difference in performance between C66x and C7x