c6678 FFT cycle problem

si cheng

Other Parts Discussed in Thread: MATHLIB

hello:

I test the complex FFT demo in the mcsdk on the EVM, I found that ,for example,1K point FFT cycles are N, and 2K point FFT cycles are 2*N , and 4K point FFT cycles are 4*N, and when the points are 8K, the cycles are about 20*N cycles, about Five times of the 4K points, not Double time; and then 16K points FFT cycles are 40*N cycle。The

difference is from the 4K points to 8K points, I think it maybe have relationship with L1D cache，because the 4K points complex FFT data are just 32KB, the same size with L1D cache, I don't know wether my guess is right， can anyone help me?

Thanks!

over 12 years ago

0 Chad Courtney over 12 years ago

TI__Mastermind 30825 points

Do you have the actual cycle count numbers?

When you start thrashing the cache you will start paying a penalty but this sounds a bit high. Where is the main data stored; L2?

Best Regards,

Chad

0 si cheng over 12 years ago in reply to Chad Courtney

Expert 2685 points

I am sorry that I did't give the exact cycle， now here it is:

1K complex points cycles :7781

2K complex points cycles :18471

4K complex points cycles :48926

8K complex points cycles :212245

16K complex points cycles :436764

32K complex points cycles :3026556

4k TO 8K and 16K TO 32K is obvious different，data is stored in MSM RAM. I have test that no matter the data is in L2 or MSM RAM, the result is almost the same,because defaut ,the MSM RAM is a share L2.

I have wrote other function,for example ，Complex sequence multiplied by a complex sequence and Complex sequence add/sub a complex sequence,and so on, with Linear assembly，I find that some are similar to the above FFT test when the complex points from 4K to 8K, and some are OK.

So I do not know what the matter is！ I hope this problem can be solved.

Thanks!

0 si cheng over 12 years ago in reply to si cheng

Expert 2685 points

hi，

I am anxious to solve this problem,So if anyone know why,Please tell me . Thanks.

Best regard

0 Chad Courtney over 12 years ago in reply to si cheng

TI__Mastermind 30825 points

It does sound like a possibility of thrashing the cache may be causing the issue. I'm going to ask a colleague to comment on this.

Best Regards,
Chad

0 Cesar over 12 years ago in reply to Chad Courtney

TI__Guru*** 149614 points

Hi,

It is expected to see performance degradation when the data is larger than the cache size. Depending on the algorithm you will see more or less degradation because the data is accessed in different patterns.

Sometimes it is possible to design the algorithm for specific cache sizes. The FFT included in the mathlib is a generic radix 4 and radix 2 implementation and will have degradation when the size increases.

Thanks

cesar

0 si cheng over 12 years ago in reply to Cesar

Expert 2685 points

Hi,cesar

You say "performance degradation when the data is larger than the cache size ". Obviously， the FFT is it. But there are some cases of non-compliance with this rule. for example:

X[i],Y[i],Z[i] are all float complex ,this is a complex array added by another complex array. I wtote it with linear assembly:

#define N 32*1024

for (i = 0; i < N;i++)

{

Z[ i ] = X[ i] + Y[ i ];

}

test the cycle is:

1K point cycle: 1917

2K point cycle: 3711

4k point cycle : 20074

8K point cycle: 40065

16K point cycle: 80031

32K point cycle :159957

1 above cycle, it is from 2K points to 4K points obvious differrent , not 4K TO 8K, why？

another case,wrote with linear assembly too:

#define N 32*1024

char X[N ];

float Y[N];

for( i = 0; i < N; i++)

{

Y[i] = (float)X[i];
}

test cycle:

N = 1K cycle: 1352

N = 2K cycle:2605

N = 4K cycle : 5166

N = 8K cycle:10287

N = 16K cycle :20592

N = 32K cycle:41130

2 this case when the data is larger than the cache size, it don't have performance degradation. Why？

3 I don't know what will lead to the cache's performance degradation ?

4 And if performance degradation, how much will it degradate?

5 When I design my project, what should be considered？

I need to know these,because these will influence design.

So I hope you can give a more detailed explanation.

Best Regards.

0 si cheng over 12 years ago in reply to si cheng

Expert 2685 points

Can anyone give me a more detailed explanation?

0 Cesar over 12 years ago in reply to si cheng

TI__Guru*** 149614 points

Hi,

Different algorithms will have different performance based on memory access. In order to understand the memory access you need to review the asm code generated by the tools. Understanding the linear assembly will not provide sufficient information to understand the memory accesses.

1) To understand why the cycle count increases much more when the data increases from 2K to 4K than 4K to 8K I suggest you step through this code in CCS and in a Memory Browser analyze the content of the cache.

2) In order to understand why there is less performance degradation for this algorithm you will need to analyze the asm code and step through the code in CCS. You will probably see that there are less cache misses than for the previous algorithm.

3)_4) The performance of the algorithm is related to the number of cache misses. Most of the time it is difficult to compute the performance degradation because one needs to know exactly how many cache misses will happen. The performance degradation is estimated by running the code and benchmarking it.

5) When you design your project you should run the code and benchmark the performance. You should first invalidate the cache and then run your the size of the FFT that will be needed in your project.

Thanks

cesar

Processors

Processors forum

c6678 FFT cycle problem