This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c6678 FFT cycle problem

Other Parts Discussed in Thread: MATHLIB

hello:

I test the complex FFT demo in the mcsdk on the EVM, I found that ,for example,1K point FFT cycles are N, and  2K point FFT cycles are 2*N , and 4K point FFT cycles are 4*N, and when the points are  8K, the cycles are about 20*N cycles, about Five times of the 4K points, not Double time; and then 16K points FFT cycles are  40*N cycle。The 

difference is from   the 4K points  to  8K points, I think it maybe have relationship with L1D cache,because  the 4K points complex FFT data are just 32KB, the same size with L1D cache, I don't know wether my guess is right, can anyone help me?

Thanks!

  • Do you have the actual cycle count numbers?

    When you start thrashing the cache you will start paying a penalty but this sounds a bit high.  Where is the main data stored; L2?

    Best Regards,

    Chad

  • I am sorry that I did't give the exact cycle, now  here it is:

    1K complex points cycles :7781

    2K complex  points cycles :18471

    4K complex points cycles :48926

    8K complex points cycles :212245

    16K complex points cycles :436764

    32K complex points cycles :3026556


    4k TO 8K  and 16K TO 32K is obvious different,data is stored in MSM RAM. I have test that no matter the data is in L2 or MSM RAM, the result is almost the same,because defaut ,the MSM RAM is a share L2.

    I have wrote other function,for example ,Complex sequence multiplied by a complex sequence and Complex sequence add/sub a complex sequence,and so on,  with Linear assembly,I find that some are similar to the above FFT test when the complex points from 4K to 8K, and some are OK.

    So I do not know what the matter is! I hope this problem can be solved. 

    Thanks!

  • hi,

    I am anxious to solve this problem,So if anyone know why,Please tell me . Thanks.

    Best regard

  • It does sound like a possibility of thrashing the cache may be causing the issue.  I'm going to ask a colleague to comment on this.

    Best Regards,
    Chad

  • Hi,

    It is expected to see performance degradation when the data is larger than the cache size. Depending on the algorithm you will see more or less degradation because the data is accessed in different patterns.

    Sometimes it is possible to design the algorithm for specific cache sizes. The FFT included in the mathlib is a generic radix 4 and radix 2 implementation and will have degradation when the size increases.

    Thanks

    cesar

     

  • Hi,cesar

     You say  "performance degradation when the data is larger than the cache size ". Obviously, the FFT is it. But there are some cases of non-compliance with this rule. for example:

    X[i],Y[i],Z[i] are all float complex ,this is a complex array added by another complex array. I wtote it with linear assembly:

    #define N   32*1024

    for (i = 0; i < N;i++)

    {

      Z[ i ] =  X[ i] + Y[ i ];

    }

    test the cycle is:

    1K point  cycle:  1917

    2K point cycle:  3711

    4k point cycle : 20074

    8K point cycle: 40065

    16K point cycle: 80031

    32K point cycle :159957

    1 above cycle, it is from 2K points to 4K points obvious differrent , not 4K TO 8K, why?

    another case,wrote with linear assembly too:

    #define N   32*1024

    char X[N ];

    float Y[N];

    for( i = 0; i < N; i++)

    {

       Y[i] = (float)X[i];

    test cycle:

    N = 1K  cycle: 1352

    N = 2K  cycle:2605

    N = 4K cycle : 5166

    N = 8K cycle:10287

    N = 16K cycle :20592

    N = 32K cycle:41130

    2 this case when the data is larger than the cache size, it don't have  performance degradation. Why?

    3 I don't know what will lead to the cache's performance degradation ?

    4  And if performance degradation, how much  will it degradate?  

    5  When  I  design my project, what should be considered?


    I need to know these,because  these will  influence design.

    So I hope you can give a more detailed explanation.

    Best Regards.

  • Can anyone give me a  more detailed explanation?

  • Hi,

    Different algorithms will have different performance based on memory access. In order to understand the memory access you need to review the asm code generated by the tools. Understanding the linear assembly will not provide sufficient information to understand the memory accesses.

    1) To understand why the cycle count increases much more when the data increases from 2K to 4K than 4K to 8K I suggest you step through this code in CCS and in a Memory Browser analyze the content of the cache.

    2) In order to understand why there is less performance degradation for this algorithm you will need to analyze the asm code and step through the code in CCS. You will probably see that there are less cache misses than for the previous algorithm.

    3)_4) The performance of the algorithm is related to the number of cache misses. Most of the time it is difficult to compute the performance degradation because one needs to know exactly how many cache misses will happen. The performance degradation is estimated by running the code and benchmarking it.

    5) When you design your project you should run the code and benchmark the performance. You should first invalidate the cache and then run your the size of the FFT that will be needed in your project.

     

    Thanks

    cesar