This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c674x vs c672x are my results reasonable?

Other Parts Discussed in Thread: OMAP-L138

I just finished a clock cycle performance evaluation of the c674x (c6748 running on the logic evm) and wanted to see if my results are expected.  I have been using the c672x processor family in several products, but I had an application that requires more external memory than is available in that family.  My intent was to compare the c674x to the c672x on the basis of clock cycles to get a general feel for what kind of hit the cache would give for my application, in exchange for the added memory.   I focused on 2 routines, a 1024 point complex FFT (hand optimized assembly version) and a complex fir routine that operates on 2 large arrays, each 10k words (40 kbytes) in length . The cplx fir is written in standard C optimized for speed at the highest level by the compiler. First I ran them both with the data in L2 ram, and then with the data in external DDR memory and then compared them with the c672x processor.  I used the emulator with break points viewing clock cycles reported for the routines.  This method has proved accurate in the past for me when looking at actual cycle consumption.  The results are as follows:

C674x: 1024 cplx fft (L2 data) 14000 clks, (ddr data) 23500 clks.  c672x: 11500 clks

C674x: cplx fir (L2 data) 43000-63000 clks, (ddr data)43000-63000 clks.  c672x: 21600 clks

I am not sure why the variance on different runs of the cplx fir on the c674x, is that much variance to be expected from run to run or could I change some initial cache condition to get more consistent performance?  The bottom line is a 2x or 3x slow down when running out of external ddr memory, and a 2x slow down when running out of L2 memory for the fir.  I assume this is because the data arrays were too big for the cache?  The fft only saw a slight hit in the L2 memory run and a 2x hit when running out of external memory.  Can anyone tell me if these results are to be expected?   I had the L2 cache enabled when running out of external memory with the MAR bit set.

  • Hi Dana,

    C674x should have at least the same performance as the C67x+. Are you using the correct mv option?

    Please see:

    http://tiexpressdsp.com/index.php/-mv_option_to_use_with_the_C674x

  • Thanks for the response and suggestion, I confirmed that I am using the correct  –mv6740 build option. 

    Isn’t it true though the best performance that I could expect out of the c6748 using the floating point instructions (running at 300 MHz) would be performance equivalent to the c6727 (also at 300 MHz) if it used the same instructions, and that this performance would only be matched if the c6748 were running out of the L1p and L1d caches.  As soon as I move into L2 or external memory the hope of equivalent performance is gone, since all of the memory on the c6727 is single cycle execution memory?  So back to my original question are the results that I posted (a 2x or 3x slow down due to cache) expected or am I missing something?

  • DanaT said:
    Isn’t it true though the best performance that I could expect out of the c6748 using the floating point instructions (running at 300 MHz) would be performance equivalent to the c6727 (also at 300 MHz) if it used the same instructions, and that this performance would only be matched if the c6748 were running out of the L1p and L1d caches. 

    Correct, the performance would be the same (or better) only in therms of number of instructions.

    DanaT said:
    As soon as I move into L2 or external memory the hope of equivalent performance is gone, since all of the memory on the c6727 is single cycle execution memory?  So back to my original question are the results that I posted (a 2x or 3x slow down due to cache) expected or am I missing something?

     If you can attach your code and optimization options I can take a look for you.

     

  • Thanks for taking a look. 

     I have attached the file TestRoutines.zip wit the source in the src folder and project and .cmd file in the build folder.  The TestRoutines.c file contains the complex multiply (fir type) routine.  It has the file specific optimizations set to optimize for max speed (-mf5) and Function level (-o2).  The command file places the data in external memory (ddr).  Main() is located in Test_CplxMult.c which also initializes the CACHE.    Feel free to email me with any questions.

    TestRoutines.zip
  • Hi Dana,

    Looking back at you 1st post I can not understand how those number apply to the test you send me.

    Can you give me the numbers that you are getting for mpy_WX:

    1) with C674x in internal memory

    1) with C6727 in internal memory

    1) with C674x in external memory

    1) with C6727 in external memory

    Also, please see attached a modified version of mpy_WX that does not use as much indexing (indexing debrades the performance).

    Also, please choose "no debug" for the build options for the file that contains mpy_WX.

  • Sorry about the confusion. In my original post what I labeled as cplx fir is the routine mpy_WX.  The routine is a time domain fir implemented in the frequency domain (mpy_WX).  So just for that routine (mpy_WX) with the c6748 in internal memory the best that I had seen was 43000 clks. 

    For the c6727 in internal memory I see 21600 clks.

    From external memory with the c6748 I also saw 43000 clks.

    The c6727 from external memory it is not reasonable to implement, which is why I was looking into the c674x.  But pretty consistently the c6748 runs at least 2x slower for both the mpy_WX and the cplx 1024 pt fft.  I have seen some consistency issues from run to run with the c674x, that I am not sure about.  Sometimes it seems to take considerably longer to execute the mpy_WX routine.  The cplx FFT always seems very consistent.

    Thanks for your other suggestions.  The “no debug” option did help a little all though the modified version of mpy_WX without indexing didn’t seem to change things much.

  • Hi Dana,

    I got 23000 to 27000 for the code/data all in internal memory with no Debug, o3, and my "non-indexed" version.

    One thing that I corrected is that if you configure part of L2 as cache, you have to reduce it's size at the cmd file, so for 128K of cache, the LENGTH = 0x00020000, for 64K of cache LENGTH = 0x00030000

    There seems to be something really weird with this code, it is giving me inconsistent results. I'm looking into it.

     

  • Anyone look at  (or can you post?) the SW Pipelined loop asm output from the C compiler?

    That will give you the theoretical best case on the C674x CPU (I often like running these on a CPU simulator with test vectors).

    THEN any effects of L1/L2 cache, SPLOOP, etc. will derate from the ideal, but give you a real world number. 

    And you can tweak these on HW to get as close to ideal as you can.

  • Hi Dana,

    Did you send me your complete test case?

    Because it seems that in the test case you sent me that there are come problems like: xIdx in the Dat structure not being initialized... etc..

    Joe, I can send you the asm file when I get the test case clarified...

  • Mariana,

    Nice catch on the xIdx not being initialized.  I had carved out this test code from another application that used it and I was not initializing the xIdx in the test case.  I have attached a new Test_CplsMult.c that initialize xIdx.  That explains some of the inconsistent behavior that I was seeing.

    Test_CplxMult.zip
  • Hi Dana, Joe,

    Here are my numbers:

    internal memory   
    customer code 36423
    my code (less indexing) 34261
    external memory with  cache config  
    customer code 38900
    my code (less indexing) 36738

    Attached is the corrected version (found a bug) of my code , and the asm code equivalent.

    It seems like the GEL file that comes with OMAP-L138 does not configure DDR unless you use the menu, so please make sure to run in CCS:

    So at least now the numbers are consistent. Joe, can you take a look at the C code and see if there is a way of making it more efficient? Please also take a look at the ASM code.

     

    results.zip
  • >>Joe, can you take a look at the C code and see if there is a way of making it more efficient? Please also take a look at the ASM code.

    At quick glance of the ASM (and not fully understanding the algorithm), your inner loop does not look bad considering the amount of loads into registers that need to be done.

    >> ii = 4  Schedule found with 6 iterations in parallel

    The MPYSP are pretty parallelized but the ADDSP/SUBSP might be improved.

    Can you take the conditional:

    >>if((--Idx)<0) Idx=MTail-1; 

    outside of the inner loop somehow?

    More importantly, is there some way to avoid the outer loop?  Or even make them into 2 loops?

    It looks like the outer loop is very inefficient

    >>ii = 22 Schedule found with 1 iterations in parallel

    An the C compiler was having trouble scheduling.

    A nice place to start for such optimization techniques is:

    http://focus.ti.com/general/docs/litabsmultiplefilelist.tsp?literatureNumber=spra666

  • c672x uses a different memory system than C674 family.

    CPU can accesses data from memory (internal ram) in one cycle, for c672x, while for c674+, the cpu accesses data from RAM at least 3cycles if L1 cach miss, or from L2 sram. c672x does not use data cach, while c674x use L1/L2 two level cache.