I just finished a clock cycle performance evaluation of the c674x (c6748 running on the logic evm) and wanted to see if my results are expected. I have been using the c672x processor family in several products, but I had an application that requires more external memory than is available in that family. My intent was to compare the c674x to the c672x on the basis of clock cycles to get a general feel for what kind of hit the cache would give for my application, in exchange for the added memory. I focused on 2 routines, a 1024 point complex FFT (hand optimized assembly version) and a complex fir routine that operates on 2 large arrays, each 10k words (40 kbytes) in length . The cplx fir is written in standard C optimized for speed at the highest level by the compiler. First I ran them both with the data in L2 ram, and then with the data in external DDR memory and then compared them with the c672x processor. I used the emulator with break points viewing clock cycles reported for the routines. This method has proved accurate in the past for me when looking at actual cycle consumption. The results are as follows:
C674x: 1024 cplx fft (L2 data) 14000 clks, (ddr data) 23500 clks. c672x: 11500 clks
C674x: cplx fir (L2 data) 43000-63000 clks, (ddr data)43000-63000 clks. c672x: 21600 clks
I am not sure why the variance on different runs of the cplx fir on the c674x, is that much variance to be expected from run to run or could I change some initial cache condition to get more consistent performance? The bottom line is a 2x or 3x slow down when running out of external ddr memory, and a 2x slow down when running out of L2 memory for the fir. I assume this is because the data arrays were too big for the cache? The fft only saw a slight hit in the L2 memory run and a 2x hit when running out of external memory. Can anyone tell me if these results are to be expected? I had the L2 cache enabled when running out of external memory with the MAR bit set.