Hi,
I am working currently with the C6678 and I am exploring the processing performance of the processor. According to the user guide, each core running at 1.25Ghz supports up to 20 GFLOPS. However, when looking at the core architecture I am not sure to understand how one can exploits this 20 GFLOPS. Indeed, as far as I understand it (but the reason why I am posting this message, is that I am not sure about it and would like to get the opinion of a TI expert), each core has two register files and each register file is connected with a 64-bit bus meaning that only 2 single-precision load\store operations can take place every cycle in each register file. Consequently only one single precision multiplication (whose two 32-bit operands were loaded from L1 at the previous cycle) can be performed. Given that there are two register files, it means that only two single precision multiplications can be performed by a core in one cycle when the operations are done on new operands loaded from L1 at the previous cycle. In conclusion in a realistic scenario each Core running at 1.25 Ghz would have a 2.5 GFLOPS capacity. Therefore 100% (i.e. 20GFLOPS per core), utilization is not practical unless calculations are done on the same data over and over.
Is it right? or am I missing something?
Thank you very much in advance,
Mike