Hello,
My question is a little naive and likely that I am missing a point, but still:
I have been profiling the Vector Dot Product function using the benchmark DSPF_sp_dotprod() source from C67x DSP Library and DSP_dotprod() source (intrinsic C implementation) from C64x DSP Library respectively. So, when I profile the respective code with data arrays of size (nx =) 256, 512, 1024, 2048 and 4096 samples in separate runs, I observe that the C67x generic device counts around (1.5 * nx) for cycleCPU and C64x generic device counts around (3 * nx) for cycleCPU for the aforementioned array sizes.
Now, the benchmark for C64x lists the formula as (nx/4 + 15) and C67x lists the formula as (nx/2 + 25). How do I go about interpreting the kind of cycleCPU counts I have been getting on profiling the code against these theoretical software algorithm benchmarks? Clearly, the profiling results include instruction execution cycle , cross path stalls and memory bank conflict stalls for cycleCPU. Is there any way I can disable them while I profile the code?
Is there a relationship that can be formed, considering the underlying DSP architecture, between the algorithm benchmarks specified by TI and the actual cycleCPU count observed upon profiling the code using CCSv4? Lastly, how would I say which device is faster in the above two cases given that the simulation was performed for a generic device in each case?
Thanks!