Tool/software: TI C/C++ Compiler
Hello,
I am trying to write EVE VCOP kernel for an algorithm having many arithmetic operations involved. This shall be implemented by splitting the operations between multiple for loops in a single kernel function.
My question is, will there be any performance difference between the below-mentioned approaches?
Assume, the algorithm will fit in a single for-loop satisfying all VCOP constraints.
Approach 1: Algorithm implemented in VCOP kernel using single for-loop
Approach 2: Algorithm implemented in VCOP kernel using multiple for-loop's by splitting the operations.
Known that approach 2 will introduce the usage of intermediate buffers which will increase the memory, how does this affect the performance?