Tool/software: Linux
The same algorithm was optimized in A15 (NEON multithreading) and C66 (data DMA to L2 and Instruction Optimization) respectively. The result is obviously fast on A15. Is this normal? C66 should be specialized in making algorithms faster.