I tried to figure out how long it takes for a function execution in simulator mode and in actual target. In simulator mode it took 2.5 milliseconds where as in actual target it took 3.4 milliseconds. In simulator mode, L2 memory was used (No caching). In actual target, disabled HWI during function execution and L1 caching was enabled. I could not understand still why the target run was slower by 900 microsec ?