Eugen
I ran my experiments on a WinCE 7 compact environment (since that is my current development setup). I will go back to wince 6 and try to reproduce the issue you are seeing. I will get back to you once I find something.
-Madhvi
Please click Verify Answer button if this response answers your question. For updated information on TI WINCE BSP, please check out the WinCE Handbook.
Madhvi,
I am downloading Windows Embedded Compact 7 now and will try doing an EVM build without and with your recommended changes above. I will let you know of my findings.
Hi Madhvi,
This issue, accessing the File System taking an excessive amount of time, has been identified to be related to the attempt to improve Data Cache flushing, implemented by both 6.X and A8 BSPs. The implementation is not adequate, causing poor RAM Performance As File IO. The fix consists in defaulting to Microsoft's Windows CE default ARM Flush Cache Routines (OALFlushDCache and OALFlushDCacheLines).
Thank you,
Using Microsoft routines to flush data cache is not the right workaround since these routines only take care of L1 cache. The routines in BSP take more time since they take care of both L1 and L2 cache. So just using Microsoft routines, could lead to cache coherency issues. We will not be recommending these changes for our BSPs.
Now for the original problem, Microsoft acknowledges that the excessive data flushing is a know issue of Object Store (RAM based filesystem), it needs to flush the cache constantly to prevent data loss when accidental reset/power failure (assume RAM still maintaining data)
As per Microsoft suggestion, the possible workaround is to use RAMDISK or RAMFMD to mount as root if Object Store is not a must requriement. The idea of using RAMDISK/RAMFMD is to replace the Object Store and mount it as root. (i.e. set SYSGEN_FSROMONLY=1). Unless you exclude the ObjectStore, you will still suffer from excessive data cache flushing issue. Actually, that is due to the cache size on modern CPU getting larger and larger so the flush all penalty is amplified.
If that is the case, we could try shutting off the L2 in our BSP and see what the performance tradeoff might be. Is it possible to setup the L2 cache as write-through, so no modified data is stored there, and instead of flushing the L2 line by line, invalidating the whole L2? This might be an option. It would at least give speed improvements when reading data.
Have you duplicated the RAM copy performance we are seeing?
I do understand the need for cache coherency on the L2, but interestingly we have not run into stability problems yet. I would expect to have problems running applications if the L2 is not being flushed properly.
Richard Hendricks Madhvi, If that is the case, we could try shutting off the L2 in our BSP and see what the performance tradeoff might be.
If that is the case, we could try shutting off the L2 in our BSP and see what the performance tradeoff might be.
You could try - I am not sure what the side-effects would be.
Richard Hendricks Have you duplicated the RAM copy performance we are seeing?
Yes
Richard Hendricks I do understand the need for cache coherency on the L2, but interestingly we have not run into stability problems yet. I would expect to have problems running applications if the L2 is not being flushed properly.
well it depends on what applications you are running/testing. Again, its not something we would recommend as a workaround.
I am wondering, do you know the details of where CE is trying to do that cache flushing? I am curious, because it seems like the amount of cache flushing would have to be horrendous to cause such a performance degradation. Are they flushing after every 4 bytes or something? After all, the flush routine should be able to run through very quickly since after the first pass much of the L2 should be invalid and not result in a write to the system RAM, even if they are attempting to flush the full cache. Hmmm. If they are doing a full cache flush instead of just flashing specific memory areas, then that could be a problem I guess since the source area for the read/write would also be invalidated.
Sorry I dont have the details - but you could post the question to MSDN forums for more information.
The only information I can help you with is that the "flush all D-cache" is called by the kernel around 25000+ times on my setup while writing a 32 MB file with 256K buffers (and my Storage memory is around 88 MB)
So in this case (file IO) a 500mhz ARM11 beats an ARM A8 @800mhz. ???
You cannot compare 2 processors/products just by the processor speed. The issue here is not that the h/w is in-capable but that the software is not optimized for the newer processors.
I couldn't agree more ... thats why benchmarks are so helpfull. It sure points to an area thats needs optimization.
The nand read performance is poor in wince. Is it related to the L2 cache issue?
Below is the update from CE 6.0 Monthly Update Feb 2011, can this resolve this issue?
110211_KB982563 - This update implements L2 cacheable page table support for ARM processor.
Simply applying the monthly update, it did not help.
I'm seeing similar slow file copy on OMAP4430 and WE7.
Is there any suggested fix to improve file copying performance to match i.MX31?