This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TI OMAP35X WinCE BSP 6.15 - Why is OMAP35X RAM Access much slower than RAM Access on IMX31?

Using OMAP35X WinCE 6.15 BSP, access to RAM is extremely slow.

The test case simply copies a 10MB file from Temp to "Application Data". Both directories are mapped to RAM, as the image is a RAM-based Object Store. Compression is disabled. It takes about 8 seconds for the transfer to complete. The same amount of time, even higher, has been seen on the Mistral board. Also, running the unified A8 BSP, on the Mistral board takes the a large amount of time.

In contrast, using an IMX31 processor, Windows CE R3, takes under a second for the transfer to complete.

 

  • This should not be the case. Do you have data from some standard benchmark such as BMQ or QBench on the two platforms? Is anything else running on the system when you run these tests? Do you have DVFS enabled? What OPP are you running the processor at? There are lot of factors that can contribute to the (observed) differences  - therefore it is important to have a standard way to compare

    thanks

    Atul

  • The OMAP35X BSP has been minimized and it only includes GPIO, Ethernet, Serial Debug Only, PMIC Root Driver and RTC. The performance is still as bad as for the full BSP. Also, the OEMIdle() has been simplified to a simple return.  Running the perfalyzer on OMAP35X does not show any major activity. It is in IDLE_STATE for about 90% of the time. Similar the IMX31.

  • Can you try to run BMQ test (from here: http://www.benchmarkhq.ru/english.html?/be_ppc.html) on the two platforms? We have our published numbers for it here:

    http://processors.wiki.ti.com/index.php/WinCE_Comparative_Benchmarks

    this way we can see if something unique is going on in your setup

     

    Atul

     

  • Atul,

    BMQ numbers are:

    ===============================

    OMAP35X

     

    ===============================

    2006-01-01 00:02:04

    -------------------------------

    CPU Type : StrongARM

    OS Ver.  : CE 6.00 (Build 0)

    Platform : 

     

     Inte.  2695

     Float   737

     Draw    796

     Window  284

     Memory 3523

     -----------

     Total  1607

     

     

    ===============================

    IMX.31

    ===============================

    ===============================

    2006-01-01 12:04:07

    -------------------------------

    CPU Type : StrongARM

    OS Ver.  : CE 6.00 (Build 0)

    Platform : MX31 Platform

     

     Inte.  1259

     Float   583

     Draw    341

     Window  358

     Memory 1345

     -----------

     Total   777

  • Your numbers seem to be comparable to what we see. However, here BMQ results should OMAP is 2.5 times faster than IMX as far as memory benchmark is concerned - so I am confused about your original claim

     

    Atul

  • Eugen,

    Is file_cache enabled in both the BSPs? Since you are doing a file-based operation, I am wondering if that plays a role.

    Aparna

  • I meant cache_manager in the catalog. Please ignore the file_cache term in my previous post.

    Aparna

  • Aparna,

    do you mean adding SYSGEN_CACHEFILT to the BSP? If so, yes, I have enabled this item in catalog and no difference. This entry adds file caching to the FSD, via common.reg.

    What is also true, is that the IMX31 does not have it enabled.

    And, yes, using telnet in both systems (OMAP35X and IMX31), shows that IMX31 copies the file almost instantaneous, under a second. Whereas, the OMAP35X takes 8 seconds, for the same 10 MB file.

    Unfortunately I can't simply send you an IMX31 kit for demonstration, but it does behave much better, at the FS level.

    Eugen
  • It will be good if you run your comparison on the two systems with similar/comparable settings. Here is one discussion where enabling file system caching can have adverse effect:

    http://www.eggheadcafe.com/software/aspnet/32470191/readwrite-performance-in-ce60-with-file-cache-manager.aspx

    We have also observed this while running some CETK tests.

    I am assuming you are using memory mapped files so file system caching should not come into play but you never know.. just to have a fair comparison pl. make sure all such system settings are same

     

    Atul

  • Atul,

      I am working with Eugen on this issue as well.  Have you tried copying a large file into \TEMP on Windows CE 6.00?  We originally started this by trying to copy files out of a USB memory stick into \TEMP, and seeing a 32 MB file take 26.5 seconds.  This has been duplicated on Adeno's and BSquared's BSPs for the OMAP35x.  During our analysis, Eugen focused on the filesystem aspect, and I attempted to focus on the low level USB aspect.  It is very strange, nominally it takes ~10ms for 64KB of data to be returned by the USB subsystem, but then the upper levels of the filesystem take 50+ms to send the next request.  Also, at the lowest level those 64kB groups of data seem to have random instances where additional delay is added, ranging from 5-15ms, cutting the performance at that level in half.

     

    Based on what we see with QBench, it does not detect this performance degradation, perhaps due to small sizes on the transfers, but they are very clear when transferring a large file. QBench reports ~55 MB/sec transfer rate, slightly faster than MX31.

     

    One other interesting effect we noticed was that a 32 MB file from USB to \TEMP takes 26.5 seconds, and 32 MB file from an SD card (class 6) to \TEMP also takes 26.5 seconds.  BUT, copying from USB to SD a 32 MB file takes only 18 seconds.  A similar test on the MX31 is ~9 seconds, and on a PC is ~4 seconds.

     

    One other, possibly related symptom is that deleting large files from \TEMP takes a very long time.  Deleting a 32 MB file from \TEMP can take more than 6 seconds, for example.

     

     

  • Hello Richard,

       The original thread started with issue related to RAM access , which we now know is not the case. Having that out of the way, filesystem performance is characteristic of WinCE core OS and has nothing to do with the BSP. The only knobs we can control are the ones provided by Microsoft. Therefore, is it possible for you to run these experiments with filesystem cache disabled (because you have it disabled on IMX) and see what numbers you get

     

    thanks

    Atul

  • Hi Atul,

    As mentioned previously, enabling or disabling the File Cache Manager did not make any difference. The time it took - both ways - to copy the file from Temp to Application Data, was the same, in both cases.

    Thanks,

    Eugen

  • Atul,

      Can you test a file copy with a 32 MB file?

  • Ok we will run some experiments and get back to you

    thanks

    Atul

  • Thanks Atul, we will let you know if we find anything on our end.

  • Atul,

    please note that we are usig RAM-based Object Store,which will be relevant while conducting the measurements.

    Thanks,

    Eugen

  • Have you tried using a RAMDISK instead of the RAM object store? The RAM object store is very obscure and we don't really know what's going on in there, especially regarding file caching. RAMDISK is made of pure memory allocation and you can be sure that disabling file caching will be effective. Reuse/adapt code \WINCE600\PUBLIC\COMMON\OAK\DRIVERS\BLOCK\RAMDISK in your BSP and see if you see more relevant results.

  • Atul,

    yes, last week I did mount the RAMDISK in WINCE 6.0. I used the SYSGEN_RAMDISK=1, as a project env variable. RAMDISK.dll was brought in, and its registry entries created. However, it did not make any difference. It used the default 1MB of ramdisk size.

    There is another, to load dynamically and resize the ramdisk size. I did not try did, as I could not see any difference with the first option. I will give it a try.

  • Mounting the RAMDISK, dynamcally, and also allocating different sizes for the RAMDISK, yielded no performance improvements.

    It seems that there is a systemic problem, because regardless of Object Store optimizations (file compression disabled, storage manager paging disabled, usage of RAMDISK to name a few), there are no improvements.

    (In addition, our other reference platform, IMX31, built on the same very machine, WinCE 6.0 R3, does not use RAMDISK, and yet, it copies the file in under a second. Whereas, on the OMAP, it takes 8 seconds).

  • Adeneo Embedded support team said:

    Have you tried using a RAMDISK instead of the RAM object store? The RAM object store is very obscure and we don't really know what's going on in there, especially regarding file caching. RAMDISK is made of pure memory allocation and you can be sure that disabling file caching will be effective. Reuse/adapt code \WINCE600\PUBLIC\COMMON\OAK\DRIVERS\BLOCK\RAMDISK in your BSP and see if you see more relevant results.

    Do you have performance numbers for copying a 32 MB file into RAMDISK and deleting it?  Please try on your system and get back with us.

  • GetTickCount() was used for instrumenting the time it takes Copyfile() function to copy a 16 MB file from \Temp to \Application Data.

    (Note I had to use a 16 MB file, since there is not sufficient ram to hold 2x32 MB files.)

    16 MB CopyFile() Results:

    **NO RAMDISK**                                                     

    •  
      • First Copy:   9756 ms
      • Second Copy: 13216 ms

    **RAMDISK, 3 MB Ram Disk **

    •  
      • First Copy:  15210 ms
      • Second Copy: 18604 ms

     

     

  • Eugen

    Even though the cache filter (file cache manager) is enabled in the catalog, it is not used by the RAM file system (object store) by default. Only the FATFS (i.e. USB, SD) uses it by default. To enable cache filter for RAM filesystem, please add the following registry setting to your platform.reg.

    [HKEY_LOCAL_MACHINE\System\StorageManager\AutoLoad\ObjectStore\Filters\CacheFilt]
        "Dll"="cachefilt.dll"
        "Order"=dword:68000000
        "LockIOBuffers"=dword:1

    In our experiments, we found that writing a 32MB file to \temp dir took ~13 secs before. With cache filter enabled, it takes ~0.475 secs. Please try this change and let us know if you see similar improvements.

     

    -Madhvi

  • Madhvi,

    I did try using the [HKEY_LOCAL_MACHINE\System\StorageManager\AutoLoad\ObjectStore\Filters\CacheFilt] key, when trying various routes with  enabling file cache manager. No improvements.

    I am wondering about your numbers: are you sure you are using a RAM-based Registry and not the Hive-Based Registry? The reason I am asking is that the Mistral board does not have a whole lot of RAM, and by default is set to use the Hive-Based Registry (see \WINCE600\OSDesigns\EVM_3530\EVM_3530\EVM_3530.pbxml, for instance).

    If the answer is that you do use the RAM-Based Registry, send me please a copy of the platform.reg, platform.bib, config.bib and your platform's .pbxml, as well as its catalog file .pbcxml.

     

    Thanks,

    Eugen

  • How much RAM is available on your setup after kernel is up? Are you running these experiments on TI OMAP35x EVM? If yes, what rev are you using?

    We have Hive based registry but its still stored in the Object store (RAM filesystem). Also, how should that affect file read/write operations to the filesystem (and not registry) unless you are running really low on RAM. Have you tried Hive based registry? Does it give you better performance while reading/writing to RAM filesystem?

    -Madhvi

  • The experiments are executed on LOGIC SOM-LV, which are using the OMAP35X. 

    The system uses 128 MB of physical RAM. The "RAM" section in config.bib is assigned 78 MB. The default FSRAMPERCENT is set to 0x40404040. When the kernel is up, there are 18224KB assigned towards Storage Memory, and 54684KB towards Program Memory.

    Hive image behaves the same.

     

  • Eugen

    My experiments were on mistral board with 256MB of total RAM . After boot-up I had ~88MB of Storage memory and ~88MB of Program memory.  So in your experiments:

    1. how much is the used memory for both Storage and Program when the kernel is up (before you do any experiments)?

    2. What is the file size that you are copying to temp dir?

    3. Do you have the exact same RAM usage/allocation for IMX31 setup?

     

    I experimented with the program/storage memory on my setup - I found that if I reduce the storage memory to 18MB (like yours) then copying a 32MB file obviously took more time (since the kernel tries to increase grab some memory for storage from program memory during copy operation). If I reduce the amount of program memory too, then the time to copy increases. So that proves my earlier point, if you don't have enough RAM to support the experiments that you are trying to do, then the system will behave slow.

     

    -Madhvi

  • Madhvi

    1. I have switched to a SOM that uses 256MB of RAM, as well. I've configured config.bib to use  FSRAMPERCENT of 0x80808080: after booting, the system has 100MB of Storage Memory and 100MB of Program Memory.

    2. What is the file size that you are copying to temp dir? 16MB sharp.

    3. The IMX.31 has 128MB of physical RAM, out of which 70MB dedicated as "RAM" in config.bib.

     

    To recap: all suggested options did not show any improvements on our OMAP35X platform.

     

    Since you are seeing good performance on the Mistral board, please send us the following to be able to replicate it, and then identify a solution based on it:

    1. The name of the BSP you are running: bspsource_omapwince_06_15_00 or BSP_WINCE_ARM_A8_01_00_00. It is not clear which one you are using. Note please we are using bspsource_omapwince_06_15_00.

    2. For #1 above, all files that you had customized to see the improved performance: 

    • platform.reg
    • config.bib
    • project's .batch file (.bat)
    • platform.bib (if any changes)
    • anything else that was changed

    Eugen

     

     

  • Madhvi,

    I have ran Adeneo's A8 BSP on the Mistral Board, with your suggested changes (adding to platform.reg the following key/subkeys [HKEY_LOCAL_MACHINE\System\StorageManager\AutoLoad\ObjectStore\Filters\CacheFilt]).

    It DOES NOT yield the results you are seeing.

    This are the numbers seen, for a 16MB file:

    • without the entry above: 13,181 seconds
    • with the entry above: 17,133 seconds

    What are we missing on our end?

    Thanks,

    Eugen

  • Eugen

    I ran my experiments on a WinCE 7 compact environment (since that is my current development setup). I will go back to wince 6 and try to reproduce the issue you are seeing. I will get back to you once I find something.

    -Madhvi

  • Madhvi,

      I am downloading Windows Embedded Compact 7 now and will try doing an EVM build without and with your recommended changes above.   I will let you know of my findings.

  • Hi Madhvi,

    This issue, accessing the File System taking an excessive amount of time, has been identified to be related to the attempt to improve Data Cache flushing, implemented by both 6.X and A8 BSPs. The implementation is not adequate, causing poor RAM Performance As File IO. 
       
    The fix consists in defaulting to Microsoft's Windows CE default ARM Flush Cache Routines (OALFlushDCache and OALFlushDCacheLines). 

    Thank you,

    Eugen

  • Eugen

    Using Microsoft routines to flush data cache is not the right workaround since these routines only take care of L1 cache. The routines in BSP take more time since they take care of both L1 and L2 cache. So just using Microsoft routines, could lead to cache coherency issues. We will not be recommending these changes for our BSPs.

    Now for the original problem, Microsoft acknowledges that the excessive data flushing is a know issue of Object Store (RAM based filesystem), it needs to flush the cache constantly to prevent data loss when accidental reset/power failure (assume RAM still maintaining data)

    As per Microsoft suggestion, the possible workaround is to use RAMDISK or RAMFMD to mount as root if Object Store is not a must requriement. The idea of using RAMDISK/RAMFMD is to replace the Object Store and mount it as root. (i.e. set SYSGEN_FSROMONLY=1). Unless you exclude the ObjectStore, you will still suffer from excessive data cache flushing issue. Actually, that is due to the cache size on modern CPU getting larger and larger so the flush all penalty is amplified.

    -Madhvi

  • Madhvi,

      If that is the case, we could try shutting off the L2 in our BSP and see what the performance tradeoff might be.  Is it possible to setup the L2 cache as write-through, so no modified data is stored there, and instead of flushing the L2 line by line, invalidating the whole L2?  This might be an option.  It would at least give speed improvements when reading data.

     

      Have you duplicated the RAM copy performance we are seeing?

     

    I do understand the need for cache coherency on the L2, but interestingly we have not run into stability problems yet.  I would expect to have problems running applications if the L2 is not being flushed properly.

  • Richard Hendricks said:

    Madhvi,

      If that is the case, we could try shutting off the L2 in our BSP and see what the performance tradeoff might be.

    You could try - I am not sure what the side-effects would be.

    Richard Hendricks said:

      Have you duplicated the RAM copy performance we are seeing?

    Yes

    Richard Hendricks said:

    I do understand the need for cache coherency on the L2, but interestingly we have not run into stability problems yet.  I would expect to have problems running applications if the L2 is not being flushed properly.

    well it depends on what applications you are running/testing. Again, its not something we would recommend as a workaround.

     

  • Madhvi,

      I am wondering, do you know the details of where CE is trying to do that cache flushing?  I am curious, because it seems like the amount of cache flushing would have to be horrendous to cause such a performance degradation.  Are they flushing after every 4 bytes or something?  After all, the flush routine should be able to run through very quickly since after the first pass much of the L2 should be invalid and not result in a write to the system RAM, even if they are attempting to flush the full cache.  Hmmm.  If they are doing a full cache flush instead of just flashing specific memory areas, then that could be a problem I guess since the source area for the read/write would also be invalidated.

  • Sorry I dont have the details - but you could post the question to MSDN forums for more information.

    -Madhvi

  • The only information I can help you with is that the "flush all D-cache" is called by the kernel around 25000+ times on my setup while writing a 32 MB file with 256K buffers (and my Storage memory is around 88 MB)

    -Madhvi

  • So in this case (file IO) a 500mhz ARM11 beats an ARM A8 @800mhz. ???

     

  • You cannot compare 2 processors/products just by the processor speed. The issue here is not that the h/w is in-capable but that the software is not optimized for the newer processors.

    -Madhvi

  • I couldn't agree more ... thats why benchmarks are so helpfull. It sure points to an area thats needs optimization.

  • The nand read performance is poor in wince. Is it related to the L2 cache issue?

    Below is the update from CE 6.0 Monthly Update Feb 2011, can this resolve this issue?

    110211_KB982563 - This update implements L2 cacheable page table support for ARM processor.

  • Simply applying the monthly update, it did not help.

  • I'm seeing similar slow file copy on OMAP4430 and WE7.

    Is there any suggested fix to improve file copying performance to match i.MX31?

  • I am not sure if it is the same, but when I compare this two platform, then i.Mx233 (I don't know how it is with i.Mx31) have DCP (Memory Copy, Crypto, and Color-Space Converter) coprocessor, if I remember correctly , iMx233 has optimized memcpy() function to use this coprocessor.