This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM48L952: memory access performance

Part Number: RM48L952
Other Parts Discussed in Thread: HALCOGEN

Team,

I am posting on behalf of my customer as this is urgent:

In our project we are severely struggling with computing performance of RM48L952. After inspection, it seems that the bottlenecks are in over-EMIF access to the external SDRAM and to the internal code FLASH. What are the things that we should pay attention to get the potential performance realized on these interfaces?

Additional info/questions received in the meantime:

After inspection it seems that the bottlenecks are in the over EMIF access to the external SDRAM. According to our tests the reading of SDRAM is about 10-times slower than from internal RAM and the writing is 2-times slower (test code attached). 

SDRAM: ISSI IS42S16800

Processor: RM48L952, silicon rev D

Open questions:

  1. Is it possible to run SDRAM at higher frequency than 55MHz?
  2. According to the user documentation: “normal memory type” mode (NORMAL_OINC_SHARED) speeds up the interface – is it possible to configure silicon rev D to this mode with SDRAM?
  3. Memory conf attached. Is there something that could be done to improve the performance?

 Best regards,

Frank

  • Hello Frank,

    1. The maximum EMIF Clock for the synchronous memory is 55MHz (min. cycle time = 18ns)
    2. Yes, the synchronous memory (SDRAM) can be configured as normal mode。
  • Hello,

    We have played now a bit with new memory settings and the read is still slow (at least in some cases)...

    Q1: Does it matter which memory type of NORMAL is configured when RM48 does not have cache? I'll guess that it does not matter after looking through ARM documentation for R4 (understood only a part what I read)...
    Background for question:
    SafeRTOS does not list all options (looks like all SHARED options are missing from portmpu.h, values 6,7,0xC to be exact)  is at least missing ) but it lists much more options than HalCoGen. HalCogen GUI does not allow choose values which safeRTOS uses (to be exact GUI does not show any options which contains "other attributes".

    Brief testing does not show differences between 2,3,8 and 0xB (which are selectable via SafeRTOS header). Manually can be set of course also those SHARED values (but thinking there should be some logical reason why those are misson from SafeRTOS headers), also 0xC looks to work as any other value.

    SafeRTOS uses for own internal MPU regions (protects kernel data) for RAM it uses 0x0B (MPU_NORMAL_OIWBWA_NONSHARED) that is in OS header which is not allowed to change.


    Q2: Is DMA completely independent from those MPU related memory settings? Can any memory type SHARED/NON_SHARED be set even DMA would also access same memory locations simultaneously with CPU?
    Background for question:
    Memory region change from MPU_STRONGLYORDERED_SHAREABLE to NORMAL accelerated write speed significantly, but DMA speed stays same. Have understood that DMA is another separate bus-master with own "rules and functionality" so seen effect is most likely valid/ok/normal?

    With DMA it takes roughly 300us to move 12000 bytes of data from int->ext despite config where with memcpy() same move takes ~150us with NORMAL and ~380us with STRONGLY_ORDERED.


    Q3: Why read speed from SDRAM is not only slow but extremely very slow?
    Background for question:

    We have simple tests, we repeated with macro those commands and use pmu to calculate clk (using compensation) and also double checking results with us-counter. Also verified from disassebler that instructions are multiplied correctly and no loops exists:
    a) read from internal/external
    b) write to internal/external
    c) move from int to ext / ext to int
    d) move inside int / ext

    a&b) r: 1000 clks | w: 2492 clks | r_ext: 68025 clks | w_ext: 1000 clks<CR><LF>
    c) moveto_ext: 2000 clks | moveto_int: 68026 clks<CR><LF>
    d) moveinside_int: 3492 clks | moveinside_ext: 68026 clks<CR><LF>

    us-time double checks (there is easily +-1us since time can change or has just changed so 220 is ok difference, also us-time is not compensated and time stamps are taken before and after pmu start&stop):
    a&b) Took: 5 us ==> 1100 clks <CR><LF>
    Took: 12 us ==> 2640 clks <CR><LF>
    Took: 310 us ==> 68200 clks <CR><LF>
    Took: 6 us ==> 1320 clks <CR><LF>

    c) Took: 10 us ==> 2200 clks <CR><LF>
    Took: 309 us ==> 67980 clks <CR><LF>

    d) Took: 17 us ==> 3740 clks <CR><LF>
    Took: 310 us ==> 68200 clks <CR><LF>

    As can be seen, the pmu calculations match very well to pmu measurements.


    Q4: Why read operations performed to SDRAM looks to be highly dependent what has been done before?
    Background for question:
    We changed write test to contain read after write resultin to STR, LDR, STR, LDR pattern instead of pure STR,STR,STR
    DO1000( *pu32Addr = u32Value;);
    ->
    DO1000( *pu32Addr = u32Value; (void)*pu32Addr;);

    One would expect that now 'w_ext' would take eternity like 'r_ext' takes but NOT
    r: 1003 clks | w: 8000 clks | r_ext: 68024 clks | w_ext: 6000 clks<CR><LF>

    Then modified also read test to contain write and suddenly we do not have any performance problems at all..
    DO1000( (void)*pu32Addr; )
    ->
    DO1000( *pu32Addr = 0; (void)*pu32Addr; )
    r: 7000 clks | w: 7000 clks | r_ext: 6000 clks | w_ext: 6000 clks<CR><LF>


    Q5: PMU measures CPU clks, how it can write in 1 clk to EMIF which clk is 1/4 of CPU clk (and also read in Q4)? Just wondering is there so mistakes in test-setup even though triple checked everything and verified that writes really has gone to SDRAM. Does pipeline somehow explain this together with store buffer and what else "hidden secrects" there may be?

    === end of questions ===

    As can be seen in Q3 and Q4 it is quite difficult to say what is the impact of read slowness noticed in Q3 and dismissed in Q4 to application. Looks like there is change that individual read is fast or very slow and that will be quite random... Based on testing it looks like that if read speed would be consistently get to "fast" it may have quite big effect since changing the memory from STRONGLY to NORMAL which accelerated write and that affected ~5% (as unit) (average was ~88%, now it is ~83% with same external outputs to the device)  to application CPU load.

    Just for information with STRONLY_ORDERED the original test result were (write was also long):
    r: 1000 clks | w: 2492 clks | r_ext: 68026 clks | w_ext: 56023 clks<CR><LF>
    moveto_ext: 56024 clks | moveto_int: 68028 clks<CR><LF>
    moveinside_int: 3492 clks | moveinside_ext: 124022 clks<CR><LF>

    Also noticed that memcpy() uses (in IAR) STM and LDM instructions and own made for-loop of course uses STR&LDR. How ever when optimizing the for loop a bit (with DO10, so divided needed rounds to 1/10, moving 10 uint32 in 1 round) where memcpy moves with STR&LDM's 4 regs (==4 uint32) the perfomance from int->ext was pratically same but ext->int the memcpy LDM instruction looked to be superior being ~2x faster than STR (moved 12000 bytes).

    for-loop with DO10:
    ext->int: memcpy took: 928 us
    int->ext memcpy took: 152 us
    real memcpy() function which looks to use LDM and STM (4 regs):
    ext->int: memcpy took: 499 us
    int->ext: memcpy took: 152 us

    Attached is testcode where original testing logic can be seen and reproduced (this is run from safeRTOS task cause OS configures/enables MPU in OS start), haven't tried with evaluation board and CCS. us-time measurements can be removed as those were there to just verify that pmu works&measures ok (since when taking compensation measurement I first used all 4 counters (started &stopped 1 at time of course) and tried to check that everyone gives same compensation and they didn't, even between compilations single counter (EVT1) looked to give a +-2 clks maybe this was somehow pipeline related... Also noticed that measurement over 1 instruction could result to "negative" value when substracking the compensation from result (just taken before actual measurement from result). Got feeling that calculated accuracy is not clk (maybe would need to use some pipeline & store buffer emptying instructions before doing anything), thats why using 1000 instructions in tests to minimize the effect +-2 clk measurements.

    Regards,
    Jarkko

    6064.testcode_1000.c

  • Hi Jarkko,

    I have added my comments in blue.

    Regards,

    Sunil

    -------------------------------------------------------------------------------------------

    We have played now a bit with new memory settings and the read is still slow (at least in some cases)...

    Q1: Does it matter which memory type of NORMAL is configured when RM48 does not have cache? I'll guess that it does not matter after looking through ARM documentation for R4 (understood only a part what I read)...

    >> You are right. The only configuration that makes a difference in the accesses is whether the memory is configured as "normal", or "device" or as "strongly-ordered" types.

    Background for question:
    SafeRTOS does not list all options (looks like all SHARED options are missing from portmpu.h, values 6,7,0xC to be exact)  is at least missing ) but it lists much more options than HalCoGen. HalCogen GUI does not allow choose values which safeRTOS uses (to be exact GUI does not show any options which contains "other attributes".

    Brief testing does not show differences between 2,3,8 and 0xB (which are selectable via SafeRTOS header). Manually can be set of course also those SHARED values (but thinking there should be some logical reason why those are misson from SafeRTOS headers), also 0xC looks to work as any other value.

    SafeRTOS uses for own internal MPU regions (protects kernel data) for RAM it uses 0x0B (MPU_NORMAL_OIWBWA_NONSHARED) that is in OS header which is not allowed to change.


    Q2: Is DMA completely independent from those MPU related memory settings? Can any memory type SHARED/NON_SHARED be set even DMA would also access same memory locations simultaneously with CPU?

    Background for question:

    Memory region change from MPU_STRONGLYORDERED_SHAREABLE to NORMAL accelerated write speed significantly, but DMA speed stays same. Have understood that DMA is another separate bus-master with own "rules and functionality" so seen effect is most likely valid/ok/normal?

    >> The "shared" or "non-shared" type does not affect concurrent accesses between CPU and DMA. This setting is to manage multiple CPUs accessing shared memory regions. The DMA has its own MPU in order to prevent spurious writes to memory regions based on addresses.

    With DMA it takes roughly 300us to move 12000 bytes of data from int->ext despite config where with memcpy() same move takes ~150us with NORMAL and ~380us with STRONGLY_ORDERED.

    >> The memory type configuration mostly affects the write speeds for the CPU. For the DMA accesses, try reading the largest element size (64-bit) so that you can use the packing/unpacking feature of the DMA.

    Q3: Why read speed from SDRAM is not only slow but extremely very slow?

    >> The external memory is "far" from the CPU based on its location in the interconnect, so the read accesses each take 20+ cycles to complete. Plus, the SDRAM clock is only limited to 55MHz. These factors contribute to define the SDRAM read performance. Typical use cases (should) would be to read in data from external memory on start-up / power-down or on infrequent occasions during the application.

    Background for question:

    We have simple tests, we repeated with macro those commands and use pmu to calculate clk (using compensation) and also double checking results with us-counter. Also verified from disassebler that instructions are multiplied correctly and no loops exists:
    a) read from internal/external
    b) write to internal/external
    c) move from int to ext / ext to int
    d) move inside int / ext

    a&b) r: 1000 clks | w: 2492 clks | r_ext: 68025 clks | w_ext: 1000 clks<CR><LF>
    c) moveto_ext: 2000 clks | moveto_int: 68026 clks<CR><LF>
    d) moveinside_int: 3492 clks | moveinside_ext: 68026 clks<CR><LF>

    us-time double checks (there is easily +-1us since time can change or has just changed so 220 is ok difference, also us-time is not compensated and time stamps are taken before and after pmu start&stop):
    a&b) Took: 5 us ==> 1100 clks <CR><LF>
    Took: 12 us ==> 2640 clks <CR><LF>
    Took: 310 us ==> 68200 clks <CR><LF>
    Took: 6 us ==> 1320 clks <CR><LF>

    c) Took: 10 us ==> 2200 clks <CR><LF>
    Took: 309 us ==> 67980 clks <CR><LF>

    d) Took: 17 us ==> 3740 clks <CR><LF>
    Took: 310 us ==> 68200 clks <CR><LF>

    As can be seen, the pmu calculations match very well to pmu measurements.


    Q4: Why read operations performed to SDRAM looks to be highly dependent what has been done before?
    Background for question:

    >> I will have to check on this and get back to you. Do you read back from the location that you just wrote to before proceeding with the next write?

    We changed write test to contain read after write resultin to STR, LDR, STR, LDR pattern instead of pure STR,STR,STR
    DO1000( *pu32Addr = u32Value;);
    ->
    DO1000( *pu32Addr = u32Value; (void)*pu32Addr;);

    One would expect that now 'w_ext' would take eternity like 'r_ext' takes but NOT
    r: 1003 clks | w: 8000 clks | r_ext: 68024 clks | w_ext: 6000 clks<CR><LF>

    Then modified also read test to contain write and suddenly we do not have any performance problems at all..
    DO1000( (void)*pu32Addr; )
    ->
    DO1000( *pu32Addr = 0; (void)*pu32Addr; )
    r: 7000 clks | w: 7000 clks | r_ext: 6000 clks | w_ext: 6000 clks<CR><LF>


    Q5: PMU measures CPU clks, how it can write in 1 clk to EMIF which clk is 1/4 of CPU clk (and also read in Q4)? Just wondering is there so mistakes in test-setup even though triple checked everything and verified that writes really has gone to SDRAM. Does pipeline somehow explain this together with store buffer and what else "hidden secrects" there may be?

    >> Yes, writes to the internal store buffer take a single CPU cycle. This is only possible with the external memory configured as "normal" type. The CPU is then free to move on to the next instruction, while the write to the external memory is "managed" by the interconnect elements.

    === end of questions ===

    As can be seen in Q3 and Q4 it is quite difficult to say what is the impact of read slowness noticed in Q3 and dismissed in Q4 to application. Looks like there is change that individual read is fast or very slow and that will be quite random... Based on testing it looks like that if read speed would be consistently get to "fast" it may have quite big effect since changing the memory from STRONGLY to NORMAL which accelerated write and that affected ~5% (as unit) (average was ~88%, now it is ~83% with same external outputs to the device)  to application CPU load.

    Just for information with STRONLY_ORDERED the original test result were (write was also long):
    r: 1000 clks | w: 2492 clks | r_ext: 68026 clks | w_ext: 56023 clks<CR><LF>
    moveto_ext: 56024 clks | moveto_int: 68028 clks<CR><LF>
    moveinside_int: 3492 clks | moveinside_ext: 124022 clks<CR><LF>

    Also noticed that memcpy() uses (in IAR) STM and LDM instructions and own made for-loop of course uses STR&LDR. How ever when optimizing the for loop a bit (with DO10, so divided needed rounds to 1/10, moving 10 uint32 in 1 round) where memcpy moves with STR&LDM's 4 regs (==4 uint32) the perfomance from int->ext was pratically same but ext->int the memcpy LDM instruction looked to be superior being ~2x faster than STR (moved 12000 bytes).

    for-loop with DO10:
    ext->int: memcpy took: 928 us
    int->ext memcpy took: 152 us
    real memcpy() function which looks to use LDM and STM (4 regs):
    ext->int: memcpy took: 499 us
    int->ext: memcpy took: 152 us

    Attached is testcode where original testing logic can be seen and reproduced (this is run from safeRTOS task cause OS configures/enables MPU in OS start), haven't tried with evaluation board and CCS. us-time measurements can be removed as those were there to just verify that pmu works&measures ok (since when taking compensation measurement I first used all 4 counters (started &stopped 1 at time of course) and tried to check that everyone gives same compensation and they didn't, even between compilations single counter (EVT1) looked to give a +-2 clks maybe this was somehow pipeline related... Also noticed that measurement over 1 instruction could result to "negative" value when substracking the compensation from result (just taken before actual measurement from result). Got feeling that calculated accuracy is not clk (maybe would need to use some pipeline & store buffer emptying instructions before doing anything), thats why using 1000 instructions in tests to minimize the effect +-2 clk measurements.

  • Thanks for answers! Guessing that we do not have to focus any more to memory types. Pratically only the reading part is something which still have issues (and see also from bottom of the post that also writing looks some time take now a lot more time than it usually do).

    Sunil Oak said:

    >> I will have to check on this and get back to you. Do you read back from the location that you just wrote to before proceeding with the next write?

    Yes, we read from same location that was just written (variables/pointers are declared as volatile so new fetch should always occur despite that 0 was just written). Also note that those 1000 original read operations was in a row and all of them performed into same address.

    Note: The order looks to be important, with write (STR) first the result is this (same was already in previous post, put here for easier comparison)
    DO1000( *pu32Addr = 0; (void)*pu32Addr; )
    r: 7000 clks | w: 7000 clks | r_ext: 6000 clks | w_ext: 6000 clks<CR><LF>

    And if we but write last, so 1st LDR is without STR and then rest 999 LDRs has STR in front we get delay of 88 CPU clks which practically should describe the delay/time to execute for the first LDR instruction...
    DO1000( (void)*pu32Addr; *pu32Addr = 0;  )
    r: 6995 clks | w: 7000 clks | r_ext: 6088 clks | w_ext: 6000 clks<CR><LF>

    Sunil Oak said:

    >> The external memory is "far" from the CPU based on its location in the interconnect, so the read accesses each take 20+ cycles to complete. Plus, the SDRAM clock is only limited to 55MHz. These factors contribute to define the SDRAM read performance.

    Q6: I assume that 20+cycles is CPU cycles, why it looks to take "88" CPU clk to perform 1 LDR into SDRAM as above experimental shows (original test log said 68000 CPU clk:s for 1000 read that is also 68clk/read)?

    That ~20+ CPU clk would be ok and understandable but as above can be seen it looks to take 88 or 68 CPU cycles... According to TRM (SPNU503C–March2018) EMIF cycles are following: RAS, CAS, we have CAS latency as 2 so that all together is 4 EMIF CLKs and when reading 32bit of data it requires 2 EMIF clks on top of 4 so total is 6 EMIF clks which is 24 CPU clks. According to same TRM the writing should be 3 EMIF CLK's faster since no CAS-latency and Data can be given together with CAS) being 12 CPU clk for single 32bit write.

    That is what I meant with "extremely" slow read since it looks to take ~20 EMIF clks to read something and then sometimes it does not take so long...

    PS. found 1 minor error in original test code, u32MoveToRam() missed "++" from destination DO1000( *pu32DestAddr = *pu32SourceAddr++; ). This looks to cause some delay to "moveto_ext" and "moveinside_ext" results since now destination address changes. In assembler it is still 1 STR command, just offset is added as new argument (was STR R0,R5 and now STR R0, [R5, #0x...], LDR is/was all the time same LDR R0, [R4, 0x..]).
    Here is "fixed version" results as functionality how it was originally intended to be
    c: moveto_ext: 12900 clks | moveto_int: 68024 clks<CR><LF>
    d: moveinside_int: 3492 clks | moveinside_ext: 106891 clks<CR><LF>

    Moveto_ext takes now 6x longer when destination address steps forward instead of being fixed (was 2000 clk)
    moveinside_ext result doubled (was 68026, which was same as pure read delay). Quite can't understand how destination being all the begin of array can speed up the process since read address was still forwarded properly all the time. In both cases the read never occurs to same address as write as it did in Q4 the pure write/read tests...

    Sunil Oak said:

    >> Yes, writes to the internal store buffer take a single CPU cycle. This is only possible with the external memory configured as "normal" type. The CPU is then free to move on to the next instruction, while the write to the external memory is "managed" by the interconnect elements.

    Can store buffer and interconnect really handle 1000 writes in a row, EMIF write should take at least 12 CPU clk, so in the end there should be ~900 items queued after 1000 is sent and only ~100 of those processed. Maybe it helps/makes difference since write address is fixed and data to write is fixed?

    Now when copying was fixed and destination address also changes it looks to take 12,9 CPU cycles/write (moveto_ext: 12900 clks) that would quite well fall into expected EMIF cycles and would prove that store buffer & interconnect really has some impact in certain situations which makes testing/evaluations much more harder.

    But now also internal SDRAM move takes 'eternity' (moveinside_ext: 106891 clks) was before "moveinside_ext: 68026 clks" so it increased like 40000 clks meaning 40clk/transfer. And as above int->ext move show/proofs the write should only take ~12clks/write but now in this case it takes 40clk... One would thing that total time should have increased only by ~12000...

    Regards,
    Jarkko

  • We have tweaked timings of SDRAM a bit (looks to effect (0)/10-12 CPU clk ~3 EMIF clk depending what doing) and also removed automatic refresh (this looks to help 4 CPU clk/operation == 1 EMIF clk).

    Automatic refresh needs to be removed manually since halcogen enables it (due to emif#5 in errata?) but does not restore it in emif_SDRAMIni() which is empty.

    Tweaked timings to really match with SDRAM specs:

    r: 1003 clks | w: 2492 clks | r_ext: 56033 clks | w_ext: 1000 clks<CR><LF> // r_ext: 12000 faster, w_ext: storebuffer

    moveto_ext: 12871 clks | moveto_int: 57985 clks<CR><LF> // read 10000 faster

    moveinside_int: 3492 clks | moveinside_ext: 88923 clks<CR><LF> // 12000 faster


    Then when SDCR_SR bit removed after sys_startup() call on top of timing tweaks:

    r: 1000 clks | w: 2492 clks | r_ext: 52100 clks | w_ext: 1000 clks<CR><LF> // r_ext: 4000, w_ext: storebuffer

    moveto_ext: 12834 clks | moveto_int: 54109 clks<CR><LF> // move to SDRAM (write) "no effect" (~100), move from sdram (read): 4000,

    moveinside_int: 3492 clks | moveinside_ext: 80693 clks<CR><LF> // read & write = 2*4000

    Original (with normal mode change) without any other changes was:

    r: 1000 clks | w: 2493 clks | r_ext: 68023 clks | w_ext: 1000 clks<CR><LF> // read: 16000 improvement from this one

    moveto_ext: 12900 clks | moveto_int: 68025 clks<CR><LF> // read: 14000 improvement from this one

    moveinside_int: 3492 clks | moveinside_ext: 106891 clks<CR><LF> //  20000 improvement from this one


    So "memcpy" takes now following times (not focusing anymore to single repeated write or read since store buffer etc)
    - inside memory takes now 80 CPU clk => 20 EMIF clk // read&write
    - to memory 12,8 CPU clk => 3 EMIF clk // write
    - from memory 54 CPU clk => 13,5 EMIF clk // read

    Is that cap of 20-(13,5+3) = 3,5 EMIF clk explainable somehow, why read,write,read,write... pattern takes more time than  read,read,read...+ write write write..?
    Also read itself looks to take much longer than write what comes to the expected based on the TRM (24CPU clk:s) compared to writes which match to TRM very well

    Regards,
    Jarkko

  • Hi Jarkko,

    I am out of office this week and will get back to working on this next week. I hope to be able to simulate and explain the cycle differences. There may still be some variability depending on the interconnect activity from other bus masters which won't be simulated accurately, but the actual access time numbers will be fairly accurate.

    Regards,

    Sunil

  • Hi,

    I am on holiday next 4 weeks but some other from our side may/should monitor the progress (we have also holiday season). Hopefully they also insert a couple of pictures which illustrates the problem properly/clearly (don't have those by myself) but we have consistently seen in bus analyzer that 1 read  itself goes ok and burst is interrupted as it should, BUT after every read there is 9 EMIF clk of  "NOPs" and only after those "NOPs" the next read starts... Basically the "NOPs" could/would explain the slowness but we cannot figure out why those are there in first place...

    Regards,
    Jarkko

  • Hello Sunil,

    These images are from same SDRAM read cycle, I have edited commands to one of the images. As you can see, there is quite many NOP commands which we haven't got rid of yet.

    Best regards,

    Leevi Hokkanen

  • Hello Sunil,

     

    I have continued investigation for a while now. I have come to the point where I am able to make little to no progress anymore. So I will try to explain what I have found so far and also the code behind it. Actual code is attached as file and I will explain step by step what I expect to happen and what is actually happening. After that all I can do is to wait actual explanation what is going on and whether it is possible or not to achieve the operation that we are after at.

     

    I have tested External Memory Interface performance by using Hercules RM48x Development Kit. Idea was to have no excess code so that it is easier to see what is going on. So I used Hall Code Generator to configure all necessary register to get CPU up and running. Then I configured following EMIF registers manually so that I am completely sure what timing values are used:

    -          SDRAM configuration register                                        SDCR

    -          SDRAM refresh control register                                     SDRCR

    -          SDRAM Timing Register                                                SDTIMR

    -          SDRAM Self Refresh Exit Timing Register                    SDSRETR

     

    Then I started to use EMIF straight away. I did not follow SDRAM initialization sequence, but I did not find any mention about performance affect for EMIF side if it is not done. And this test purpose was to figure out EMIF performance, not so much how reliable SDRAM is.

     

    Then I made two simple loops. First one wrote 10 32 bit values (from 0 to 9) to SDRAM and incremented address by one between writes. Second loop read same data back that was written in previous loop.

    Results were different from what I expected. Since EMIF uses only 8 column burst mode (if I read datasheet correctly), I expected to see following commands during write loop:

    -          WRITE

    -          7 NOP’s

    -          WRITE

    -          NOP

    -          BURST TERMINATE

    But instead I got following commands repeating 10 times:

    -          WRITE

    -          NOP

    -          BURST TERMINATE

    -          NOP

    So as can be seen, expectations and reality did not meet here. EMIF did not write 8 16 bit values before BURST TERMINATE but instead it terminated after each 32 bits (as was the data size).

    Following oscilloscope images shows what is happening between MCU and SDRAM. First image has all 10 write and 10 read cycles and following images are zoomed into different parts of that same record.

    10 write and 10 read cycles between cursors

    10 write cycles between cursors, rest are 10 read cycles

    All 10 write cycles between cursors

    First write cycle zoomed between cursors

    All 10 read cycles between cursors

    First read cycle zoomed between cursors

    Results changed slightly when I used same code except used 64 bit values instead of 32 bits to be written into SDRAM. All 64 bits were written before burst was terminated and then next value was written by using new WRITE command.

    Following oscilloscope images shows what is happening between MCU and SDRAM. First image has all 10 write and 10 read cycles and following images are zoomed into different parts of that same record.

    10 write and 10 read cycles between cursors

    10 write cycles between cursors, rest are 10 read cycles

    All 10 write cycles between cursors

    First write cycle zoomed between cursors

    All 10 read cycles between cursors

    First read cycle zoomed between cursors

    Based on these results (and many other measurements) I have few ideas what could be wrong, but I can’t verify those.

    I understood that burst continues as long as there is “pending” requests from CPU side. But it seems like burst continues only as long as the data type is that is written into SDRAM. That seems like odd requirement and would not make much of a sense since data type size limit would hit before 8 column burst is reached.

    It could be that CPU is so slow that it cannot create enough of requests to EMIF and that’s why EMIF terminates writes and reads because it does not see more requests from CPU. But it seems like something is wrong if 220 MHz core frequency can’t keep up with 55 MHz memory interface. And this is as simple code as it gets, any practical application requires more complicated code.

    CPU slowness could be caused if CPU waits response from EMIF and if that is case then I want to know how to get rid of it. It would make much of a sense either because then burst mode would be totally useless.

    CPU could be also slow if it does not get next instructions fast enough. I am not expert when it comes to internal operation of microcontrollers and CPUs, but cache should take care of that, right?

     

    So as you can see, the more I keep looking into this issue, the more I have questions. So some professional help is needed.

    2311.Source.c

     

    Best regards,

    Leevi Hokkanen

    Edit 1: Added missing source file.

    Edit 2: Changed 32 bit write/read oscilloscope images to correct ones (were same as 64 bit write/read images).

    edit 3: Changed source code to use 32 bit data type in SDRAM read/write as was used in testing.

  • Hi Leevi,

    I did not find the file attached with the code. Is the CPU using STM (store multiple) instructions to write to the external memory? Is the external memory defined as "normal" type using the MPU? This should be the best case for write operations.

    I am working with the design team to understand some of the dependencies on the interconnect configuration.

    Regards,
    Sunil
  • Hello Sunil,

    If only uint64_t would be able to hold the number of times when I have forgotten to attach a file when I have said to do so...
    But now you can find that source code file at the end of my previous post.


    Best regards,
    Leevi Hokkanen

  • Hi Sunil,

    So far I have tried using STM instruction, normal mode with MPU and DMA (and mix of all previous). Longest burst write is 4 EIMF CLK cycles with STM instruction, still only half what is expected. Any other write method terminates burst based on data type size (uint_32_t -> burst terminate after 2 EMIF CLK, uint64_t -> burst terminate after 4 EMIF CLK).


    Best regards,
    Leevi Hokkanen