This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5K2E04: EMIF Performance

Part Number: AM5K2E04

Hi,

can you review below topic and comment of the six questions at the end of the thread?

We are developing a new board with Keystone2 processor. Attached to the EMIF16 of the Keystone2 processor there are several analog and digital

input and output devices. To reach our system performance requirements the read and write accesses to these devices must have a very short latency.

That means that the duration of read and write accesses must be as short as possible. Since the accesses on the EMIF are mostly random or

read-modify-write, we cannot use burst transfers by using DMA. Additionally some of our digital I/O devices need to use the wait signal.

 

In the past we made EMIF16 performance measurements on the K2EVM-HK development board with the XTCI6638K2KXAAW processor.

We observed the following EMIF performance (16 bit read) values with enabled extended wait mode:

 

 

Length of Chip Enable (CE) low pulse: ~26 ns

Distance between risig edge of CE and falling edge of next CE access: ~30 ns

ð  Access duration = ~56ns

Note: On the evaluation board we made measurements on the expansion header without a device on slave side.

The wait signal was hard wired on the expansion header to be inactive. Clocking: SYSCLK1 = 1167,36 MHz, ARMCPUCLK = 1375 MHz.

We used Linux-3.8.4-g42865b7. The test application only generates accesses to the EMIF Interface in a loop.

 

Based on this results we decided to use the Keystone2 processor for the new board design. We decided to use the AM5K2E04XABDA4, because we only

need the ARM cores, the DSPs are not necessary.

Now we are investigating the performance of the EMIF on our board and we are seeing that it is much below the results we reached with the development board.

 

 

Length of CE low pulse: ~22 ns

Distance between risig edge of CE and falling edge of next CE access: ~111 ns

ð  Access duration = 133ns

Note: On our board we made measurements with the same conditions like in the past with the evaluation board.

The slave device on EMIF16 bus is a FPGA that acknowledges the access immediately. The wait signal is always inactive.

Clocking: SYSCLK1 = ARMCPUCLK = 1400 MHz.

We use RT-Linux. The test application only generates accesses to EMIF Interface in a loop.

 

We are wondering, if there are differences between the two processors, or if it is a problem of our register settings?

Following you will find the EMIF register settings we used on both processors:

 

EMIF Settings on the development Board:

                RCSR=                                  0x40000205

AWCCR=                             0xf0040080 (WP1=1, WP0=1, CS5_WAIT=0, CS4_WAIT=0, CS3_WAIT=1, CS2_WAIT=0, MAX_EXT_WAIT=128)

A1CR=                                  0x04422218

A2CR=                                  0x40100081 (SS=0, EW=1, W_SETUP=0, W_STROBE=1, W_HOLD=0, R_SETUP=0, R_STROBE=1, R_HOLD=0, TA=0, ASIZE=1)

A3CR=                                  0x00000000

A4CR=                                  0x00000000

IRR=                                      0x0000000C

IMR=                                    0x00000000

IMSR=                                  0x00000000

IMCR=                                 0x00000000

NANDFCR=                        0x00000001

NANDFSR=                        0x00000003

PMCR=                                0xfefefefe

NFECCCS2=                        0x00000000

NFECCCS3=                        0x00000000

NFECCCS4=                        0x00000000

NFECCCS5=                        0x00000000

NANDF4BECCLR=            0x0000033f

NANDF4BECC1R=            0x00000000

NANDF4BECC2R=            0x00000000

NANDF4BECC3R=            0x00000000

NANDF4BECC4R=            0x00000000

NANDFEA1R=                   0x00000000

NANDFEA2R=                   0x00000000

NANDFEV1R=                   0x00000000

NANDFEV2R=                   0x00000000

 

EMIF Setting on our own Design:

RCSR=                                  0x40000205

AWCCR=                             0xf0040080 (WP1=1 WP0=1 CS5_WAIT=0 CS4_WAIT=0 CS3_WAIT=1 CS2_WAIT=0 MAX_EXT_WAIT=128)

A1CR=                                  0x7ffffffd

A2CR=                                  0x40100081 (SS=0 EW=1 W_SETUP=0 W_STROBE=1 W_HOLD=0 R_SETUP=0 R_STROBE=1 R_HOLD=0 TA=0 ASIZE=1)

A3CR=                                  0x7ffffffd

A4CR=                                  0x7ffffffd

IRR=                                      0x0000000d

IMR=                                    0x00000000

IMSR=                                  0x00000000

IMCR=                                 0x00000000

NANDFCR=                        0x00000000

NANDFSR=                        0x00000000

PMCR=                                0xfefefefe

NFECCCS2=                        0x00000000

NFECCCS3=                        0x00000000

NFECCCS4=                        0x00000000

NFECCCS5=                        0x00000000

NANDF4BECCLR=            0x00000000

NANDF4BECC1R=            0x00000000

NANDF4BECC2R=            0x00000000

NANDF4BECC3R=            0x00000000

NANDF4BECC4R=            0x00000000

NANDFEA1R=                   0x00000000

NANDFEA2R=                   0x00000000

NANDFEV1R=                   0x00000000

NANDFEV2R=                   0x00000000

 

Please can you give us some help, how to reach the EMIF performance of the development board on our own design?

Here are our detailed questions:

  1. As described above, the distance between the CE pulses is much longer on our new design than on the development board (30ns vs. 110ns). Do you have an explanation for this behavior? Can you help us to improve the values on our new design?
  2. We observed a different behavior between the four CE signals of the EMIF. CE0, CE1 and CE2 have an identical behavior, but on CE3 the write access duration (access start to next access start) is about 20ns shorter than on the other CE signals (79ns vs. 97ns). Can you confirm this behavior?
  3. Since we are using the “wait enabled mode”, we also investigated the “wait disabled mode”. On the development board the access duration decrease from 56ns to 46ns after disabling the wait mode. On our new design the access duration decrease from 140ns to 77ns after disabling the wait mode. Compared to the development board on our new design the wait mode obviously has a huge influence to the access duration. Do you have an explanation for this different behavior? How do we reach the values from the development board?
  4. We observed another unusual behavior in the “wait disabled mode”. If we increase the strobe length in the A2CR register the CE pulse length will be increased accordingly, but the distance between the CE pulses will be increased as well. We expect, that only the pulse length will be affected by the strobe setting in the A2CR register. Can you explain this behavior?
  5. In the “wait enabled mode” we additionally observed an unusual behavior. According to the datasheet the wait signal must be inactive for at least 2 cycles to complete an access. We observed, that neither 2 nor 3 or 4 inactive wait cycles were enough to complete a read access. In our case we need to hold the wait signal inactive until the read access is completed (until rising edge of OE). Not till then we can activate the wait signal again for the next read access, but we additionally have to add 3 hold cycles in the A2CR register to take care, that the next access even recognizes the wait signal. In our opinion this is a strange behavior compared to the documentation. Do we use the wait signal in a correct manner? Do you have any explanation for this behavior, or can you tell us how to avoid it?
  6. One last question according the PLL: According to Figure 4-1 on page 14 of the datasheet the quad core ARM A15 Version of the Keystone2 (that we are using: AM5K2E04XABDA4) should have a Main PLL and an ARM PLL. But in the further description an ARM PLL is not mentioned any more. It seems, that the quad core has only a Main PLL like the dual core (see Figure 4-2 on page 15). Is that correct

  • Hi Frederik,

    1) The timing of the access to the external device is controlled by the EMIF16 interface IP. The timing between accesses is based on delays in accessing that IP and by the software running on the device. The accesses to EMIF16 were not prioritized for streaming data into the device so we didn't optimize that path. You mentioned that you were measuring the performance with different boards and devices using different software builds. There are too many variables to point to any one thing that is effecting the delays observed. 

    2) Access to the CEs should be the same but CE3 does have one different tie-off inside the part. CE3 does have the ability to generate byte write enable signals for wider parts. This feature wasn't found to be useful so we didn't document it but the different tie-off may effect the start or end of the state machine causing a different delay time. 

    3) The external wait signal is an external asynchronous input. The internal state machine must wait for the signal to be latched before it can detect it. The system clock on the development board is operating at a slower frequency then your board. The higher frequency means that the wait will be detected more quickly on your board. This does have an input on how quickly the access is completed but doesn't effect the delay between accesses. 

    4) I have no explanation for this behavior. Did you observe this was consistent for every setting of the strobe length? 

    5) I will have to do some research on this question. All the wave forms I have seen captured using the wait signal have the wait going low to end the cycle and staying low until the cycle has ended. I agree that this doesn't appear to match what is in the data manual. The internal documentation that I have matches the timing in the data manual. Can you provide a scope capture of the access with the wait signal?  How were you generating the timing for the wait signal?

    6) That looks like a carryover from earlier documents. The K2E doesn't have a separate ARM PLL. The main PLL is used for both the ARMs and the system.

    Regards, Bill

  • Hi Bill,

    1) It seems, that we found out why the EMIF accesses with the development board were faster than with our own board. Our software colleagues have investigated the old Uboot of the development board and they have found the following code:

    K2_AEMIF_PERF_DEGRADE_ERRATA_FIX

           {

            u32 tmp;

            tmp = __raw_readl(CONFIG_AEMIF_CNTRL_BASE+8);

            tmp |= 0x80000000;

            __raw_writel(tmp, CONFIG_AEMIF_CNTRL_BASE+8);

           }

    After adding this code to the Uboot of our own board the EMIF accesses seems to be as fast as on the development board. It seems to be an undocumented feature. We found the following case in the TI E2E forum: http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/550371 . This case is related to a different processor (C6678), but seems to describe the solution for our problem. The code above sets the EMIF16 in an asynchronous mode with lower latency. But why ist that mode not documented? Are there any problems with this mode in the kestone2? We would like to use this mode, do you have any concern?

     

    2) Question ist solved, thanks.

     

    3) May be this question will be resolved with the asynchronous mode, we will check that.

     

    4) Yes, it is consistent for every setting of the strobe length.

     

    5) Ok, we will do a scope capture and come back to you afterwards.

     

    6) Question ist solved, thanks.

     

    Best Regards,

    Stefan

  • Hi Stefan,
    The EMIF16 IP is a generic block used in many devices and includes a DDR memory controller. The K2E only uses the Async memory interface portion of that IP. By default, there is a timer associated with the DDR portion for refresh. Setting that bit disables the timer preventing the internal state machines from stalling the async memory interfaces to perform the unnecessary refresh. I do recall hearing about this in the C6678 but I didn't realize it had been added to UBOOT.

    I will close this thread for now but please post if you gather any scope captures associated with the wait signal. I will post if I hear from the design team on my questions.
    Regards, Bill
  • Hi Bill,

    we are a little bit unconfident about using the asynchronous mode. Why is it not documented in the official manual? Do we have to expect any problems with this mode? We would like to use it in our application, is that ok or do you have any concerns?

    Best regads, Stefan

  • Hi, Stefan,

    It is documented in C6678 Errata Usage Note 25. K2E uses C6678 DSPs.
    In MCSDK 3.1.4.7 (Kernel 3.10), the u-boot git log shows the change for this errata.

    commit 703322d188a4efb98137c6a26b3abc1e3e10daec
    Author: Murali Karicheri <m-karicheri2@ti.com>
    Date: Thu Jun 13 11:20:40 2013 -0400

    keystone2: implement aemif errata performance degradation fix

    Following errata for keystone2 devices exist for emif16 and this patch
    implements the work around suggested in the errata. The errata is
    described with title "Performance degradation for asynchronous accesses
    caused by an unused feature enabled in EMIF16".

    Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>

    Rex
  • Hi Stefan,
    There are no issues with the async mode. The IP will operate as documented with the one modification you found in uboot. I agree that the wait mode could use some additional documentation and I will look into that once I hear back from design but the operation without wait mode operates as documented. Note that we don't specify any limits on the delays between accesses since that is dependent on factors outside the IP.
    Regards, Bill