NAND flash performance on c6748

ColinL

Other Parts Discussed in Thread: OMAP-L138, OMAPL138

Hi,

I'm using the OMAP-L138 dev board from logicPD with a numonyx NAND02G 256mb NAND flash chip (although this is equivalent to the one it comes with).

Does anyone have numbers on read/write speeds they are getting with a C6748 (or similar)? At the moment I am getting ~2.7mb/s read and write speeds (read is slightly slower) using DSP bios and YAFFS as the file system (I have written the driver myself). At first I was only getting ~255kb/s data rate, but I bumped the emif bus up to 100mhz from 25mhz (default in the gel file) and got much better performance, then I have played with the read/write strobe/hold timings and have gotten it up even further. I am still testing but at the moment I have:

write strobe =7

write hold = 6

write setup = 0

read strobe = 3

read hold = 2

read setup = 0

turnaround = 2

I have tried with both select strobe mode and normal mode but it doesn't seem to make much of a difference.

The board is designed with an 8 bit bus, would I expect approximately double speed if I were to use a configuration with 16 bit?

The write speed seems alright but I was expecting around 5mb/s read speed.

Using a logic analyser, the actual frequency on the data bus was still only around 25mhz, but I can't bump the emif bus up further as 100mhz is the limit.

over 15 years ago

0 Mukul Bhatnagar over 15 years ago

TI__Guru* 83935 points

I will ping the NAND driver/file system experts to look at this thread.

FYI you could use the NAND performance benchmarks provided for OMAPL138 , which is a pin compatible ARM+DSP offering in the same family as c6748.The data is for the 8 bit NAND on the LogicPD EVM Kit

http://processors.wiki.ti.com/index.php/DaVinci_PSP_03.20.00.11_Device_Driver_Features_and_Performance_Guide#Performance_Benchmarks_2

I would've thought looking at http://elinux.org/File_Systems that YAFFS should've been slightly better compared to the data given for JFFS2 (there could be other differences also from ARM vs DSP, DSP/BIOS vs Linux side performance considerations?)

Regards

Mukul

0 RandyP over 15 years ago

TI__Guru* 84110 points

Since your "write setup = 0", you are stating the value you programmed into the EMIFA register rather than the actual timing value which is always at least 1. So adding all three write parameters gives 13, plus an extra 3 for the physical timing parameter gives a total of 16 cycles for each byte written.

At 100MHz, this means the theoretical max transfer rate is 100/16 = 6.25MB/s. So your 2.7MB/s measurement looks pretty reasonable considering the overhead associated with writing to a NAND Flash.

For reads, your theoretical max is quite a bit higher. 8 cycles per read is a max rate of 100/8 = 12.5MB/s. The loss of nearly 80% efficiency is probably due to the wait signal and the OS overhead, but that depends on what you are using to calculate your 2.7MB/s numbers.

ColinL said:
The board is designed with an 8 bit bus, would I expect approximately double speed if I were to use a configuration with 16 bit?

Yes. But only for the transfer time. The overhead and wait delays would probably stay the same.

ColinL said:
Using a logic analyser, the actual frequency on the data bus was still only around 25mhz, but I can't bump the emif bus up further as 100mhz is the limit.

Are you looking at EMA_CLK or control or address or data lines? If you are looking at EMA_CLK and it is not 100MHz, then you did not change it to 100MHz. If you are looking at any other signals, they will always be slower than EMA_CLK.

0 ColinL over 15 years ago

Prodigy 130 points

the 25mhz was measured from the read enable line

i've gotten the read timings down to 2 for strobe and 1 for hold, any lower and i get errors (as one would expect).

i'm also using cache read mode on the nand, which has been slightly faster than normal read mode, but not significantly

0 ColinL over 15 years ago in reply to ColinL

Prodigy 130 points

did some further testing.

the main read loop to fetch the 2k page from the data bus takes ~530us.

however, i did a test reading the NANDFCR register 2k times and that took ~370us, and that doesn't depend on read strobing so it should be able to operate much quicker than that.

reading 2k from a random address in ddr to a local array takes ~30us, which is fast as expected.

reading 2k from the revision id MPU register to a local array takes ~200us, which seems pretty slow.

considering how long it takes just to read various memory locations, i'm thinking that maybe there is something else that needs configuring not specifically related to the EMIF. the standard gel file only had the emif bus at 25mhz originally, so it's obviously not optimised, so maybe there is something else in there that i need to change,.

0 Paul51033 over 15 years ago in reply to RandyP

Prodigy 220 points

I notice that we are trying to read the data one word at a time, is it possible that the SCR is causing our latency problems as in this post:

http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/112/p/11649/45453.aspx#45453

Probing the nand flash read enable line shows a complete read cycles takes about 40-60ns, but there is a delay of approximatly 220ns before the next read occurs. This makes me think that there is a lot of latency in the path (possibly due to the pipelining of the SCR or similar).

As far as I can see all the wait cycles are set to zero, but I can only find waits related to extending the read period, not the delay between reads.

If so the how do we configure the bus for a burst of data from the same address (remebering that the memory is a nand flash)?

I suspect that the EDMA would configure for burst mode, but am reluctant to configure this at the moment, and you metrics show the cpu utilization at 100% which leads me to beleive that there must be another way of configuring for burst mode

Thanks,

Paul

0 Brad Griffis over 15 years ago in reply to ColinL

TI__Guru*** 125430 points

ColinL said:
reading 2k from a random address in ddr to a local array takes ~30us, which is fast as expected.

I assume cache is turned on for the DDR. Hence when you start reading from the array the L2 cache controller issues a request through the interconnect for 128 bytes of data which efficiently uses the bursting capabilities of both the DDR and the Switched Central Resource.

ColinL said:
reading 2k from the revision id MPU register to a local array takes ~200us, which seems pretty slow.

I assume this memory is not cacheable. Is that right? That being the case, when you read from this memory range the CPU requests only the single byte of data (vs the 128 bytes you were getting with the cache on). The CPU stalls while the read is completing (as it would from the DDR as well). The big difference here is that when doing reads with the cache enabled you would have a long CPU delayed followed by a bunch of consecutive bytes that are "cache hits" so you would not have delays for subsequent words in the cache line. In this case with the cache disabled you will incur this large penalty for each and every read!!!

If you need to do a large block read from this memory location I recommend using EDMA. The EDMA will be able to read a big block of data without any gaps.

Brad

0 Brad Griffis over 15 years ago in reply to Brad Griffis

TI__Guru*** 125430 points

Whoops -- I didn't read that second quote carefully! I was gearing my answer toward the NAND reads, not the MPU register... FYI, that's a different issue as that will be on a different bus ("configuration bus") which is not as heavily optimized as the data bus.

0 RandyP over 15 years ago in reply to Paul51033

TI__Guru* 84110 points

Paul said:
Probing the nand flash read enable line shows a complete read cycles takes about 40-60ns, but there is a delay of approximatly 220ns before the next read occurs.

In your initial posting, you said the Read timing values (setup, strobe, hold) are set to 0, 3, 2, resp. This means that at EMIF CLK = 100MHz, the EMA_OEn signal will pulse low for 4 cycles = 40ns, but EMA_CE[N]n will be low for 80ns for the entire read cycle. Depending on where you are measuring the 220ns, some of this is additional time in the EMIF prior to any use of the data inside the DSP. Where are you measuring the 200ns? Between which edges of which signal(s)?

Single reads from the EMIF do take a long time, with peripheral delays, device delays, and internal delays even within the CPU Megamodule. Depending on which address lines you choose for CLE and ALE, you may be able to get better performance from the DSP by reading wider than a byte at a time.

And as Brad has explained, the EDMA is by far the best way to move data from external to internal in all cases other than short accesses. For NAND block reads, EDMA is the right way to go.

Processors

Processors forum

NAND flash performance on c6748