TDA4VM: qspi nor flash performance

xin alex

Part Number: TDA4VM

Dear, expers

The sf test command we used is like below:

=> sf test 0 0x10000
SPI flash test:
0 erase: 206 ticks, 310 KiB/s 2.480 Mbps
1 check: 17 ticks, 3764 KiB/s 30.112 Mbps
2 write: 123 ticks, 520 KiB/s 4.160 Mbps
3 read: 18 ticks, 3555 KiB/s 28.440 Mbps
Test passed
0 erase: 206 ticks, 310 KiB/s 2.480 Mbps
1 check: 17 ticks, 3764 KiB/s 30.112 Mbps
2 write: 123 ticks, 520 KiB/s 4.160 Mbps
3 read: 18 ticks, 3555 KiB/s 28.440 Mbps

From above data, we can see the Nor Flash has poor read and write performance. Driving into the details of the code, we learned that the driver uses memcpy when writing data to flash. So the write performance seems to be within the normal range.

int cadence_qspi_apb_write_execute(struct cadence_spi_platdata *plat,
				   const struct spi_mem_op *op)
{
	u32 to = op->addr.val;
	const void *buf = op->data.buf.out;
	size_t len = op->data.nbytes;

	/*
	 * Some flashes like the Cypress Semper flash expect a dummy 4-byte
	 * address (all 0s) with the read status register command in DTR mode.
	 * But this controller does not support sending dummy address bytes to
	 * the flash when it is polling the write completion register in DTR
	 * mode. So, we can not use direct mode when in DTR mode for writing
	 * data.
	 */
	if (!plat->dtr && plat->use_dac_mode && (to + len < plat->ahbsize)) {
		memcpy_toio(plat->ahbbase + to, buf, len);
		if (!cadence_qspi_wait_idle(plat->regbase))
			return -EIO;
		return 0;
	}

	return cadence_qspi_apb_indirect_write_execute(plat, len, buf);
}

But the read routine uses a DMA engine dma_memcpy to copy data from flash. So the read performance is very poor!!!!!

static int
cadence_qspi_apb_direct_read_execute(struct cadence_spi_platdata *plat,
				     const struct spi_mem_op *op)
{
	......
	if (!cadence_qspi_apb_use_phy(plat, op)) {
		if (!op->data.dtr || dma_memcpy(buf, plat->ahbbase + from, len) < 0)
			memcpy_fromio(buf, plat->ahbbase + from, len);

		if (!cadence_qspi_wait_idle(plat->regbase))
			return -EIO;
		return 0;
	}
	
	......
}

The conditional judgment `if (!op->data.dtr || dma_memcpy(buf, plat->ahbbase + from, len) < 0)` is weird. The correct conditional judgment would look something like this: `if (!op->data.dtr && dma_memcpy(buf, plat->ahbbase + from, len) < 0)`. The test results are as follows:

=> sf test 0 0x10000
SPI flash test:
0 erase: 163 ticks, 392 KiB/s 3.136 Mbps
1 check: 5 ticks, 12800 KiB/s 102.400 Mbps
2 write: 110 ticks, 581 KiB/s 4.648 Mbps
3 read: 5 ticks, 12800 KiB/s 102.400 Mbps
Test passed
0 erase: 163 ticks, 392 KiB/s 3.136 Mbps
1 check: 5 ticks, 12800 KiB/s 102.400 Mbps
2 write: 110 ticks, 581 KiB/s 4.648 Mbps
3 read: 5 ticks, 12800 KiB/s 102.400 Mbps

THANKS

over 3 years ago

0 Keerthy J over 3 years ago

TI__Guru**** 162410 points

Hi Alex,

Can you please provide the SDK version for me to reproduce this issue with performance?

- Keerthy

0 xin alex over 3 years ago in reply to Keerthy J

Intellectual 506 points

Hi Keerthy

SDK 7.1

+1 Keerthy J over 3 years ago in reply to xin alex

TI__Guru**** 162410 points

Hi,

I just reproduced the performance numbers on 8.0 SDK itself once I change the if check that you have suggested I even see the read
performance life.

Default SDK 8.0:

sf probe 1:0
Can't get reset: -2
Software reset enable failed: -524
SF: Detected mt25qu512a with page size 256 Bytes, erase size 64 KiB, total 64 MiB

=> time sf read 0x80000000 0x0 0x1000000
device 0 offset 0x0, size 0x1000000
SF: 16777216 bytes @ 0x0 Read: OK

time: 4.410 seconds

After the if check change

sf probe 1:0
cadence_spi spi@47050000: Can't get reset: -2
jedec_spi_nor flash@0: Software reset enable failed: -524
k3-navss-ringacc ringacc@2b800000: Ring Accelerator probed rings:286, gp-rings[96,20] sci-dev-id:235
k3-navss-ringacc ringacc@2b800000: dma-ring-reset-quirk: disabled
SF: Detected mt25qu512a with page size 256 Bytes, erase size 64 KiB, total 64 MiB
=> time sf read 0x80000000 0x0 0x1000000
device 0 offset 0x0, size 0x1000000
SF: 16777216 bytes @ 0x0 Read: OK

time: 1.014 seconds

So performance improves 4x. The opinions from our SPI experts is that this might be beneficial for larger sizes only.

This is the commit that added the change: https://git.ti.com/cgit/ti-u-boot/ti-u-boot/commit/?id=246ca5eae017e855ada6b372661d3bf9a192e8e0

reasoning: If we are not in DTR mode then there is not much advantage in using DMA therefore don't use DMA in this case. This condition mostly happens when parsing SFDP table during enumeration (at slower speeds). Unfortunately SPI NOR core uses non DMA'able buffers during SFDP reads (ie on stack structs as read buffers). So this fix also helps in avoiding DMA to buffers allocated on stack.

Regards,
Keerthy

Processors

Processors forum

TDA4VM: qspi nor flash performance