This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: qspi nor flash performance

Part Number: TDA4VM

Dear, expers

 The sf test command we used is like below:

=> sf test 0 0x10000
SPI flash test:
0 erase: 206 ticks, 310 KiB/s 2.480 Mbps
1 check: 17 ticks, 3764 KiB/s 30.112 Mbps
2 write: 123 ticks, 520 KiB/s 4.160 Mbps
3 read: 18 ticks, 3555 KiB/s 28.440 Mbps
Test passed
0 erase: 206 ticks, 310 KiB/s 2.480 Mbps
1 check: 17 ticks, 3764 KiB/s 30.112 Mbps
2 write: 123 ticks, 520 KiB/s 4.160 Mbps
3 read: 18 ticks, 3555 KiB/s 28.440 Mbps

From above data, we can see the Nor Flash has poor read and write performance. Driving into the details of the code, we learned that the driver uses memcpy when writing data to flash. So the write performance seems to be within the normal range.

int cadence_qspi_apb_write_execute(struct cadence_spi_platdata *plat,
				   const struct spi_mem_op *op)
{
	u32 to = op->addr.val;
	const void *buf = op->data.buf.out;
	size_t len = op->data.nbytes;

	/*
	 * Some flashes like the Cypress Semper flash expect a dummy 4-byte
	 * address (all 0s) with the read status register command in DTR mode.
	 * But this controller does not support sending dummy address bytes to
	 * the flash when it is polling the write completion register in DTR
	 * mode. So, we can not use direct mode when in DTR mode for writing
	 * data.
	 */
	if (!plat->dtr && plat->use_dac_mode && (to + len < plat->ahbsize)) {
		memcpy_toio(plat->ahbbase + to, buf, len);
		if (!cadence_qspi_wait_idle(plat->regbase))
			return -EIO;
		return 0;
	}

	return cadence_qspi_apb_indirect_write_execute(plat, len, buf);
}

But the read routine uses a DMA engine dma_memcpy to copy data from flash. So the read performance is very poor!!!!!

static int
cadence_qspi_apb_direct_read_execute(struct cadence_spi_platdata *plat,
				     const struct spi_mem_op *op)
{
	......
	if (!cadence_qspi_apb_use_phy(plat, op)) {
		if (!op->data.dtr || dma_memcpy(buf, plat->ahbbase + from, len) < 0)
			memcpy_fromio(buf, plat->ahbbase + from, len);

		if (!cadence_qspi_wait_idle(plat->regbase))
			return -EIO;
		return 0;
	}
	
	......
}

The conditional judgment  `if (!op->data.dtr || dma_memcpy(buf, plat->ahbbase + from, len) < 0)` is weird.  The correct conditional judgment would look something like this: `if (!op->data.dtr && dma_memcpy(buf, plat->ahbbase + from, len) < 0)`.  The test results are as follows:

=> sf test 0 0x10000
SPI flash test:
0 erase: 163 ticks, 392 KiB/s 3.136 Mbps
1 check: 5 ticks, 12800 KiB/s 102.400 Mbps
2 write: 110 ticks, 581 KiB/s 4.648 Mbps
3 read: 5 ticks, 12800 KiB/s 102.400 Mbps
Test passed
0 erase: 163 ticks, 392 KiB/s 3.136 Mbps
1 check: 5 ticks, 12800 KiB/s 102.400 Mbps
2 write: 110 ticks, 581 KiB/s 4.648 Mbps
3 read: 5 ticks, 12800 KiB/s 102.400 Mbps

THANKS

XA

  • Hi Alex,

    Can you please provide the SDK version for me to reproduce this issue with performance?

    - Keerthy

  • Hi Keerthy

    SDK 7.1

    AX

  • Hi,

    I just reproduced the performance numbers on 8.0 SDK itself once I change the if check that you have suggested I even see the read
    performance life.

    Default SDK 8.0:

    sf probe 1:0
    Can't get reset: -2
    Software reset enable failed: -524
    SF: Detected mt25qu512a with page size 256 Bytes, erase size 64 KiB, total 64 MiB

    => time sf read 0x80000000 0x0 0x1000000
    device 0 offset 0x0, size 0x1000000
    SF: 16777216 bytes @ 0x0 Read: OK

    time: 4.410 seconds

    After the if check change

    sf probe 1:0
    cadence_spi spi@47050000: Can't get reset: -2
    jedec_spi_nor flash@0: Software reset enable failed: -524
    k3-navss-ringacc ringacc@2b800000: Ring Accelerator probed rings:286, gp-rings[96,20] sci-dev-id:235
    k3-navss-ringacc ringacc@2b800000: dma-ring-reset-quirk: disabled
    SF: Detected mt25qu512a with page size 256 Bytes, erase size 64 KiB, total 64 MiB
    => time sf read 0x80000000 0x0 0x1000000
    device 0 offset 0x0, size 0x1000000
    SF: 16777216 bytes @ 0x0 Read: OK

    time: 1.014 seconds

    So performance improves 4x. The opinions from our SPI experts is that this might be beneficial for larger sizes only.

    This is the commit that added the change: https://git.ti.com/cgit/ti-u-boot/ti-u-boot/commit/?id=246ca5eae017e855ada6b372661d3bf9a192e8e0

    reasoning: If we are not in DTR mode then there is not much advantage in using DMA therefore don't use DMA in this case. This condition mostly happens when parsing SFDP table during enumeration (at slower speeds). Unfortunately SPI NOR core uses non DMA'able buffers during SFDP reads (ie on stack structs as read buffers). So this fix also helps in avoiding DMA to buffers allocated on stack.

    Regards,
    Keerthy