AM6421: OSPI frame is split?

Part Number: AM6421
Other Parts Discussed in Thread: SYSCONFIG

Tool/software:

In direct mode, the clock is 10 M.

By calling the interface as shown in the figure above, the waveform is as follows.

A frame is split into three frames.

I want to know if this phenomenon is correct?

  • Hi Vaibhav,

    Thank you very much for your efforts.

    Can you use the following configuration to conduct the test again?

    1. Clock: 100 MHz, Clock Divider: 2(or  Clock: 200 MHz, Clock Divider: 4)

    2. Read 512 bytes

    3. Testing the performance of OSPI_readIndirect (Use the following similar method)

            uint32_t cycleCountBefore, cycleCountAfter, cpuCycles;
     
            CycleCounterP_reset();
     
            cycleCountBefore = CycleCounterP_getCount32();
     
            OSPI_readIndirect(xx,xx,xx,xx);
     
            cycleCountAfter = CycleCounterP_getCount32();
     
            cpuCycles = cycleCountAfter - cycleCountBefore;
            
            DebugP_log("CPU cycles:%u\n", cpuCycles);

    Looking forward to your reply.

    BR

    Ryan

  • Hi Ryan,

    The supported modes for the TAP mode is as follows:

    Read 512 bytes

    Should be okay to read.

    Testing the performance of OSPI_readIndirect (Use the following similar method)

    You can go ahead with this.

    Please let me know if you run into any issue.

    Regards,

    Vaibhav

  • Hi Vaibhav,

    I want to know the test results from your side(The above configuration).

    BR

    Ryan

  • Hi Ryan,

    I did not do the time profiling as it was not in the initial ask post our call yesterday, so I am going to do it right now and send it to you in sometime.

    Regards,

    Vaibhav

  • Hi Vaibhav,

    Thank you very much for your efforts.

    Next week, I will modify the SDK based on the changes you proposed and conduct a time analysis test.

    I will notify you immediately after the test is completed.

    BR

    Ryan

  • Hi Ryan,

    Please go ahead and compute the time as defined in the attached API definition:

    int32_t OSPI_readIndirect(OSPI_Handle handle, OSPI_Transaction *trans)
    {
        int32_t status = SystemP_SUCCESS;
        const OSPI_Attrs *attrs = ((OSPI_Config *)handle)->attrs;
        OSPI_Object *obj = ((OSPI_Config *)handle)->object;
        const CSL_ospi_flash_cfgRegs *pReg = (const CSL_ospi_flash_cfgRegs *)(attrs->baseAddr);
        uint8_t *pDst;
        uint32_t addrOffset;
        uint32_t remainingSize;
        uint32_t readFlag = 0U;
        uint32_t sramLevel = 0, readBytes = 0;
        uint32_t dacState;
        uint32_t cycleCountBefore, cycleCountAfter, cpuCycles;
    
        addrOffset = trans->addrOffset;
        pDst = (uint8_t *) trans->buf;
    
        /* Disable DAC Mode */
        dacState = obj->isDacEnable;
        if(dacState == TRUE)
        {
            OSPI_disableDacMode(handle);
        }
    
        /* Config the Indirect Read Transfer Start Address Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_START_REG, addrOffset);
    
        /* Set the Indirect Write Transfer Start Address Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_NUM_BYTES_REG, trans->count);
    
        /* Set the Indirect Write Transfer Watermark Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_WATERMARK_REG,
                     CSL_OSPI_SRAM_WARERMARK_RD_LVL);
    
    
        CycleCounterP_reset();
        cycleCountBefore = CycleCounterP_getCount32();
    
        /* Start the indirect read transfer */
        CSL_REG32_FINS(&pReg->INDIRECT_READ_XFER_CTRL_REG,
                       OSPI_FLASH_CFG_INDIRECT_READ_XFER_CTRL_REG_START_FLD,
                       1);
    
        if(OSPI_TRANSFER_MODE_POLLING == obj->transferMode)
        {
            remainingSize = trans->count;
    
            while(remainingSize > 0U)
            {
                if(OSPI_waitReadSRAMLevel(pReg, &sramLevel) != 0)
                {
                    /* SRAM FIFO has no data, failure */
                    readFlag = 1U;
                    status = SystemP_FAILURE;
                    trans->status = OSPI_TRANSFER_FAILED;
                    break;
                }
    
                readBytes = sramLevel * CSL_OSPI_FIFO_WIDTH;
                readBytes = (readBytes > remainingSize) ? remainingSize : readBytes;
    
                /* Read data from FIFO */
                OSPI_readFifoData(attrs->dataBaseAddr, pDst, readBytes);
    
                pDst += readBytes;
                remainingSize -= readBytes;
            }
            /* Wait for completion of INDAC Read */
            if(readFlag == 0U && OSPI_waitIndReadComplete(pReg) != 0)
            {
                readFlag = 1U;
                status = SystemP_FAILURE;
                trans->status = OSPI_TRANSFER_FAILED;
            }
    
        }
    
        cycleCountAfter = CycleCounterP_getCount32();
        if(cycleCountAfter > cycleCountBefore)
        {
            cpuCycles = cycleCountAfter - cycleCountBefore;
        }
        else
        {
            cpuCycles = (0xFFFFFFFFU - cycleCountBefore) + cycleCountAfter;
        }
        cpuCycles = cycleCountAfter - cycleCountBefore;
        
        DebugP_log("CPU cycles for INDAC transfer of 512 bytes is: %u \r\n", cpuCycles);    
    
        /* Return to DAC mode if it was initially in enabled state */
        if(dacState == TRUE)
        {
            OSPI_enableDacMode(handle);
        }
    
        return status;
    }

    My settings in SysConfig for OSPI is as follows: 166 MHz, Clock Divider: 4, Mode: 8D-8D-8D

    The value at my end turns out to be:

    "CPU cycles for INDAC transfer of 512 bytes is: 22362"

    How to convert these cycles to a meaningful number?

    Since the R5 core is running at 800 MHz, so the time in microseconds would be = 22362 * (1 / 800) ~= 27 microseconds.

    Regards,
    Vaibhav

  • Hi Ryan,

    I will reattach the waveform for 512 bytes after discussing internally. Thanks for your patience.

    Regards,
    Vaibhav

  • Hi Ryan, 

    As asked by you, the test results for INDAC read of 512 bytes:

    1. DDR: 8D-8D-8D, With Clock: 200 MHz and Clock Divider: 4, is: "CPU cycles for INDAC transfer of 512 bytes is: 22217"
    2. DDR: 8D-8D-8D, With Clock: 166 MHz and Clock Divider: 4, is: "CPU cycles for INDAC transfer of 512 bytes is: 22488"
    3. DDR: 8D-8D-8D, With Clock: 133 MHz and Clock Divider: 4, is: "CPU cycles for INDAC transfer of 512 bytes is: 22454"

    Regards,

    Vaibhav

  • Hi Vaibhav, 

    I have a doubt. In different situations with Clock, why is the time consumption almost the same?

    BR

    Ryan

  • Hi Rayn,

      512-byte transactions occur on OSPI controller, the CPU cycle is count by PMU,  and as to the context overhead  expenses,I don't think we can get a significantly different cycle count.    Hope we could focus on the issue of packet  length and overlap issue. Thanks.

                                    50M        41.5M              33.3M

        512B 8D-8D-8D  5.12us    60168675us    7.68us

    Linjun

  • Hi Ryan,

    I am pretty sure, that for 128 bytes of INDAC read, you see the operation happening under 1 CS at your setup.

    Please tell me what do you see for 512 bytes INDAC read?

    I am pretty sure, there is no overlapping at all.

    But what about the chip select? Under how many chip selects do you see the 512 bytes transfer?

    Regards,
    Vaibhav

  • Hi Ryan, 

    Please note my Flash side configurations when I read 128 bytes of data. The same configuration follows when I go ahead and read 512 bytes of data.

    Regards,

    Vaibhav

  • Hi Vaibhav, 

    Thank you for your efforts.

    For now, the issue of packet length and overlap can be resolved through the INDAC read.

    However, from a performance perspective, this mode falls short of meeting the requirements.

    In addition to the time required for transmission, the context overhead expenses is excessively long.

    Is there a way to reduce the context overhead expenses.

    We hope to keep the context overhead within 2 microseconds.

    BR

    Ryan

  • Hi Ryan,

    Thanks for your patience.

    For now, the issue of packet length and overlap can be resolved through the INDAC read.

    I am glad that you are not seeing overlapping when using INDAC mode.

    I have checked waveform, and from waveform for 512 bytes of INDAC transfer the time taken is: 14.736 microseconds.

    The waveform should be a good point to measure, as you can track the timing by putting two markers, one at starting and the other at ending of the transfer.

    Moreover, this time includes your command(0xEE, 0xEE) + Address + Data(read back). So, upon checking the waveform I am seeing that for 512 bytes the time is 14.736 microseconds. I am running at 166 MHz and clock divider 4.

    Regards,

    Vaibhav

  • Ryan, it would be helpful if you measure time based on the waveform which you see. That would be more precise according to me assumption. Please let me know how much microseconds do you see?

  • Hi Vaibhav,

    While measuring time based on the waveform is a good approach, we are more concerned with the time difference before and after calling the C function interface. This time is critical to our business. Our tests have shown that the time measured through the C function interface is significantly greater than that measured using the waveform.

    In direct mode, the additional time will be minimal. However, if the extra time is unavoidable in the indirect mode, we may have to abandon it.

    BR

    Ryan

  • However, if the extra time is unavoidable in the indirect mode, we may have to abandon it.

    When we do indirect transfer, we do setting of some registers inside the API OPSI_readIndirect() for example

        dacState = obj->isDacEnable;
        if(dacState == TRUE)
        {
            OSPI_disableDacMode(handle);
        }

        /* Config the Indirect Read Transfer Start Address Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_START_REG, addrOffset);

        /* Set the Indirect Write Transfer Start Address Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_NUM_BYTES_REG, trans->count);

        /* Set the Indirect Write Transfer Watermark Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_WATERMARK_REG,
                     CSL_OSPI_SRAM_WARERMARK_RD_LVL);
    Post this we start the indirect transfer.
    Allow me sometime to discuss this internally and get back to you.
    Regards,
    Vaibhav
  • Hi, Vaibhav,

    Thank you for your efforts.

    I look forward to your reply.

    BR

    Ryan

  • Hi Ryan,

    Since our SDK we do not perform INDAC reads for either of the NOR or NAND flashes, hence we do not have a performance measurement which comes in the MCU PLUS SDK document.

    Apart from this, I am looking at the possibility of DMA DAC reads, notice, that the lower copy of DMA reads is 1024 bytes, which means that the transfer size should be greater than 1024 bytes in order for DMA to happen.

    So we can have this macro changed as follows:

    In file, ospi_v0.c change the macro to: 

    #define OSPI_DMA_COPY_LOWER_LIMIT     (256U) // 1024
    And then build the library with the following commands:
    Could tell me the waveform seen with the following changes and the CPU cycles which you get?
    NOTE: This is just another possibility which I want to explore.
    I will do the same measurement on my end and let you know the numbers of DAC DMA transfer of 512 bytes vs INDAC read 512 bytes(already known).
    Also, let me know if you see the overlapping with the DMA transfer as well.
    Moreover to enable DMA you would need to go to SysConfig of the application and head to OSPI section and check on Enable DMA option and rebuild the application after you have made the changes to the macro and rebuilt the libraries with the two commands as shown above in the snippet.
    Looking forward to your response.
    Regards,
    Vaibhav
  • Moreover, I know currently your driver is configured to readIndirect as per my last changes which I proposed.

    To make sure that the driver does OSPI_readDirect follow the instructions here: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1450567/am6421-ospi-frame-is-split/5683156#5683156

    But make sure to write OSPI_readDirect instead of OSPI_readIndirect in the file flash_nor_ospi.c and rebuild the libraries as well as mentioned in the step by step instructions.

  • Hello Ryan,

    This thread is about "OSPI frames are split", and it is clear that indirect access mode can be addressed. If you have questions about bus realtime or bus efficiency requirements, please submit a new topic. Thanks. 

    Linjun

  • Hi,

    Not sure if there is another thread created specifically to discuss the timing for INDAC.

    But, I will provide updates here, I have done measurements for INDAC and DAC with different number of bytes and clock frequencies.

    After working these values with other experts, I will put the numbers here forward.

    NOTE:

    An effort was made to increase the performance, by placing the OSPI_readFifoData API in TCMB0 instead of MSRAM, but this also did not increase the throughput of the API OSPI_readIndirect. 

    It is also observed that most of the time is taken while we read the data from the FIFO, more specifically, OSPI_readFifoData().

    Another thing observed is that, the timing from the waveform captured is pretty less. So, for 512 bytes of INDAC read, I see 6.9 microseconds on the wire/waveform, whereas for the OSPI_readIndirect I see 26-29 microseconds, where 90 percent of the time is consumed by OSPI_readFifoData().

    Best Regards,

    Vaibhav

  • Hi Vaibhav

    Can we reduce the time comsumption of OSPI_readFifoData?

    Regards

    Zekun

  • Hi Zekun,

    These are the current numbers seen on the TI EVM AM64x-SK.

    Efforts made to improve efficiency were:(These efforts did not improve the efficiency for the OSPI_readIndirect API)

    1. Placing the destination buffer in TCMA or TCMB.
    2. Placing the OSPI_readFifoData in TCMA/TCMB.
    3. Caching the region 0x60000000.
    4. Changing the SRAM partition configuration register from default value of 63 to 127. (This made sure that the 512 bytes were transferred under just 1 chip select, but again did not improve the OSPI_readIndirect time in microseconds. Please also note on the waveform, for 512 bytes it shows 6.9 micrseconds, but the OSPI_readIndirect shows roughly 22+ microseconds the reason being OSPI_readFifoData is taking 90 percent of the time out of 22+ microseconds)

    Regards,

    Vaibhav

  • Hi Ryan,

    I and Lucas have been running some experiments with few tweaks to the API OSPI_readIndirect().

    Currently some improvement has been seen.

    Earlier, the time taken for 512 bytes of transfer with 166 MHz/4 was 29-30 microseconds which gave a bandwidth of roughly 17.65 MBps.

    With the modifications, we see latency of just 14-15 microseconds(ran over 2000 test cycles by reading different 512 bytes of data every cycle), so now the improved bandwidth is seen to be 34 MBps.

    It will take me sometime to write you a formal list of modifications needed and share you the changes once I have cleaned the code a bit.

    Follow Up Doubt:

    In the meantime can you tell me if you are always going to read 512  exact bytes of data? Or you can also sometimes happen to read data < 512 bytes?

    Regards,

    Vaibhav

  • Hi Vaibhav,

    Thank you for the efforts made by you and Lucas. The results achieved are truly exciting!
    Regarding your query, our read operations generally involve less than 512 bytes.
    We look forward to the modification list and code sharing you will organize in the future.
    If there are any other questions, please feel free to communicate with us.
    BR
    Ryan
  • Hi Ryan,

    I am going to share the changes now.

    Replace the following APIs with the current definition:

    Please include:

    #include <kernel/dpl/ClockP.h>

    static int32_t Flash_norOspiRead(Flash_Config *config, uint32_t offset, uint8_t *buf, uint32_t len)
    {
        int32_t status = SystemP_SUCCESS;
        Flash_NorOspiObject *obj = (Flash_NorOspiObject *)(config->object);
        Flash_Attrs *attrs = config->attrs;
        int32_t startTime = 0, endTime = 0;
    
        if(obj->phyEnable)
        {
            OSPI_enablePhy(obj->ospiHandle);
        }
    
        /* Validate address input */
        if ((offset + len) > (attrs->flashSize))
        {
            status = SystemP_FAILURE;
        }
        if (status == SystemP_SUCCESS)
        {
            OSPI_Transaction transaction;
    
            OSPI_Transaction_init(&transaction);
            transaction.addrOffset = offset;
            transaction.buf = (void *)buf;
            transaction.count = len;
            uint32_t* source = (uint32_t*) 0x60100000;
            CacheP_inv(source, 512, CacheP_TYPE_ALLD);
            startTime = ClockP_getTimeUsec();
            status = OSPI_readIndirect(obj->ospiHandle, &transaction);
            endTime = ClockP_getTimeUsec();
            DebugP_log("Time in microseconds for %u bytes indac read is %u \r\n", len, (endTime - startTime));
        }
    
        if(obj->phyEnable)
        {
            OSPI_disablePhy(obj->ospiHandle);
        }
    
        return status;
    }

    int32_t OSPI_readIndirect(OSPI_Handle handle, OSPI_Transaction *trans)
    {
        int32_t status = SystemP_SUCCESS;
        const OSPI_Attrs *attrs = ((OSPI_Config *)handle)->attrs;
        const CSL_ospi_flash_cfgRegs *pReg = (const CSL_ospi_flash_cfgRegs *)(attrs->baseAddr);
        uint8_t *pDst;
        uint32_t addrOffset;
        uint32_t sramLevel = 0; 
    
        uint32_t* source = (uint32_t*) 0x60100000;
    
        addrOffset = trans->addrOffset;
        pDst = (uint8_t *) trans->buf;
        uint32_t* rxBuffer = (uint32_t*) pDst;
    
        /* Disable DAC Mode */
        CSL_REG32_FINS(&pReg->CONFIG_REG,
                       OSPI_FLASH_CFG_CONFIG_REG_ENB_DIR_ACC_CTLR_FLD,
                       0U);
    
        /* Config the Indirect Read Transfer Start Address Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_START_REG, addrOffset);
    
        CSL_REG32_WR(&pReg->IND_AHB_ADDR_TRIGGER_REG, 0x100000);
    
        // set INDIRECT_TRIGGER_ADDR_RANGE_REG to 7 instead of 4
        CSL_REG32_WR(&pReg->INDIRECT_TRIGGER_ADDR_RANGE_REG, 7);    
    
        /* Set the transaction count */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_NUM_BYTES_REG, trans->count);
    
        // set parition config reg value to 127 imstead of the default 63
        CSL_REG32_WR(&pReg->SRAM_PARTITION_CFG_REG, 127);
    
        /* Set the Indirect Write Transfer Watermark Register */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_WATERMARK_REG, 32);
    
        /* Start the indirect read transfer */
        CSL_REG32_FINS(&pReg->INDIRECT_READ_XFER_CTRL_REG,
                       OSPI_FLASH_CFG_INDIRECT_READ_XFER_CTRL_REG_START_FLD,
                       1);
    
        // checking if indirect read has been completed. 
        do{
            OSPI_waitReadSRAMLevel(pReg, &sramLevel);
        }while(sramLevel != 128);
    
        for(int i = 0; i < 128; i += 8)
        {
            rxBuffer[i + 0] = source[i + 0];
            rxBuffer[i + 1] = source[i + 1];
            rxBuffer[i + 2] = source[i + 2];
            rxBuffer[i + 3] = source[i + 3];
            rxBuffer[i + 4] = source[i + 4];
            rxBuffer[i + 5] = source[i + 5];
            rxBuffer[i + 6] = source[i + 6];
            rxBuffer[i + 7] = source[i + 7];
        }
    
        // reset INDIRECT_TRIGGER_ADDR_RANGE_REG to 4 instead of 7
        CSL_REG32_WR(&pReg->IND_AHB_ADDR_TRIGGER_REG, 0x00);
    
        CSL_REG32_WR(&pReg->INDIRECT_TRIGGER_ADDR_RANGE_REG, 4); 
    
        // partition config reg
        CSL_REG32_WR(&pReg->SRAM_PARTITION_CFG_REG, 63);
    
        /* Set the Indirect Write Transfer Watermark Register back to default 16 */
        CSL_REG32_WR(&pReg->INDIRECT_READ_XFER_WATERMARK_REG, 16);    
    
        return status;
    }

    Also make sure to include the following settings in your application's sysconfig:

    Note in the last screenshot, the .data.gOspiRxBuf is the Receive buffer which I initialize at application level globally, like this:

    Also my settings are 166 MHz / 4 and phy mode is not enabled as we do INDAC Read.

    Please also note, that the implementation inside OSPI_readIndirect caters specifically for 512 bytes read. It can be modified to work generically to read x number of bytes. Firstly, let me know if you also see the same performance as me, by reading exact 512 bytes on your setup.

    Regards,

    Vaibhav

  • As per today's discussion in weekly meeting marking it resolved.

    Regards

    Ashwani

  • Hi Ryan,

    I am assuming this has been talked about in the weekly meeting and your goals have been met in throughput performance enhancement.

    Ashwani has marked the thread closed as per his inputs from the call I guess.

    Please let me know if there is anything additional you would need to know.

    Regards,

    Vaibhav