This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5728: DSP software execution time

Part Number: AM5728

Hello,

We have an AM5728 based system running a ported code base from a previous design that used the TI 6482.  The C66x core on the AM5728 is running under Sys/Bios and at 750 MHz.  The C64x core on the 6482 runs under DSP/BIOS at 1 GHz.  In both systems we get data from an FPGA. We are seeing the 5728 implementation is 2.5 to 3.0 times slower than the 6482.  I am wondering where to start looking at why this might be happening.

More information:  The AM5728 has DDR3 that we basically lifted from the EVM design.  The DSP has as much L1/L2 cache enabled as I can. The DSP is connected the FPGA via PCIE at 5G x 1 lane.  The data from the FPGA to the DSP is via a PCIE write from the FPGA to DSP.  The data is written into L2SRAM memory, part of the 32k left over after cache, the DSP then gets an interrupt and copies the data to DDR3 memory for later processing.  This was done to avoid having to cache invalidate the DDR3 memory each time the DSP wants to access the data from the FPGA.  As mentioned above, the code that then processes the data, from DDR3 is 2.5 to 3.0 times slower on the AM5728 DSP.  I would guess that of our cycle time, ~70 usec per interrupt, data is arriving from the PCIE to the L2SRAM about 50% of that time. 

I am wondering if possibly the large volume of data arriving into the L2RAM is blocking the cache from working effectively or just blocking the DSP core from accessing memory effectively.  In other words, is the PCIE transfer of data into L2SRAM killing our processing?

Any suggestions or insight into this would be much appreciated.

Thanks,

Chris

  • The RTOS team have been notified. They will respond here.
  • Hi,

    I thought several things to check:
    1) DSP speed (750MHz vs 1000MHz)
    2) DDR speed and L1D, L1P, L2 cache setting in both cases
    3) C6482 is PCI interface, not PCIE, correct?
    4) the C6482 is C64x+ core and AM5728 is C66x core, do you use the same compiler option? Can you try -mv6600 for C66x?
    5) When data writes from FPGA into DSP, if there an EDMA in FPGA side so the writing uses EDMA or just CPU write.
    6) Is there a way you can split the processing time into several steps:
    a. the time from FPGA starting write to the point data landed at DSP side;
    b. interrupt latency
    c. time of copy from L2 to DDR
    d. Processor time in DDR

    Then we can have a better idea what slows the processing done. And why just write from FPGA directly into DDR?

    Regards, Eric
  • Thanks for the reply Eric. In the past day I discovered a major contributor to the issue. Underlying code was setup to the VCP in the 6482 when possible and if the VCP was not present (aka C66 on the 5728) to use a software Viterbi. Obviously, the SW based Viterbi is much slower. However, when comparing cycles to cycles between 5728 and 6482 we still some variations that we can't explain. We can explain some of it, because it appears the HWI processing in Sys/Bios takes 5-8 usec. That is a bit hit when we get an interrupt every 70 usec. However, even with that taken into account, we still variations in cycle times, and I am thinking it is related to my use of the L2SRAM as a target memory for the PCIe writes. If the PCIe is writing a block of data to the L2SRAM and it is also the main cache for the DSP, it seems that could cause contention and impact speed in a way that is not observable. Can you comment on that? The reason I used the L2SRAM was to hopefully avoid having to do cache_invalidates of the PCIe data. But I might have to reconsider that approach.

    I will answer your questions since that also lead to other ideas.

    1) DSP speed (750MHz vs 1000MHz)

    Yes, DSP on 5728 is running at 750 MHz, this is confirmed. DSP on 6482 was at 1000MHz. But that is just a 25% difference, not less than 50%.


    2) DDR speed and L1D, L1P, L2 cache setting in both cases

    DSP on 6482 was running from internal memory. Due to the much more limited internal memory on the AM5728, this not possible, so it runs from External DDR3 at 533 MHz. L1D, L1P and L2 cache is maxed out and enabled on the 5728.

    3) C6482 is PCI interface, not PCIE, correct?

    On the 6482 it was Fast EMIF and the data was moved by the DMA of the 6482. On the 5728, the data is moved via direct memory writes from the FPGA via PCIe. In other words, from the 5728 DSP viewpoint, the data just shows up in memory, L2SRAM, then is moved to DDR.

    4) the C6482 is C64x+ core and AM5728 is C66x core, do you use the same compiler option? Can you try -mv6600 for C66x?

    Good point here. I did not even realize I had mv64+ on the settings for the 5728 DSP. I changed to -mv6600 and did not see any noticeable reduction in cycle times.

    5) When data writes from FPGA into DSP, if there an EDMA in FPGA side so the writing uses EDMA or just CPU write.

    The writes are a DMA like process. We use the NWL PCIe core on an Altera chip. The writes come in 32 word (32 bits) blocks.

    6) Is there a way you can split the processing time into several steps:

    a. the time from FPGA starting write to the point data landed at DSP side;

    The DSP is not aware of this time as it is a DMA process that happens in the background. The data is written into L2SRAM then the DSP gets an MSI interrupt. That is when the 70 usec period starts on the DSP. The DSP must complete all it's tasks before the next 70 usec MSI interrupt.

    b. interrupt latency

    This has been measured and reported in another thread to be between 5 and 9 usec. This was observed with the Execution Task tool. See this thread. e2e.ti.com/.../604259

    c. time of copy from L2 to DDR

    I will need to instrument for this.

    d. Processor time in DDR

    This is very hard to measure since the time spent doing the actual work, executing processing code and data from DDR, is constantly interrupted by PCIe data interrupts. I mentioned above that we need to service PCIe interrupts at 70 usec intervals. However, once we have collected the required data, we need to process it within 1 msec. That processing spans many PCIe interrupts. So, measuring just that processing time is not really possible.

    Thanks for your further consideration of this.

    Chris