This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Self-written QMSS snippit works in simulator but fails on EVM6678



Hi,

For some days now I've tried to get a QMSS code-snippit working, where one core just sends a message to another core though hardware queues. However, the descriptor just got queued in the TX-queue and nothing else happend, even when sending messages with just a single core involved.
Today I discovered that the code works as expected on the C6678 simulator, where "messages" are passed correctly to the listening RX queue, which itself fetches free descriptors from its rxFreeQueue.

Most likely it's a fault on my side, however I still wonder why it works with the simulator.
It would be really great if somebody could take a look at the demo project: http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/639/0804.CPPIExecutor.7z

Thanks in advance, Clemens

PS: In general QMSS was troublesome - there is a lot of high-level documentation available (webminars, multicore navigator users guide) and only a few quite poorly documented samples. Documentation on an intermediate level would be great, even better documentation of the samples themself would help a lot.

  • I see several problems with this project:

    • Descriptors are in MSMCSRAM with L1D cache enabled.
      • Must align and pad descriptors to 64 bytes
      • Must do Osal_qmssBeginMemAccess() after pop (to invalidate cache)
      • Must do Osal_qmssEndMemAccess() before push (to writeback cache).
      • Must do same for linked buffers.
    • No buffers are linked to host descriptors.  See ti/drv/qmss/InfrastructureMode/infrastructure_mode.c
      • Cppi_setData, cppi_setOriginalBUfInfo, and cppp_setPacketLen() need to be done once per init of each descriptor.
      • setPacketLen() needs to be done for each packet, if packets are of variable length
      • you can legally (from hw) link one buffer (say source buffer) to more than one descriptor.
      • you cannot push same descriptor multiple times.  It can be on at most one queue at the same time.
    • Reason infrastructure_mode.c doesn't have cache operations is because buffers and descriptors are in L2, not MSMC.
      • If you use MSMC, need to align base address and pad length to 64 and do NOT set L2_CACHE define for Osal.c
      • If you use DDR, need to align base address and pad length to 128 and set L2_CACHE define for Osal.c
  • Hi John,

    Thanks for your reply, it is highly appreciated.

    1. The first code-line of the example disables the L1D cache completely with Cache_disable(Cache_Type_L1D),  so MSM access should be uncached. I guess this should eliminate most of the cache concerns you mentioned. Thats also the reason why the descriptor memory region is only aligned to 16 bytes.

    2. I intentionally haven't linked any buffers to host descriptors right now, therefore data / originalBufInfo as well as packetLength are all 0 for now (set by memset).
    I'll give the sample a try tomorrow with host buffers attached to see wether it makes a difference.

    Thank you so far, Clemens

    PS: Would it make sence to forward the attached sample project to the simulator team? In my understanding the simulator should behave like the hardware, so this could be considered as a simulator bug.

  • Sorry about the cache, its a tree I like to bark up and am usually right.

    The navigator user guide (http://www.ti.com/lit/sprugr9) requires that buffers be linked before feeding to a RX DMA.  See section 2.3.4.1.

    I tried changing Cppi_DescType_HOST to Cppi_DescType_MONOLITHIC and your example seems to work as intended, making it likely that the DMA is hanging on the descriptor ptr ==NULL.

  • I also removed the memset() because it clobbers init done by initDescriptor.

  • Hi John,

    Thanks a lot for your assistence - I can confirm that after switching to monolithic descriptors the code works as intended.
    Currently I see a ping-pong latency of about 900 cycles, I am curious to see how that will change with separate descriptors in L2 with L1D-cache enabled.

    It would be great to having the simulator fixed to behave like the actual hardware.

    Thanks, Clemens