Possible bug in PKTDMA/CPPI

Aamir Husain

Hello Ti Folks,

I think I have found a possible bug in how PKTDMA seems to be working.

Here is a summary of what I am trying to do. I have 80 bytes packets (not including all the headers Eth/IP/UDP etc) coming into an EVM via Ethernet and the PA and stored in DDR3. I then copy them on the core's L2 memory . An object instance then takes that input and packages them in a buffer which is in L2) which is then passed on to a second object at a given time interval. This second object copies the data in the buffer passed amongst the objects into its local buffer. This local buffer is originally in DDR3 but the problem still happens if it is in L2. Then when the second object is ready to send out the packet, it pops a descriptor of the free Tx queue and pushes the descriptor to the switch to send out through Ethernet. I repeat this process for two other channels so in total I have 3 channels operating simultaneously sending packets out in a purely loopback mode so the data portion of the incoming packets suitably delayed through the system should match the data portion of the outgoing packets. What I find is that when I just have a snigle channel running, I do not see any data corruption. However as I go to 3 channels, I see smoe corruption.

To investigate further, I produced canned messages such that each channel has an incoming 80 byte packet that is different from the other channels so for example channel 1 may use char values 0xff, 0xfc, 0xf8 etc but channel 2 may only use 0xfe, 0xfb, 0xf7 and so on. Then I fed that the canned messages as input for each of the channels. I then compared the buffer passed between objects for channel 1 against the expected input of channel 1 and repeated for the other channels and setup a breakpoint to be triggered if their was any difference. I found none. I then repeated this step for the local buffer being used by the object that will be sending the packets and found again that the data in the buffer, before popping a descriptor and pushing to send out, matched the channel expected input. I then looked at the output and I was observing that occasionally I was seeing data from other channels in the expected channel data so it would point to some cross coupling happening in the PKTDMA engine.

I then set up the queue to push the pkts back to the core 0 using the infrstructure pktdma so that it would be received by the same core and I could compare the actual output received packet with the expected packet. I made use of the psinfo words to send info about the descriptor and buffer used in the TX process so that when I received the packet I could check the tx buffers or descriptors had been corrupted and they do not seem to be. This seems to point to some internal software error within the PKTDMA drivers that is causing this coupling or worse still some issue in silicon in the PKTDMA hardware blocks.

I have tried looking at the Errata for the chip as well as the PA LLD and CPPI lld in particular and I do not see anything glaring. I am using cppi lld 1.0.1.5 and PA lld c6678_1_0_0_19 and mcsdk 2_00_07_19. The EVM is using rev 1.0 of the silicon I believe.

This is a critical issue for us to resolve and am looking forward to TI's assistance in this regards.

Thanks, Aamir

over 12 years ago

0 Aamir Husain over 12 years ago

Expert 2715 points

I forgot to add that after I received the packet back from the infrastructure pktdma, I setup breakpoints to trigger if the comparison fails to match and that is what happened proving conclusively that the data received was different from the data supposedly sent.

Thanks, Aamir

0 John Dowdal over 12 years ago in reply to Aamir Husain

TI__Intellectual 2180 points

This seems like a cache coherence issue. Remember the c66x does not have a coherent cache, so the sw must invalidate and writeback the cache as needed.

It looks like you tried to prove this by moving all of you buffers and descriptors to the L2 of the core that will access the buffer (remember remote access to another core's L2 via 0x1#800000 address is not coherent). I also assume that you removed your cache coherence operations on your buffers and descriptors when you moved to L2. If you didn't you could still have a coherence problem. Whenever you do such cache operations, the buffers and descriptors *must* be aligned and padded to a multiple of 128 bytes (or 64 bytes if the buffers/descriptors are in MSMC or otherwise only cachable in L1). Thus if your packet buffers are 80 bytes, you must round up their allocation to 128 bytes (same holds for descriptors)

Note: cppi 1.0.1.5 has a known cache bug which is fixed in 1.0.2.x. This only affects its internal structures, not the actual descriptors and buffers. Also make sure that Osal_cppiMalloc is using a heapMemMP if you call cppi APIs from more than one core. The BIOS MCSDK from http://software-dl.ti.com/sdoemb/sdoemb_public_sw/bios_mcsdk/latest/index_FDS.html has the new version of cppi.

Assuming these (missing cache coherence operations, unaligned buffers/descriptors, cppi 1.0.1.5) is not the problem, do you have an example or testcase showing this problem?

0 Aamir Husain over 12 years ago in reply to John Dowdal

Expert 2715 points

John,

Can you explain what is the cache bug fix and how does it manifest itself? Can I take the fix without making changes to my PDK or MCSDK version or I have to upgrade those too?

Note: I am not accessing any other cores L2.

John sorry I will just rewrite the steps just so that you can see any gotchas in alignment, coherency etc. I am not using L2 as cache.

My descriptors used by the CPPI are 64 bytes long and are aligned on 64 byte boundaries even though they have a align(16) command before the allocation but they are 64 byte aligned so I did not change that. They are stored in DDR3. The incoming packets to the DSP are also stored in buffers in DDR3 that are 64 byte aligned and they are 256 bytes in length. I have the invalidate of both the descriptors and the buffers in the function that handles the receiving of the packets.

It is after this step that I do the memcpy from DDR3 to a pingpong buffer in my core's L2 memory. Do the ping pong buffers need to be also 64 byte aligned? Actually I had two portions of the ping-pong buffer, a header portion for an internal proprietary header and the data portion, and though the header was aligned to 64 bytes and was 12 bytes in length the data portion was not aligned to 64 bytes where the data is the packet I received over Ethernet. Once I aligned the data portion, I no longer seem to seem to hit the breakpoints in the packet match comparison with expected packets after I loop them back to the core again using the infrastructure DMA. I am not quite sure why that seems to resolve the issue but maybe you can help bring some more clarity? In anycase when I go back to my original file in place of the canned packet, I still see differences between the channels when there should be none.

From the ping-pong buffer the data is again memcpy'ed to what I will call transfer buffers by an input C++ object. These buffers are in my core's L2 and they are of size 92 or 172 bytes. I have hundreds of these allocated from a HEAP and then broken down into 92 or 172 byte buffers and I essentially manage my own run-time heap. They are not aligned. Do I need to have invalidate and writebacks? I thought not as L2 is coherent with L1D, is it not? The output C++ object takes the data in the transfer buffer and copies it to a buffer which was at first in DDR3 and part of the object but I then moved it to L2 as you rightly point out to see it if was a caching issue as I am not doing the cache flushing before I call the function to send the packet out. However in the function that sends the packet out, I grab a descriptor from the Tx free queue and invalidate the descriptor and the bufferPtr within the descriptor with calls

SYS_CACHE_INV (pCppiDesc, SIZE_HOST_DESC, CACHE_FENCE_WAIT);
SYS_CACHE_INV ((Ptr)pCppiDesc->buffPtr, pCppiDesc->buffLen, CACHE_FENCE_WAIT);

I then overwrite the buffPtr with the new bufPtr which was the buffer within my C++ object (in DDR3) but now is a buffer in L2 and made this new buffer 64 bytes aligned and still saw the issue till I made the correction of aligning the data portion in the ping pong buffer.

As for the example showing this, I have our proprietary code doing that. It will be hard to extricate that and create the situation again in more general terms.

Thanks, Aamir

0 Aamir Husain over 12 years ago in reply to Aamir Husain

Expert 2715 points

John,

I had made a mistake in my code and so was incorrect in saying that I saw no errors. After I fixed the alignment of the data portion, I still see errors. I shall pen a post explaining that tomorrow morning.

Aamir

0 John Dowdal over 12 years ago in reply to Aamir Husain

TI__Intellectual 2180 points

Aamir Husain said:
Can you explain what is the cache bug fix and how does it manifest itself? Can I take the fix without making changes to my PDK or MCSDK version or I have to upgrade those too?

The cppi_listlib which is used to track rx/tx channels and flows was not handling cache properly if cppi calls were made from more than one core. This would generally cause the refcnt to be wrong because it would allocate the same object on two cores instead of have second core reference the object created on first core. There was also an align and pad error, which could cause CPPI to whack whatever is next to cppiObject in .cppi section (including something that has nothing to do with CPPI). Thus it would act as if refcnt became 0 when it was still nonzero, and it would program the hw twice. You can see the details by looking at CPPI's release notes in <pdk>\docs\ReleaseNotes_CPPI_LLD.pdf.

While we don't recommend picking and choosing components, I suspect you can replace the packages/ti/drv/cppi from your current PDK with the one from the latest PDK, while leaving the other modules alone.

Aamir Husain said:
Do the ping pong buffers need to be also 64 byte aligned?

If these buffers are used exclusively by a single CPU and no other masters (other CPUs, pkt dma, edma, etc), then they do not need to be aligned, nor do they need wb(inv) after cpu writes, and inv before cpu reads. It isn't 100% clear from me if you memcpy() out of the ping pong buffer, or if infrastructure dma does this.

If these buffers are used by CPU to memcpy() then used by infra dma to read, then they must be aligned, and you must do wb(inv) with a FENCE_WAIT before you push the data to the DMA. Otherwise, the data written by memcpy() could sit in the cache, while the DMA reads the old data sitting in DDR.

When I say wb(inv) I am saying the "Inv" part is optional but recommended, and the wb portion is required.

Aamir Husain said:
I am not quite sure why that seems to resolve the issue but maybe you can help bring some more clarity?

A little explanation. The problem is called "false sharing" in CS literature. False sharing on automatically coherent systems reduces performance (but is correct). On systems which require manual coherence false sharing results in incorrect operation. For example, lets say you have a 40 byte buffer at address 0x00800020. This means that 32 bytes of the buffer sit on the line associated with 0x00800000, and 8 bytes of the buffer sit on the line associated with 0x00800040. Here are some things that can go wrong:

You invalidate 0x00800020, length 40 trying to read the buffer with CPU

This unintentionally writes between 0x00800000 and 0x0080001f (what happened to be there during dynamic execution?).
This may unintentionally kill everything between 0x00800048 and 0x0080007F (depending on if length is increased to 104 or left at 40).
This may not invalidate the last 8 bytes of the buffer (if length was not rounded up to 104).

You writeback 0x00800020, length 40 after writing the buffer with CPU

If something else (dma, other CPU) wrote something between 0x00800000 and 0x0080001F (such as dma to another packet just before this one), it will be whacked by the cache writeback, losing the other write.
Similar issues as above if the length isn't increased to 104.

One more gotcha: the L1D is read allocate write back cache. This means that a cache line is only loaded on a read. Thus if you do a memcpy() and the write side of the buffer is not in cache, it won't enter cache. However, if its leftover from a previous packet, the write will go into the cache. Thus when you single step the first time its likely you won't see the problem, but you'll see it running full speed.

Aamir Husain said:
Do I need to have invalidate and writebacks? I thought not as L2 is coherent with L1D, is it not?

the L2 is coherent with the L1D so align, pad, inv, wb are not needed.

Can you explain what is the cache bug fix and how does it manifest itself? Can I take the fix without making changes to my PDK or MCSDK version or I have to upgrade those too?

0 Aamir Husain over 12 years ago in reply to John Dowdal

Expert 2715 points

John,

From your description of the cache bug fix this would not seem to be my problem as I am only working on a single core for now but the alignment issue may cause some problems. I managed to go back over different mcsdk version release notes to see the issue that you are talking about but the release notes really have no detailed explanation. Your explanation was quite helpful. I am just a little nervous of making changes to mcsdk (bug fixes notwithstanding). Also you mentioned earlier that one should change the osal_cppiMalloc to make use of the HeapMemMP if calling from multiple cores. My code is based on the multicore example that is provided with the PDK and that makes use of HeapMem not HeapMemMP so am I to assume that the later versions of the examples would have made those fixes to correctly use HeapMemMP as otherwise the multicore example may not possibly work.

my ping-pong buffers are not used by any other master but only the core and are in L2 for the particular core. Currently I have data memcpy'ed to them from DDR3 though I intend later to use EDMA/QDMA to transfer the packets so I will then have to take care to ensure that prior to the EDMA transfer I invalidate the 64 byte aligned destination buffer in L2 so if it is in cache it gets removed and then perform the EDMA copy from DDR3 to L2 and then the cpu can cache it when accessed by the CPU. Is that correct? Actually I should be able to invalidate the destination buffer after the EDMA transfer too as long as I have not tried reading the dest buffer with the CPU before or does it matter? The ping pong buffers are memcpy'ed to transfer buffers which are also in L2 and not used by any other master other than core itself. The transfer buffers are memcpy'ed to an output buffer and it is from this buffer that the pktdma reads to sends the packets out. Since L2 is coherent with L1D unless it is used by another master, is that also true for buffers in external DDR3 too i.e. are they coherent if not used by another master. I know in my case I intend to use another master so it does not apply but just for general info sake?

Thanks for your help. I figured out my problem as I was invalidating my output buffer in L2 and was losing the updates the CPU made in cache even though I was doing a wb/inv before sending the packet out but the inv had already lost the update so the wb was not obviously able to update the dirty line as it was invalidated before. What I could not figure out was why only parts of the updated packet were lost on the invalidate in the cache line but some were not when I single stepped through the code. Let me explain. The output buffer consists of a 12 byte proprietary header plus 14 bytes ethernet header+20 bytes IP +8 bytes UDP header+12 bytes RTP header +80 bytes or 160 bytes packet. The buffer is 64 byte aligned. In the previous send I had sent a 160 byte data packet for a total IP packet of length 200 i.e. 0xc8 and a UDP packet length of length 180 bytes i.e. 0xb4. Both the ip length and udp length are in the first 64 bytes from the start of the alignment and so should be in the same cache line. Now this frame I am sending an 80 byte data packet so I compute the UDP and IP length and put that in the packet i.e. 0x78 and 0x64. After the incorrect invalidate, I would expect that I would lose both the updated ip and udp header lengths but I end up losing the updated udp header but the ip header remains the updated one in the memory.

Thanks for the false sharing examples. Why is it that when you are doing the invalidate before the read at address 0x00800020 of length 40 bytes that the last 8 bytes are not invalidated? After all, one is asking to invalidate 40 bytes, so two cache lines should be invalidated. Additionally, in the invalidate portion item 1, why is their an unintentional write between 0x00800000 and 0x0080001f, should it not be that earlier writes between 0x00800000 and 0x0080001f would be lost as that line is removed from cachce before the changes can be written back to the lower level memory?

thanks for the adivce of read allocate gotcha's.

Thanks, Aamir

Processors

Processors forum

Possible bug in PKTDMA/CPPI