Cached data appears allthough caches are invalidated

Markus Grunwald

Expert 2060 points

Other Parts Discussed in Thread: SYSBIOS, OMAP-L138

Hello,

currently I have some behavior on my DM814x that I really don't understand. To put it short: "old" data that has been in a buffer before an edma transfer into this buffer reappears and destroys current data. This happens only with release builds and only if caching is enabled.

This is what I do:

// Initialisation of buffer (U_FIFO_SIZE==4096) :
g_au32FPGAData = (uint32_t*) Memory_alloc( NULL, sizeof( uint32_t ) * U_FIFO_BUFFER_SIZE,
                                           EDMA3_CACHE_LINE_SIZE_IN_BYTES,
                                           rEH.GetErrorBlockPtr());
rEH.SysRaise( CPTErrorHandler::DEC_MALLOC_FAILED, 0, "Memory_alloc failed", 0);

// Usage of the buffer:

// Only for debugging: fill the buffer with easy to recognize data.
// Make sure the 0xBD are written to the buffer and caches are invalidated
memset( g_au32FPGAData, 0xBD, U_FIFO_BUFFER_SIZE * sizeof(uint32_t) );
Cache_wbInvAll();

// get a big part of the fifo in one go
edmaTransfer( (UInt32*) (PCIMEM_BASE + UI_FIFODATA),
              (UInt32*) &g_au32FPGAData[0],
              u32ToRead );

The data in the buffer consists of short message frames. All frames have to start with a magic start-of-frame value. Usually right after the first transfer (about 3000 uint32_ts / floats) the check for the start of frame of a message fails. When I look at the data in the memory browser, I see this:

Hmm, I hope one can read this...

So, somewhere, usually in about the same place there appears 0xBD. It appears there instead of the valid frames data. Everything else in the buffer is correct. The large amount of 0xBD at the end of the buffer has not been overwritten.

I recognized this behavior only after I started to transfer data from my fifo in "large" (~4K floats) chunks instead of small ones (~ one message per transfer). When I disable caching or do a release build or have a look at the buffer in CCS's Memory Browser /before/ the frames are evaluated, I don't get the error.

Just for completeness I'll attach the code for edma setup and the edmaTransfer.

I really hope you can help...

TIA

Markus

files.zip

over 11 years ago

0 Matthijs van Duin over 11 years ago

Mastermind 8040 points

The first thought that comes to mind would be: could someone from TI please confirm whether or not the DSP megamodule revision used on the DM814x is one that fixes all of the many cache-coherency bugs that apparently used to plague C64x+/C674x GEM in general?

0 Matthijs van Duin over 11 years ago

Mastermind 8040 points

Markus Grunwald said:

Hmm, I hope one can read this...

No worry: clicking on the image will show it at its original size. However, it would be helpful to resize the window to make the number of columns some power of 2. Having 16 columns (64 bytes/line) would make every line of the memory view exactly one L1D cache line.

After the transfer, what sort of access pattern is used on the data? Is it read more or less linearly? Is it also modified?

Other things you can try to help narrow down the issue: what if only L1D or only L2 cache is enabled? What if you partially configure L2 as memory instead of cache, and use that as buffer instead of main memory? (i.e. use EDMA to transfer directly into L2).

I have some other suspicions, I'll try to do some tests later today since any potential issues with EDMA and DSP-based processing interest me.

BTW, side note: your "wait for transfer to finish" loop looks a bit dubious since the amount of time it takes will depend on compiler settings. It would be better to use the CPU's cycle counter (TSCL, defined in <c6x.h>). It needs to be started once at boot by writing any value into it (I don't know if SYS/BIOS already does this).

0 Badri Narayanan over 11 years ago in reply to Matthijs van Duin

TI__Guru 59700 points

Apart from the excellent points suggested by Matthijs van Duin pls check the following:

1. I see you are aligning memory allocation to 128 bytes cache line size. Assert start address is aligned to 128 bytes and length is multiple of 128 .

2. Don't do Cache_wbInvAll. Always do Wbinvalidate by address range. Doing wbInvAll is very bad idea and if you have similar code executing from other threads it would result in randomly writing cache lines to memory.

3. Try doing wbInv in a loop of smaller chunks. There was an issue with doing invalidate for greater than 64K but that issue is not present in sysbios. ANyhow if it works by doing invalidate in smaller chunsk it would give some clue.

4. Writes from c674 are posted-writes . There is small possibility that wbInv returned before the content got updated in DDR and overwrote the EDMA contents.Try reading back the last bytes from end of Cache_wbInv block address. Such a hack is actually not required and this is again only for debug purpose.

5. Confirm by deselecting L1D and L2 in CCS memory view that contents in DDR are really 0xBDBD and not the EDMAed contents.

0 Matthijs van Duin over 11 years ago in reply to Badri Narayanan

Mastermind 8040 points

Badri Narayanan said:
There is small possibility that wbInv returned before the content got updated in DDR and overwrote the EDMA contents.

Actually, when using wbInvAll you also flush your code from cache, so the subsequent instructions need to be fetched again from external RAM which effectively provides an "OCP barrier" similar to the readback you're suggesting.

In general however you make a good point here, and I'm not so sure that "such a hack actually not required" when using range-ops for cache writeback. This situation is even warned about in the TRM (7.2.6.3). The problem is somewhat mitigated in this case by the fact that the DSP and EDMA TC0 connect to the same switch within the L3 interconnect and use the same DMM port, so their traffic to RAM will follow the same path and should stay in order (unless L3 initiator pressure is used to prioritize TC0 over the DSP), and EMIFs won't reorder requests to the same address (the same 2 KB block even), although the TRM is unfortunately completely silent on what kind of reordering can be done by the ROBINs in DMM.

Another issue is that since EDMA configuration takes a completely different route (through the config port), it is theoretically even possible for EDMA to be started before the last cache-writeback write has even reached the L3 interconnect, depending on how much buffering is done in the async bridge and how congested the L3F is at that moment, though this scenario seems unlikely to me.

As I said, wbInvAll already implicitly provides an OCP barrier whenever your code in located in external RAM, so this seems to preclude non-causal ordering problems from being to blame in this particular case, though maybe it's somehow still an issue between the end of EDMA transfer and subsequent reading from the DSP (though it shouldn't, since EDMA uses non-posted writes of course).

At the same time, the only cache coherency errata I've seen w.r.t. the C674x GEM (e.g. in the OMAP-L138) relate to DMA requests to DSP L2 memory, so even if they are present in the GEM on the DM814x (its "revision" field is zero, so if that field means anything then those errata would be present) they still couldn't be causing the issue Markus Grunwald is seeing here.

So although I'm still inclined to blame either cache issues or non-causal ordering of memory requests, I'm still not sure how exactly since none of the most obvious potential problems seem to apply here.

More things that can be tried to help narrow down the problem:

If either L3 initiator pressure or EMIF request priority (using PEG in DMM) is used to prioritize EDMA TC0 requests over DSP MDMA requests or vice versa, try setting them to equal priority.
Putting the buffer in some on-chip SRAM instead of external RAM, such as HDVICP SL2 if it's not in use for other purposes.

0 Markus Grunwald over 11 years ago in reply to Badri Narayanan

Expert 2060 points

Hello Matthijs and Badri,

sorry for my late answer, we had a few free days in Germany.

Thanks to your suggestions, I came a big step further, but there's still a problem. I'll try to answer your questions first:

After the transfer, what sort of access pattern is used on the data? Is it read more or less linearly? Is it also modified?

The data is processed linearly. I have a pointer to a header structure, see if it's valid (SOF=0xFFFF1234, you see it in the dump) look at the header, decide where to put the following payload, read the payload and advance the pointer to the header to the next frame. This is repeated for the whole buffer. Data of the buffer is not modified.

what if only L1D or only L2 cache is enabled?

I have not tried this, yet.

I see you are aligning memory allocation to 128 bytes cache line size. Assert start address is aligned to 128 bytes and length is multiple of 128 .

I checked this and both conditions are true.

Don't do Cache_wbInvAll. Always do Wbinvalidate by address range. Doing wbInvAll is very bad idea and if you have similar code executing from other threads it would result in randomly writing cache lines to memory.

Try doing wbInv in a loop of smaller chunks. There was an issue with doing invalidate for greater than 64K but that issue is not present in sysbios. ANyhow if it works by doing invalidate in smaller chunsk it would give some clue.

Cache_wbInvAll was only for debugging purpose. Please have a look at edmaTransfer in the attached zip file: there I'm using Cache_wbInv / Cache_inv, but for the whole big block. I corrected this, but please see my code at the end of this post

Confirm by deselecting L1D and L2 in CCS memory view that contents in DDR are really 0xBDBD and not the EDMAed contents.

It's the other way round: When L1D and L2 are selected, I see 0xBD like in the screenshot. The pointer to my header data in CCS's "Variables" view shows 0xBD, too. This way, I'm able to detect the error, because the header should start with 0xFFFF1234 but it points to 0xBD.

If I deselect L1D and L2, CCS memory dump shows me the correct data that has been read by DMA into DDR ram. So the trouble is that data seems to be read from the cache, not from DDR.

So I tried this:

// *****
if( _ftoi( psNextMessageHeader->fSOF ) != FPGA_FIFO_SOF && (float*) psNextMessageHeader < pfEnd )
{
    Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE, Cache_Type_ALL, TRUE);  // <=== Breakpoint here
    System_abort("Header doesn't start with SOF");
}

I set a breakpoint at Cache_inv. This way, I can see when the header doesn't start with SOF but with 0xBD. The buffer dump looks like the screenshot in the first post.

Now I step over Cache_inv and tadaaa, the dump looks correct and even psNextMessageHeader->fSOF in the CCS variables view shows the correct SOF. So if I put Cache_inv before the 'if' ( // ***** in the code above) all should be fine and the bug fixed, right? And I'd even avoid invalidating big chunks of memory like you mentioned above (U32_FPGA_MESSAGE_SIZE is 276 bytes)

It doesn't work :( If I add the Cache_inv before the 'if', nothing changes. It looks like it is ignored:

Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE, Cache_Type_ALL, TRUE);
if( _ftoi( psNextMessageHeader->fSOF ) != FPGA_FIFO_SOF &&
    (float*) psNextMessageHeader < pfEnd )
{
    Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE, Cache_Type_ALL, TRUE);
    System_abort("Header doesn't start with SOF");
 }

Sorry for the long post. Can you help me with the additional/new information?

Many thanks,

Markus

0 Matthijs van Duin over 11 years ago in reply to Markus Grunwald

Mastermind 8040 points

The plot thickens...

Markus Grunwald said:
Can you help me with the additional/new information?

I still intend to do some tests myself, but I've been occupied with other things and haven't had the time yet. Meanwhile there are still various suggestions mentioned in earlier posts in this thread of things to try to help locate where/when exactly this issue arises. Especially my last one might be worth a try: putting the buffer in on-chip RAM instead of external RAM would help figure out whether the issue somehow relates to EMIF/DMM or whether it is purely due to the DSP cache and/or interconnect. If it is available, the HDVICP SL2 ram (256 KB @ 0x59000000) would probably be the best choice since it is relatively large and afaik connects to the same switch within the interconnect as DMM does.

0 Matthijs van Duin over 11 years ago in reply to Badri Narayanan

Mastermind 8040 points

BTW, Badri Narayanan,

Regarding to my inquiry which (if any) of the cache problems the C64x+/C674x GEM has (or at least, had) might still apply to the GEM instance in the DM814x: it would be most helpful if you could forward that question to someone who can make an authoritive statement on that, if you haven't already. Hopefully the answer is "none" (and in general I'd hope that any applicable known errata of the GEM instance would be copied into the DM814x errata document) but I think it would be a good idea to verify this, especially since the revision field of the megamodule is zero and cache issues seem to be present in most of the DSPs (including the newer C6600 series).

0 Matthijs van Duin over 11 years ago in reply to Markus Grunwald

Mastermind 8040 points

You mentioned this only happens in release builds... maybe it's just because of tighter timing of memory requests, but I'm also starting to wonder what exactly the optimizer is doing to your code. I'd presume that SYS-BIOS somehow makes sure a Cache_inv call is regarded as a compiler barrier and absolutely never moves any memory access across such a call, but maybe I'm presuming too much there...

Can you try setting the --aliased_variables option of the compiler? (and making sure --no_bad_aliases is not set)

Also, if program-level optimization is enabled, try disabling it.

(if it were ARM code I'd maybe also suggest disassembling the code to inspect it, but on C6000 DSPs that's not a very attractive option)

0 Badri Narayanan over 11 years ago in reply to Matthijs van Duin

TI__Guru 59700 points

Is your SOF 0xFFFF4321 or 0xFFFF1234 . In the CCS memory window I see only 0xFFFF4321 pattern. The address where 0xFFFF4321 occurs is not 128 byte aligned and neither is length of 276 ,a 128 byte multiple. From the memory widow it looks like one additional cache line is not invalidated . Do you have prefetch enabled (PFX ?) . Can you try invalidating additional 3 cache lines ? Also if you are never writing to dest buffer from CPU never use Cache_wbInv. Always use Cache_inv.

Matthijs I will check about the GEM errata on 814x. I don't have outlook on when I will get response from design team.

0 Markus Grunwald over 11 years ago in reply to Matthijs van Duin

Expert 2060 points

Hello Matthijs,

I tried the compiler switches that you mentioned (especially --aliased_variables looked promising), but it didn't help. We were suspecting that the optimizer reorders something, too, so we had a look at the ASM output. I'm no C6747 ASM expert, but I think the code is not reordered. At leas not in a way that could cause this behaviour:

1008              Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE, Cache_Type_ALL, TRUE);
9a02beec:   00872C10            B.S1          ti_sysbios_hal_Cache_inv__E (PC+276832 = 0x9a06f840)
9a02bef0:   02B2100A            EXTU.S2       B12,16,16,B5
9a02bef4:   8506                MV.L1         A10,A4
9a02bef6:   D2C6                MV.L1X        B5,A6
9a02bef8:   01890162            ADDKPC.S2     $C$RL40 (PC+36 = 0x9a02bf04),B3,0
9a02befc:   E4200000            .fphead       n, l, W, BU, nobr, nosat, 0100001
9a02bf00:   8507                MV.L2         B10,B4
9a02bf02:   2313     ||         MVK.S2        1,B6
1009              if( _ftoi( psNextMessageHeader->fSOF ) != FPGA_FIFO_SOF &&
9a02bf04:   01A80334            LDNW.D1T1     *+A10[0],A3
9a02bf08:   00314BF8            CMPLTU.L1     A10,A12,A0
9a02bf0c:   0626                MVK.L1        0,A4
9a02bf0e:   2A66         [ A0]  MVK.L1        1,A4
9a02bf10:   00000000            NOP           
9a02bf14:   01AC6A78            CMPEQ.L1      A3,A11,A3
9a02bf18:   F9E6                XOR.L1        A3,1,A3
9a02bf1a:   8588                AND.L1        A4,A3,A0
9a02bf1c:   E9200001            .fphead       n, l, W, BU, nobr, nosat, 1001001
9a02bf20:   D023A120     [!A0]  BNOP.S1       0x9A02BF46 (PC+70 = 0x9a02bf46),5
1012                  Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE, Cache_Type_ALL, TRUE);
9a02bf24:   00872410            B.S1          ti_sysbios_hal_Cache_inv__E (PC+276768 = 0x9a06f840)

The load and compare instructions at 0x9a02bf04 ff are obviously fter the branch to ti_sysbios_hal_Cache_inv__E ...

Now I'll at last try to write to internal ram, as you suggested - after I figured out how ;)

0 Matthijs van Duin over 11 years ago in reply to Badri Narayanan

Mastermind 8040 points

Badri Narayanan said:
Do you have prefetch enabled (PFX ?)

Prefetch? on a c674x? I thought that was introduced with c66x? (I find no mentions of "prefetch" or "pfx" in the c674x megamodule reference)

Good catch on the misalignment of the individual messages, that indeed means the start address should be rounded down and end address rounded up to the nearest cache line boundary.

Badri Narayanan said:
Matthijs I will check about the GEM errata on 814x. I don't have outlook on when I will get response from design team.

Thank you. I understand these things will take time.

0 Markus Grunwald over 11 years ago in reply to Badri Narayanan

Expert 2060 points

Hello Badri,

Sorry, I messed that up. SOF is 0xFFFF4321.

Your observation that neither the SOF locations nor the length of a message are related to multiples of 128 is correct, but I think this doesn't matter: The edma transfers don't have this destination or length but writes to g_au32FPGAData with length u32ToRead (see the code in my first post). I calculate u32ToRead like this:

const UInt32 EDMA3_CACHE_LINE_SIZE_IN_FLOATS = EDMA3_CACHE_LINE_SIZE_IN_BYTES / sizeof(float);

const UInt32 u32ToRead    = u32FifoLevel /
      EDMA3_CACHE_LINE_SIZE_IN_FLOATS * EDMA3_CACHE_LINE_SIZE_IN_FLOATS;

edmaTransfer works in units of floats, that's why I do the conversion.

Or did I miss something there?

Do you have prefetch enabled (PFX ?) .

How can I check this?

Can you try invalidating additional 3 cache lines ?

I tried this, made no difference:

        Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE + 3* EDMA3_CACHE_LINE_SIZE_IN_BYTES, Cache_Type_ALL, TRUE);
        if( _ftoi( psNextMessageHeader->fSOF ) != FPGA_FIFO_SOF &&
            (float*) psNextMessageHeader < pfEnd )
        {
            Cache_inv((Ptr) psNextMessageHeader, U32_FPGA_MESSAGE_SIZE + 3* EDMA3_CACHE_LINE_SIZE_IN_BYTES, Cache_Type_ALL, TRUE);
            System_abort("...");
        }

0 Markus Grunwald over 11 years ago in reply to Matthijs van Duin

Expert 2060 points

Matthijs van Duin said:
Good catch on the misalignment of the individual messages, that indeed means the start address should be rounded down and end address rounded up to the nearest cache line boundary.

Could you elaborate more on this? Because I don't understand how the alignment of the individual messages influences it. In the end, I could configure my FPGA to output only 0xAA55AA55 (without any message structure) and the result would be the same, wouldn't it?

Or does the address-parameter of Cache_inv() have to be aligned somehow? The docs don't mention any restriction...

Documentation said:
Cache_inv()
Invalidate the range of memory within the specified starting address and byte count. The range of addresses operated on gets quantized to whole cache lines in each cache. All lines in range are invalidated for all the 'type' caches

0 Matthijs van Duin over 11 years ago in reply to Markus Grunwald

Mastermind 8040 points

Markus Grunwald said:
We were suspecting that the optimizer reorders something, too, so we had a look at the ASM output. I'm no C6747 ASM expert, but I think the code is not reordered.

It appears you're using --c_src_interlist ? This option actually partially counteracts the optimizer by preventing it from reordering instructions across C statement boundaries. I personally use --symdebug:dwarf --optimize_with_debug --debug_software_pipeline --src_interlist (the lack of c_ is important) to get the maximum amount of debug info without (as far as I can tell) affecting the optimizer. It also generates an assembly file with various explanatory comments from the optimizer.

Markus Grunwald said:
The load and compare instructions at 0x9a02bf04 ff are obviously fter the branch to ti_sysbios_hal_Cache_inv__E

They are after the branch, but not obviously so. The remaining assembly instructions between the branch opcode and the "1009" source listing line are all executed before the branch. The ADDKPC for example sets up the return address for the call, and the remaining ops put the arguments to the call in the appropriate registers.

See here for the pipeline diagram up to the moment of entry into Cache_inv. Vertical axis is code as fetched from memory, horizontal axis is time in cycles. Fetch-packet boundaries indicated by solid purple lines, execute-packet boundaries by dashed blue lines. All instructions in this fragment are single-cycle (excluding fetch/decode stages).

0 Markus Grunwald over 11 years ago in reply to Matthijs van Duin

Expert 2060 points

Matthijs van Duin said:
Especially my last one might be worth a try: putting the buffer in on-chip RAM instead of external RAM would help figure out whether the issue somehow relates to EMIF/DMM or whether it is purely due to the DSP cache and/or interconnect. If it is available, the HDVICP SL2 ram (256 KB @ 0x59000000) would probably be the best choice

I tried this the quick and dirty way:

    g_au32FPGAData = (uint32_t*) 0x59000000;
    Cache_setMar( g_au32FPGAData, sizeof( uint32_t ) * U_FIFO_BUFFER_SIZE, Cache_Mar_ENABLE );

Results are the same :(

Matthijs van Duin said:

Other things you can try to help narrow down the issue: what if only L1D or only L2 cache is enabled?

I tried this:

Cache_disable(Cache_Type_ALLD);
Cache_enable(Cache_Type_L1D);

and this:

Cache_disable(Cache_Type_ALLD);
Cache_enable(Cache_Type_L2);

Strangely enough, this didn't change anything ... ?

0 Matthijs van Duin over 11 years ago in reply to Markus Grunwald

Mastermind 8040 points

Markus Grunwald said:

g_au32FPGAData = (uint32_t*) 0x59000000;
    Cache_setMar( g_au32FPGAData, sizeof( uint32_t ) * U_FIFO_BUFFER_SIZE, Cache_Mar_ENABLE );

Results are the same :(

That's relatively good news: it means that the DMM and EMIF are not somehow involved, which significantly reduces the search space for the problem.

I may have time later today to do some testing, and also inspect a bit what SYS-BIOS is doing in those calls. Which version are you using?

0 Badri Narayanan over 11 years ago in reply to Markus Grunwald

TI__Guru 59700 points

Below is code for Cache_inv

Void Cache_block(Ptr blockPtr, SizeT byteCnt, Bool wait,
    volatile UInt32 *barReg)
{
    volatile UInt32 *bar;
    volatile UInt32 *wc;
    Int wordCnt;
    UInt mask;
    UInt32 alignAddr;

    /*
     *  Get the base address and word count register.
     *  wc is one word after bar on c64x+ cache.
     */
    bar = barReg;
    wc = bar + 1;

    /* word align the base address */
    alignAddr = ((UInt32)blockPtr & ~3);

    /* convert from byte to word since cache operation takes words */
    wordCnt = (byteCnt + 3 + ((UInt32)blockPtr - alignAddr)) >> 2;

    /* loop until word count is zero or less */
    while (wordCnt > 0) {

        /* critical section -- disable interrupts */
        mask = Hwi_disable();

        /* wait for any previous cache operation to complete */
        while (*L2WWC != 0) {
            /* open a window for interrupts */
            Hwi_restore(mask);

            /* disable interrupts */
            mask = Hwi_disable();
        }

        /* get the emif config for the address */
        Cache_module->emifAddr = getEmifCtrlAddr(alignAddr);

        /* set the word address and number of words to invalidate */
        *bar = alignAddr;
        *wc = (wordCnt > MAXWC) ? MAXWC : wordCnt;

        /* end of critical section -- restore interrupts */
        Hwi_restore(mask);

        /*
         * reduce word count by _BCACHE_MAXWC and
         * increase base address by BCACHE_MAXWC
         */
        wordCnt -= MAXWC;
        alignAddr += (MAXWC * sizeof(Int));
    }

    /* wait for cache operation to complete */
    if (wait) {
        Cache_wait();
    }
}

As you can see it doesn't align address to cache line. It just aligns it to 32bit

Regarding prefetch I was wrong . As Matthijs mentioned it is available only on c66x . I was looking at the wrong cache control file in sysbios.

Pls share the portion of your code after EDMA where you are looping thru individual msgs. It is not present in the initial code you shared.

I suggest you modify your code flow as below for test:

1. Cache_inv address range of EDMA dest address. Check address and length are 128 byte aligned.

2. Do EDMA. Remove all cache control code out of the EDMA xfer function.

3. Do Cache_freeze so that you don't get anything into L2/L1D cache. Cache_setMode API can be used.

4. When looping thru confirm from CCS memory window that contents are not in cache.

Also confirm closing the CCS memory window that your functionality is still broken. I am not sure if viewing the memory via CPU view is causing cache fetch. I don't see that as a possibility but it helps to confirm by doing the experiment with CCS memory window closed.

Another test you can do is if it is same address that is not evicted from cache you can put hardware watchpoint in CCS for that address.If CPU is accessing that address it should break on read from that address.

Also as test disable interrupts after EDMA xfer complete so that you are sure code is not getting preempted.

0 Matthijs van Duin over 11 years ago in reply to Markus Grunwald

Mastermind 8040 points

Markus Grunwald said:

I set a breakpoint at Cache_inv. This way, I can see when the header doesn't start with SOF but with 0xBD. The buffer dump looks like the screenshot in the first post.

Now I step over Cache_inv and tadaaa, the dump looks correct and even psNextMessageHeader->fSOF in the CCS variables view shows the correct SOF. So if I put Cache_inv before the 'if' all should be fine and the bug fixed, right?

It doesn't work :( If I add the Cache_inv before the 'if', nothing changes.

I just had another thought: all behaviour described so far would make perfect sense if in fact the EDMA transfer hasn't finished yet! If for some reason waiting for EDMA completion (occasionally) returns prematurely, then once your code manages to catch up with the EDMA transfer it will read BDBDBDBD which then ends up in cache. If you set a breakpoint and then perform cache_inv or inspect RAM directly it will reveal the correct data, since EDMA has of course finished by then!

This happens only with release builds and only if caching is enabled.

... since otherwise the code runs too slow to catch up with the transfer.

I recognized this behavior only after I started to transfer data from my fifo in "large" (~4K floats) chunks instead of small ones (~ one message per transfer)

... since even if the CPU processes data faster than EDMA can fetch it from the FIFO, EDMA will have a head start so the buffer needs to be big enough to give the CPU a chance to catch up

Not sure what the easiest way is to check this theory, nor why your code would sometimes fail to wait for EDMA transfer, but it does seem to fit all symptoms observed so far.

0 Markus Grunwald over 11 years ago in reply to Badri Narayanan

Expert 2060 points

Hello Badri,

thanks for your help: With a few modifications, I got my project back to life :) I think this was the catch:

Badri Narayanan said:
1. Cache_inv address range of EDMA dest address. Check address and length are 128 byte aligned.

But if this seems to be necessary for a working cache, it should be documented better. The documentation mentions no restrictions on address and byte count:

Docs said:
Invalidate the range of memory within the specified starting address and byte count. The range of addresses operated on gets quantized to whole cache lines in each cache.

0 Markus Grunwald over 11 years ago in reply to Matthijs van Duin

Expert 2060 points

Hello Matthijs,

Matthijs van Duin said:
all behaviour described so far would make perfect sense if in fact the EDMA transfer hasn't finished yet!

You are right and we had the same thought, too. In one of your first answers you already mentioned that 'your "wait for transfer to finish" loop looks a bit dubious'. I checked that immediately and the timeout-value was big enough so that the loop did not end too soon. Nevertheless, I immediately changed it to use TSC.

Sorry, I should have mentioned that.

Thanks for your help!

0 Matthijs van Duin over 11 years ago in reply to Markus Grunwald

Mastermind 8040 points

Markus Grunwald said:
The documentation mentions no restrictions on address and byte count

The C674x megamodule documentation in fact explicitly states "all cache lines that overlap the range specified are acted upon."

Also, per-message invalidation should not be necessary: a single Cache_wbInv between the memset() and starting the EDMA transfer should suffice (or once the memset is no longer needed for debugging and removed, just a Cache_inv before the EDMA transfer). In particular, your earlier versions with invalidate-all, while needlessly inefficient, should have worked also. So to be honest I have doubts whether you found "the catch" or merely perturbed a race condition temporarily out of the way...

Should the problem resurface, I can provide a small bit of code that enforces the handover of the buffer between EDMA and the DSP by configuring the appropriate L3 firewall to allow only EDMA to access the buffer during the transfer, and allow DSP (but not EDMA) access when the DSP thinks the transfer is done. That would ensure that if the DSP prematurely accesses the buffer it gets an error, and if it reclaims ownership before the transfer is complete, EDMA would get an error.

0 Markus Grunwald over 11 years ago in reply to Matthijs van Duin

Expert 2060 points

Thank you Maatthijs,

if some similar trouble returns, I'll certainly come back to you!

Processors

Processors forum

Cached data appears allthough caches are invalidated