This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Cache considerations when performing a task switch?

Other Parts Discussed in Thread: SYSBIOS

I've been making the following experiment:  Using a C6a8168 EVM, with Linux running on the ARM and Bios 6 running on the DSP, I have a program with two simultaneous threads running on the ARM, each of which communicates to the DSP via Syslink.  For each ARM thread there is a corresponding task of equal priority on the DSP.  Consider the following situations:

1.  If thread 1 on the ARM runs followed by thread 2 on the ARM, implying that the corresponding DSP tasks follow sequentially, everything works as expected, images are setup and processed in shared memory.

2.  If however thread 1 on the ARM and thread 2 on the ARM run simultaneously, implying that sub-processes of DSP tasks 1 and 2 become intermixed, corruption is observed in the images in shared memory.  I suspect the cause of this is due a non-coherence of the DSP cache.  I have attempted to get this to work by protecting DSP task activities with a GateMutex, but this is only partially successful, the corruption of shared memory persists.  Is there a way to get this second case to work?  The DSP sub-tasks do such things as DMA between L1D and DDR.

The Real-time Operating System v6 User's Guide says: "Entire context saved to the task stack" upon task preemption.  I assume that this includes such things as the PC, stack pointer, registers, and so forth, but what about the state of the cache?

I've been putting ping-pong buffers in L1D with DMA between L1D and DDR and I'm wondering whether it might be better to put the ping-pong buggers in L2 and keep the L1D cache full size.

Lee Holeva

 

  • Lee,

    Not fully understanding what resources your app uses and what each Task does, I'm will try to give you my best suggestions for what you could try.

    1.  Any shared resources between DSP Tasks should be gated by a GateMutex as you already done.  I assume there's no Hwi or Swi that is doing any work right?  Otherwise, you might need to use a GateHwi or GateSwi.

    2.  Cache coherence for local memory (for example between L1D and DDR) needs to be maintained by the app.  SYSBIOS provides Cache API's.  Any shared memory used by IPC/SYSLINK should already be kept coherent but this would be memory between DSP and ARM.

    3.  If there are things that are shared between the different processors, You might need a GateMP (Multi-Processor Gate) to prevent concurent processor access to the same shared memory. 

    4.  If you suspect DSP cache is causing the problem, you might want to try disabling the cache by setting it up as all SRAM (both L1D and L2).  Of course this means anything in DDR would be very slow so that might change the behaviour of your program and might not be a valid test?

    Judah

  • judahvang said:
    If you suspect DSP cache is causing the problem, you might want to try disabling the cache by setting it up as all SRAM (both L1D and L2).  Of course this means anything in DDR would be very slow so that might change the behaviour of your program and might not be a valid test?

    The shared memory data corruption is definitely due to a cache coherency problem.  I disabled the cache by:

    var Cache = xdc.useModule('ti.sysbios.family.c64p.Cache');

    Cache.MAR128_159 = 0x0000ffff; /* disables cacheing for 0x90000000->0x9fffffff */

     

     

    Cache.MAR160_191 = 0x00000000;

    and the data corruption went away. I have been reading the C674 Cache User's Guide, but I'm still not sure how to set this up.  Consider the following:

    ping-pong buffers in L2, with DMA between these buffers and DDR  The DMA is protected with a Gate_Mutex.  According to the example given in section 2.4.2.1, I shouldn't have to do anything for this case, yet I see corruption in the shared memory that goes away if caching is disabled.

    Another example is processing an image from DDR to DDR without DMA.  Again, with cache disabled this works fine.  I have tried invalidating destination rows and writting back source rows before they are used, but to no effect.

    Update:

    I seem to be having a problem with the GateMutex.  Processing started with task1 entering the gate, followed by a switch to task2, and task2 entered the gate, it should of been blocked.

    Lee

  • Lee,

    You said the buffers are ping-pong so does this mean the same buffers could potentially be used by either tasks on the DSP correct?

    Could you explain more how the buffers are passed between processors?  Are you using MessageQ to pass the buffer from ARM to DSP?  If you use MessageQ, when passing the buffer between DSP and ARM, MessagQ will Cache writeback invalidate the message to keep it coherent.  If you are not using MessageQ, you need to do this yourself.

    Lets look at the DDR to DDR without DMA case since that should be easier to explain.  When the DSP receives the buffer from the ARM, it should Cache invalidate the buffer just in case the buffer happened to be in cache already.  Once its done processing the buffer, it needs to Cache writeback to get the data out to external memory.  If the ARM has cache too, then it needs to do the same thing.

    Judah

  • judahvang said:

    Could you explain more how the buffers are passed between processors?  Are you using MessageQ to pass the buffer from ARM to DSP?  If you use MessageQ, when passing the buffer between DSP and ARM, MessagQ will Cache writeback invalidate the message to keep it coherent.  If you are not using MessageQ, you need to do this yourself.

    Yes, I am using MessageQ, but not to pass buffers.  The DMA buffers are for QDMA transfers between DDR and L2, just on the DSP.

    The problem appears to stem from GateMutex not blocking.  This is what I see:

    1.  task1 enters the Gate

    2.  MessageQ_get unblocks on task2 and there is a task switch

    3.  task2 enters the Gate <-- should not of happened, instead should of blocked waiting for task1 to leave (why is this?)

    4.  cache becomes corrupted

    Lee

     

  • Lee,

    If you are familiar with ROV, you can use that to help debug.  GateMutex is based on semaphore so in your scenario Task2 should block trying to enter the gate until Task1 leaves the gate.  At this point, Task1 posts the semaphore to unblock Task2.  I don't have an explanation why Task2 is not blocked there.

    Judah

  • judahvang said:
    GateMutex is based on semaphore so in your scenario Task2 should block trying to enter the gate until Task1 leaves the gate.  At this point, Task1 posts the semaphore to unblock Task2. 

    The data corruption is not due to the GateMutex, the GateMutex is working as expected.

    Lee

     

  • Lee Holeva said:

     Yes, I am using MessageQ, but not to pass buffers.  The DMA buffers are for QDMA transfers between DDR and L2, just on the DSP.

    The problem appears to stem from GateMutex not blocking.  This is what I see:

    1.  task1 enters the Gate

    2.  MessageQ_get unblocks on task2 and there is a task switch

    3.  task2 enters the Gate <-- should not of happened, instead should of blocked waiting for task1 to leave (why is this?)

    4.  cache becomes corrupted

    Lee

    Okay, I'm confused.  So are you saying task2 is correctly blocked waiting for Task1 in step 3?  You state that GateMutex is not blocking...which would be a problem.

  • judahvang said:

    Okay, I'm confused.  So are you saying task2 is correctly blocked waiting for Task1 in step 3?  You state that GateMutex is not blocking...which would be a problem.

    Allot of confusion.  This is what I presently know:

    1.  I had thought that GateMutex wasn't blocking, but it is.  I simply relocated the GateMutex_leave() to after the DSP's MessageQ_put(), that does the reply to the host, and confirmed with rov that the Gate is indeed blocking as it should.  It is not the problem.

    2.  I have been experimenting with combinations of Cache_inv() and Cance_wb() for the non-DMA functions and I think I have that part figured-out.  The functions that use DMA are reading/writting from/to ping-pong buffers in L2 memory, and according to the C674 Cache User's Guide I shouldn't have to worry about caching for this case.

    3.  The host has two threads, if task1 runs followed by task2 (case 1), everything works as it should, with caching turned on and there is no corruption of shared memory.

    4.  if the host's two threads run simultaneously (case 2), the corruption of shared memory occurs, though if I turn off caching this goes away.

    5.  Using CCS, I have confirmed that the pointers to shared images point to different data sets between the two cases and hence different output images.  The puzzle here is that between the two cases the DSP code is identical, all that is different is the ordering of the tasks.  Perhaps this an issue with Syslink?

    Lee