This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Question about SRIO communication and descriptors in external memory

Other Parts Discussed in Thread: SYSBIOS

Hello,

I have an application running on TI C6678 and I am trying to add SRIO communication to this application.

I built and executed on C6678 EVM the SRIO projects provided by TI's platform development kit:

  1. pdk_C6678_1_1_2_5\packages\ti\drv\srio\example\SRIOMulticoreLoopback
  2. pdk_C6678_1_1_2_5\packages\ti\drv\srio\test\tput_benchmarking
  3. pdk_C6678_1_1_2_5\packages\ti\drv\srio\test\Loopback. Details for this project: (i) SRIO Driver uses Application Managed Configuration, (ii) SRIO Driver operates in Polling Mode and (iii) Communication using Raw, Type 11, Non-Blocking sockets.

All tests executed correctly.

 

Since the Loopback project is very close to my application needs, I applied one change required by my application: I explicitly deployed the buffer descriptors to DDR3, external memory (please note that, according to the map file of the original project, L2 memory is the default location of the buffer descriptors).
With this change, the communication test fails.

Initially, I didn't suspect a caching consistency problem because:
- caching for external memory should be initially disabled
- the original CFG file and the original C files don't seem to enable caching for external memory
However, reading the MAR register 128, I found that caching was enabled for external memory.

Although I don't understand why caching was enabled, I tried to deal with the DDR3 cache consistency:
- Option 1: I added CSL cache consistency calls CACHE_wbL2() after SRIO send and CACHE_invL2() before SRIO receive. It didn't work.
- Option 2: I added CSL cache consistency calls CACHE_wbL2() after SRIO send and CACHE_wbInvL2() before SRIO receive (I assumed that the descriptors used by the receive completion queue might require a write back). It didn't work.
- Option 3: I modified the CFG file to explicitly disable caching for the first DDR3 block: Cache.setMarMeta(0x80000000, 0x1000000, 0). With this change, the communication test executes correctly.

Questions:
1. My application requires that (i) the buffer descriptors are in DDR3 and (ii) DDR3 caching is enabled. As a consequence, Option 3 won't work. Is there a way solve this problem using only CSL cache consistency calls, similar to Options 1 and 2?
2. Somehow related: why is the original CFG file enabling DDR3 caching? I noticed that the generated file srio_test_pe66e.c contains entry ti_sysbios_family_c66_Cache_marvalues__C[128]=0x0D (enable caching).

Attached is the modified CFG file: srio_test.cfg

Thanks,
Sergiu 

srio_test.cfg
  • Hi Sergiu,

    1) Are you saying that the application requires cacheable buffers, or could you put the descriptors in a non-cached dedicated region and enable caching on the rest of the DDR3 memory?  I'm not sure you will see any benefit to having the descriptors in cached DDR space, because you will have to use cache wb and inv commands to maintain coherency anyway.  You would need to do this for the descriptors as well as the data buffers themselves if the data is written by the CPU  in DDR3 memory.  So when you pop a descriptor and it returns the descriptor address, you will need to do a inv command prior to reading the descriptor and getting the buffer pointer.  Then when you write the descriptor, you will do a wb operation prior to pushing it to the transmit queue.  For the data buffer, if the CPU is writing the data, I believe you would only have to do the wb function.  The one thing I'm not clear on is if you would have to do L2 AND L1 wb inv functions.

    2) Not sure, I'm guessing since the example uses L2 instead of DDR3, it is just enabling the cachability of DDR3 for performance reasons if someone uses the example as a starting point for their application which uses DDR for program needs other than queue manager.

    Regards,

    Travis

  • Hi Travis,

    Thanks for your reply.

    I think we can ignore Item 2 and discuss Item 1 only. Here are the clarifications for Item 1:


    Our application uses DDR3 with caching enabled and we do take care of cache coherency (WB & INV) as needed. However, as you say, it's possible to put the descriptors in a dedicated region where caching is disabled: the problem with this approach is that this region needs to be big (16 Mbytes).
     
    I understand the complexity of dealing with cached memory when CORE and peripherals access that memory and, indeed, the benefits are debatable.

    To evaluate my  options, I was just prototyping using the TI SRIO project Loopback. Basically the only way to execute successfully when the descriptors are in DDR3 is to disable caching; keeping DDR3 caching enabled and using cache coherency calls fails to work for me. 

    Also, I think found a similar, older posting: http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/205526.aspx . Is this a known fact?   

    As far as L1 vs. L2 coherency routines, I think L2 calls should be good enough (according to SPRUGY8, C66 DSP Cache User Guide, sections 1.4 and 2.4.3). However, I tried both flavours and there is no difference in behaviour.

    Thanks,
    Sergiu

  • Hi,

    Check the descriptors align and gaps. All descriptors should be aligned to cache line (128 bytes for L2), otherwise the wb/inv operation could corrupt/affect more then one descriptors.

    If you allocate as in the examples, this is not possible since the descriptors array use a size of 48 bytes, so even if the first descriptor is aligned, the second one it will not be.

    You can try to fix this or remapping the DDR to a logical address and defines is as non-cachable (by means on MPAX registers). In this case you have to manage the logical to physical address conversion where needed.

  • Sergiu Stambolian said:
    Is this a known fact?  

    Are you referring to descriptors in MSM or L2?  That is not a requirement, they can be in DDR.

    Excellent points on the cache boundary alignment!  Please make sure you have modified the example to incorporate this requirement.


    Regards,

    Travis

  • Hi Alberto, Travis,

    Thanks for your responses.

    1. I tried the 128 bytes descriptors aligned on 128 byte boundaries. The behaviour is the same: it fails on Rx operation.

    2. As far as using non-cacheable DDR for the descriptors, I already tried this solution (please see my initial posting, Option 3) and it works just fine. The problem with this solution is the price tag: disabling caching for a 16 Mbytes segment. It's true that managing cache coherency has a cost as well.

    3. Regarding my question about http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/205526.aspx . While I was posting this discussion, I found the posting above where Shivang T. had a similar problem (SRIO descriptors in cacheable MSMCSRAM) and the advice from Bhavin Kharadi (TI employee) was to use non-cacheable memory (MSMCSRAM or DDR). Bhavin's last posting also states that "I will check on the driver implementation about cacheablity and get back to you". That follow-up never happened and I was wondering if this is a known limitation for SRIO descriptors in cacheable memory.

    Thanks,

    Sergiu

  • Sergiu Stambolian said:

    2. As far as using non-cacheable DDR for the descriptors, I already tried this solution (please see my initial posting, Option 3) and it works just fine. The problem with this solution is the price tag: disabling caching for a 16 Mbytes segment. It's true that managing cache coherency has a cost as well.

    That's why I suggest to create an address alias and disable cache only on  that address. You can still use all your DRR as cachable.

    For instance, assuming the simplest scenario where you have a max of 1G of DDR: you can create an alias that remap from 0x80000000 to 0xC0000000 (by MPAX registers) and disable cache on the alias range. In all your code you use normal address, but when you write/read descriptors you use the alias address.

  • Sergiu Stambolian said:

    - Option 1: I added CSL cache consistency calls CACHE_wbL2() after SRIO send and CACHE_invL2() before SRIO receive. It didn't work.

    Just to understand, when you say you add cache operation before/after you mean you have modified the osal routines?

    That is, Osal_srioBeginDescriptorAccess() should be an invalidate (no wb), while the Osal_srioEndDescriptorAccess() should be a wb (invalidate is not required).

    Both invalidate and wb should be implemented as in Osal_srioEndMemAccess(), what is protected against interrupts and be "CACHE_FENCE_WAIT".

  • Hi Alberto,

    Thanks for your answers yesterday.

    Here are my comments:
    - I will evaluate the number of SRIO descriptors (I believe our application will require a "large" number of descriptors) and their memory footprint. Should this number be "small", I will apply your suggestion regarding memory alias.
    - I used the cache coherence operations as you described (CACHE_FENCE_WAIT + interrupt protection). The only difference is that I used operations on L2 cache while the OSAL file in the TI project uses operations on L1 cache; as I mentioned before, this should be fine according to DSP Cache User Guide, sections 1.4 and 2.4.3.

    Summing up:
    1. I have the SRIO project working using non-cacheable DDR3 memory segment for descriptors. If needed, I will apply your suggestion regarding memory alias.
    2. It would be nice to have an answer from Travis regarding Item 3 in my previous posting.

    Cheers,
    Sergiu

  • Sergiu,


    I'm not aware of any requirement to use non-cached memory for the descriptors, but it is possible that the LLD does not manage the cache coherency calls internally, which could be problematic.  I've made inquiries internally at TI and will let you know.

    Regards,

    Travis

  • Hi Sergiu,

    I've modified the SRIO_LoopbackTestProject for C6678 to place the descriptors and buffers used for the type 11 message passing test into DDR3. I did this by modifying the srio_test.cfg file to place the systemHeap into DDR3. I replaced the line: 

    Program.sectMap["systemHeap"] = Program.platform.stackMemory;

    with

    Program.sectMap["systemHeap"] = "DDR3";

    I set a breakpoint after the buffers and descriptors are configured and confirmed they are in DDR3. I then took a look at MAR128 (memory location 0x01848200 for C6678) and saw it was set to 0xD as you found in the srio_test_pe66e.c file. This indicates that the memory range is cacheable.

    The program executes properly. Is this the use case you are trying to replicate? Let me know if there is something else I need to modify to recreate your desired use case.

    Thanks,

    Clinton

  • Hi Clinton,

    Thanks for investigating this matter.

    I don't think the CFG modification will move the SRIO descriptors to DDR3:

    1. Changing the mapping of the system heap to DDR3 will only affect dynamically allocated data (the data buffers in this case).
    2. The SRIO descriptors (aggregate variable host_region) is statically allocated data: it should not be affected by the mapping of the system heap.
    3. I checked the MAP file before/after the CFG change and, indeed, host_region is always in L2 memoy.

    I also built & executed the Loopback project (pdk_C6678_1_1_2_5\packages\ti\drv\srio\test\Loopback) with and without the CFG change : 

    • without the CFG change: as expected, the execution is successful.
    • with the CFG change: the first iteration of the RAW test is OK, but the second iteration fails on data comparisson (I suspect it fails due to a caching problem for data buffers: they are stored in DDR3 in this case).

    On the side: The initial Loopback project assumes that data is stored in L2. When moving data storage (heap, SRIO descriptors, etc.) to DDR3, additional changes are required: see calls to l2_global_address() and Osal_local2Global().

    Regards,

    Sergiu