This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Unaligned volatile access

Other Parts Discussed in Thread: DM3730, OMAP-L138

Hi,

I need to perform a unaligned access to a volatile memory region (shared with DMA) from c++ code.

I tried using _mem4(), but it does not accept "volatile" pointers, as it is invalid to cast "volatile xx *" to (void *). I could force the cast, but this (probably?) means loosing the "volatile" behaviour.

What is the best (elegent/convinient/robust) solution to this ?

(the interest of keeping the volatile properties is to make sure the data is in RAM before kicking the QDMA).

Thanks,

Raul.

  • Raul,

    This was moved from the C64x Single Core DSP Forum, but which device are you using and which version of the compiler?

    Please explain why you think casting is a bad choice. That seems like the best method, to me.

    "volatile" has no relationship to caching or controlling when the data write has completed or landed.

    I apologize, but it is not clear what the real issue is.

    Regards,
    RandyP

  • Hi Randy,

    the device is DM3730. Codegen is 6.1.21.

    My concern is with regards to how to guarantee data has been written to memory (rather than held in a register) before initiating a QDMA transfer. I am initiating the QDMA transfer via an IDMA transfer.

    So, I'd thought that if my pointer to the internal buffer is declared as volatile, and my access to the IDMA register is volatile; I could guarantee that by the time the IDMA starts processing all the writes to the internal buffer  would have been committed to IRAM (L2 RAM and/or L1D Cache).

    This in itself is not a great solution, as I don't care about the order of the intermediate writes to the internal buffer, but rather making sure that all writes to the internal buffer have ocurred before DMA start. And making all access volatile will prevent optimization by the compiler. Is there any better solution for this ? (so far, I couldn't find any way of making a memory barrier or similar)

    (BTW, I am new to C64x+ programming)

    Regards,

    Raul

  • Raul,

    In the Training section of TI.com, there is a training video set for the C6474. It may be helpful for you to review several of the modules that apply to C64x+ items, even though the C6474 is a different device yet has (three of) the same DSP core. In particular, the Memory & Cache Module and the EDMA3/QDMA/IDMA Module may help you understand some of the features and options available within the cache and EDMA3 portions of the common architectures of the two devices. You can find this complete video set here.

    The C64x+ Megamodule Reference Guide would also be a good document to review. Go to the TMS320DM3730 Product Folder page to find all of the documentation available for the device, and possibly find some more training opportunities.

    After you have reviewed those materials, you may have some new questions, and it may be appropriate to move this thread to the appropriate hardware forum now that the device has been identified and also the problem/question, which is how to deal with cache coherency.

    Regards,
    RandyP

  • Randy,

    I went thru the modules I hadn't been already thru. But I still didn't manage to find an answer to my core question, which is how do I guarantee that when the DMA is kicked the DSP has written the data to memory (regardless which memory) rather than it being held in a register ?

    Let me explain it with an example. Following code computes stuff that gets stored in a local buffer, and then DMA'd to external DDR.

    I've omitted many details to concentrate on my query.

    #pragma DATA_SECTION(".bss")
    #pragma DATA_ALIGN(8)

    int myBuffer[ MY_HEIGHT*MY_WIDTH ];


    void main() {

         mySetupQDMA();

         mySetupIDMA();

         for( int i =0; i< MY_HEIGHT ; i+= MY_WIDTH ){

              int *myRow = myBuffer+(i*MY_WIDTH);

              int *myPtr = myRow;

              for ( j=0; j < MY_WIDTH ; j++ ) {

                   *myPtr++ = <some operations>;
              } 

              // Update source pointer for QDMA

              myLocalParamSet[ SRC_PTR_IDX ] =myRow;


              // Do IDMA to EDMA PaRAM

              while((*TPCC_IPR & (1 << myTCC))==0); //make sure our last QDMA is done.

              *TPCC_ICR = 1<<myTCC;

              while ((*IDMA0_STAT & 0x03) != 0);  //make sure IDMA is idle

              *IDMA0_MASK = myMask;

              *IDMA0_SOURCE = myLocalParamSet;

              *IDMA0_DEST   = myQDMA0ParamSet;

              *IDMA0_COUNT  = myParamSize; // this shall kick IDMA, which in turns kicks QDMA

               }

    }

    So, here my questions, based on the above code.

    How do I guarantee the the assignment of "myRow" to myLocalParamSet[ SRC_PTR_IDX] has been performed (rather than the value being held in a DSP register) before I write to IDMA0_COUNT.

    How do I guarantee that the access to IDMA0 are executed in order (as requested by the C64x+ spec on the IDMA) ?   Is this by declaring them "IDMA0_xxx" as volatile pointer ?

    How do I guarantee that all the writes to "myPtr" have been performed (ie: rather than the value being held in a DSP register) before the QDMA does the transfer ?

    Regards,
    Raul 

  • Hi,

    All the variable directly used in the DMA setup and data buffers must be pointer to volaite (not "volatile pointer") and, as RandyP say, you have to take in count the cache. So, for example:

        volatile int myLocalParamSet[...];

        volatile  int* IDM0_MASK; .....

        ....

        CacheFlush(myRow, MY_WIDTH);

        CacheFlush(myLocalParamSet, ...);

        ....

        *IDMA0_MASK = .......;

    myRow not required to be volatile.

  • Alberto,

    thanks for the reply.

    I forgot to mention that "myRow" lives in L1DSRAM. So there is no need for CacheFlush.

    Why do you claim that myRow does not need to be volatile ? What guarantees the the value has been "committed" to RAM ? AFAIK, the C (or rather C++ in my real code) standard only guarantees that those values will be 'committed' to RAM by the time we finish the function.

    Raul.

  • Raul,

    The Optimizing Compiler User's Guide explains how it uses the volatile keyword. I do not believe this has anything to do with what you are asking for. But someone else in this COmpiler Forum may be able to address that better than I can.

    The only potential problem I see here is that you may have cache coherency issues depending on where myBuffer (.bss) is located. If myBuffer is in L2, then coherency will not be a problem. If it is in other memory outside of the DSP Megamodule, then coherency will need to be addressed as explained in the trianing videos and the Megamodule Reference Guide.

    Have you had any indication that the ordering of the writes is not as expected? I am curious why you have this concern.

    Since you are repeatedly using the same QDMA channel to copy from a different source address to the same destination (and count, index, options), it would seem less complicated to use the power of the QDMA and only write the new SRC address to trigger the QDMA directly. Once the QDMA channel has been configured for SRC to be the trigger word and the other 7 registers are initialized to their repeated values, your code could reduce to

              // Wait for QDMA channel to complete

              while((*TPCC_IPR & (1 << myTCC))==0); //make sure our last QDMA is done.  

              *TPCC_ICR = 1<<myTCC;    

              // Update source pointer directly in QDMA's PARAM set  

              myQDMA0ParamSet[ SRC_PTR_IDX ] =myRow;   // automatically triggers QDMA to run

     

    This would eliminate the use of IDMA0 which requires several writes plus the update to the local param set array. Could this work for you? It does not directly affect your questions on ordering or guaranteed landing, but it is an observation and suggestion.

    Regards,
    RandyP

     

  • Raul Benet said:

    Why do you claim that myRow does not need to be volatile ?

    It seems to me that myRow (the poiinter, not the data pointed by), is used only to assign myPtr and the myLocalParamSet, so it is myLocalParamSet that need to be volatile (reardless of the where myRow is - register or memory - its value will be written into the paramset).

    It is also used as start fopr myPtr, but in this case, since you work on an array, the *myPtr will be always immediatly written to memory (mybe in cache), and not keeped in a register.

  • Randy,

    my sample was somewhat simplfied from the original 1k lines of code. In fact, the QDMA dest pointer also increments, and need to be re-written.

    Nevertheless, I also wondered if we were gaining really anything from the use of IDMA (the code wasn't written by myself). But at this stage, I am worried of doing any major changes without undertsanding my current problem.

    Re: why my concern ?

    I have a bit accurate model of my DSP code running in parallel on the ARM, for test purposes.

    When I run it, I do get bit error mismatches, but only very occasionally. The rate of occurance goes between 1% (in a case where I have a data pattern that makes the error occurr more oftern, plus some "magic NOPs" that also make it happen more often) to 1 per million. The mistmatches always occur towards the end of a row. 

    In some cases, they all happen on the one before last 16 bit value (all row are multiples of 32 bit). In other much rarer cases, it is a burst of 32 bytes.

    I've been thru the code trying to find what might be wrong, but haven't yet managed to pin point it. Error pattern changes depending on adding NOPs, inline vs non-inline of certain functions,....

    So, I am trying to check my understanding of how things work to see if I can find the problem. Note that I have also reviewed the Errata for the DM37xx and found nothing that would explain that behaviour in my opinion. I did see some errate in the OMAP-L138 that could (I am not sure if it would) lead to something like this, but given that it is another device, I didn't do a thorough analysis. Is there any chance of an errata being published in teh OMAP-L138 that may have not been upadted for the DM37xx ? I ask, because I guess that they share a good part of the design of the DSP.

    Regards,

    Raul.

  • Raul,

    Raul Benet said:
    my sample was somewhat simplified from the original 1k lines of code. In fact, the QDMA dest pointer also increments, and need to be re-written.

    May I assume you know how to add that to the example I supplied? The QDMA mechanism is intended for this type of operation.

    Raul Benet said:
    I am trying to check my understanding of how things work to see if I can find the problem.

    Do you have any reason to believe the row copies are not being executed correctly?

    Raul Benet said:
    I did see some errata in the OMAP-L138 that could (I am not sure if it would) lead to something like this, but given that it is another device, I didn't do a thorough analysis. Is there any chance of an errata being published in the OMAP-L138 that may have not been updated for the DM37xx ?

    Please give me some hints as to which errata you are concerned about. I seriously doubt any missing errata are pending that would affect your computations.

    When there are differences between the DSP version and the ARM version, how do you know which one is right?

    Have you compared the ARM version to the DSP's version prior to the QDMA copy operation? That might eliminate any issue with the QDMA copy process.

    Regards,
    RandyP

  • Randy,

    thanks for the prompt replies.

    RandyP said:

    May I assume you know how to add that to the example I supplied? The QDMA mechanism is intended for this type of operation.

    Yes, I do understand enough of the QDMA operation to do that.

    RandyP said:

    Do you have any reason to believe the row copies are not being executed correctly?

    Well, given that running the software with the same pattern 1000's of time, it works most of the time; I assume that the code itself is correct. The only source of randomness that I can think of is the DMA traffic, which may get influenced by the activity on the ARM (which is not particularly high during the test, but it is probably never truly zero).

    I don't know if the problem occurs during the reading (DDR->internal RAM), or during writing (internal RAM->DDR); but from the data I've seen it looks like the "writing" is the most likely explanation. BUt, at this stage, I can't rule out anything.

    RandyP said:

    Please give me some hints as to which errata you are concerned about. I seriously doubt any missing errata are pending that would affect your computations.

    It is errata 2.1.17 of OMAP-L138. Now I realize that it probably doesn't apply as my buffers are in L1SRAM; but at the time I looked at it the first time, I was under the impression my buffers were in L2SRAM.

    RandyP said:

    When there are differences between the DSP version and the ARM version, how do you know which one is right?

    Most of the time it is hard to tell (as our tests use random patterns), but when running with a pathological pattern, I do know which value is correct.

    RandyP said:

    Have you compared the ARM version to the DSP's version prior to the QDMA copy operation? That might eliminate any issue with the QDMA copy process.

    You mean remove the QDMA copy code, and write to DDR "manually" ? I haven't done that, I will try. If it works it will tell me nothing, as I have cases where all I need to make it work is an "NOP". But if it fails, it may indeed point to some other issue. Though, to be honest I cannot see what other kind of "reasonable" issue may lead to this kind of random behaviour where it works 99.999% of the time. 

    Regards,
    Raul

  • Raul,

    Raul Benet said:
    You mean remove the QDMA copy code, and write to DDR "manually" ?

    No, I mean compare the ARM results data with the DSP results data prior to the write to DDR, while still in the L2 sram location.

    Regards,
    RandyP