This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Optimization of code

Hi there,

I have two threads, which share data by thread A writing to a global variable and thread B reading from it. It works fine when I build "Debug" or "Release" with -o1. But thread B is not able to get the updated data if I build "Release" with the -o2 option. Any suggestion?

PS.: the global variable is declared as volatile. I have also tried to disable the L1D cache and L2 cache, didn't help.

BR

C.J.

  • Hi Chunjian,

    In which memory end point (L2, MSMC or DDR3) is this global variable allocated from? Do you have synchronization between thread A and thread B, .i.e thread B waits until thread A has completed writing to the shared variable, before reading the shared variable. Instead of disabling L1D and L2 cache, try making the region of the memory containing the shared variable non-cacheable by appropriately configuring the MAR (memory attributes register).  

  • Hi Karthik,

    It is located in the L2SRAM.

    I did try to set the MAR to 0 for the memory range where the variable is located in. It seems like it works only if the variable is in the DDR3 and if I set the whole DDR3 non-cacheable. I am not sure if I used it correctly. This is what I put in the code:

    This would work:

    Cache_setMar((xdc_Ptr *)0x80000000, 0x10000000, 0); 

    But this won't work:

    Cache_setMar((xdc_Ptr *)0x88000000, 0x02000000, 0);

    You have expereance with this?

  • Hi Chunjian,

    I do not know how the Cache_setMAR() is implemented and I ve not used it before. However, there are totally 256 MAR registers for each CorePAC and each MAR register can be configured to set the cacheability attributes of only 16 MB sized memory sections. For more details, refer to the C66x CorePAC users guide. A related E2E post, which might be helpful: 

    http://e2e.ti.com/support/embedded/bios/f/355/t/177410.aspx

  • Hi,

    It is not possible to change the cacheabiity of the L2 and MCSM (the first 15 MAR registers are read-only). So far, I have never find problem in enabling the cache for all the DDR or only a part (with the constrains of the granularity of the MAR registers, that is 32M).

    Anyway, multiple threads running of the same core should never have problems of cache coherency: both thread access data thought the same caches and memory interfaces. You should look somewhere else to resolve your problems. Maybe You can extract a little snippest of code to show exactly how you declare and define the shared variable and how you access and synchornize the threads: You use a semaphore? they poll continuosly over the shared var? The simple sleeps and then poll the variable?

  • Hi Alberto,

    As you said, it is likely something else than cacheability that caused this problem, since the problem exists only when I build Release with the -o2 option, and it is fine if built as Debug or Release with -o1.

    Both threads running on the same core. The global variable is declared as: 

    extern volatile int32_t sample_counter;

    Thread B is a HWI routine, which processes the data comming from the converter and counts the number of sample. Thread A polls the variable sample_counter, and wait until it reaches a certain number:

    while (sample_counter<number_of_samples){}

     I suppose the cache coherence is not a problem, because the variable is in the L2SRAM ( 0x00800000)? 

    The volatile keyword should keep the compilor from doing wrong things, I suppose?

  • I have also tried to put the variable in the L1SRAM(0x00F00000), same result:  the -o1 option works, -o2 won't work.

  • Hi,

    So You are using SYS/BIOS and Thread B is not a SYS/BIOS thread but a simple routine that process the sample and then routine. I suppose You have already verified that the Thread B runs correctly and the sample_counter is incremented as expected. Also I suppose You have already verified that number_of_samples hold the correct value.

    So far, I see no reason to justify that behaviour. The little loops could be critical since the cpu cannot serve interrupts when branching, but this is not the case I suppose since CGT 7.2.4+ with "-o2" can generates a "Software pipeline loop" that reload the sample_counter from memory at every iteration and is interruptible.

    You should try to set a break point in the middle of the loop (directly in the assembler), just before the load of the sample_counter, ant examine the context. You shoould find an instruction like:

       [ B0]   LDW     .D2T2   *+DP(sample_counter),B5   ; load the value
               NOP             4  ; wait for the value to be ready in tegister B5
               CMPLT   .L2     B5,B6,B4          ; compare with your threshold   <--- put the breakpoint here and look at register value and sample_counter value

      .....

    In my examples, B5 hold sample_counter, B6 number_of_samples (not reload at every iteration). You code could be a bit different.




  • Hi,

    I forget to say that I think it is better to not place the variable in L1 since it used as cache by default (maybe do you disable it?). Since cache coherency should not be the problem, I suggest ti put the variable on the MCSM so it works regardless of you KL1/L2 configuration

  • Ok, I found something that suprised me. The thread B I mentioned above is a HWI. When I put a breakpoint at the beginning of it, and it is never reached. That means the thread is not running at all. It is out of my imagination that -O2 option can destroy my HWI, while it works fine with the -O1 option?

  • Chunjian Li said:

    Ok, I found something that suprised me. The thread B I mentioned above is a HWI. When I put a breakpoint at the beginning of it, and it is never reached. That means the thread is not running at all. It is out of my imagination that -O2 option can destroy my HWI, while it works fine with the -O1 option?

    This should not be the case. Try to add some "nop" in the busy loop of thread A. Just write:

      while (sample_counter<number_of_samples) { __asm("  nop\n  nop\n nop\n nop\n nop\n nop\n nop\n  nop"); }

    In this way you lost some optimization of the loop, but you can better isolate the problem. If it works, it means the optimizer has generated code that cannot serve the interrupt in the Thread A loop, otherwise You have to look somewhere else.

    Which version of the compiler are You using?

  • I have now removed the thread A, so that only thread B is running. The isr still won't run under the -o2 option, while it runs fine with -o1.

    The HWI was set up statically in the cfg file.

    I tried also to create the HWI dynamically from another thread, using the following code, but it won't run even with no optimization. 

     Hwi_Handle hwi0;
    Hwi_Params hwiParams;
    Error_Block eb;
    Error_init(&eb);
    Hwi_Params_init(&hwiParams);
    hwiParams.arg = 0;
    hwiParams.enableInt = 1;
    hwiParams.eventId = 84;
    hwiParams.priority = 8;
    hwi0 = Hwi_create(4, (Hwi_FuncPtr)isrConverter, &hwiParams, &eb);
    if (hwi0 == NULL) {
    System_abort("Hwi create failed");
    }
    
    
    Is this the correct way of creating the HWI?