This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Is this a coherency problem? (at platform c6678)

Hello,

I try to do some image processing using c6678. And I get a correct result (as the 1920x1080 resolution image) using only one core.

 

Then I try to use 8 cores to do the same image processing, but I get the result with some defects as the marked place in the following image.

 

I think it could be the coherency problem. So I search the forum and then use Cache_inv / Cache_wb to make sure the coherence. But I still can not solve the probelm. My steps are as follow:

 

1. Write the image raw data to DDR, named pucInData.

2. For each core,

    Cache_inv(pucInData, nImgHeight*nImgWidth, Cache_Type_ALL, TRUE);  //Invalidate image data to make sure I see the right data in each core.

    //////////////////////////////////////////////

    //do the image processing

   Sobel(pucInDataPtr, pucDataOutPtr, nImgWidth, nProcessLength);    //pucInDataPtr = pucInData + an offset to the proper start position.

    //////////////////////////////////////////////

    Cache_wb(pucDataOutPtr, nProcessLength, Cache_Type_ALL, TRUE);  //write the data back to make sure I can get the right data in Core 0.

 

3. In Core 0, after each core complete the operation,

    Cache_inv(pucDataOut, nImgWidth*nImgHeight, Cache_Type_ALL, TRUE);  //Invalidate the data to make sure I can see the right data.

 

pucDataOutPtr is the address of the processing result in each core which I aligned to 16 byte (128 bit).

I print out some information:

Core 0 Process Length=258720 Index from      0 (   0,    0) to  3F2A0 (1440,  134) Address of pucDataOutPtr from 81001500 to 810407A0

Core 1 Process Length=258720 Index from  3F2A0 (1440,  134) to  7E540 ( 960,  269) Address of pucDataOutPtr from 810407A0 to 8107FA40

Core 2 Process Length=258720 Index from  7E540 ( 960,  269) to  BD7E0 ( 480,  404) Address of pucDataOutPtr from 8107FA40 to 810BECE0

Core 3 Process Length=258720 Index from  BD7E0 ( 480,  404) to  FCA80 (   0,  539) Address of pucDataOutPtr from 810BECE0 to 810FDF80

Core 4 Process Length=258720 Index from  FCA80 (   0,  539) to 13BD20 (1440,  673) Address of pucDataOutPtr from 810FDF80 to 8113D220

Core 5 Process Length=258720 Index from 13BD20 (1440,  673) to 17AFC0 ( 960,  808) Address of pucDataOutPtr from 8113D220 to 8117C4C0

Core 6 Process Length=258720 Index from 17AFC0 ( 960,  808) to 1BA260 ( 480,  943) Address of pucDataOutPtr from 8117C4C0 to 811BB760

Core 7 Process Length=258720 Index from 1BA260 ( 480,  943) to 1F9500 (   0, 1078) Address of pucDataOutPtr from 811BB760 to 811FAA00

Then I guess, the data align could be 128 byte. So I aligned the pucDataOutPtr in each core to 128 byte. But the result is still the same.

So, where is my mistake? How can I fixe it? Thanks.

  • Hi Jin-Yi,

    Try to disable the cache and the prefetch to see what causes your issue, please check the MAR bits Table 4-21 Memory Attribute Register Field Description (sprugw0b.pdf - TMS320C66x DSP CorePac User Guide)

    Thanks,

    HR

  • Hi HR,

    I don't know how to disable the cache and the prefetch. But I tried to use setMarMeta in my .cfg file to do disable the cache and the prefetch and do some test.

    I put the pucDataOutPtr in my previous post at the address 0xB0000000, and make four different test.

    1. Cache.setMarMeta(0xB0000000, 0x01000000, Cache.PC|Cache.PFX );

        The MAR176 (0xB0000000~0xB1000000) becomes 9. The result is wrong as my previous post. And the processing time costs 6.2 ms.

    2. Cache.setMarMeta(0xB0000000, 0x01000000, Cache.PC );

        The MAR176 (0xB0000000~0xB1000000) becomes 1. The result is wrong as my previous post. And the processing time costs 6.2 ms.

    3. Cache.setMarMeta(0xB0000000, 0x01000000, Cache.PFX );

        The MAR176 (0xB0000000~0xB1000000) becomes 8. The result is correct. And the processing time costs 17 ms.

    4. Cache.setMarMeta(0xB0000000, 0x01000000, 0x0 );

        The MAR176 (0xB0000000~0xB1000000) becomes 0. The result is correct. And the processing time costs 17 ms.

    It seems when the cache is enabled, the result goes wrong. Then, how can I solve this proble?

    Thanks,

    Jin-Yi

  • Hi Jin-Yi,

    The solution is using wb & inv, I assume that as you are using the same input image than you will get the issues at the same address so you can stop at that address and see why you have the issue, you can use the CCS memory browser to check where the data is (you have the check boxes L1,L2,...),

    Thanks,

    HR

  • Hi Jin-Yi,

     

    When you do a write back, do you use a semaphore to protect the memory?  What if Core A is reading the memory while Core B is writing back or what about if Core C and Core D are writing at the same time and they are sharing bytes along a cache line.  You should perhaps consider this type of problem and protect the memory access with semaphores.

     

    Brandy

  • Jin-Yi,

    What sobel function are you using? Is it one of the functions from ImgLib? Also, how are you calculating nProcessLength?

  • Hi all,

    Thanks for your help. I find out where the problem is.

    I use the soble function from ImgLib and do some modification. But the problem is not on what I modified. The problem is the output buffer is 128 bit aligned, but the sobel function fill the output buffer from index 1 not from 0. 

                       Core        A          B         C        D         E          F         G        H

                                   0       128      256    384    512    640     768     896    1024 

    The output buffer |--------|--------|--------|--------|--------|--------|--------|--------|

    Sobel index            |--------|--------|--------|--------|--------|--------|--------|--------| 

                                     1       129    257     385    513     641     769     897    1025

    There is one pixel overlap (marked in blue background color) between two successive cores. And this is the cause of the error. After I modify the sobel function fill output from index 0, the problem is solved.

    Jin-Yi