Is this a coherency problem? (at platform c6678)

Jin-Yi Wu

Hello,

I try to do some image processing using c6678. And I get a correct result (as the 1920x1080 resolution image) using only one core.

Then I try to use 8 cores to do the same image processing, but I get the result with some defects as the marked place in the following image.

I think it could be the coherency problem. So I search the forum and then use Cache_inv / Cache_wb to make sure the coherence. But I still can not solve the probelm. My steps are as follow:

1. Write the image raw data to DDR, named pucInData.

2. For each core,

Cache_inv(pucInData, nImgHeight*nImgWidth, Cache_Type_ALL, TRUE); //Invalidate image data to make sure I see the right data in each core.

//////////////////////////////////////////////

//do the image processing

Sobel(pucInDataPtr, pucDataOutPtr, nImgWidth, nProcessLength); //pucInDataPtr = pucInData + an offset to the proper start position.

//////////////////////////////////////////////

Cache_wb(pucDataOutPtr, nProcessLength, Cache_Type_ALL, TRUE); //write the data back to make sure I can get the right data in Core 0.

3. In Core 0, after each core complete the operation,

Cache_inv(pucDataOut, nImgWidth*nImgHeight, Cache_Type_ALL, TRUE); //Invalidate the data to make sure I can see the right data.

pucDataOutPtr is the address of the processing result in each core which I aligned to 16 byte (128 bit).

I print out some information:

Core 0 Process Length=258720 Index from 0 ( 0, 0) to 3F2A0 (1440, 134) Address of pucDataOutPtr from 81001500 to 810407A0

Core 1 Process Length=258720 Index from 3F2A0 (1440, 134) to 7E540 ( 960, 269) Address of pucDataOutPtr from 810407A0 to 8107FA40

Core 2 Process Length=258720 Index from 7E540 ( 960, 269) to BD7E0 ( 480, 404) Address of pucDataOutPtr from 8107FA40 to 810BECE0

Core 3 Process Length=258720 Index from BD7E0 ( 480, 404) to FCA80 ( 0, 539) Address of pucDataOutPtr from 810BECE0 to 810FDF80

Core 4 Process Length=258720 Index from FCA80 ( 0, 539) to 13BD20 (1440, 673) Address of pucDataOutPtr from 810FDF80 to 8113D220

Core 5 Process Length=258720 Index from 13BD20 (1440, 673) to 17AFC0 ( 960, 808) Address of pucDataOutPtr from 8113D220 to 8117C4C0

Core 6 Process Length=258720 Index from 17AFC0 ( 960, 808) to 1BA260 ( 480, 943) Address of pucDataOutPtr from 8117C4C0 to 811BB760

Core 7 Process Length=258720 Index from 1BA260 ( 480, 943) to 1F9500 ( 0, 1078) Address of pucDataOutPtr from 811BB760 to 811FAA00

Then I guess, the data align could be 128 byte. So I aligned the pucDataOutPtr in each core to 128 byte. But the result is still the same.

So, where is my mistake? How can I fixe it? Thanks.

over 11 years ago

0 HRi over 11 years ago

Guru 10750 points

Hi Jin-Yi,

Try to disable the cache and the prefetch to see what causes your issue, please check the MAR bits Table 4-21 Memory Attribute Register Field Description (sprugw0b.pdf - TMS320C66x DSP CorePac User Guide)

Thanks,

0 Jin-Yi Wu over 11 years ago in reply to HRi

Prodigy 140 points

Hi HR,

I don't know how to disable the cache and the prefetch. But I tried to use setMarMeta in my .cfg file to do disable the cache and the prefetch and do some test.

I put the pucDataOutPtr in my previous post at the address 0xB0000000, and make four different test.

1. Cache.setMarMeta(0xB0000000, 0x01000000, Cache.PC|Cache.PFX );

The MAR176 (0xB0000000~0xB1000000) becomes 9. The result is wrong as my previous post. And the processing time costs 6.2 ms.

2. Cache.setMarMeta(0xB0000000, 0x01000000, Cache.PC );

The MAR176 (0xB0000000~0xB1000000) becomes 1. The result is wrong as my previous post. And the processing time costs 6.2 ms.

3. Cache.setMarMeta(0xB0000000, 0x01000000, Cache.PFX );

The MAR176 (0xB0000000~0xB1000000) becomes 8. The result is correct. And the processing time costs 17 ms.

4. Cache.setMarMeta(0xB0000000, 0x01000000, 0x0 );

The MAR176 (0xB0000000~0xB1000000) becomes 0. The result is correct. And the processing time costs 17 ms.

It seems when the cache is enabled, the result goes wrong. Then, how can I solve this proble?

Thanks,

Jin-Yi

0 HRi over 11 years ago in reply to Jin-Yi Wu

Guru 10750 points

Hi Jin-Yi,

The solution is using wb & inv, I assume that as you are using the same input image than you will get the issues at the same address so you can stop at that address and see why you have the issue, you can use the CCS memory browser to check where the data is (you have the check boxes L1,L2,...),

Thanks,

0 Brandy Jabkiewicz over 11 years ago in reply to HRi

Mastermind 6325 points

Hi Jin-Yi,

When you do a write back, do you use a semaphore to protect the memory? What if Core A is reading the memory while Core B is writing back or what about if Core C and Core D are writing at the same time and they are sharing bytes along a cache line. You should perhaps consider this type of problem and protect the memory access with semaphores.

Brandy

0 Dharik Patel over 11 years ago

TI__Intellectual 2630 points

Jin-Yi,

What sobel function are you using? Is it one of the functions from ImgLib? Also, how are you calculating nProcessLength?

0 Jin-Yi Wu over 11 years ago in reply to Dharik Patel

Prodigy 140 points

Hi all,

Thanks for your help. I find out where the problem is.

I use the soble function from ImgLib and do some modification. But the problem is not on what I modified. The problem is the output buffer is 128 bit aligned, but the sobel function fill the output buffer from index 1 not from 0.

Core A B C D E F G H

0 128 256 384 512 640 768 896 1024

The output buffer |--------|--------|--------|--------|--------|--------|--------|--------|

Sobel index |--------|--------|--------|--------|--------|--------|--------|--------|

1 129 257 385 513 641 769 897 1025

There is one pixel overlap (marked in blue background color) between two successive cores. And this is the cause of the error. After I modify the sobel function fill output from index 0, the problem is solved.

Jin-Yi

Processors

Processors forum

Is this a coherency problem? (at platform c6678)