C6678 L2 cache block coherence

si cheng

hello:

I am using c6678, my app runs in it, recently I my L2 cache data block coherence have some problem. Core 0 provide the data that core1~core7 use to process,and core1~core7 return the results to core0.My data in DDR,and core 1~7 shared code in MSMS RAM,core0 code in DDR, private data eg. stack is in LL2. when core0 get the results for slave core, it will write the data into file A, then I will compre the file A from c6678 with a correct result file B which is generated by X86 PC.

If I set L2 cache 0KB, just use L1, when core1~7 process over one frame, I do the operations below:

the DSP result is the same with PC.

Then I set L2 cache szie 64KB, and when core1~7 process over one frame, I do the operations below:

the result from DSP is not the same with PC, if I do L2 global operation rather than block operation， the result is

the same with PC.

I also follow the advice from Silicon Errata to do the asm(" DINT"); before cache operation， and afer do the

asm(" nop 8");
asm(" nop 8");
asm(" RINT");

it seems dose not work too.

So I am not sure where my problem is! I need some one help me! I was plagued by this problem deeply!

Thanks

Best Regards,

Si.

over 11 years ago

0 Raja over 11 years ago

TI__Guru* 81335 points

The response from experts to your post may get delayed due to Christmas and New Year Holidays.

Kindly bear with us.

Thanks.

0 Brandy Jabkiewicz over 11 years ago

Mastermind 6325 points

Hello Si,

When you do the wbInv, do you use a semaphore to protect the DDR data? I am not sure I completely understand your setup but if you let Cores 1 to 7 write back to the data at the same time, it could become corrupted depending on the timing of the reads and writes. Another idea is to consider the cache line alignment. If you are using L2, the data needs to be aligned to 128 and the size you are invalidating should be a multiple of 128. I think for L1 it is 64 bytes.

I find cache very sneaky. If things are not aligned properly, then when you invalidate and write back, you could write back data that is incorrect or overwrite data that another core is using etc. You need to be very careful with the multicore - using semaphores and data alignment and separation to ensure that each core does not corrupt another core.

Hope this helps til TI comes back from holiday!

Brandy

0 si cheng over 11 years ago in reply to Brandy Jabkiewicz

Expert 2685 points

hello Brandy,

Thanks for your reply, and happy new year! Yes, my data alignes to 128 bytes, and I am not
sure core 1 to 7 when write back data to DDR, maybe at the same time, or not, because when they
process over, they write back data to DDR.

I have a question that if my data alignes to 128 bytes and the size invalidated is a multiple of
128, need I use a semaphore to protect the DDR data??

if use a semaphore to protect the DDR data, after core1 to 7 process their own's data over, I do
the operation below:

while(!CSL_semAcquireDirect(2));

L2WBINV();

CSL_semReleaseSemaphore(2);
Is it right?

Best Regards,

0 Brandy Jabkiewicz over 11 years ago in reply to si cheng

Mastermind 6325 points

Hi Si,

Happy new year!

si cheng said:
I have a question that if my data alignes to 128 bytes and the size invalidated is a multiple of
128, need I use a semaphore to protect the DDR data??

If you are aligned and the size to invalidate is a multiple of 128, then I think you are alright. Do the cores 1 - 7 ever use the same data? Also, how are you aligning the data? For example, if you have an array of structs - if the struct is not also always aligned, you could have a problem.

Maybe you need this:

#pragma DATA_ALIGN(__128BYTES)
#pragma DATA_SECTION(".cfgHeaders")
FrameDataStruct far obsHdrs[MAX_NUM_DOORBELLS];

where the struct is also aligned:

typedef struct {
	unsigned int frameNum __attribute__ ((aligned(128)));
	unsigned long long aqTime;
	unsigned int totalObsCnt;
	void * firstAddress;
}FrameDataStruct;

It uses more memory because even if the size of the struct is small, it will be a minimum of 128Bytes. But this is the only way to ensure that the invalidates and writebacks from different cores do not corrupt.

si cheng said:

while(!CSL_semAcquireDirect(2));

L2WBINV();

CSL_semReleaseSemaphore(2);

Yes, this is how you would do it. Please consider which semaphore you use because if it is used a lot (say by the TI libraries), you might get more delay than anticipated. I usually grab one in the twenties because TI usually uses < 10.

Lastly, if you are using flags to trigger when the core 1 to 7 start working, remember this errata: http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/161625.aspx

I had lots of problems with this. (http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/253690.aspx) before I found the above post. You need to invalidate the prefetch buffer when you are wb and inv.

Hope this helps!

Brandy

0 si cheng over 11 years ago in reply to Brandy Jabkiewicz

Expert 2685 points

Brandy,

core 1~7 will access the member of structure, and I use the "__attribute__ ((aligned(128)));" operation, different core will access different data in the structure which is in DDR.

You said "If you are aligned and the size to invalidate is a multiple of 128, then I think you are alright." But when I just use L1D cache with L2 cache disable, it run OK, with L2 cache enable,it is NOT OK. I think the data is aligned and the size to invalidate is a multiple of 128, Or when I just use L1D, it will not run OK,but actually it is OK. With L2 cache enble and disable, the difference is when L2 cache is disable, slave core cache operation will not use DDR bus, and when L2 cache is enable, they will use DDR bus. What do you think?

Best Regards,

0 si cheng over 11 years ago in reply to si cheng

Expert 2685 points

Brandy,

In my project I disable the prefetch buffer and enable MAR for L2 cache,using the operation below:

CACHE_setMemRegionInfo(mar,pcx,0);

So when I do L2 wb and inv, I need not do invalidate the prefetch buffer.

Best Regards,

0 Brandy Jabkiewicz over 11 years ago in reply to si cheng

Mastermind 6325 points

Hi Si,

I am not sure. Maybe if I ask more questions we can come up with the problem.

Where is the text code - in DDR3 or L2? Do the 7 cores execute the same code/executable?

How was the memory for the structure allocated? Are you sure it is not being by something else in the linker?

I suppose that when you allocate more cache, the cores could start storing their local variables in cache (assuming you have their heaps and stacks in DDR3), and when they have to make more room, the writeback to DDR3 at any time? This case would only occur depending on some answers from the above questions.

I'll keep thinking too... cache is so frustrating in my opinion!

Brandy

0 si cheng over 11 years ago in reply to Brandy Jabkiewicz

Expert 2685 points

Brandy,

thank you very much for your reply.

core0’s section like heap, text ,.far:NDK_PACKETMEM, .const, platform_lib,.far:NDK_OBJMEM all on DDR,and the other section include stack, far in LL2. core0 run NDK, receive data from net,put the data in it's heap,and tell the position of data to slave core.

core1~core7 text code in MSM,they execute the same code, and other section, eg. stack, data and so on all in LL2,and slave core without heap section, they just get the data from core0 to process, and they put the result to the position where core0 tell them.The data core0 communicated with slave core is using pointer of structure.

So my project has two text code, one for core0 in DDR, other for slave core in MSM.

The structure is allocated from core0's heap which is in DDR.

Core0 L2 cache size is 64KB.

cache is frustrating me too!

Best Regards,

0 Brandy Jabkiewicz over 11 years ago in reply to si cheng

Mastermind 6325 points

Ok. That all seems clean to me. Now, I have more questions about this:

si cheng said:
they put the result to the position where core0 tell them.The data core0 communicated with slave core is using pointer of structure.

Where is the data that is shared between the slave cores and core 0? Is it in MSM?

Is there some way, that when these flags are getting set for the slave cores there is some data corruption? For example, core 0 is trying to start core 1, then core 2. Core 0 sets the pointer for core 1, does an inv/wb to MSM so that core 1 can see the update. Then core 0 sets the pointer for core 2, does an inv/wb to MSM so that core 2 can see the update. However, these two pointers are not individually 128 byte aligned, so the second inv/wb messes up core 1's flag? Or maybe when Core 1 tries to tell Core 0 that its done, it is accidentally clearing Core 2's flag as well, because when Core 1's pointer was set to active, Core 2s pointer was actually zero and that's what was stored in cache?

Do you see what I am getting at? Can you think of a case where the pointers could get corrupted because you did not invalidate on Core 0 before changing the value? Or because you did not invalidate on the slave cores before writing the flag etc? Or because the slave cores have to dump their cache to make room for more active data?

I am sorry to have no precise advice. I am trying to think who would be best to tag from TI to give some advice too, maybe RandyP or John Dowdal. Do you know how to tag people in these posts?

Brandy

Processors

Processors forum

C6678 L2 cache block coherence