66AK2H12: memcpy() and cache_writeback integration

SamSeytani

Part Number: 66AK2H12

Hello all,

For example i want to copy a block of memory using memcpy from "msmc to ddr" or vice versa. L2 (as 256KB) and L1(as 32KB) is enabled as caches. When i transfer say 512KB, the L2 cache holds some memory values that is not written back. When I do Cache_wb it operates on the whole L2 cache and decreases the performance. I want only the dirty data to be written back. The cache_wb function that takes pointer argument is not working properly since the pointer address will not be valid in the cache due to the big block sized transfer.

What is the proper way of transferring blocks of memory considering cache?

over 5 years ago

0 lding over 5 years ago

TI__Guru* 95265 points

Hi,

You can use:

void cache_writeback (void *ptr, int size)
{
#ifdef _TMS320C6X
uint32_t key;
/* Disable Interrupts */
key = _disable_interrupts();

CACHE_wbL1d (ptr, size, CACHE_FENCE_WAIT);
CACHE_wbL2 (ptr, size, CACHE_FENCE_WAIT);

/* Reenable Interrupts. */
_restore_interrupts(key);
}

Where pointer address can be DDR or MSMC address and size you can use 32768 (32KB).

If you want higher performance, please consider using EDMA for transfer.

Regards, Eric

0 SamSeytani over 5 years ago in reply to lding

Intellectual 535 points

What happens if i provide an pointer that is not in cache? How do the function behave?

0 lding over 5 years ago in reply to SamSeytani

TI__Guru* 95265 points

Hi,

Your pointer pointing to and address in MSMC, DDR or L2, correct?

For MSMC:

MSMC SRAM can serve as a Shared Level 2 or Level 3 memory:
• Shared Level 2 memory—The MSMC memory is cacheable by L1D and L1P
caches; L2 will not cache requests to MSMC SRAM.
• Shared Level 3 memory—The MSMC memory is not directly cacheable at the L2,
but is cacheable in L1D and L1P. However, if it is remapped to an external address
using the address-extension capabilities in the C66x CorePac MPAX, the MSMC
memory can be cached as a shared L3 memory both in the L1 and L2 caches. To
achieve this, the caching must be enabled in MAR registers (using MAR.PC bit)
for the remapped region.

For DDR3: it is cached into L2.

So, above cache function: void cache_writeback (void *ptr, int size) on L1D and L2 does this purpose.

Regards, Eric

0 SamSeytani over 5 years ago in reply to lding

Intellectual 535 points

Yes it is correct.

So I only need cache_writeback when I do L2 to DDR transfers.

For DDR to L2, MSMC to L2 and L2 to MSMC transfers i don't need to call cache_writeback.

My aim is to measure the bandwidth changes by changing MDMA priorities of 8 cores concurrently. I want to write a program for it but I am not sure how I should write it to get the best results.

Thank you.

0 lding over 5 years ago in reply to SamSeytani

TI__Guru* 95265 points

Hi,

I thought you asked the same question earlier, https://e2e.ti.com/support/processors/f/791/t/865597. What was unresolved?

Regards, Eric

0 SamSeytani over 5 years ago in reply to lding

Intellectual 535 points

Actually those part are clear. In addition to that question there is a huge throughput difference between L2 to DDR transfers and DDR to L2 transfers. L2 to DDR transfers has nearly half of the throughput that DDR to L2 can get. Do you have any comment about it?

As a first thought, i think this is because of cache policies. In DDR to L2 case, only DDR values are cached in L1D and L2_cache. But in L2_ram to DDR case, L2_ram is cached in L1D and DDR values are cached in L2_cache. So for L2_ram to DDR, i have to do also a writeback to measure the throughput correctly.

To summarize, I just think for a new project that I can stress the DDR truly and concurrently with all the DSP cores. Also testing DDR with this sample code "*ddr_ptr = constant _value; ddr_ptr++" maybe will provide a better understanding of priority assignment to different cores.

Thank you.

0 lding over 5 years ago in reply to SamSeytani

TI__Guru* 95265 points

Hi,

We don't have CPU benchmarking number between L2 and DDR3. To stress the DDR, you have to use 3-4 EDMA channels in parallel, not CPU.

But from the thread, your goal is to use 8 CPUs with different bus priority for the test, I am not sure how your code is written (memcpy?, pointer operation?), where to start and end the measurement and if any cache operation in between, so I can't comment. You need provide the code (e.g. a CCS project), if further study is needed.

Regards, Eric

0 SamSeytani over 5 years ago in reply to lding

Intellectual 535 points

I will provide a pseudo code/flow for simplicity;

- All cores set their priority with CSL_XMC_setMDMAPriority(num);

- A master core sets a bit in shared memory and sets ;

- Other cores poll this bit, if not set loop this bit;

- Cores start their timers;

- Cores make concurrent memcpy() from various locations (DDR to L2, L2 to DDR etc.) for a specified payload;

- Cores call cache_writeback if necessary;

- Cores stop the timer and measure the throughput.

0 lding over 5 years ago in reply to SamSeytani

TI__Guru* 95265 points

Hi,

Thanks for the explanation! I have other priorities and it will take times for me to write such test cases for the test. Please upload the CCS project showing the issue so I can try to debug.

Regards, Eric

Processors

Processors forum

66AK2H12: memcpy() and cache_writeback integration