How to configure L2 for both cache and data?

Marko Moberg

Other Parts Discussed in Thread: SYSBIOS, 66AK2H12

I am using Keystone II device with a setup that currently uses only L1D and L1P as cache and L2 as SRAM. This setup is fully functional and IPC with MessageQ is working fine between ARM and DSP cores.

I am now trying to enable cache also on L2 but changing the platform configuration (L2 cache from 0k to something else) seems to break everything. The sw is compiled and linked correctly but Ipc_start() fails with -1 and the /tmp/LAD/lad.txt has the following messages:

Initializing LAD...
opening FIFO: /tmp/LAD/LADCMDS
Retrieving command...

LAD_CONNECT:
client FIFO name = /tmp/LAD/2253
client PID = 2253
assigned client handle = 0
FIFO /tmp/LAD/2253 created
FIFO /tmp/LAD/2253 opened for writing
sent response
DONE
Retrieving command...
LAD_MULTIPROC_GETCONFIG: calling MultiProc_getConfig()...
MultiProc_getConfig() - 9 procs
Proc 0 - "HOST"
Proc 1 - "CORE0"
Proc 2 - "CORE1"
Proc 3 - "CORE2"
Proc 4 - "CORE3"
Proc 5 - "CORE4"
Proc 6 - "CORE5"
Proc 7 - "CORE6"
Proc 8 - "CORE7"
status = 0
DONE
Sending response...
Retrieving command...
LAD_NAMESERVER_SETUP: calling NameServer_setup()...
NameServer_setup: entered, refCount=0
NameServer_setup: created send socket: 5
NameServer_setup: connect failed: 22, Invalid argument
closing send socket: 5
NameServer_setup: created recv socket: 5
NameServer_setup: created send socket: 6
NameServer_setup: connect failed: 22, Invalid argument
closing send socket: 6
NameServer_setup: created recv socket: 6
NameServer_setup: created send socket: 7
NameServer_setup: connect failed: 22, Invalid argument
closing send socket: 7
NameServer_setup: created recv socket: 7
NameServer_setup: created send socket: 8
NameServer_setup: connect failed: 22, Invalid argument

...

In DSP cache user guide (http://www.ti.com/lit/ug/sprugy8/sprugy8.pdf) it says:

Note—Do not define memory that is to be used or boots up as cache under the
MEMORY directive. This memory is not valid for the linker to place code or
data in. If L1D SRAM and/or L1P SRAM is to be used, it must first be made
available by reducing the cache size. Data or code must be linked into L2 SRAM
or external memory and then copied to L1 at run-time.

Does this also mean that if I have configured a part of L2 as cache e.g. 128kB, I can not link code or data sections into L2? If so, that might be the reason for the failure since I have mapped msgq_heap in L2SRAM in sysbios app.cfg file.

regards,

Marko

over 10 years ago

0 Chad Courtney over 10 years ago

TI__Mastermind 30825 points

Marko,

The reason for the Note you quoted from the DSP Cache UG, is that the L1D and L1P are configured on power up to be all cache, so it would require to be reconfigured prior to some of it being available as SRAM.

As for the L2 space, as long as it's not being utilized, it can be reconfigured as cache. You can configure L2 to be 32K, 64K, 128K, 256K or 512K(all) bytes of cache. For the 32K - 256K the beginning 512K - X were X is the size of cache set is SRAM (i.e. starting from 0x0080 0000h is cache) So as long as you're not using the last 32K -X space, then you can use it.

That said, if you're making cross core accesses to other cores L2 space, then you need to maintain the coherence manual for caching operations.

We'll need more details about what issues you're observing and how you have the device configured. Please also indicated the device number you're using.

Best Regards,
Chad

0 Marko Moberg over 10 years ago in reply to Chad Courtney

Intellectual 550 points

Hi Chad,

Here is some additional info for you. The device I am using is 66AK2H12 (4 ARM cores, 8 DSP cores). By default the platform configuration defines 0k or L2 cache and 32kB of L1P cache and 32kB of L1D cache. I created a new platform configuration file (using tools - RTSC Tools - Platform) where I defined 128kB of L2 cache while keeping all the other setting untouched. The outcome of this change was that Ipc_Start() fails with the abovementioned messages.

The linker seems to be handling L2 SRAM properly and I cannot see anything overlapping with 128kB cache region. MessageQ vring buffers are using DDR3 and cache is disabled for the corresponding area. Everything worked perfectly without L2 cache enabled but now I just don't have a clue what is going wrong here.

linker.cmd file looks like this:

MEMORY
{
L2SRAM (RWX) : org = 0x800000, len = 0xe0000
MSMCSRAM (RWX) : org = 0xc000000, len = 0x600000
DDR3 (RWX) : org = 0x80000000, len = 0x80000000
}

I can move program data from L2 to MSMC and everything works as long as I don't enable L2 cache.

With L2 cache enabled the DSP SW seems to be running ok. At least it is sitting in BIOS idle loop waiting for messages which never come. I would like to know what Ipc_Start() function executed by ARM is actually checking? Is DSP involved at that point or is Ipc_Start() just using ARM? Does Ipc_Start() access anything from DSP L2? I would really like to understand the reason why Ipc_Start() is failing.

regards,

Marko

0 Chad Courtney over 10 years ago in reply to Marko Moberg

TI__Mastermind 30825 points

Marko,

This seems to be a BIOS related issue, I'm going to move this to the RTOS forum, they should be able to provide better support on the BIOS functions and why you may be experiencing these issues.

Best Regards,
Chad

0 Robert Tivy over 10 years ago in reply to Marko Moberg

TI__Mastermind 18260 points

Marko Moberg said:
With L2 cache enabled the DSP SW seems to be running ok. At least it is sitting in BIOS idle loop waiting for messages which never come

Have you checked for DSP LOG or System_printf() output? I suspect that the DSP has not completed the needed setup correctly and there may be a message telling you so. Since you say BIOS seems to be running well in the idle loop, any setup failure would have to be non-aborting else BIOS would vector to System_abort().

IPC for Keystone tries to use just L2 SRAM for code and data and only use DDR for vrings (as you pointed out), and it's been a tight squeeze. Perhaps the reduction in L2 SRAM has caused a BIOS heap to be reduced in size (if it is a heap that is defined with "use the rest of this memory as heap" type of logic), and now there's not enough for things such as dynamic Task creation.

Marko Moberg said:
I would like to know what Ipc_Start() function executed by ARM is actually checking?

In the case of your LAD failure as reflected in your log.txt LAD output, Ipc_start() is setting up the NameServer, and as part of NameServer_setup() a pair of AF_RPMSG sockets are created by the LAD daemon, one for sending to the DSP and one for receiving from the DSP. Your receive sockets are succeeding (which indicates that the socket "binds" are succeeding), but your send sockets are failing to connect to the DSP (the socket "connect" is failing). I don't know why one would fail and not the other, since both a "connect" & "bind" need cooperation from the remote core.

Marko Moberg said:
Is DSP involved at that point or is Ipc_Start() just using ARM?

As implied in my above paragraph, the DSP is involved in a successful Ipc_start().

Since the one variable here is a different L2 SRAM size (and the presence of an L2 cache, which I wouldn't expect to be a problem since the L1 caches have always been in play), I can only guess that the issue lies with the reduced L2 size, which would affect only dynamically-created things since statically-created things would fail the link if there wasn't enough L2 SRAM.

Regards,

- Rob

0 Marko Moberg over 10 years ago in reply to Robert Tivy

Intellectual 550 points

Hi Rob,

I should have plenty of "spare" space in L2 for at least 32kB cache. The current .map file looks like this:

name origin length used unused attr fill

---------------------- -------- --------- -------- -------- ---- --------

L2SRAM 00800000 00100000 000e0b71 0001f48f RW X

MSMCSRAM 0c000000 00600000 0048dec0 00172140 RW X

DDR3 80000000 80000000 00000000 80000000 RW X

I seem to be getting the same Ipc_Start() failure issue in the following two cases:

1) define 32k L2 cache in RTSC platform configuration (and 32k L1P and 32k L1D)

2) define 0k L2 cache in RTSC platform configuration (and 32k L1P and 32k L1D) but enable L2 cache in program code as follows:

Cache_Size cache_size;

cache_size.l1pSize = Cache_L1Size_32K;

cache_size.l1dSize = Cache_L1Size_32K;

cache_size.l2Size = Cache_L2Size_32K;

Cache_setSize(&cache_size);

Cache_enable(Cache_Type_L2D);

To me it seems that there is space available in L2SRAM but enabling cache makes IPC go crazy. Do I need to do something special on ARM side when I am enabling DSP L2 cache (for example regarding writeback/invalidate)?

The DSP debug trace shows the following:

# cat /debug/remoteproc/remoteproc0/trace0

3 Resource entries at 0x800000

Add heap handle:@008dc130 size: 524288 add:800118

Add heap handle:@00800158 size: 67108864 add:e0000000

DspFunction::registerFunction symbol: sumvoid function pointer:@008008e0

DspFunction::registerFunction symbol: sum function pointer:@00800950

Enter main() on Core 1

registering rpmsg-proto service on 61 with HOST

<< D L O A D >> DLIF_trace: dloaderTask IN

<< D L O A D >> DLIF_trace: dloaderTask handle @00803d88

<< D L O A D >> DLIF_trace: dloaderTask DLOAD_initialize

dspFunctionDispatcher() IN

Print outs from version with L2 cache problem stops here. The functional version continues with the following messages:

<< D L O A D >> DLIF_trace: loadBaseImageSymbols fd = 47

<< D L O A D >> DLIF_trace:

Reading dyn_module->dyntab add:@00804220 size:192

I assumed that enabling L2 cache would have been an easy task but apparently it is not. I don't have a clue what's causing the problem here. Any help would be greatly appreciated.

Regards,

Marko

0 Robert Tivy over 10 years ago in reply to Marko Moberg

TI__Mastermind 18260 points

Marko Moberg said:

name origin length used unused attr fill

---------------------- -------- --------- -------- -------- ---- --------

L2SRAM 00800000 00100000 000e0b71 0001f48f RW X

Your previous posts mention 128K L2 cache, in which case you wouldn't have enough L2SRAM left over to hold the 0x000e0b71 bytes that are "used" above (since 0x00100000 - 128K = 0x000e0000). But you mention 32K cache in this post, which should leave enough RAM.

Marko Moberg said:

To me it seems that there is space available in L2SRAM but enabling cache makes IPC go crazy. Do I need to do something special on ARM side when I am enabling DSP L2 cache (for example regarding writeback/invalidate)?

You shouldn't need to do anything with cache maintainence operations unless some "shared" memory is being handled, but you state that your DDR vring area is marked as non-cached (using the MAR registers, I presume).

This leads me to ask "why do you want to enable L2 cache?" I see you're using MSMC, does that get cached by L2?

Since you're using DDR for just the vrings, and since you have that area marked as non-cached, I don't see the benefit of having an L2 cache.

Marko Moberg said:

<< D L O A D >> DLIF_trace: dloaderTask IN

<< D L O A D >> DLIF_trace: dloaderTask handle @00803d88

<< D L O A D >> DLIF_trace: dloaderTask DLOAD_initialize

What's this DLOAD stuff on the DSP? What are you loading?

This seems suspicious from an IPC point of view.

Regards,

- Rob

0 Marko Moberg over 10 years ago in reply to Robert Tivy

Intellectual 550 points

In addition to vring area we have a shared heap in DDR3. We use that memory area for transferring large data buffers between ARM and DSP. We are seeing a very poor DSP performance when we run the same algorithm on DSP and ARM. So I was thinking of increasing the cache size from 32k L1 data cache to something larger on L2.

Originally MSMC was used just for dynamically loaded libraries but now I have also linked some static program code there to make room from L2. I don't think I have changed the MAR setting for MSMC so I am assuming that it is being cached. Then again, we have just program code there.

DLOAD comes from the dynamic loader which is based on TI's reference implementation of dynamic loader (http://processors.wiki.ti.com/index.php/C6000_Dynamic_Loader). There is a separate BIOS task which dynamically loads DSP libraries. Libraries are stored in the file system managed by ARM and they are accessed using some simple API calls from DSP.

Anyway, I still don't understand what is failing within Ipc_Start() and how it can be related to L2 cache settings.

Marko

0 Robert Tivy over 10 years ago in reply to Marko Moberg

TI__Mastermind 18260 points

Marko Moberg said:

DLOAD comes from the dynamic loader which is based on TI's reference implementation of dynamic loader (http://processors.wiki.ti.com/index.php/C6000_Dynamic_Loader). There is a separate BIOS task which dynamically loads DSP libraries. Libraries are stored in the file system managed by ARM and they are accessed using some simple API calls from DSP.

My suspicion of the culprit is definitely with DLOADing (not the reference implementation itself, but with the usage of it). This suspicion is based on the fact that the failure case stops right in the middle of a sequence of DLOAD operations.

Are you invalidating L2 cache before performing the DLOAD loading? I assume you're invalidating L1P and/or L1D already.

Are you ensuring that the libraries aren't DLOADed to the L2 area that is now cache?

Marko Moberg said:

In addition to vring area we have a shared heap in DDR3. We use that memory area for transferring large data buffers between ARM and DSP. We are seeing a very poor DSP performance when we run the same algorithm on DSP and ARM. So I was thinking of increasing the cache size from 32k L1 data cache to something larger on L2.

I assume that you're reading/writing this shared data more than once on the DSP, since you won't get any bump in performance if you're just reading the data one time.

Regards,

- Rob

0 Marko Moberg over 10 years ago in reply to Robert Tivy

Intellectual 550 points

The original setting in app.cfg was Cache.setMarMeta(0xA0000000, 0x1FFFFFF, 0);

So this is setting the DDR3 area of 0xA0000000 -> 0xA2000000 as non-cacheable. This should cover the entirety of the vring structures and buffers (including alignment holes).

And since that didn't work I changed it to Cache.setMarMeta(0x90000000, 0x1FFFFFFF, 0);

So this is setting the DDR3 area of 0x90000000 -> 0xB0000000 as non-cacheable. Something in the range 0x90000000 -> 0xA0000000 or 0xA2000000 -> 0xB0000000 must be getting cached in the "bad" case and is no longer getting cached with this setting (in the "good" case).

CCS offers access to HW breakpoints on many devices, and I wonder if the DSPs on Keystone II have the capability of setting a HW breakpoint for any read or write to those areas? As far as I can tell, the DSP should not be accessing anything in those areas.

So the non-cached DDR3 region starts now earlier and extends a bit further. I am sure the range could be smaller but I didn't play with the values that much. Our MPAX is defined so that all the references to range 0x8000 0000-0xffff ffff are actually mapped to DDR3A starting from 0x08 0000 0000. We have a shared heap between ARM and DSP at 0xc000 0000 - 0xdfff ffff (physical cmem reservation) which should actually be reserved at 0x8 4000 0000 since ARM has LPAE enabled. The region from 0xe000 0000 to 0xffff ffff (0x08 6000 0000 - 0x08 7fff ffff) is for DSP heaps.

CMEM was modified to handle LPAE extended addresses. Are you specifying the phys_start/phys_end parameters of cmemk.ko with the 32-bit aliased address or the 36-bit real LPAE physical address?

Perhaps you can try insmod/modprobe'ing cmemk.ko with the address form that you're currently not doing (while reverting to the "bad" MAR settings).

In summary the DDR3A memory map is as follows:

0x8000 0000 - 0xbfff ffff ARM Linux
0xc000 0000 - 0xdfff ffff ARM+DSP Shared heap
0xe000 0000 - 0xffff ffff DSP heaps

And the same with "real" DDR3A physical addresses:

0x08 0000 0000 - 0x08 3fff ffff ARM Linux
0x08 4000 0000 - 0x08 5fff ffff ARM+DSP Shared heap
0x08 6000 0000 - 0x08 7fff ffff DSP heaps

I suppose it would not hurt to disable DSP cache for the entire region from 
0x08 0000 0000 to 0x08 3fff ffff since that memory is used only by ARM apart 
from those vring message buffers.

Sounds like a decent idea.

I am assuming that MAR setting are only affecting DSP cache behavior, not ARM cache.

Yup, that's correct.

Regards,

- Rob

0 Marko Moberg over 10 years ago in reply to Marko Moberg

Intellectual 550 points

> Are you specifying the phys_start/phys_end parameters of cmemk.ko with the 32-bit aliased address or the 36-bit real LPAE physical address?

I am using 32-bit aliased addresses with cmemk.ko. i.e. phys_start=0xc000 0000, phys_end=0xdfff ff00.

Marko

0 Robert Tivy over 10 years ago in reply to Marko Moberg

TI__Mastermind 18260 points

Marko Moberg said:
I am using 32-bit aliased addresses with cmemk.ko. i.e. phys_start=0xc000 0000, phys_end=0xdfff ff00.

It would be interesting to see if using the 36-bit real phys addr helps:
% modprobe cmemk phys_start=0x840000000 phys_end=0x860000000 ...

This is more guidance for CMEM usage than it is a possible solution to your problem.

For the benefit of others reading this thread...

This thread got taken offline due to difficulties with posting to the Forum. We found a workaround for those posting difficulties, hence Marko posted my offline response for him to the Forum, and then responded to "my" post (which was really posted by Marko).

Marko found out that he could overcome the DSP application difficulties by extending the DDR3 area that's configured as non-cacheable from the DSP, using the DSP's MAR registers. Originally the MARs for 0xA0000000 -> 0xA2000000 were being configured as non-cacheable but the DSP application was crashing or not proceeding correctly. By extending the MAR region (somewhat arbitrarily) to 0x90000000 -> 0xB000000 the application was able to proceed and LAD was able to start and connect to the DSP. Now we're trying to figure out *why* this extended non-cacheable area is making it work, since there shouldn't be any accesses to DDR3 from the DSP *except* for the vrings area, which lies completely inside the original non-cacheable area of 0xA0000000 -> 0xA2000000.

Regards,

- Rob

Processors

Processors forum

How to configure L2 for both cache and data?