Are shared program images a bad idea?

Gordon Deane

So having a mostly-working single core application, I wanted to start using the same application image across multiple cores on a C6678.This raises the question of how to configure the memory map. I am using the MAD tools to create a single Ethernet boot image. Guided by the MAD documentation, I set this up as an execute-in-place shared program image. I had days of grief with this - tearing hair out in frustration - and have concluded that it's a very bad idea. But more below.

As far as I can see, the options for an application running on more than one core are as follows:

Keep all code and static data in L2SRAM, and use DDR and MSMCSRAM only for heaps. Not very viable for larger applications.
Use shared code and keep stack and non-shared data in L2SRAM. Additional data can be allocated using MP-safe heaps This is what the Image Processing demo's slaves do, which is the most "grown up" multicore example provided in the MCSDK. This is fine as long as your data fits in L2 and/or you're happy with the heap approach. It has the nice feature that you can just load the image on all cores using the CCS ELF loader without doing anything else.
Build a separate project for each core, relocating code and data to non-overlapping addresses. This seems a lot of extra work and pain, though you can then use the MAD prelinker-bypass mode so the MAD step is simpler. It also means you're addressing memory without remapping, which may be simpler to understand.
Use shared code, and have private data arranged using XMC - each core sees its own data at the same virtual address, but different physical RAM. This is the model the MAD tool examples configure for you.
Use XMC to make private copies of code and private copies of static data for each core, at the same location. Again, the MAD tools can arrange this for you.

I was trying to use (4). The problem I had with (4) and (5) is that you can't just load the app on every core in CCS "out of the box" because you have to configure the memory mapping, and in any case loading apps on multiple cores is a bit tedious. So I tried loading the MAD image into RAM and jumping into it as described in the MCSDK docs.

This worked as far as it went. Unfortunately my program didn't work, because of some bugs in my IPC setup (there turned out to be an interrupt conflict, for starters). This meant I wanted to step through a multi-core program.

Loading symbols "the MAD way" reliably crashed CCS 5.2. (I have a separate forum thread about this, which nobody has bitten at yet).

Nevertheless I managed to use DSS to automate the whole thing: loading the BBLOB image, loading symbols on all cores, jumping into it, and breaking all cores at main().

That's when the weirdness started. Despite being started by a script, runs would behave in non-repeatable ways. CIO would sometimes work and sometimes not. I have stepped through function calls and seen them do nothing, because there was a software breakpoint (SWBP) still lurking there. (You can see it in the disassembly window). It took me a long time to blame the debugger, but others seem to have similar issues, egi n this thread

http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/p/206790/802889.aspx#802889

Even if I didn't set my own breakpoints or limited myself to hardware breakpoints, I had oddities that may have been to do with the presence of CIO breakpoints.

The other things I found difficult was that, in the shared-memory model, a software breakpoint would logically break any core that hits it - which mostly but not entirely happened. I think some of the problems I had were because each breakpoint actually belongs to a single target [core] but it isn't obvious in the default CCS view which core "owns" the breakpoint. (Once I discovered the "group by debug context" option, things became a lot clearer).

So in the end I decided I needed option (5), giving each core its own private virtualised copy of the code segment. This works perfectly, without weirdness, and with the great bonus that you can set per-core breakpoints properly (again, I recommend grouping the breakpoint view as above or this is very confusing).

Another bonus is that all the symbols are exactly where they were compiled to be, so you don't need any of the symbol-relocation malarkey you have with the prelinker.

The downside is that, if you launch my program from an MAD image, it's difficult to break in main(). You have to load symbols for code that doesn't exist yet, set a hardware breakpoint for main(), jump into the MAD image and then probably reload symbols again to get the CIO breakpoints etc. in place once it breaks. It was all getting quite hard work.

In the end, I thought of configuring the XMC mapping I want in GEL, so that I can then "just load" the app using the CCS ELF loader on each core. This works brilliantly. However, I don't remember seeing this idea mentioned anywhere, and I wouldn't have known how to do it if I hadn't spent a day stepping through the MAD NML loader trying to debug my MAD config in excrutiating detail. I'm a bit worried I'm hiking off-trail here ("bush-bashing" as we used to say in my youth in Australia).

What do other people do? Are people using option (2) or (3), or something I haven't though of? Do Real Men not need to step through in a debugger? Is my XMC trick so obvious that you're all rolling your eyes?

Regards,

Gordon

over 12 years ago

0 Renjith Thomas over 12 years ago

Guru 31670 points

Gordon,

The similar problem is solved in most of the high-level operating systems available in single/multicores. They've implemented the same concept using dynamic link libraries. But the final conclusion will be very simple, no matter whatever may be the approach to solve it. Atmost you can share the sections which are read only (text, rodata etc.. ) and you need to have individual copies of the r/w sections (data, bss, stack etc)

0 Gordon Deane over 12 years ago in reply to Renjith Thomas

Expert 1255 points

Renjith,

I think you're missing my point. I agree that shared code sections are indeed a simple and lovely idea. Dynamic link libraries are another potentially nice tool, one also TI support (but have little documentation about) for SYS/BIOS on this device.

My point is that when I tried shared code sections, the CCS debugger became erratic and unworkable. I am trying to understand if I was doing something wrong, or if this is an inherent issue in CCS - and if so, whether the advice should be to avoid shared code sections during application development on those grounds.

A secondary point is that multicore deployment is still a weak area in TI's offering for this chip. Within a core there are more mature and documented tools: the compiler and linker toolchain, SYS/BIOS libraries, and IBL bootloader. These things are wilfully blind to the multi-core side of the device because of their portability. Once your cores are booted there are IPC, OMP and other multi-core libraries that support various multi-core development strategies.

Before you get there, the whole issue of memory mapping and how you bootstrap your single core image(s) on 8 cores is essentially covered by the MAD tools only, for which the documentation is very thin and the guidance almost nil. MAD is ingenious but has poor error messages, duplicates information that is already unhelpfully split (in RTSC) between the platform and cfg files, and does not integrate well with the debugger.

Again, I'm wondering if I'm missing something . . .

Gordon

0 Renjith Thomas over 12 years ago in reply to Gordon Deane

Guru 31670 points

Gordon,

The problem is really interesting :)

I've not worked on Keystone architecture. But, can we check whether CCS instability is the only issue and shared code sections works fine or not? Have you tested this code in a standalone mode without CCS?

0 Gordon Deane over 12 years ago in reply to Renjith Thomas

Expert 1255 points

As far as I know, the MAD loader worked correctly with shared code segments, once properly configured.

Configuring it was a problem, as I didn't originally understand the relationship between the physical and virtual addresses in the examples, and ended up with a mapping that crashed the loader. So I'd have preferred it if the MAD tools said "hey, you know, the in-place image really shouldn't overlap any of your other mappings..." - but I did step through quite a lot of it to debug that configuration and eventually sorted it. I was able to run a two core test application successfully from Flash with it also.

So in upshot, I believe a shared code application not run through the debugger would work, yes.

My main application IPC code was not working because of my own bugs, which is why I wanted to use the debugger in the first place.

Now that I have fixed the bugs and separated the code segments, I could in principle build with a shared program image and try again but I'm already behind schedule so will not be doing that. We have a lot more DDR on our board than I actually need, so separate code images are not actually a problem in this application.

0 Renjith Thomas over 12 years ago in reply to Gordon Deane

Guru 31670 points

Gordon,

I understood your point. If there is so much of memory available, I don't think all these circus is necessary. Keeping separate images will be better. Can you mark this post as answered, if you think so?

0 Gordon Deane over 12 years ago in reply to Renjith Thomas

Expert 1255 points

Well, I posted partly to see if anyone else using Keystone devices wanted to share their experiences, and partly to see if TI want to comment, so I don't think you've answered either of those yet.

It's true that in hindsight there was no particular need for me to have a shared program image, but I would also say the limited stock of MAD examples are strongly oriented towards explaining how you can have a nifty shared execute-in-place image, which represents a sort of de-facto guidance.

0 Ralf Goebel over 12 years ago in reply to Gordon Deane

Genius 3715 points

Hi Gordon,

I totally agree with your experience of examples and the MAD tools. The mechanism is much too complicated for us, because we have a lot of customers developing their applications using our hardware.

We decided to create our own bootloader with some limitations:

We only support one ELF binary which runs on every core.
No pre-linking is required, we can use the compiled binary as it is.
The XMC memory mapping is used to place private memory sections to different physical memory locations for each core.
The configuration of the memory mapping is stored as exported symbols within the ELF image. We only have to define the base address and the length for each region in the command linker file.
The bootloader reads these symbols, copies each private memory to a different physical address and sets up the memory mapping for each core.
A .gel script is used to emulate the behaviour of the bootloader within CCS.

For debugging, it's also useful to place code sections into private memory, because of the software breakpoint issue which you mentioned.

Ralf

Processors

Processors forum

Are shared program images a bad idea?