This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to debug _c_int00 overwriting code in L1P?

Other Parts Discussed in Thread: TMS320C6678

This is for the C6678 with CCSv5.1 and compiler version v7.3.5 on Windows.

I am developing a system where individual processing functions are performed on each core of a 6678. The programs in each core communicate using IPC (1.24.2.27). The processing function was written and optimized by another team member to run in L1P/L1DL2. The IPC code will run in DDR3 and calls the processing function in L1 for every frame.

During early development, I just loaded everything into DDR3 because my concern was with debugging the IPC functions rather than the overall system performance. The code worked correctly that way, just not very fast. Note that I am not using the cache at all because the processing function was written to carefully use L1 and L2.

Now I'm using a linker command file to put the desired processing functions in L1P. The problem is that the L1P data is loaded correctly by the emulator, but is overwritten before main() is executed. I verified this by setting the debug options to not run to main after loading the program, inspecting L1P values, running to main(), and inspecting L1P again.

If I immediately reload the core with the same program and repeat the test then the L1P values are not overwritten!

This feels like some kind of uninitialized variable problem, but I don't know where to start tracking it down.

Thanks,
Fred

  • Fred Brehm said:
    If I immediately reload the core with the same program and repeat the test then the L1P values are not overwritten!

    Then this cannot be a problem with the compiler tools.  If it were, the problem would happen every time.  Thus, we who watch this forum cannot help you.

    It seems more likely that some detail of device configuration is not handled correctly.  I'll move this thread to the C66x Multicore forum.

    Thanks and regards,

    -George

  • Fred,

    You say you have the code loaded directly to the L1P, do you have cache turned off?  Also, does your linker command file use the 'Global Address' for L1P space so that the code is loaded to the individual cores L1P?

    It may be better to load the code into L2 Space of the cores and have L1P set to cache unless everything will fit in L1P.  Possibly using a mix of the SL2 and L2 space.

    Are you doing these tests on an EVM or your own board?

    Best Regards,

    Chad

  • Fred, by default the L1P on the C6678 is all a cache at power-up.

    For the 32K bytes of code you want in L1P, I think you must do this:

    Load the 32K Bytes of code into (e.g.) DDR, but not try to run that code yet.

    Disable the L1P (and make sure BIOS does not re-enable it in BIOS_start, if you are using BIOS)

    Copy the code from DDR to L1P.

    Your linker command files probably needs to use the LOAD and RUN commands, see SPRU186V Section 7.5.4.

    Basically you will say something like this:

    .text: load = DDR, run = L1P

    That tells the linker to load it at DDR, but link it as if it would run at L1P. You have to manually copy the code from DDR to L1P after L1P cache is switched off.

    Document SPRAA46A (Advanced Linker Techniques) may also be useful.

    Regards, Jon

  • Jon,

    Good comments, though I wouldn't put it in DDR unless DDR was already configured, if we're having the earlier hurdles then it may be best just to suggest he put it in the Shared L2 Space to simplify things that need to be done.

    Best Regards,

    Chad

  • Hi Chad and Jon,

    Having the cache enabled by default at startup is a surprise to me. I expected that all caches would be off just like on every other processor I have used.

    I'm using a RTSC platform to define the memory regions, and I have all cache turned off. The L1 and L2 addresses in the definition are the aliased addresses (e.g., 0xE00000) not the core-specific addresses (e.g., 0x10E00000). I don't know which of those count as the "global" address. Here's one of thePlatform.xdc definitions edited to reduce the number of lines

        config ti.platforms.generic.Platform.Instance CPU =
            ti.platforms.generic.Platform.create("CPU", {
                clockRate:      1000,
                catalogName:    "ti.catalog.c6000",
                deviceName:     "TMS320C6678",
                customMemoryMap:
               [ ["L1PSRAM", {
                            name: "L1PSRAM",
                            base: 0x00e00000,
                            len: 0x00008000,
                            space: "code",
                            access: "RWX",} ],
                    ["L1DSRAM",  {
                            name: "L1DSRAM",
                            base: 0x00f00000,
                            len: 0x00008000,
                            space: "data",
                            access: "RW", }],
                    ["L2SRAM",  {
                            name: "L2SRAM",
                            base: 0x00800000,
                            len: 0x00080000,
                            space: "code/data",
                            access: "RWX",}],
                    ["DDR3",  {
                            name: "DDR3",
                            base: 0x80400000,
                            len: 0x00400000,
                            space: "code/data",
                            access: "RWX",}],
                    ["DDR3HEAP", {
                            name: "DDR3HEAP",
                            base: 0x82000000,
                            len: 0x1E000000,
                            space: "data",
                            access: "RW", }], ],
              l2Mode:"0k",
              l1PMode:"0k",
              l1DMode:"0k",});
    instance :
        override config string codeMemory  = "DDR3";
        override config string dataMemory  = "DDR3";
        override config string stackMemory = "L2SRAM";

    The target board is a EVM6678L that I'm setting up with evmc6678l_xds100.ccxml and evmc6678l.gel file straight from the MCSDK. The final target is a quad Advantech board where the initialization method is a little different -- there's a separate program to load and run over the PCI.

    Is there something quick and dirty that I can do with the EVM? Can I add a function in the .gel file to disable the caches? Would that have an adverse affect on something else?

    The code is written and optimized for running in L1 with intermediate data in L2 copied from DDR3 using the EDMA. We are processing large video frames so there's lots of data to be shuffled about. We tried putting the code in DDR3 with the LIP and L1D caches on, but it didn't run fast enough. We can't use L2 as cache because we need most of it for buffers.

    If I need to use the technique that Jon suggests, is there some sample code somewhere to show how to do it? I'm sure that I can figure out the starting address of the code to copy, but how do I figure out the length?

    I have not seen a description of events that happen at startup in any of the documentation, but maybe I have missed it. It would be good to know what all of the various pieces do (boot code, gel file, _c_int00, SYS/BIOS startup) and how they influence the state of various devices and especially the caches. For example, just when do the caches get turned on or off to match the state defined in the platform file? If I had found such a description then I may have been able to save a day and a half of debugging time with this problem.

    Thanks,
    Fred

  • Technically, the cache is off after PORz (power on reset.)  But the BootROM code enables cache for L1P/L1D, since almost everyone uses it.  I'm not sure on these devices if anyone has set the cache fully off on L1P and L1D, it's highly unlikely.  Also if all the code fits in cache, then it should simply be left on, it will get cached in and will be there always with no penalties.

    You can do as you suggest and modify the GEL file such that you disable the cache on startup.  Then load the code.

    That said, I think it's a mistake to do so.

    Why aren't us simply using the MSMC (Shared L2) space for code, you have 4M's of Share L2 at 0x0C000000 that you're not using, this is much faster access than DDR.  This is most likely going to give you better performance.  And you can partially enable local L2's for cache, you don't have to have it all on.

    If you can't fit everything (or at least an extremely high percentage of the run time execution >> 95%) into L1P, then turning it off as cache is a big mistake.  If you can put everything into L1P, then the amount of space it consumes is negligible. 

    Can you give me a run down of how much space you need for Data and how much of Code.  You can generate a .map file with the linker (simple option to add) and it will tell you what everything is and where it is.

    Best Regards,

    Chad

  • We have 8 cores running different programs operating on video frames about 1 Mbyte each. The plan is to operate on about 20 independent streams simultaneously. Most of the cores need to operate on multiple sequential frames from each stream to perform their function. The frames and intermediate data must be stored in DDR3.

    We use IPC and SYS/BIOS for communicating between the cores (maybe that was a mistake, but it's too late to redo that part). The algorithm code is carefully crafted to use L2 for intermediate data. The EMDA is used to copy data between DDR3 and L2. There's no way we can use L2 for cache--not even part of it.

    The executable for every core is different, so the MSMC would need to be partitioned to handle the different processors. Part of MSMC is used by the network stack for buffers. The network will probably not be used on the quad board, but we need it on the EVM so that will need a partition, too. I have built the executable for every processor using only DDR3. Adding up the sizes I get over 4.5 Meg. Adding up only .text gives me just over 2 Meg, so I can probably fit things in MSMC if I'm really careful.

    I'm still not sure that turning on the L1P cache would do much. I tried that once and it didn't seem to help. I turned it on by setting the L1P mode to 32K in the platform file. Did this actually turn on the cache, or do I have to do something special?

    Thanks,
    Fred

  • Based on this.  Turning off the L1P would be extremely detrimental to performance.  You'll never get the caching advantage.  I agree turning off Local L2 cache and keeping data there is a good thing, but putting the program in MSMC L2 is probably the smart thing to do.  It should be no more difficult that having it in DDR from a partitioning perspective. 

    As mentioned earlier the L1P and L1D cache's are on after booting the device, setting it to on again will not change anything.  Please note that caching the program stored in MSMC L2 is going to be much more efficient then caching it directly to L1P via DDR.

    Best Regards,

    Chad

  • Fred, regarding your question "I have not seen a description of events that happen at startup in any of the documentation .... .... what all of the various pieces do (boot code, gel file, _c_int00, SYS/BIOS startup) ", I recommend the following:

    SPRUGY5A (Bootloader User Guide), in particular Section 2.2 "Bootloader Initialization after POR", where it mentions about L1D and L1P being set to all-cache at power-up.

    c_int00 - See the actual code in .../bios_6_32_05_54/packages/ti/bios/support/boot.c. Note that c_int00 calls your own main(), then calls BIOS_start().

    SPRUEX3K (BIOS User's Guide) Section 6.6.1 explains that the cache parameters defined in the platform file are set during BIOS_start().

    I load over PCIe and so don't use a gel file, so I can't comment on that.

    Regards, Jon

    PS Chad, thanks, you are right, DDR can't be used until it is properly configured.