This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MCU-PLUS-SDK-AM243X: Understand LTS-compiler, CCS and Bootloader in combination.

Part Number: MCU-PLUS-SDK-AM243X


Hello,

currently we face a bug and need to debug it in our bootloader and in our application. this bug only happens in a release build.
So we are using SDK 08.06 and LTS2.1.3. We build the sdk-libraries as static libs (for each debug and release) before we link them against our bootloader and application.

We use the sbl_ospi as a base but we wrote our complete own Bootloader in C++.

1.

Now we noticed it seems the tiarm-compiler produces data at address 0x00000000 even if it is not shown in the map-file.

For example normally all vectors are placed in the internal SRAM for the bootloader:



But somehow it also creates the same values at 0x00000000:

since we face some weird aborts I thought that probably someone accesses 0... and then I tried to set a watchpoint and this happened:


the soc-firewall-open tries to write a magic word there. Is this intended? Normally this is the TCMA and for us we normally place executable code there and will set the MPU to be not able to write. Since this is not the case for the bootloader it may be ok, but it seems really suspicious and unwanted behaviour to just write something at address 0.

Bootloader is compiled with vectors for SBL of course.

2.

If I connect CCS to the AM243 and debug-step through the code I cannot reproduce the abort. This only happens without a CCS-connection or if I somewhere just let the application run. But it seems a reference (C++-code) with a valid memory somehow gets nulled in the middle of the process and then aborts when it wants to access or it runs into an undefined handler which somehow happens randomly the one case or the other. But the reference stays correct if I do debug-stepping!

Do you know what could produce such behaviour? Why does CCS have an impact on the executed code and what exactly does it to? This also happens when no breakpoints or anything are set. So there are no SW-breakpoints which could somehow modify the execution.

I also ensured the stack is big enough (set to 64 KB).

And a debug-compilation works just fine as it should! That's the most mysterious part here. Also we noticed this abort in our application (not the bootloader) in a release build. not only the bootloader, but it seems to take place at the same or at least around the same source-code-location. Application and bootloader share the same flash-driver (we wrote our own since SDK 08.06-flashdrivers did not work with our ISSI-flash) with a layer which translates a sleep to one time the freeRTOS taskDelay and the other time it runs down to the ClockP_usleep. E.g. when waiting for the flash to be ready. Instead of while-looping since we had the problem, that tasks which used the flash-driver blocked lower-prio-tasks in that time-frame.

We thought it may be related to some optimizations then we noticed no matter what optimizations we use for our code, somehow the drivers-lib of the sdk in a release-build seemed to be the issue, since we could link our release-application (not bootloader here) against the debug-driver-lib and everything worked. We could use Oz, Os, Og no matter what. as long as the drivers lib was in release (with the options of the sdk-makefiles), it crashed.

But unfortunately this got way more complex since a simple change of some lines of code inside our application, right before calling the drivers-lib-functions, worked in the end again with the drivers-release-library. The relevant part was calling OSPI_readCmd. All I did was change some of the read-Params to more suitable values. But I cannot create the link to this abort in my head. It doesn't make sense. Also the abort inside the bootloader still exists but at least we get our application running.

So our current state is:

working compilations:
- Bootloader and application in debug (-Og, -g) with drivers-lib in debug
- Application in release (-Oz) with drivers-lib in release

not working compilations:
- bootloader in release (-Oz) with drivers-lib in release

For better debugging we also enabled -g with -Oz, but it doesn't matter if with or without -g.

Is there a list with some suggestions or some help how I can debug such a mysterious error? Maybe there are some tricks I don't know, because I think opening a debug-session should be only the last option if I really cannot proceed here.

Best regards

Felix

PS: I put both topics here since it seems they could be related but I am not sure. If needed we can split them up.

  • Hi ,

    The subject matter expert is currently out of office. Please wait till Monday i.e. 11th Sept for him to be back to office.

    Best Regards,
    Aakash

  • hey Aakash,

    alright. We also think that we found the issue.
    I try to summarize our findings. And sorry for the edits. I updated this thread with my latest findings and changed some references to pointers for a bit easier debugging.

    So I digged more into and noticed that something of the stack gets corrupted which changes particularily this address. This happens two times. The screen shows the first corruption and the second time 1F gets 10.


    And I think I found the place which produces this error. It is in OSPI_readCmd:

    The memcpy right here overwrites the value on the stack. before it's fine, after it's corrupted.
    The call to that function looks like this:

    And I think here we see the problem. I pass an uint8_t value but I am reading 2 bytes later on. This copies in a memory-region where it shouldn't.

    I will fix this but it seems this was our problem.

  • hey Aakash,

    after a lot of debugging I found the issue. it seems the Ospi-driver of the SDK corrupts the call-stack and thus changes an address we used before which was put on the stack and thus gets corrupted.
    I updated the previous post of me