This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/TMS320C6472: independent core shutdown after 72 minutes without apparent reason

Part Number: TMS320C6472
Other Parts Discussed in Thread: SYSBIOS

Tool/software: TI-RTOS

hi,

We are using the TMS320 c6472 multicore dsp and we're experiencing unexplained behavior.

Our boot sequence is comprised of a secondary bootloader that configures the emac ant the local timers, afterwards it receives the actual image via the emac.

The image itself is built with sysbios 6.22, csl and our code.

The weird behavior is that 72 minutes after image loading the core experience shutdown (or at least the task stop running).

The secondary bootloader configures the local timers to be in dual unchained mode without the watchdog feature on.

Moreover it seems like it happens individually per core, e.g if core 2 loads 10 minutes after core 1 then the shutdown will be exactly 10 minutes after the shutdown of core 1.

When we print the "mode", "hi period" and "lo period" of the 6 local timers we see the secondary bootloader configuration on the cores without the image loaded, and the period was changed to the core with the image loaded but the mode stays the same.

We suspect bois_start changes the local times of the core but the new mode isn't wachdog mode either.

Can you help us figure out the reason of this shutdown?

regards,

eli

  • Hi,

    Are you sure this is not a hardware problem? For example some core power rail is dropping.. Have you checked this?

    Best Regards,
    Yordan
  • hi,

    only one core resets every time, the whole dsp and emac stays on, so it doesnt look like a power problem.

    regards,
    eli
  • Just a guess, but if you have a mistake in the layout of the power rails, you could experience insufficient current powering your processor, causing some of the sw processes to hang. That was my point.

    Anyway, I've added the RTOS experts to elaborate on the software side, since the design is ok.

    Best Regards,
    Yordan

  • Eli,

    Look for stack overflow or running out of heap space. The most common cause of 'unexplained' crash is stack overflow, which causes very strange behavior. A slow memory leak could also fit, which could be on the stack or on the heap.

    What do you observe in CCS on the core that has shutdown? What exactly does shutdown mean in this case?

    What events are occurring at the 72 minute point, either external or internal?

    Regards,
    RandyP
  • hi,

    we have a hook that checks for stack overflow in task-switch, and it catches stack overflows with assertion and an Ethernet message. we had a few of those while writing the image, and its not what happens after 72 minutes. we ran different images that do significantly different jobs but all of them reproduce the 72 minute "shutdown" so its unlikely that it is stack overflow. moreover, all of our memory is statically allocated so memory leakage is unlikely as well, at least from the "user code", again 72 minutes is very weird and reminds some sort of timer more then memory problems for different images.

    we don't have jtag connected to the chip, so we can't really say what happens, all our communication with the chip happen over Ethernet. after 72 minutes we see that the emac still works (different cores reload at different times and the other cores keep on running because they haven't hit the 72 minute mark yet), but said core simply stops sending Ethernet messages (we use periodic message for sw-watchdog).

    there aren't any external events happening after 72 minutes, and internally in the "user space" we have nothing unique to 72 minutes so we are guessing currently that its connected someway to the sysbios timers.

    we don't understand why does the bios change our local timer configuration after bios_start() , we think its connected to that.

    maybe it is also connected to the "Clock" module? can we disable it and still work?

    regards,
    eli
  • Eli,

    2**32 * 1us = 72 minutes, so your 'shutdown' occurs at the point when a 32-bit timer would return to 0 when clocked at 1us. 1us is the default timer period used for the BIOS tick. Are you also clocking other timers at 1us independent of BIOS?

    BIOS uses one timer, probably defaulting to timer0. I do not remember how we recommended configuring BIOS for all 6 cores of the C6472, but I do not recall having multi-core support for a single instance of BIOS at the time (not sure how TI-RTOS of today does this, I am not from the BIOS team where those experts are - I am just offering advice in case it helps).

    You need to make sure that the local timer you are using is not the same as the one assigned to BIOS. Either can be changed; someone from the BIOS team will have to walk you through changing the one for BIOS, or you could change the local timer if that is easier. That is the only explanation I can imagine for BIOS changing your local timer configuration - you are using the same one for both.

    Disabling the Clock module will disable many BIOS features. Probably not the way to go.

    Do you have some point in your code where you read the BIOS or local timer and check for the current value being >= the previous value + an offset? That would fail at 72 minutes for a 1us timer since it rolls over to 0.

    With JTAG, you can determine if the core is frozen, running in invalid program space, stuck in a loop or such. It is our recommended method of observing and controlling the processor cores for debug. But your timer configuration being stepped on is a good hint of something that needs to be addressed, so keep the supply going.

    Regards,
    RandyP
  • RandyP,

    thank you very much for your thorough explanation, i read a bit further into the timer/clock configuration and we do indeed need it for Semaphore_pend which we use. furthermore, the secondary bootloader does indeed use the same timer, but once the image with the sysbios runs it ceases to exist so it is ok for the sysbios to use said local timer. we made an image that prints the period and the count and we see that the count for the timer rolls over proabably once it hits the periodLo mark.

    The sysbios uses the timer as half timer (unchained) and uses indeed the lower part which is 32bit, but if it rolls upon hitting the "1us" mark then the overflowing counter should be the clock counter no?

    Regards,
    eli

  • Eli,

    There is probably (I am not the RTOS expert, but it should be in the documentation somewhere) a counter that keeps track of the 1us count, yes. And if that counter is 32 bits in length, then it will also wrap around to 0 when it passes 0xffffffff, and that would be the counter we are talking about affecting your primary problem.

    Regards,
    RandyP