TM4C129XNCZAD: The task apps are stuck in the queue with TI-RTOS

ray yang

Part Number: TM4C129XNCZAD

Tool/software:

Hi experts,

I refer to the following link to try to find out what's wrong with the program.

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/854484/faq-how-do-i-trace-backward-to-the-code-that-caused-an-exception-in-my-ti-rtos-app#Open_Registers_View

The current information is as follows:

PC: 0x000351c4
LR: 0x0002a959

TI RTOS :tirtos_tivac_2_16_01_14

We found that the program ended up in the queue, and we don't know how to find out why.

What information do we need to provide in order to troubleshoot the cause of the problem?

Many thanks,

Ray Yang

8 months ago

0 ray yang 8 months ago

Prodigy 150 points

Hi,

The following diagram shows how to create and send a mailbox.

Ray Yang

0 Charles Tsai 8 months ago in reply to ray yang

TI__Guru**** 184366 points

Hi,

It seems you are pending in a mailbox because you call Mailbox_pend in line 2903. It is waiting forever as it is waiting the semaphore.

Does your Task1MessageDat ever get called? If you put a breakpoint in Task1MessageDat, does it stop there?

Does Task1MessageDat have a lower priority than Task11Message? Can you change the priority and if that makes a difference?

Does the problem occur immediately or the Mailbox stops working after some time?

0 ray yang 8 months ago in reply to Charles Tsai

Prodigy 150 points

Hi Charles,

1) We burned the compiled binary file into our devices, and then tested them for 3 months. 8 devices were tested, and 3 of them crashed.

2) It is not possible to set breakpoints to query the current situation when the MCU has crashed, instead of writing the current registers of the CPU into the uninitialized RAM, after the MCU is reset by the external WDT chip, the CPU dumps this information to the LCD, the PC and LR information is obtained by this method.

3) The information at that time is as follows

PC: 0x000351c4

LR: 0x0002a959

Exception occurred in ThreadType_0

name: Task1 handle: 0x3

stack base: 0x20000fa8

BFAR: 0x209d0098

stack size: 0x1400

AFSR: 0x00000000

PSR: 0x21000000

UFSR: 0x0000

ICSR: 0x00416803

HFSR: 0x40000000

MMFSR: 0x00

DFSR: 0x00000000

BFSR:0x82

MMAR:0x209d0098

4) The relationship between Task1MessageDat and Task11Message is as follows

Task11Message(LCD_WAKEUP); -- (a)

void Task11Message(MbxMsg1Type msg)
{
Task1MessageDat(msg, 0, 0); -- (b)
}

void Task1MessageDat(MbxMsg1Type msg, uint8_t dat1, uint32_t dat2)      --(c)
{
    Mailbox1Struct curMessage;
    static uint8_t count = 0;

    curMessage.msgId = msg;
    curMessage.dat1 = dat1;
    curMessage.dat2 = dat2;

    if (!Mailbox_post(gMailbox1, &curMessage, BIOS_NO_WAIT))
    {
        count++;

        if (count == 100)
        {
            count = 0;
        }
    }
}

5) Do you have an unpublished forum? Some information is inconvenient to discuss in the public forum because of the business content.

Many thanks,

Ray Yang

0 ray yang 8 months ago in reply to ray yang

Prodigy 150 points

The frequency of TM4C crashes is about once a month.

Ray Yang

0 Charles Tsai 8 months ago in reply to ray yang

TI__Guru**** 184366 points

Hi Ray,

Ok. The problem is much deeper and harder then I originally thought. Earlier you just gave a description that mailbox does not work. It led me to think the problem is during software development. Now you are saying the problem only occurs after running the device for a month. In the future, please give precisely the details how and when the problem occurs.

If your error log is correct then this is a precise fault when processor is probably doing some type of read from 0x209d0098. The SRAM base address starts at 0x20000000. The offset address is 0x9d0098. This is more than 10MB. The SRAM size is only 256kB. Somehow the processor is reading from this illegal address. The PC is equal to 0x000351c4. Can you trace the code to find out what is the CPU instruction close to this address?

What type of tests are you doing? Are you subjecting the device to a harsh environment stress test? All eight devices are running the same firmware for three months. If it is a software issue, I would expect all eight of them to have the same issue and I would also expect the problem to occur quickly, not after 3 months.

3) The information at that time is as follows

PC: 0x000351c4

LR: 0x0002a959

Exception occurred in ThreadType_0

name: Task1 handle: 0x3

stack base: 0x20000fa8

BFAR: 0x209d0098

stack size: 0x1400

AFSR: 0x00000000

PSR: 0x21000000

UFSR: 0x0000

ICSR: 0x00416803

HFSR: 0x40000000

MMFSR: 0x00

DFSR: 0x00000000

BFSR:0x82

MMAR:0x209d0098

0 ray yang 8 months ago in reply to Charles Tsai

Prodigy 150 points

Hi Charles,

I apologize for not telling you the details of what happened first.

1) Burn-in tests, (equipment reliability tests), which are conducted indoors, similar to a typical laboratory environment.

2) " The PC is equal to 0x000351c4. Can you trace the code to find out what is the CPU instruction close to this address?"

Trace the code to try to figure out where the program is going wrong, do you mean “0x000351c4” ?

If yes, as I stated in my first post, the PC ends up in the mailbox program area.

3) Please take a look at 1) and 2) in the following pictures, which are about stack and heap settings. Are these two (stack /heap) the same thing or different universe stuff ?

       1) They are set in arm linker.
       2) They are set in .cfg.

The .map file produced after program compilation shows that the current stack size is 8192 bytes.

4) Let's check if there is enough task stack space for 4 tasks in TI-RTOS?
As we can see from the picture below, the stack peak does not exceed the task stack space.

Does the task stack take up the stack size in 3) above?

Other information:

XDC tools version: 3.32.25

CCS version: 8.3.0.00009

Compiler version: T1 v18.1.4 LTS

Many thanks,

Ray Yang

+1 Charles Tsai 8 months ago in reply to ray yang

TI__Guru**** 184366 points

ray yang said:
3) Please take a look at 1) and 2) in the following pictures, which are about stack and heap settings. Are these two (stack /heap) the same thing or different universe stuff ?

       1) They are set in arm linker.
       2) They are set in .cfg.

The .cfg will overwrite the linker setting for the system stack. System.

ray yang said:
The .map file produced after program compilation shows that the current stack size is 8192 bytes.

8192 is the system stack.

ray yang said:
4) Let's check if there is enough task stack space for 4 tasks in TI-RTOS?
As we can see from the picture below, the stack peak does not exceed the task stack space.

Does the task stack take up the stack size in 3) above?

You did not exceed the maximum stack allocated for each task.

Can you do a ABA swap test? You said 3 out of 8 will crash after 3 months. This means the other 5 boards do not crash. Is this a correct understanding?

Can you swap the MCU from one of the three suspected boards to one of the five good boards and vice versa. This will help isolate the problem.

Are all 8 boards running the same firmware? Can you confirm this? The reason I ask is because I would expect all 8 boards to behave the same if there is a software issue, especially after running for so long. If there is an issue with the software (e.g. stack ) then it should be a deterministic behavior.

0 ray yang 8 months ago in reply to Charles Tsai

Prodigy 150 points

Hi Charles,

Thank you for your reply, regarding the ABA test, we will have to discuss with our hardware engineers to see if it is possible to perform it and continue to try to find out why this is happening.

Would you like to use the ABA test to identify whether the issue is in the TM4C or on the board?

As you can see in the picture below

For "Scan for errors", is the problem with the debug settings or the CCS utility software or the code itself?

"You said 3 out of 8 will crash after 3 months. This means the other 5 boards do not crash. Is this a correct understanding?" Yes

"Are all 8 boards running the same firmware?" Yes

"Can you confirm this?" Our devices are equipped with an LCD display module that displays the latest firmware version.

Ray Yang

0 Charles Tsai 8 months ago in reply to ray yang

TI__Guru**** 184366 points

ray yang said:
Would you like to use the ABA test to identify whether the issue is in the TM4C or on the board?

Hi Ray,

That is correct. ABA swap test will isolate the problem to the board or to the MCU.

The A-B-A Swap Method is a simple cross check test, which can confirm the observed issue is not systemic.

A-B-A Swap Method
(1) Remove the suspected component (A) from the original failing board.
(2) Replace the suspected component (A) with a known good component (B) and check if the original board now works properly.
(3) Mount the suspected component (A) to a known good board and see if the same faliure occurs on the good board.

Step 3 is important because it helps us to exclude any possibility that the issue is caused by a systemic issue or the interaction of multiple slightly bad components on a good board.

ray yang said:
As you can see in the picture below

For "Scan for errors", is the problem with the debug settings or the CCS utility software or the code itself?

I never use "Scan for errors" before. I'm not sure what information they are providing. On a good device, what do you see when you perform Scan for Errors.

ray yang said:
"Are all 8 boards running the same firmware?" Yes

"Can you confirm this?" Our devices are equipped with an LCD display module that displays the latest firmware version.

Thanks for the confirmation. As I said, I would expect all eight boards to somehow show the same behavior if they are all running the same firmware after running for so long. I could envision some type of unknown memory leak issue if it is an issue with TI-RTOS. But it is not the case here. The 5 good devices are running ok after 3 months. If there is a memory leak, then these 5 should have run into the same situation.

0 ray yang 8 months ago in reply to Charles Tsai

Prodigy 150 points

News updates.

After the last discussion, we updated the program of 10 devices, and then did the burn-in test. After about 10 days of operation, one of the 10 devices crashed, using the way of dumping the CPU's register to the LCD, we got the information of PC, LR, and SP, and after tracing the program code, the program stayed at the same line number as the code screenshot of the first time I posted the article.

Regarding the ABA test, we have to remove the chip with BGA package, and replace it with another board which is not crashed, the hardware engineers think it is difficult, and put it in a stop state.

stack base: 0x20000fa8

BFAR: 0x20330098

stack size: 0x1400

AFSR: 0x00000000

PSR: 0x21000000

UFSR:0x0000

ICSR: 0x00415803

HFSR: 0x40000000

MMFSR: 0x00

DFSR:0x00000000

BFSR: 0x82

MMAR: 0x20330098

Ray Yang

0 Charles Tsai 8 months ago in reply to ray yang

TI__Guru**** 184366 points

Hi Ray,

Sorry, I really don't know what is wrong here. I don't know why only one fails but not others when they are all running the same code and subject to the same burn-in test. As you indicated, when failed, the code is stuck on the same line. Can you do one experiment? Can you try to use queue instead of mailbox? I wonder if you will see the same issue? Here are also some links to debugging TI-RTOS projects that may help.

0 ray yang 8 months ago in reply to Charles Tsai

Prodigy 150 points

Hi Charles,

Thank you for your reply, we asked a very tricky question that caused you a lot of trouble!

Is there a dedicated TI RTOS forum in the TI E2E forum?

Thank you.

Ray Yang

0 Charles Tsai 8 months ago in reply to ray yang

TI__Guru**** 184366 points

Hi Ray,

Unfortunately, there is no dedicated TI-RTOS forum. I hope the various debugging links will help. I wonder if you can:

- Experiment with a simple program using the mailbox instead of your full blown application firmware. Will you see any issue with the mailbox. I have never seen reporting that the TI-RTOS mailbox has an issue. At the moment, I'm not fully convinced that it is a Mailbox issue. However, you are reporting that each time the device crashes, it is always in the mailbox function. This is why I suggest to try out a simple mailbox program.

- Experiment with Queue instead of Mailbox. Can you repeat the same issue after x number of days?

0 ray yang 7 months ago in reply to Charles Tsai

Prodigy 150 points

Hi Charles,

1) We're going to try to change the mailbox to a queue.

2) I'm not questioning the mailbox, I'm just curious why the MCU hangs in the same program block every time.

3) I suspect that mailboxes are superimposed on queues. The reason is that when tracing the code, we see that it has the call Queue_get function, which comes from the queue function, the difference is that its attribute is atomically.

https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/bios/sysbios/6_76_04_02/exports/bios_6_76_04_02/docs/cdoc/ti/sysbios/knl/Queue.html#get

Many thanks,

Ray Yang

0 Charles Tsai 7 months ago in reply to ray yang

TI__Guru**** 184366 points

ray yang said:
1) We're going to try to change the mailbox to a queue.

Hi Ray,

Thank you for trying the experiment.

ray yang said:
2) I'm not questioning the mailbox, I'm just curious why the MCU hangs in the same program block every time.

Sorry, I'm unable to explain why it only hangs on the mailbox function but not others and neither can I explain why only one out of 10 devices will fail after running for many days.

ray yang said:
3) I suspect that mailboxes are superimposed on queues. The reason is that when tracing the code, we see that it has the call Queue_get function, which comes from the queue function, the difference is that its attribute is atomically.

Thanks for tracing the code. What I know is that mailboxes are copy-based meaning that both the producer and the consumer of the mailboxes must allocate the memory for it while for queues, the sizes can expand and contract on runtime.

Arm-based microcontrollers

Arm-based microcontrollers forum

TM4C129XNCZAD: The task apps are stuck in the queue with TI-RTOS