EXC_exceptionHandler when opening RTA window

Franck

Other Parts Discussed in Thread: TMS320DM6437

Hello,

We are running a pretty big application on the DM6437 running Bios 5.41.10.36. I have used several times the RTA Printf Logs window to print debug messages in the past. But for an unknown reason, very recently, as soon as I open an RTA window, my application crashes with the following outputs:

71 841 960,1773,EXC_exceptionHandler: EFR=0x2,RTASystem,
71 842 094,1774, NRP=0x3230c,RTASystem,
71 842 203,1775, mode=supervisor,RTASystem,
71 842 808,1776,Internal exception: IERR=0x1,RTASystem,
71 843 389,1777, Instruction fetch exception,RTASystem,

I tried to start with a clean fresh new workspace with no success.

I don't know how to troubleshoot that issue, I am wondering if it is a recent Code Composer update that could have lead to this problem...

Is there somebody that could give me hints about how to troubleshoot that issue?

Thanks

Franck

Code Composer 4.2.3

XDS510 USB Plus

Bios 5.41.10.36

XDCTools 3.20.08.88

over 14 years ago

0 Karl Wechsler over 14 years ago

TI__Mastermind 20805 points

Hi Franck --

My guess would be a stack overflow problem. If you've recently added additional ISRs or SWI's to your system, you might need to increase the size of the system stack. You can see the stack sizes and current use of each stack using the Tools->ROV tool available from the Debug perspective.

The 'NRP' in the message above should point at/near the instruction that caused the problem. Is that a valid code address? Either something is overwriting the code or somehow you are branching to a bad place.

-Karl-

0 Franck over 14 years ago in reply to Karl Wechsler

Intellectual 685 points

Hi Karl,

Thanks for replying. I have not recently added any ISR or SWI. I also already checked all task and kernel stacks through the ROV and they are all well below their limit.

About your note about NRP, the address shown in the error message is not a valid code address.

Assuming this is actually caused by corrupted program code as you suggest, how could this be related to opening the Printf Logs window? When this window is not opened, the whole application runs without any problem. Is there some extra code running when the window is opened that is inactive when it is not? If you can suggest me some other investigation steps, that would be great.

Thanks

Franck

0 Karl Wechsler over 14 years ago in reply to Franck

TI__Mastermind 20805 points

To rule out 4.2.3, can you try building one of the default BIOS examples (like stairstep) and see if it works on your board with the LOG windows? If this works, then we have to start questioning something specific to your app.

Thanks,
-Karl-

0 Franck over 14 years ago in reply to Karl Wechsler

Intellectual 685 points

Hi Karl,

Got something very interesting following your suggestion.

I built the stairstep project as you suggested, and yes everything works perfectly. Actually, I also could get the CPU load graph working, something that never happened in my application.

Comparing the stairstep project settings to mine, one of the difference was the size of the log buffer. Since my printf log always worked only in stop mode in the past, I had set a very large buffer size as follows in order to get all entries from the ROV log window :

trace.bufLen = 16384;

I can't explain why, but it turns out that if I reduce that size to 64 which is the default size, the printf log and CPU load graph starts working perfectly in my app. But most important is that the application no more crashes.

I tried to use a buf length of 16384 in the stairstep project and I get the same behavior as in my application: unexpected device reset occurs.

Do you see anything wrong in using a large buffer length for the LOG module? Actually, with my system, I determined that going higher than 256 for the log buffer makes my application go crazy. I would be curious to know if this observation is particular to my system and if so, what are the reasons for this to happen. Do you see the same behavior on your side?

Thanks

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

RTA sends requests down to the target to read the LOG buffers in chunks. The requests specify the address to read and the number of records to read at that address. On the target, in the idle loop, the RTA code receives these requests, copies the chunk of LOG data to a buffer to send to the host, and clears that portion of the LOG buffer. It seems like something is going wrong there--maybe the values in the data requests are screwed up. That would explain why your app is crashing when you open RTA.

However, I haven't been able to reproduce your issue so far. I'm using an evmDM6437 in CCSv4.2.0, with the same emulator and version of BIOS. I tried increasing the trace buffer to 1512, then LOG_system to 1024, then trace to 16382, and they all seemed to work fine.

How long do you have to run before the application crashes and spits out that exception? Does it happen pretty immediately?

Thanks,

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

Hi Chris,

Thanks for joining us.

I must apologize because the stairstep application crash that I thought was similar to my issue was in fact caused by something completely different. That was caused by my watchdog that was previously loaded with my usual application but not kicked by the stairstep application (I would have thought that loading a new program through the debugger would cancel the watchdog but it looks not...)

After additional testing, I can't reproduce my issue in the stairstep application. Moreover, in my own application, triggering of the issue looks more aleatory than I thought initially, and not only dependent on the log buffer size. I have been able to increase the buffer size above 256 but by placing it in a different memory segment in DDR without any crash.

I will try to further investigate and gather information in order to figure out the cause of that issue. I will let you know when I find something. On the other hand, if you have additional ideas about ways and suggestions about how to troubleshoot that issue, please let me know.

Thanks

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

It'd be interesting to know whether you see the issue immediately upon opening RTA, or if you have to wait a little while. If it seems to happen instantly, it may be easier to debug. I could point you to some of the relevant functions to step through and see if you can tell where things go wrong. The RTA target code gets called a lot, though, so it would be hard to debug unless the problem happens on the very first call.

Another suggestion would be to disable RTDX in your application, and just use stop mode (this should remove the RTA target code from the picture). You can add the following lines to your .tcf file to remove RTDX:

bios.HST.HOSTLINKTYPE = "NONE";

bios.RTDX.ENABLERTDX = 0;

This has some disadvantages, though; you'll only get data when you halt the target, and it doesn't support the STS views or the CPU load.

Thanks,

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

Hi Chris,

Your latest suggestion to disable RTDX works well, my app no more crashes during execution. So that could be an interesting workaround if we can't further explain the RTDX failure. Thanks for that.

About the time it takes before I get the crash, it is almost instantly. I can't tell if it is the very first call to the RTDX function, but I can tell that it is well within a second. So I would be open to step through some key functions you could explain to me.

Thanks

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

Great!

Can you try placing a breakpoint on the symbols RTA_F_getlog and RTA_F_getsts, and see if both of these are reached, and how many times they're each reached before it crashes?

Thanks,

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

Hi Chris,

RTA_F_getlog() is reached first. Then RTA_F_getsts() is reached. Then hwi1 is reached (crash). What's next?

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

Franck -

Are you able to provide your build .out file for me to try? It might be easiest for me to step through the code and spot anything fishy.

To be clear, it only reaches each of those functions once before crashing?

Have you been able to step through the code for Hwi1 and see on what instruction it goes bad? Does it jump to a bad address or something?

Thanks,

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

Hi Chris,

Here is my .out file:

7065.53A0002_BetaDsp.zip

Since I am not using the EVMDM6437 eval board, I have put a big TSK_sleep of 20 seconds at the very beginning of my app to prevent hardware related issues. The good news are that the issue is still showing up.

Yes both functions are called exactly once before the device crashes.

I am not sure about your question about hwi1. I am not familiar with this aspect, but from my understanding, hwi1 is the result of the exception so it is already too late to know what happened before. The following actions I can see is the exception handling and error message display in the printf log. Am I right?

Please let me know if you need anything else from me to help you stepping through the code. I look forward to get news from your investigation.

Thanks

Franck

0 Franck over 14 years ago in reply to Franck

Intellectual 685 points

Hi Chris,

Any update about that issue? Were you able to reproduce it with the .out file I have provided you?

Thanks

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

I tried running the app, it appears to terminate after a few seconds (even without RTA open) with an assertion failure:

Assertion failed in file '../Src/main.c' at line 414.

Are you seeing the same?

Also, when you receive the exception, can you halt the target and open ROV (Tools -> ROV), and look at the KNL view to ensure that the system stack hasn't overflowed?

Thanks,

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

Hi Chris,

This is the assertion I get ONLY if the log printf window is opened, and it happens within 1 or 2 seconds after I hit the run button. In fact, this assertion is reached because of the following line I put in my TCF file:

bios.SYS.ABORTFXN = prog.extern("MAIN_MyAbort");

The assertion message you see is set inside MAIN_MyAbort(). I checked my stack in ROV and it peaks only at 372 over 8192 available.

Do you get this assertion within 20 seconds? If not, that assertion would probably be a normal error that happens in the application because of some hardware specific to our product that is not satisfied with the EVMDM6437 system. Then I would recommend you to activate the RTA log printf before that timeout to reproduce my problem.

If you get the assertion within 20 seconds and your RTA log printf window is not opened, then it looks to behave differently than it does on my side. Have you try to put a breakpoint on RTA_F_getlog and RTA_F_getsts to ensure that those functions are really not called before reaching this assertion?

Thanks,

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

I put a breakpoint on MAIN_MyAbort, RTA_F_getsts, and RTA_F_getlog. I didn't reach the RTA functions, and I reach MyAbort in ~2 seconds.

Chris

0 Karl Wechsler over 14 years ago in reply to Chris McCormick

TI__Mastermind 20805 points

Can you set a b/p at RTA_F_dispatch and see if you are getting here before the crash? The host sends a function address (RTA_F_getsts or RTA_F_getlog) across via this RTA_F_dispatch function. I suspect that the address is getting corrupted or is invalid and the code is jumping to bad place. It almost seems like the code running on the target doesn't match the .out file that CCS thinks is running on the target. Is this possible? The .out file loaded by CCS needs to match the code on the target. Are you moving code around or booting in a strange way? Or using symbol load of a .out file that doesn't match the code on the target?

0 Franck over 14 years ago in reply to Karl Wechsler

Intellectual 685 points

Hi Karl,

I am loading the .out file through the debugger (XDS510 USB) in a very usual way. I don't see how I could end up with unmatching code inside the DSP. But I am curious about the reasons that make you think that way?

The only unusual setup I have is that my application is based on the Bios 5.41.10.36 but is compiled through a RTSC project with the --tcf option to read the bios 5 TCF file. Do you see anything wrong in doing that? Here is part of my building outputs I can see which shows versions of the XDCTools and bios used:

C:\Texas Instruments\ccsv4\utils\gmake\gmake -k all
'Building file: ../PN_53A0002.cfg'
'Invoking: XDCtools'
"C:/Texas Instruments/xdctools_3_20_08_88/xs" --xdcpath="V:/Perforce/Projects/Common/Software/TMS320DM6437/Platforms/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/PSP/PSP_leddar/PSP_1_10_03/release_v1/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/PSP/PSP/PSP_1_10_03/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/FrameworkComponents/framework_components_2_26_00_01/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/FrameworkComponents/framework_components_2_26_00_01/fctools/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/Edma3Drv/edma3_lld/edma3_lld_01_10_00_01/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/NDK/ndk_1_94_1/packages;V:/Perforce/Projects/Common/Software/TMS320DM6437/c64plus_mpeg4enc/c64xplus_mpeg4enc_02_02_04_production/packages;C:/Texas Instruments/ccsv4/../bios_5_41_10_36/packages;C:/Texas Instruments/xdais_6_25_01_08/packages;" xdc.tools.configuro -o configPkg -t ti.targets.C64P -p PN70A0010_1 -r whole_program -c "C:/Texas Instruments/ccsv4/tools/compiler/C6000 Code Generation Tools 7.0.4" --tcf "../PN_53A0002.cfg"
making package.mak (because of package.bld) ...
generating interfaces for package configPkg (because package/package.xdc.inc is older than package.xdc) ...
configuring PN_53A0002.x64P from package/cfg/PN_53A0002_x64P.cfg ...
    will link with ti.sdo.fc.acpy3:lib/release/acpy3.a64P
    will link with ti.sdo.fc.memutils:lib/release/memutils.a64P
    will link with ti.sdo.fc.dskt2:lib/release/dskt2.a64P
    will link with ti.sdo.fc.dman3:lib/release/dman3Cfg.a64P
    will link with ti.sdo.fc.utils.gtinfra:lib/release/gt_bios.a64P
    will link with ti.sdo.utils.trace:lib/release/gt.a64P
    will link with ti.sdo.codecs.mpeg4enc:lib/mp4venc_ti.l64P
Inside H3A getLibs
    will link with ti.sdo.pspdrivers.drivers.h3a:lib/dm6437/Release/h3a_bios_drv.lib
Inside Resizer getLibs
    will link with ti.sdo.pspdrivers.drivers.resizer:lib/dm6437/Release/rsz_bios_drv.lib
Inside Previewer getLibs
    will link with ti.sdo.pspdrivers.drivers.previewer:lib/dm6437/Release/prev_bios_drv.lib
Inside VPFE getLibs
    will link with ti.sdo.pspdrivers.drivers.vpfe:lib/dm6437/Release/vpfe_bios_drv.lib
Inside pal_os getLibs
    will link with ti.sdo.pspdrivers.pal_os.bios:lib/dm6437/Release/palos_bios.lib
Inside EDMA3 Drv getLibs
Target Name: C64P
    will link with ti.sdo.edma3.drv:lib/Release/edma3_drv_bios.lib
Inside EDMA3 RM getLibs
    will link with ti.sdo.edma3.rm:lib/dm6437/Release/edma3_rm_bios.lib
cl64P package/cfg/PN_53A0002_x64P.c ...
asm64P package/cfg/PN_53A0002_x64Pcfg.s62 ...
cl64P package/cfg/PN_53A0002_x64Pcfg_c.c ...
'Finished building: ../PN_53A0002.cfg'

(...)

I have set a b/p at RTA_F_dispatch as you suggested. If I set it before running the application, this function is called several times before reaching RTA_F_getsts or RTA_F_getlog, so I decided to set the b/p only after RTA_F_getsts has been reached. Doing so, RTA_F_dispatch is reached only one time after RTA_F_getsts and then the app crashes. Stepping into RTA_F_dispatch, I could find that the crash happens inside tkat2 when it tries to branch to an invalid address at the end of the function.

From my understanding, the invalid address is loaded from a structure called "RTA_fromHost$pipe$rd". When I carefully watch that structure, I can see that when RTA_F_getsts is reached, that structure looks valid. But when RTA_F_dispatch is reached, its content has been modified and looks corrupted. The corruption looks to come from another buffer located just before in memory called "RTA_toHost$buf". Look at this, when RTA_F_getsts is reached, here is how looks the memory content around "RTA_toHost$buf" and "RTA_fromHost$pipe$rd" :

RTA_toHost$buf
00000082    83924040    8394BAD8    00000002    FFFFFFFF    FFFFFFFF    0025CF52    00000000
00000084    00000004    8394BB1C    00000003    FFFFFFFF    FFFFFFFF    0025D01E    00000000
00000086    00000004    8394BA58    00000000    FFFFFFFF    FFFFFFFF    0025D09B    00000000
00000088    00000004    00000000    0000000E    FFFFFFFF    FFFFFFFF    0025D0EF    00000000
0000008A    00000060    00000000    00000001    FFFFFFFF    FFFFFFFF    0025D253    00000000
0000008C    FFFFFFFF    8394BB1C    00000004    FFFFFFFF    FFFFFFFF    0025D27D    00000000
0000008E    00000002    8394BAF0    00000003    FFFFFFFF    FFFFFFFF    0025D316    00000000
00000090    FFFFFFFF    8394BAF0    00000004    FFFFFFFF    FFFFFFFF    0025D3DA    00000000
PIP_A_TABBEG, RTA_fromHost$pipe
00000004
RTA_fromHost$pipe$rd
00000000    839492F0    00000004    10F04520    8394B954    00000000    00000000    8394B948

Then, when RTA_F_dispatch is reached, here is the same memory location content:

RTA_toHost$buf
0000004B    00000000    00000000    00000000    00000000    80000000    00000000    00000000
80000000    00000000    00000000    80000000    0000004B    000168B9    00001D56    0000004B
000139A1    00001DAC    00000000    00000000    80000000    00000000    00000000    80000000
00000000    00000000    80000000    00000000    00000000    80000000    00000000    00000000
80000000    00000000    00000000    80000000    00000000    00000000    80000000    00000000
00000000    80000000    00000000    00000000    80000000    00000000    00000000    80000000
00000000    00000000    80000000    00000000    00000000    80000000    00000000    00000000
80000000    00000000    00000000    80000000    00000000    00000000    80000000    00000000
PIP_A_TABBEG, RTA_fromHost$pipe
00000000
RTA_fromHost$pipe$rd
80000000    00000000    00000000    86778777    0002E1A0    FD597996    FFFFFF16    8394B948

Nothing before RTA_toHost$buf is touched, so it looks like an erroneous write access to RTA_toHost$buf has been made. But I can't go further in my investigation since I don't know what part of the code usually accesses that buffer. Do you have an idea of what could corrupt that buffer? Is there a way to set a breakpoint on a memory location to know when some code tries to access that memory location?

Thanks

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

Franck - I think we're getting close!

It looks to me like RTA may be requesting too many STS records for how big the RTA_toHost$buf is. You have a total of 96 STS records in your app, which is a total of 288 bytes, but RTA_toHost$buf is only 256 bytes. I'm thinking maybe it's trying to copy over all 96 records at once, and so it's overflowing RTA_toHost$buf as you observed.

When you hit RTA_F_getsts, can you look at the value of the register B4? This should hold the number of STS records that are being requested...

Thanks,

Chris

0 Chris McCormick over 14 years ago in reply to Chris McCormick

TI__Expert 4095 points

Franck - I reviewed the RTA host code and I think this is definitely what's happening. My math was a little off earlier--you have 24 STS records, and the RTA transfer buffer only has room for 21 (each STS record takes 12-bytes to transmit). I confirmed with your app that when I reach RTA_F_getsts, the number of records requested is 24.

The RTA host code is not dividing the STS retrieval into multiple reads in this case, which it should be.

I'll get back to you soon with a suggested workaround or fix.

Thanks,

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

That is a great finding, and I am very happy that I can now explain that strange behavior!

I was curious about the fact that using smaller LOG buffers would not create the crash issue, so I just took a look at what happened when I was using 64 bytes LOG buffers instead of larger buffers. In fact, it is only a matter of how the linker orders things in the output memory map. When using smaller LOG buffers, instead of placing RTA_fromHost$pipe and RTA_fromHost$pipe$rd right after RTA_toHost$buf, it places something else. Actually, it happens to be one of my LGO buffers (slowTrace) that is being placed there so probably my slowTrace LOG is being corrupted, but it does not result in a catastrophic failure...

Unfortunately, I was recently thinking about using some stats objects (STS) to do some function profiling. I imagine that doing so in RTA mode would just worsen the situation, but could you confirm if this corruption can also happens in stop mode (through the ROV viewer)?

I look forward to get your suggested workaround or fix.

Thanks again,

Franck

0 Chris McCormick over 14 years ago in reply to Franck

TI__Expert 4095 points

Franck - It looks like there is a simple workaround to this problem for your app.

You can increase the size of the RTA_toHost buffer by adding the following line to your .tcf script:

bios.HST.instance("RTA_toHost").frameSize = 128;

This value is given in 4-byte words. This doubles the size of the toHost buffer, and should be enough room for 42 STS records in your application (you currently have 24).

A formal fix should be available in the next BIOS 5 release (5.41.11).

Thanks!

Chris

0 Franck over 14 years ago in reply to Chris McCormick

Intellectual 685 points

The workaround works great.

Thanks a lot for your support!

Franck

0 Franck over 13 years ago in reply to Chris McCormick

Intellectual 685 points

Hello Chris,

I know this case has been closed a long time ago, but I just wanted to verify with you if the new Bios version supplied with CCSv5 (5.41.11.38) actually includes this fix. In fact, I could not see anything in the release notes about it, so I preferred checking with you before removing the workaround from my TCF script.

Thanks

Franck

0 Karl Wechsler over 13 years ago in reply to Franck

TI__Mastermind 20805 points

Franck --

Unfortunately, the fix for this was not picked up in 5.41.11.

The fix will be picked up for 5.41.12. The bug id is SDOCM00081109 which has been committed for 5.41.12.

Regards,
-Karl-

Processors

Processors forum

EXC_exceptionHandler when opening RTA window