Hi, this is my first post to e2e, so let me know if I've missed anything out that's important.
I'm running DSPBIOS 5.41 with DSPLink 1.65.00.02 on a custom board with an OMAP-L138. I have both a MSGQ reader and writer on the ARM and DSP, for two-way communication. DSPLink is running in task mode, using the zero-copy transport. The ARM application is multi-threaded and the DSP application has multiple tasks and gets interrupts on UPP, GPIO and SPI (which is using EDMA3).
It seems that doing a MSGQ_alloc, MSGQ_put and then MSGQ_get, MSGQ_free has a significant performance overhead of MSGQ_alloc and MSGQ_free: MSGQ_alloc seems to take the same amount of time as MSGQ_put, although I haven't looked into MSGQ_free. (Let me know if this should not be the case).
In an attempt to improve performance, I tried to implement pre-allocating a circular buffer of messages for the ARM to send to the DSP. The DSP periodically sends a message back to the ARM to let it know how far through the circular buffer it has processed, and the ARM simply re-uses the messages. This improves performance but seemed to crash the DSP periodically (about after an hour or so in our normal application). I then tried to increase the rate of sending messages between the ARM and DSP, so I now have:
- ARM send to DSP using pre-allocated buffer, and then waits for reply.
- DSP waits to receive and then sends back to ARM using a different pre-allocated buffer.
- This repeats and crashing usually within tens of seconds.
When I change the code to use MSGQ_alloc and MSGQ_free, it does not crash. This could be due to:
1. Just recycling the messages is not supported: please let me know if this is the case.
2. The increased throughput of using pre-allocation is tickling the DSPLink software in a particular way to cause it to fall over.
I have hooked my own function as the SYS_abort handler in the tcf file, which sends the system log and SYS_abort message to the ARM is a pre-allocated message (note: not in a pre-allocated circular buffer, just a one-use message), and the crashes seem to pretty much all give a 'Run-time exception, aborting' message, with a system log similar to: (I've annotated in parentheses where in the .out file the locations refer to). Getting this in a debugger in CCS has been proving to be difficult, I've got it working a few times but I seem to get problems repeating experiments, and I'm not even sure what I should be looking at even if I did get it to work.
50681864 CLKINT ticks = 0x003fac08
50681865 PRDTICK ticks = 0x003fac08
50681866 SWIPOST handle = 0xcfcf3e70 (KNL_swi)
50681867 SWIBEGIN handle = 0xcfcf3e88 (KNL_swi$fxn)
50681868 SWIEND handle = 0xcfcf3e88 (KNL_swi$fxn)
50681869 SEMPOST handle = 0xcfb0bd30, sem->count = 0
50681870 SEMPOST handle = 0xcfcdb35c, sem->count = 0
50681871 SEMPOST handle = 0xcfb00380, sem->count = 0
50681872 SEMPOST handle = 0xcfb0bb18, sem->count = 0
50681873 SWIBEGIN handle = 0xcfcf3e88 (KNL_swi$fxn)
50681874 TSKBLOCKED handle = 0xcfcf3184 (task1)
50681875 TSKRUNNING handle = 0xcfcf2e84 (TSK_idle)
50681876 SWIEND handle = 0xcfcf3e88 (KNL_swi$fxn)
50681877 SEMPOST handle = 0xcfb00bac, sem->count = 0
50681878 SWIPOST handle = 0xcfcf3e70 (KNL_swi)
50681879 SWIBEGIN handle = 0xcfcf3e88 (KNL_swi$fxn)
50681880 TSKREADY handle = 0xcfb00bdc (near start of DDR$heap)
50681881 TSKRUNNING handle = 0xcfb00bdc (near start of DDR$heap)
50681882 SWIEND handle = 0xcfcf3e88 (KNL_swi$fxn)
50681883 SEMPOST handle = 0xcfb00380, sem->count = 0
50681884 SEMPOST handle = 0xcfb0ba34, sem->count = 0
50681885 SWIBEGIN handle = 0xcfcf3e88 (KNL_swi$fxn)
50681886 TSKREADY handle = 0xcfcf3124 (task2)
50681887 SWIEND handle = 0xcfcf3e88 (KNL_swi$fxn)
50681888 SEMPOST handle = 0xcfb00380, sem->count = 0
50681889 USRERR EXC_exceptionHandler: EFR=0x2
50681890 USRERR NRP=0xcfcdae40 (*** this location varies randomly: in this particular instance it is in data space, but it is never in code space and sometimes not even in mapped memory, which leads me to suspect something is overwriting this value or a stack problem)
50681891 USRERR mode=supervisor
50681892 USRERR Internal exception: IERR=0x12 (0x10 = Resource Access, 0x02 = Execute Packet)
50681893 USRERR Fetch packet exception
50681894 USRERR Resource conflict exception
50681895 SEMPOST handle = 0xcfb00380, sem->count = 0
The IERR register value changes, but always seems to indicate a problem with executing the packet (or fetching it), which is consistent with the NRP register being messed up.
The task2 task waits in an infinite loop to receive messages, it then processes them and sends them back to the ARM. I'm not creating any tasks on the heap, so the task at 0xcfb00bdc I suspect is the DSPLink task. I note that although task2 is ready it doesn't seem to run. I see in spru732j.pdf (C64x documentation) that EFR=0x2 means 'Internal exception has been detected' but what does that mean?
I have also checked the stacks and they don't seem to get very close to overflowing. I've hooked in TSK_checkstacks to the TSK switch function, and the system stack doesn't seem to get close to full (unless it is blowing up somehow).
Any ideas on how to debug this issue would be appreciated, as would any direction on whether the pre-allocated circular buffer approach should or should not work.