CC1352R: EasyLink_transmit() blocks

spanky

Part Number: CC1352R

I'm seeing a very random and rare issue. I don't really have any insight as to when and/or why it happens, but occasionally EasyLink_transmit() blocks.

I managed to capture a stack trace from ROV this time. Its in my receiverRadioTaskFunction task. I should note that I am using the DMM concurrent with the BLE stack, and I have that mostly working, but I am still squashing bugs. It took about 4.5 hours for this error to occur, so I really don't have any idea what could be causing it, or how to repeat it. I dont think I am running out of RAM, since there is plenty of space left on the heap.

I'm doing a seemingly innocuous call to EasyLink_transmit(), which usually returns pretty quickly (I haven't tried switching that call to EasyLink_transmitAsync(), but that is my only other idea at this point). I guess I am asking why EasyLink_transmit() would get stuck and how to prevent that... and if EasyLink_transmitAsync() would even fix it?

And, is this something possibly to do with the DMM? I do not think there was any activity on the BLE side when EasyLink_transmit() got stuck, but I'm not 100% on that. I don't think I've ever seen this happen during periods of activity on the BLE side either, but due to the time frames, I dont always have the debugger connected or handy.

over 6 years ago

0 Siri over 6 years ago

TI__Guru* 86091 points

Hi

I am not an expert on either BLE or DMM, but we have seen problems with the EasyLink_transmit blocking.

In that case the problem was that an EasyLink_TxPacket variable was allocated on the stack. It was by mistake not initialized, meaning any variables not set explicitly were undefined.
Since absTime was never set it ended up beeing some garbage value. If the garbage value absTime is a large value, the RF_pendCmd() call in Easylink_transmit() will block for an long time.

Can you please make sure that this is not what cases the problems you are seeing?

BR
Siri

0 spanky over 6 years ago in reply to Siri

Expert 2710 points

I am allocating the EasyLink_TxPacket structure on the stack as follows:

EasyLink_TxPacket txPacket =  { {0}, 0, 0, {0} };

I am also explicitly setting the absTime field to zero again, just to be sure.

Because this is an ack type operation, I am firing off 3 EasyLink_transmit()'s back to back. My understanding is that EasyLink_transmit() should block until complete, so there is no possibility of trying to TX over an existing operation. That doesn't mean that the BLE side isnt doing something... but I think the DMM should handle scheduling in that case?

I am pretty certain the BLE state is unconnected (just advertising), as this happens when I leave the system unattended overnight.

I do not have any explicit delay between my 3 tx's, but I am going to try that now.

I am wrapping that tx block in:

DMMPolicy_updateStackState(DMMPolicy_StackRole_WsnNode, DMMPOLICY_STACKSTATE_WSNNODE_TX);
...do 3 EasyLink_transmit()'s
DMMPolicy_updateStackState(DMMPolicy_StackRole_WsnNode, DMMPOLICY_STACKSTATE_WSNNODE_SLEEPING);

Where DMMPOLICY_STACKSTATE_WSNNODE_TX is defined with high priority and time critical on the subghz

I am reusing the txPacket for each repeated transmit. I just tried explicitly resetting the absTime to zero between each transmit, just in case EasyLink_transmit() modified it somehow. Same results.

0 spanky over 6 years ago in reply to spanky

Expert 2710 points

I tried adding a Task_sleep(1) after each transmit. In hindsight that was pretty dumb since it is wrapped in the DMMPolicy_updateStackState() calls, so who knows what is waking up and running between those calls. That did however seem to increase the frequency of the failure. About 2 hours to a failure now (every minute I send about 70 packets, which each get the 3 ack transmits). I still dont have any insight on what is causing the issue, but since adding Task_sleep(1), it would seem something in one of the other tasks (BLE?) is likely the issue.

Is there any way to further protect the EasyLink_transmit() call beyond the DMMPolicy_updateStackState() calls?

0 Severin Suveren over 6 years ago in reply to spanky

TI__Mastermind 23730 points

Hi spanky,

Just some sanity checks:

You have added the DMM RF driver API remapping in your EasyLink driver?

Also, could you read out the status field of the TX command when the issue occurs?

0 spanky over 6 years ago in reply to Severin Suveren

Expert 2710 points

I’m not sure I follow you on remapping in the easylink driver. I’ve added the DMM headers and set it up like the DMM example code. Do I need to modify the easylink files themselves? It does work for hours at a time with both the BLE and subghz, so I’m pretty sure I did something right.

As for the status field, are you referring to the return status of EasyLink_transmit()? Thats the problem, EasyLink_transmit() is not returning when this issue occurs.

0 Severin Suveren over 6 years ago in reply to spanky

TI__Mastermind 23730 points

I'm sorry, didn't know the EasyLink driver already supported this. In that case, have you set the USE_DMM define in the project's predefined symbols?

Regarding the status field, I mean the literal status field in the RF command struct. In other words, if your TX command is called myTxCmd, you can read the status field in the debugger by entering myTxCmd.status in the expressions view.

0 spanky over 6 years ago in reply to Severin Suveren

Expert 2710 points

USE_DMM is defined. Without it, the system doesn't function (I cant remember exactly how, as that was one of the first issues I had). Incidentally, a compiler warning in one of the DMM headers that USE_DMM is not set would be useful on future releases.

I'm not sure I completely follow you on the status. In EasyLink_transmit() there is an RF_Op* that is passed as the 2nd parameter to RF_scheduleCmd()/RF/postCmd(). That is EasyLink_cmdPropTx, which appears to have a status field. When I put "EasyLink_cmdPropTx.status" in the expression view, it has a value of 2. This is when EasyLink_transmit() is stuck in the same call stack as in the initial screenshot above.

Whatever I am doing seems to be making things worse (or maybe better for debugging). This last run only took 10 minutes for EasyLink_transmit() to block (down from 4+ hours). The only variable I can see is the placement of the remote device (that the failing one is trying to EasyLink_transmit() to). They are both on my desk roughly 6-12" apart. Is this maybe too close (i.e. is the transmission blowing out the receiver?) Again, this runs fine for big swaths of time (except for the one this morning)

0 spanky over 6 years ago in reply to spanky

Expert 2710 points

I tried just "EasyLink_cmdPropTx" in the expression view and there's a pNextOp that has a different status. I'm not sure if that helps any:

0 Severin Suveren over 6 years ago in reply to spanky

TI__Mastermind 23730 points

Which SDK version are you using?

From what I can see, the current version of EasyLink is using EasyLink_cmdPropTxAdv, not EasyLink_cmdPropTx.

0 spanky over 6 years ago in reply to Severin Suveren

Expert 2710 points

EASYLINK_API_VERSION "EasyLink-v2.40.02"

Line 155/156 in EasyLink.c:

static rfc_CMD_PROP_TX_t EasyLink_cmdPropTx;
static rfc_CMD_PROP_RX_ADV_t EasyLink_cmdPropRxAdv;

The latest development on my end was to try and wrap the transmit in a HwiP_disable/enable:

        uint32_t key = HwiP_disable();
        EasyLink_Status txstatus = EasyLink_transmit(&txPacket);
        HwiP_restore(key);

That seems to have made the problem go away... but due to the unpredictability of this issue, I cant say for sure, only that it hasn't happened for over 12 hours now, so that's a plus.

My thoughts are that I have an awful lot going on in terms of the DMM and some clocks that fire off to update the characteristics. I am also updating the advertisement bytes, and I do not fully understand what is going on under the hood in those operations (specifically when no BLE device is connected). Is it possible that one of those is indirectly stepping on the transmit and by disabling the Hwi, the transmit operation can run cleanly? (I also don't fully understand the ramifications of the HwiP_disable/enable... I was just trying anything I could think of)

Also, I think I'm going to have to switch to EasyLink_transmitCcaAsync() as my deployment environment is pretty noisy.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Hi Spanky,

I suspect what you see is due to the pending events used in the Easylink "EASYLINK_RF_EVENT_MASK" define. Seems like you end up in a situation where command might get changed by the DMM causing an "unexpected event" (or maybe no event?).

When putting the Hwi guard around the call, you are blocking out any other system calls which means you most likely would not see the issue, but it could create problems for the rest of the application where the timing is sensitive.

Could you try to expand the "EASYLINK_RF_EVENT_MASK" to all events and make a "trap" outside the "RF_pendCmd" inside the transmit API to any events not matching the original event mask? This could help us figuring out if the problem is you getting an unexpected event or if you are simply not getting any event.

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

I have since switched over to EasyLink_transmitCcaAsync() and that seems to have also resolved the issue (without Hwi disabling)

I could roll back my changes to try and recreate the issue, but at this point the EasyLink_transmitCcaAsync() is a much more robust solution for this application.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Hi Spanky,

Apart from CCA, the difference here seems to be you using an the non-blocking API instead of the blocking one, which could be the reason you are not seeing an issue. Assuming the CCA does not make a difference, I would however expect the same behavior, but with a missing callback instead of a blocking function. I recommend you make sure that you are not missing any callbacks.

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

Looks like you are correct. When using the synchronous transmit, my whole program blocked, but by switching to the async version, my program kept running along, just not correctly. I am still assessing what that means, but its almost certainly a missed callback from the transmit, as you suggested. Given that it is hours between failures, it's probably going to take a while to sort through this.

I rolled back some of my changes to return to the synchronous transmit as that failure mode is a bit easier to wrap my head around.

I've edited Easylink.c to expand the events passed to RF_pendCmd() to 0xFFFFFFFF, and then I am checking if the result of RF_pendCmd() is not in EASYLINK_RF_EVENT_MASK (the ones that it normally listens for). I cant figure out why CCS wont let me set a breakpoint on those new lines, but that's a different problem

Is there some reason RF_EventError was not included in EASYLINK_RF_EVENT_MASK? I was just looking at the event definitions and that one stood out as one that seems worth catching.

I'm having a rough time with CCS letting me catch that event because breakpoints dont work. I've added a System_printf in there, but the debugger crapped out somewhere along the way. I'm giving it another go with the printf and the Log_info macros in EasyLink_transmit(). I'll report back if/when something happens there.

Regarding the CCA, I was more interested in the Async part, but since my deployment environment is noisy, I wanted to make use of the clear channel assessment.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Hi Spanky,

Regarding CCS and breakpoints, I would assume this to be because of the optimization. You can choose to turn off optimization for only EasyLink.c by going into the properties of that file and turn down the optimization just as if it was the whole project. If you do this, there should be a small "key" icon on the file in the file symbol to show that it has custom options compared to the rest of the project.

As for why RF_EventError is not part of the mask, I can't say.

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

This is super frustrating. It happened once with my modifications, but the debugger crapped out. Then I disabled optimizations to fix that breakpoint issue (thanks... a more descriptive error message in CCS beyond the AET text would be helpful). Now its been running for close to 24 hours without an issue.

I'm headed back to the field to test in the noisy environment, so I switched back to the CCA/async code (which doesn't have the event flag / trap you suggested). I'm going to let it run over night and see if something happens. I'll do my field tests, then I'll switch back to the synchronous code and run it over the weekend and hope to catch the error.

My theory at this point is that the optimizations might have caused this particular issue? I don't have any evidence to back that up, but recently I saw a very weird (unrelated) and hard to describe issue on my companion device and that was solved by turning off optimizations (then I turned optimizations back on and that odd behavior inexplicably was still gone... I made no other changes to the code)

My observation here is that compiler optimizations do weird things. Are there any references on that general topic?

0 spanky over 6 years ago in reply to spanky

Expert 2710 points

Uuuhg, well I spoke too soon. I was just about to go to bed and I decided to try connecting and unconnected my BLE device a few times. It appears to have got stuck on a missed callback from EasyLink_receiveAsync(), or maybe in EasyLink_receiveAsync()? Its hard to tell from the stack trace in ROV. Every time this has happened it was on a transmit, so I don't know if this is a completely different issue.

The weird part, and maybe this is just because I happened to be here exactly during the error (every other time it happened while I was away) is that the BLE continued advertising and accepted connections for a few more seconds. But, the characteristics didnt respond back with any data. Then after a few seconds it stopped advertising.

Since it was on the recv this time, I had no useful code modifications to report.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

When you got stuck in this case, did you get an exception or the device simply "stopped in code"?

As for optimizations "solving the issue", this often means the actual problem in still there, but is masked as the code is not as aggressively optimizes.

My personal experience of these scenarios is that there is some kind of memory corruption going on, be that a stack that is overflown or an out-of-bounds write. In the non-optimized version the code is not packed as much, which means the small corruption is not directly noticed (but still there) as in the case where optimization might have packed it close to a vital control structure.

As you did now saw an issue in the receive case, I would look into how you handle the received data here, and if it is possible that you could get a "corrupt frame" which causes this issue?

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

In this case, since it was an async operation the other task kept running. Because I missed the async callback, that particular task blocked on the event loop (the callback would have fired off an event to unblock that). There were no exceptions that I could see.

The memory corruption sounds like the culprit and it’s entirely possible. The time scale (4-8 hours) makes it very difficult to figure that out beyond a code review. I’ve dug through my code, but it’s mostly likely a complex non-obvious series of operations.

Are there any tools that would help uncover this sort of corruption?

This time it happened on a recieve, but all the other times I’ve witnessed it (assuming this was the same issue) it was related to a transmit.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

There is no "super easy" way to uncover if this is the case, the easiest way is to simply find a a way to break just before what causes the issue and then single step over to see what goes wrong (which as you say, can be hard in situations where the error occurs in the time scale of hours).

I would still like to investigate the "all flags" check even in the case where the async API is used. Also have an extra check to make sure that you handle potential scheduling errors (in a situation where the transmit/receive call return an error).

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

Sorry for the delay. I did some field tests last week and came away with more questions than answers...

The CCA code didn't work as I hoped. It appears the environment is particularly noisy and my EasyLink_transmitCcaAsync() calls (in the reverse direction) were mostly busy. That's probably a side note to this main thread. Briefly though, there's a large AC motor in the vicinity and a bunch of 900MHz traffic. None of this is excessive, but its far more than my development environment.

When a packet did manage to transmit (from the remote device) to the board that is having the trouble (this thread), the board locked up. I'm not sure if this is the same issue since I didn't have a debugger. I left the field with very little insight. The board in question acks the packet with a transmit, so I'm not 100% sure where the failure is (on the rx or tx)

Over the weekend I left the system running again and it finally crapped out again. This time however, the stack trace from ROV is a bit different. On deeper inspection of the line numbers and functions listed in the Tasks->CallStacks of ROV, they don't appear to line up with actual lines of code. I have optimizations turned off.

If the ROV callstacks are to be roughly trusted, looking back the the codeline previous to the one listed in ROV, it does appear that it is stalled out in the EasyLink_transmitCcaAsync() call, or possibly on my semaphore waiting for the txcallback to complete (which never got called). It is strange however that my whole program stalls, even the clock I have set to toggle an LED to verify things are running.

My hope at this point is that the issue I was seeing regularly in the field is the same as the one I'm seeing after hours on the desk. My plan is to get CCS running on a laptop and revisit the field to try and trap that error with a bit more clarity. Hopefully that will happen some time this week.

0 spanky over 6 years ago in reply to spanky

Expert 2710 points

I made it back to the field and uncovered one glaring issue... but I dont think it is the one that is plaguing me in this thread. I migrated my code from an older 1350 project and dont have the TRNG initialization that is in the 1352 examples. Apparently the CCA code has some sort of dependency there. Here is a stack trace:

Incidentally, the other end on a 1310 is running fine with the CCA code without the TRNG initialization.

Can you shed some light on the dependency here? Is it possible that there is some rand() dependency that is surfacing after a certain number of calls? I'd be surprised if these issues are related, but its worth asking.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Hi Spanky,

CCA do use random numbers when calculating the back-off time, in these cases you should pass a "random number" function in the Easylink params on init (see the rfEasyLinkListenBeforeTalk example on this).

I thought that popped up (sorry for not thinking of this earlier) is that if you can trap the device in this "hanged" state again, you can have a look at the radio command and the RF queue and see if you can find any hints on what might be going on there.

To do this, I would suggest you add the RF driver source directly to the project, this will give us access to some additional symbols when debugging. Then start the debugger and add "RF_cmdQ" to your watch expressions along with the Easylink radio commands. You can find the Easylink command symbol name in Easylink.c, search for "EasyLink_cmd*". It might be a good idea to remove the "statics" just to make it easier for you to debug.

If you can get this up, you should be able to inspect the radio commands passed to the driver together with the status of the command. You should also be able to see if the command is still in the RF queue by inspecting the RF_cmdQ structure.

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

Forgive me if this is off topic, but there appears to be a difference in this between the 1310 and 1352. I am using the EasyLink_transmitCcaAsync() on the 1310 without initializing the TRNG or setting the pGrnFxn parameter and I am not seeing this issue (the crash at rand()). It seems that initializing the TRNG would be more of a security concern and rand() should(?) return a value even if it is more or less not truly random. Its easy enough to add this code, so I'll go ahead and do that... I'm just grasping at straws trying to explain the bigger issue. If rand() can crash, maybe its getting used somewhere else in the RF code. Its probably a completely unrelated issue, but worth asking.

I'm not sure how to go about adding the RF source directly. Is that just the RFCC26X2_multiMode.c file from the SDK?

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Hi Spanky,

In the case where pGrnFxn is not passed along the parameters, the default function will be used which is rand(). For your CC1352 device, did you base the EasyLink side on any particular example? I know the "Listen before talk" example uses the TRNG functionality.

Yes, copying/linking that file into the project would override the library version.

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

My code evolved out of the project_zero example from the 1350, but I integrated the DMM code and the asyncCCA stuff. In the last SDK release, I did a pretty thorough comparison as I though that update might have changed something in the example code.

I'm a bit perplexed why rand() would crash like that, but I don't think that is directly related to the main issue here.

I've been combing over my code to try and figure out what might be the issue. I have a potential bug in an un-handeled event (basically I'm posting an event once a second and not handling it through pend due to an event mask). I put that in another thread because I couldn't find an answer to what happens to unhandled events:

This probably isn't my issue, but after adding the event to the pend mask, it seems to be running for longer than it had previously (I'm up to about 14 hours now, where it used to only run 4ish hours before locking up). Every time I think I've found the issue I've been wrong, so I'm not super optimistic.

It has been running on the debugger for some time now and it did break out on this. I didn't have a breakpoint here, and I've never seen it stop here before. I was able to just hit resume and it seems to have continued on unaffected:

0 spanky over 6 years ago in reply to spanky

Expert 2710 points

Everytime I "think" I've solved the issue I've been wrong and spoke too soon.... but after resolving that event mask issue, its been running all weekend without issue, which is a massive improvement (aside from that random hiccup on the CPUwfi(), but all I had to do there was resume execution, so I'll chock that up to a random occurrence)

I think there's more to it than just the event mask, because the event I was trying to fire off every second still did what I wanted. I'm not sure how, but I haven't thoroughly investigated yet since I wanted to leave the setup running as long as possible to make sure the locking up went away.

I'll dig in this week to try and make sense of it and report back.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Interesting, looking forward to your report. As for the "WFI" place, it is strange that the debugger stopped you there, but it sound like a hiccup in the debugger (as you said) as this is where you expect to be when the device goes into idle/standby state.

0 spanky over 6 years ago in reply to IUsedToBeARealPerson (M-W)

Expert 2710 points

After digging into this, it was clearly a bug on my part (I suspected this all along). I still have no explanation for why things stalled out, but here is a bit of a description of my incompetence.

I added an event in ProjectZero_taskFxn(), just above the PZ_OAD_QUEUE_EVT block. I had 2 fundamental bugs here.

First, I forgot to add the event to the Event_pend() orflags... Which should have been obvious, except...
More dastardly, I defined the event as 11, (not bit 11 or Event_Id_11). Yes, I'm an idiot and know better. The problem was that since I defined my event as 11 (or Event_Id_00|Event_Id_01|Event_Id_03), I happened to capture the OAD_QUEUE_EVT (Event_Id_01). So when I triggered my event (via the mistaken 3 bits), my desired functionality was met, but it also quietly ran the OAD_QUEUE_EVT code. It seems that code was relatively benign if the status was success, so I didn't see it getting executed.

My best guess at this point is that my repeated calls to OAD_processQueue() might have been the culprit. Or maybe it was a timing issue (i.e. calling that at the wrong time)? Either way I had no business calling that code, so the bug was ultimately and entirely mine.

M-W, Siri, etc. Thank you for sticking with me on this.

0 IUsedToBeARealPerson (M-W) over 6 years ago in reply to spanky

TI__Guru 58425 points

Nice to hear you found the issue, tricky one :)

Good luck with the rest of the development!

Sub-1 GHz

Sub-1 GHz forum

CC1352R: EasyLink_transmit() blocks