This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Correct use of TMS570 DCAN / memory leak ?

Other Parts Discussed in Thread: TMS570LS20216, HALCOGEN, LAUNCHXL2-TMS57012

I'm using halcogen to setup the DCAN of a TMS570LS20216 and seeing memory leak/corruption after receiving/sending a number of CAN messages.

I have a whole bunch of other things setup and working fine, NHET, HTU, SPI, ADC, RTI. This can run for hours without any memory corruption.

For testing CAN I am sending a single CAN message from an external unit, interrupting and transmitting a message back. This is repeated by clicking a button every few seconds (so load is very low, single frame at a time). After 20 to 50 messages I get a memory corruption of generic variables in RAM. First 20 or so messages work fine.

My 'canMessageNotification()' is this simple test code below:

if((node == canREG1) && (messageBox == canMESSAGE_BOX2))
{
if(!canIsRxMessageArrived(node, messageBox))
{
return;
}
canGetData(canREG1, canMESSAGE_BOX2, data); /* copy to RAM */

data[0] = 'T';
data[1] = 'e';
data[2] = 's';
data[3] = 't';
data[4] = '1';
data[5] = '2';
data[6] = '3';
data[7] = '4';
canTransmit(canREG1, canMESSAGE_BOX1, data);
return;
}

I can't see anything in particular that I'm doing wrong. I've looked through the canTrasmit and canGetData source, and can't see any potential memory leaks.

Is it ok to call canTrasmit from within the interrupt handler?

Any ideas welcome?

  • Ian,

    I am not sure I'd call this a memory leak issue unless you are dynamically allocating buffers from the heap.
    Sounds more likely to be a stack overflow issue due to limited stack size.

    You should be able to configure the MPU to catch a stack overflow. or set a watchpoint to trap a memory access past the bottom of the stack, that would halt the CPU right at the instruction that overflows.

    That's my best guess without more information. I don't see any dynamic memory allocation in the code segment you pasted.
  • Thanks for the reply.
    Stack usage was my first thought, and I increased the stack. Although your suggestion for trapping an overflow is useful.

    After further debugging it seems the DCAN and ADC are interacting.
    When the memory is corrupted, it is the contents of the ADC2 group 1 data that gets copied multiple times over the whole RAM memory range. But without CAN frames, the ADC works. The buffers for the ADC data are statically linked.

    The ADC is handled with an interrupt on FIFO full and 'adcGetData' gets called. This is getting hammered pretty heavily.
    If I neuter the adcGetData by forcing if to only read a few bytes from the FIFO instead of the full contents. The CAN continues working.

    Another strange effect, ADC2Group1 data is getting stored with channel numbers, so the normal table in memory is e.g.
    1,val1,2,val2,5,val5,6,val6 repeating
    but the corrupted memory has each value repeated 6 or 7 times e.g.
    1,val1,1,val1,1,val1,1,val1,1,val1,1,val1,1,val1,2,val2,2,val2,2,val2,2,val2,2,val2,2,val2,3, val3 .... etc.
    It seems like a pointer to the adc data buffer is going beyond range, but the only time this is used in in adcGetData.

    It seems as if the CAN interrupt is disturbing the ADC interrupt, could the delay of handling the CAN interrupt be causing a re-trigger of the ADC interrupt before it has finished. The interrupt routine shouldn't need to be re-entrant, I wouldn't have thought the ADC interrupt could be retriggerable before it has exited?

    The ADC handled with interrupts was a temporary measure to check ADC readings where correct from the hardware attached.
    I was planning on changing the ADC transfers to DMA eventually, so I think I'll do that now rather than later!
  • Ian,

    Which stack did you increase?
  • I edited the 'sys_link.cmd' file

    define region STACK = mem:[from 0x08000000 size 0x00002000];

    But now that you've said that, I'm looking in 'sys_core.asm' to see what is initialized there ;-)

    Regards,
    Ian.
  • also - if your notifications are run in IRQ mode, (the routine where you copy the ADC and DCAN data ... ) then you need to increase the IRQ mode stack ;)
  • I've changed all of the stacks in Halcogen and re-generated the 'sys_core.asm' and checked it matches my 'sys_link.cmd' file. All ok.
    I've also filled the stack memory (for all stacks) with a pattern and run the program to see what gets overwritten. The stack use is minimal, a fraction of the stack sizes set.

    Thanks for the pointers.
    I'll do some reading on the DMA controller and change my ADC handling over to DMA to see if that removes the problem.

    Slightly off-topic, I notice Halcogen does not have a configuration tab for the DMA controller.
    Do you know if TI have any examples of using the mibADC with DMA?

    Regards,
    Ian.
  • Ian,

    So first, are you still having the problem with all the stack sizes increased? If so then I don't think moving to DMA is the right next step.

    2nd - there are some DMA based examples for HalCoGen in the examples folder. There isn't a GUI tab but there is a header file and a .c file to support the DMA. look for the example under examples\TMS570LS31x_21x\mibspiDma.c. It's MibSPI not ADC but it shows the DMA at least.

    3rd, just noticed that you're using the 20216. This is NRND (not recommended for new designs). Note that this does not by any means end of life - this part will be shipping for many years.

    But if you are starting a new design the LS3137 is recommended in it's place. The 3137 should actually cost less or at least the same and it has more memory. There are also reduced variants of the LS3137 available at even lower costs.

    If you want to get into this family then the processors.wiki.ti.com/.../LAUNCHXL2-TMS57012 would be an inexpensive starting point. That device has less flash but more RAM (1.25 MBytes of Flash, 192K RAM) and it's got the newer N2HET as well as the ePWM, eQEP, eCAP modules.

    The 1227 is software compatible w. the 3137 ... (we just don't have a $20 kit for the 3137... if you need the 3137's memory then you would want to look at the 3137 HDK).
  • With the stack increased I'm still having problems.
    If I remove/disable the ADC, the CAN massages work ok.
    I can't quite see the mechanism for the corruption, but the data that is getting copied over everything is from the ADC. The ADC is a huge load on the system. The interrupts will be triggering at tens of KHz at the moment.
    They do need switching over to DMA at some point.

    We're using the 20216-EP which is still shown as active on the website. The project was started about 9 months ago, I had noticed the non-EP version is now NRND.
    Does the 3137-EP carry the same level of qualifications?

    I have a LaunchXL2-TMS57012 that I was working with before we had our custom board available, so I'm familiar with that part and the dev board.

    Regards,
    Ian.
  • Hi Ian,

    Ian Guffick said:
    We're using the 20216-EP which is still shown as active on the website. The project was started about 9 months ago, I had noticed the non-EP version is now NRND.
    Does the 3137-EP carry the same level of qualifications?

    I am not sure but I would encourage you to ask (with specific qualification in question) on the High Reliability forum:  https://e2e.ti.com/support/applications/hirel/f/935#pi239797013=1

    The -EP extension products at TI actually come from the Hi-Rel team so they need to answer specifics on performance and datasheet questions.  The specs may be different than the non-EP in order to meet the more stringent environmental requirements and I don't know what's different.

    However, if you hadn't told me -EP then I wouldn't have known, and it wouldn't really make any difference to the question you are asking or the issue at hand I think.   So I would still encourage you to look at the 3137 class if you can - because there are more examples available for it and the HalCoGen drivers are more full featured. Pretty much all of the 31xx, 21xx, 11xx, 07xx, 04xx is at a level past the 20xxx, 10xxx series parts because the former are the latest generation and provide better features and usually lower cost. 

    You do not need to migrate of course - just saying that if you have the chance to do so I would recommend it because it'll be a bit easier to develop with.


    You *should* be able to drop a 3137 into the footprint/board of a 20216 with minimal changes too.   I think the biggest difference is the core voltage has to be dropped from 1.5 to 1.2V but probably (hopefully) a resistor change can accomplish this. 

    If you want to look at this and have questions I'd be happy to help. 


    OK all that is just because we don't happen to have an example for DMA on the 20216 and while you can probably use the DMA files from a 3137 project - it just brought up the issue that the software for the 3137 is a little further along towards completeness.


    Back to your issue though -  the ADC is not a bus master so it cannot write to memory and corrupt the memory of the DCAN.  And you don't have DMA integrated with it yet.   So there has to be a fairly significant software issue that would be worth understanding (IMO) before bringing the DMA into the picture.  It'll just get more complex when you add DMA but the original issue is likely to still be there unresolved.


    Maybe you can let us know where you are storing these data values from the ADC and DCAN.   Have you allocated space for them using Global variables to declare their buffers?   We should be able to see these buffers in the output of the linker  (.map) file - if you are statically allocating them.


    If they are local variables declared inside the ISR then they would by definition go out of scope when you exit the ISR and it would look like the one overwrites the other when in reality this is just because they'd be on the same stack...

  • Ok, the values for the adc are stored in a static global variable, in the map file they are:
    adc1_data 0x08002400 0x200 Data Gb sys_main.o [1]
    adc2_data 0x08002600 0xe0 Data Gb sys_main.o [1]
    They are only used in the call to 'adcGetData' inside the adcNotification handler.
    Not used anywhere else, only used to check contents using the debugger.

    For CAN, the message data is stored in a local variable within the canMessageNotification, it is used for the 'canGetData' to see if we received the correct message, then modified and used for the 'canTransmit', within the same function. Again not used elsewhere at all.

    Puzzled as to why either of these would cause the problems I'm seeing.

    Regards,
    Ian.
  • Ian,

    The globals look like arrays. Any chance your code isn't staying within the bounds of the array?
    For example, if you have an array:

    int a[10];

    and you then set:
    a[20] = value;

    You'd probably be corrupting some other memory...
  • Yes, the adc values are arrays:

    volatile adcData_t adc1_data[64];
    volatile adcData_t adc2_data[28];

    ADC1 has a FIFO set to 64 and used with this function
    adcGetData(adcREG1, adcGROUP1, &adc1_data[0]);

    ADC2 has a FIFO of 28 and used with this function
    adcGetData(adcREG2, adcGROUP1, &adc2_data[0]);

    Regards,
    Ian.
  • hmm. ok not sure what the functions do but at this level it looks ok.

    if you know the address that is corrupted, the easiest thing to do would be to use the CPU's watchpoint to halt on a memory write to that address.

    you configure the watchpoint in CCS as a type of breakpoint, (i.e. it starts as a breakpoint, but you change the type to watchpoint, and then specify the type of access to halt on).
  • I've slept on it and also did some more testing.

    If I disable ADC2, but leave ADC1 and CAN enabled, then it runs fine. With ADC2 enabled it shows the fault.

    After some messing around with data breakpoints in IAR, I've found the source of the problem, but don't understand why it's happening.

    The interrupt routine calls adcGetData, but is identical for both ADCs.

    void adcNotification(adcBASE_t *adc, uint32 group)
    {
    	if(adc == adcREG1)
    	{
    		adcGetData(adcREG1, adcGROUP1, &adc1_data[0]);
    		adcDebug_count1++;
    	}
    	else
    	{
    		adcGetData(adcREG2, adcGROUP1, &adc2_data[0]);
    		adcDebug_count2++;
    	}
    }

    adcGetData is the standard library routine provided by Halcogen

    uint32 adcGetData(adcBASE_t *adc, uint32 group, adcData_t * data)
    {
        uint32       i;
        uint32  buf;
        uint32  index = adc == adcREG1 ? 0U : 1U;
        uint32       count = (adc->GxINTCR[group] >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF));
        adcData_t *ptr = data;
    
            /** -  Get conversion data and channel/pin id */
            for (i = 0U; i < count; i++)
            {
              buf        = adc->GxBUF[group].BUF0;
              ptr->value = (uint16)(buf & 0xFFFU);
              ptr->id    = (uint32)((buf >> 16U) & 0x1FU);
              ptr++;
            }
        adc->GxINTFLG[group] = 9U;
        return count;
    }
    

    When the memory starts corrupting 'ptr' is outside the range of the array, because count is set to 4294967069 !!!

    Don't know why that is happening. The only interaction I can see between ADC and CAN is the order of interrupts and the priority.

    I believe the priority would be ADC1, CAN, ADC2. And ADC1 works fine, but ADC2 fails when the CAN interrupts?

    This would point to a stack problem.

  • I've double checked my stack when the data breakpoint hits at failure.
    Peak IRQ stack size is 0x88, with a stack size set to 0x500. Supervisor stack is only 0x18, others all zero!
    Also if I look into the stack memory, I can't see any values in any way similar to the wrong value in 'count'.
    So I think the wrong value in count must come from this line in adcGetData:
    uint32 count = (adc->GxINTCR[group] >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF));

    If there was an ADC overflow or some other problem, could this give the wrong result?
  • Hi Ian,

    Some great work here.

    I'm not yet at the same point of thinking the problem is in the line where count is assigned a value though but maybe just because I havent' seen everything you've seen. It looks like the line for count is pretty complicated and has room for error - if the peripheral registers return a value that wasn't expected by the programmer of this function.

    But first, some specific questions for when the corruption occurs:

    1) can you determine what was passed to adcGetData as parameter 'data'
    if you are triggering on the corruption then you may need to make a 'copy' of this value somewhere, like add a line
    adcData_t *ptr_copy = data;
    and hope this doesn't 'fix' the corruption... ;)

    2) can you also find out what the value for count is? count = (adc->GxINTCR[group] >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF));

    again you may need to do something like adding a line:
    uint32 count_copy = count;
    and hope the problem doesn't disappear.

    I think knowing whether the issue is in the value initially passed to this function as data or in the determination of 'count' from the ADC registers is going to be the key.

    if it's really in:

    count = (adc->GxINTCR[group] >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF));

    Then what may be useful is to break that complex expression up into some temporary variables, then compute count from the temporary variables so that we can see which component of that expression is giving an 'unexpected' result.
  • Ok, I've added some debug code into 'adcGetData', using if(count>64) to catch the buffer overflow before it's starting writing over everything. Also copied variables to globals so that I can see values on local variables that are not available whilst debugging.

    adc = 0xFFF7C200 (adcREG2)
    group = 1
    *ptr points to 0x8004300 which is the start address of my global adc2_data array
    count = 4294967069 (0xFFFFFF1D) ???????
    adc->GxINTCR[group] = 65535 (0xFFFF) ???????
    s_adcFiFoSize[index][group] = 28 (correct, my FIFO size)

    This would appear to me that GxINTCR[group] of 65536 is effectively -1.
    If this value counts from 1 to 0, but the CAN interrupt is being serviced before the adc interrupt is serviced, then another adc value being added to the FIFO would give the change from 0 to -1.
    So is this a simple overflow that is not being taken into consideration for the evaluation of count in adcGetData?
  • Looked at this again, and some more testing!

    I've split out the 'count=...' line, so instead of:
    uint32 count = (adc->GxINTCR[group] >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF));

    I have :
    if((adc->GxINTCR[group] >= 256U))
    count = s_adcFiFoSize[index][group];
    else
    count = s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF);

    Looking at my wrong value for count. At the first line 'if...', GxINTCR[group] must have evaluated as less than 256 to jump to the last line. But on the last line GxINTCR[group] must have evaluated as < 255.
    So between the first and last line an ADC reading has been added to the FIFO causing GxINTCR to read 0xFFFF instead of 0, giving a race on the evaluation because of the two volatile accesses of GxINTCR[].
    Therefore GxINTCR[group] should be read into a local variable and the same value used for both parts of the evaluation.

    If I change the code in adcGetData to remove :
    uint32 count = (adc->GxINTCR[group] >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(adc->GxINTCR[group] & 0xFF));

    and replace with these two lines :
    uint32 temp_GxINTCR = adc->GxINTCR[group];
    uint32 count = (temp_GxINTCR >= 256U) ? s_adcFiFoSize[index][group] : (s_adcFiFoSize[index][group] - (uint32)(temp_GxINTCR & 0xFF));

    Then the problem has gone, ADCs continue working fine whilst CAN frames are being received. And no memory corruption!
  • Wow that is some great work Ian - I wouldn't have spotted the double read of the volatile register as a culprit.
    I will file a ticket on this issue. Thank you for your detailed analysis.

    EDIT:  Ticket is SDOCM00121232

  • Thanks for the help in tracking this down. It's good to have someone question your assumptions!
    It makes you look at things that you may have otherwise overlooked.

    Regards,
    Ian.