This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Several questions regarding the eFuse controller

Other Parts Discussed in Thread: TMS570LS3137

Hi.

My questions are regarding the TMS570LS3137. I've several questions regarding the eFuse controller:

Question 1:
In spnu499b.pdf, page 1718, chapter "32.3.1 eFuse Controller Connections to ESM" is stated "...If an error occurs during the eFuse controller self test, then a group one channel 41 error and a group one channel 40 error are sent to the ESM. ...".
This question is a "double-check" question: Is it assured that in case of an error during the eFuse controller self test ESM group 1 channel 40 and channel 41 are ALWAYS set together? Or is it also possible that only ONE of both channels would be set?

Question 2:
This one is again regarding the eFuse self test. Two sub questions:
2.a: In spnu499b.pdf, page 1719, chapter "32.3.2.5 eFuse ECC Logic Self Test" is stated "Verify that bits 4 to 0 of the eFuse Error Status register at address 0xFFF8C03C are zero.".
    But on page 1724, chapter "32.4.3 EFC Error Status Register (EFCERRSTAT)", Description of bit field "Error Code" (bits 4:0), is stated that only 0x18 will be set for this bit field in case an eFuse self test error occurs.
    So the question is: In case an eFuse self test error occurs, is 0x18 really the only possible error code which will be stated in bit field EFCERRSTAT[4:0]? Or could also OTHER error codes be stated in this bit field cause of an eFuse self test error?
2.b: If the following question is relevant would depend on the answer of question 2.a. Is it possible that the following bit field values could occur (I assume not, but I just want to be dead sure):
    *) The eFuse self test is triggered.
    *) An error was identified by the eFuse self test. After the self test finished, the following bit field values will occur:
      -) Bit field EFCERRSTAT[4:0] will have another error code than 0x18 (e.g. 1).
      -) Bit EFCPINS[14] is 0.
      -) ESM group1, channel 40 is not flagged.

Question 3:
This one is again regarding the eFuse self test: The eFuse self test is started by writing 0x0000200F to the EFCBOUND register. What happens after the self test has finished, i.e.:
3.a: Will bit EFCBOUND[13] be reset automatically to 0?Will bit field EFCBOUND[3:0] be reset automatically to 0?
3.b: In case EFCBOUND[13] and EFCBOUND[3:0] will not be reset automatically to 0, and the application SW also does not reset one of both bit fields to 0: Is then the eFuse self test start condition still fulfilled? I.e. will the eFuse self test then be triggered again and again and again ...?

Question 4:
This one is regarding an uncorrectable error occurs during the loading of the eFuse values after reset: I read the following two discussions, but they didn't answered my question:
 * http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/209778.aspx
 * http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/209680.aspx
What exactly happens in case of an uncorrectable error? Cause of this error the ESM group 3, channel 1 is flagged. I see two possibilities:
Possibility 1: The TMS570 starts NOT with execution of the first instruction which is located at the reset vector. Rather the TMS570 jumps directly to the abort handler, and executes the instructions of the abort handler. And inside the abort handler, the application SW shall check if an ESM group 3, channel 1 was flagged. One sub-question:
    1.a: Which abort handler will be entered? The data or the prefetch abort?
Possibility 2: The TMS570 starts as usual with execution of the instructions starting at the reset vector. And the check for a flagged ESM group 3, channel 1 shall be done during the startup phase.

Question 5:
This is one is regarding bit "Instruc Done" of register EFCERRSTAT, i.e. EFCERRSTAT[5] (see also spnu499b.pdf, page 1724): I read the following two discussions, but they are for me contradicting:
 * http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/221205.aspx, "Posted by Chuck Davenport on Dec 14 2012 14:34 PM", is stated: "... I have investigated the use of Bit 5 of the EFCERRSTAT Register and found that this bit is used to indicate the successful completion of a command during manufacturing setting of the eFuse bits. ..."
 * http://e2e.ti.com/support/microcontrollers/hercules/f/312/p/208946/739611.aspx, "Posted by Chuck Davenport on Aug 21 2012 09:57 AM", is stated: "Bit5 of the EFCERRSTAT register is used to indicate the status of a requested self test. ..."
So I've now two questions:
5.1: I assume that the posting "Posted by Chuck Davenport on Dec 14 2012 14:34 PM" is correct (and the posting "Posted by Chuck Davenport on Aug 21 2012 09:57 AM" is not correct)?
5.2: Is it at least assured that EFCERRSTAT[5] is 0 for the following time duration: From any reset until that point in time when the eFuse self test is triggered for the first time?

Thank you and regards

Oliver.

  • Hi Oliver,

    I'll research this one and get back to you soon.

  • Oliver,

    Regarding Q1 and Q2,  I think there is an underlying assumption in the documentation.   It's stated on page 1719 this way:

    "This test should only be performed once for every device PORRST cycle. Perform the self test by following
    these steps:"

    And then also on page 1718 this way:

    "A class 1 error of the eFuse controller means that there was a failure during the autoload sequence. The
    values read from the eFuses cannot be relied on. All device operation is suspect."

    Now, if you had the complete list of error codes which we don't publish,  you'd find that there's about 3 broad categories.

    1st are error codes related to 'autoload'.   But if you get one of these, you will get the group 3 ESM error and you'll be in the category of 'all device operation is suspect' per the description on page 1718.  So you wouldn't be moving on to further self tests you will need to enter a safe state.

    2nd category a bunch of codes related to programming the efuse which is only a factory function.   You shouldn't get these because you shouldn't be trying to program.  There's also an internal diagnostic or two that are not documented.

    3rd category is the self test of the EDAC logic (0x18).

    I believe if you:

      a) do not have any autoload errors

      b) you only run the self test of EDAC one time after each PORRST

    Then you will begin the self test with the error code 0x00.  The only possible outcomes after this are 0x00 again (no error) or 0x18 (EDAC self test error).

    If you tried to run the EDAC self test again and the starting value in 0x3C were not 0x00 then I am not sure what the value would be.   From experimenting with this I can see that running a successful self-test after a failed self test does not clear the 0x18.  You have to write a value of 0x00 to the register 0x3C to clear the 0x18.  So I suspect this register is a sort of 'accumulator' of error bits without a 'clear' mechanism except through one of the register interfaces.     But our recommendation isn't to do this, just to run once after a power on reset.

    Regarding Q3 from experimenting I can tell you that the value 0x200F stays in the boundary register after the EFUSE EDAC controller self test is run.  You can clear it back to 0x0000 and then write 0x200F again and it appears to kick off the EDAC self test again.  But the only way to really see this is to mess up the cycle count or signature (cycle count is easier to put back!) and run it again so that the result of the next run is a failure.   The self test 'done' bit 15 of address offset 0x2C seems to be only cleared by another power on reset.   This may be because we are only recommending to run the test once per power on reset.

    Regarding Q4 I believe the answer is you don't know what's going to happen exactly.  The fuse bits in part are necessary for the memories on the device to work properly and if you don't know what the memory is going to do it's very hard to say what the CPU will do.   Except in this case the Group 3 ESM will trigger the ERROR pin and you can use that at the system level.

    From a practical matter, I suppose you have to write the error handler code to assume that you will be able to get some control of the part but I'd probably make sure it then goes to code that will put you in a safe state.  You really don't know for sure the CPU will execute this code if that severe an error occurs, but you can at least have it do something rational in case  the code does execute.

    Q5 - I do not see Bit 5 of the ERRSTAT being set after an EFUSE self test.  I don't think EFUSE EDAC self test falls into the same category of instruction as this bit refers to.  There are other undocumented registers where you can instruct the controller to do things like 'program' and I believe this is what that bit 5 would indicate is complete.

    The EDAC self test complete goes to bit 15 of the EFCPINS register.  But this seems to be set on the first completion and only reset by PORRST.  So it's good but only seems to be good once after each power on reset.   Which I think aligns with that recommendation to run the EDAC self test only after a power on reset.  

    Will discuss these answers more w. colleagues and get back to you if there are any updates.   But for now, I wanted to give you a preliminary view of what I've found so far.

  • Anthony,

    thank you for your fast preliminary feedback.

    Regards

    Oliver.

  • Hi Anthony,

    any news regarding this topic?

    Regards

    Oliver.

  • Hi Oliver,

    Not yet, sorry.

     

  • Hi Anthony,

    this thread is very helpful and clarified quite a few questions for me.

    I have a related question: can the stuck at zero test be run multiple time in one PORRST cylce? My idea is that if this test fails, I can warm reset the CPU, re-run the EFUSE stuck at zero test. If it was a soft error, hopefully the stuck at zero test will pass at 2nd time. If it is a hard error, then the system will stay in a endless loop (reset - stuck at 0 test - failure - reset). This endless loop is acceptable for us. Is there any bit  related with stuck at zero test  (like bit 15 of EFCPINS) which can only be cleared with PORRST?

  • Hi.

    Any news regarding the post from "Posted by on Nov 25 2013 18:00 PM"?

    Regards

    Oliver.

  • Hi.

    Any news regarding the post from "Posted by on Nov 25 2013 18:00 PM"? That's already about 3 weeks ago!

    Regards

    Oliver.

  • @ Libo:

    The eFUSE Autoload & other scan-chain diagnostics only run on a power on reset (PORRST\).   So you would need a PORRST\ not a warm reset to try again after a fault occurs.   However, if one of these errors occurs you can't rely on any code operating therefore this nPORRST\ would need to be externally generated.

    EDIT:  I got my own wires crossed here.  The stuck at 0 test is testing that the signal from the EFUSE controller to the ERROR pin will trigger *if* the autoload diagnostics fails.  So while the autoload diagnostic happens after a power on reset, the 'stuck at 0 self test' is something that you trigger in your code.   Sorry for the confusion on my part.

    @ Oliver:

    After more discussion, the intent is to run the ECC diagnostic once after each power on reset, as the manual indicates.  What I got after discussing w. our safety expert is that the typical transportation applicaiton will have a limited time of operation between cycles - < 30min typically and < 6-7 hours max.   So its' usually sufficient to test after each power on reset.    On the other hand some industrial applications that run continually might need to synthetically force a power on reset so that this diagnostic is run at regular intervals.  I believe that's the intention in the safety manual as opposed to running just the ECC part of the diagnostic periodically at runtime.
    BTW - The eFUSE values are only *important* on a power on reset.  They are scanned out and then the eFUSE is not accessed again until the next power on reset.  There is supposed to be a spreadsheet that comes with the safety deliverables in which you can evaluate the effects of varying the period w. which this diagnostic is run on the various coverage metrics.   Further questions about this probably should go to the private safety forum, since I have limited understanding of how these spreadsheets work.     

  • Hi Anthony,

    thanks for the answer. The TRM explicitly declares that the ECC logic self test can only be run once for each PORRST\ cycle. But it doesn't state this constraint for stuck at 0 test. Please add this information also into the TRM.

    In our software, the test flow can be implemented to run into endless loop if stuck at 0 test occurs. But in such case, a DWWD reset (warm reset) may be triggered since it is not fed in time. According to your answer, I need to make sure that the stuck at zero test is not run again after a warm reset. During debug I see that EFCBOUND is not reset by a warm reset although the TRM shows the reset value is -0. Could you please confirm this register is not cleared by a warm reset?

    I am going to use this register after a reset to test if the stuck at 0 test was started once (if it is 0x003FC000, this means the test was started once and failed. If the test passed once, it should have  been changed to 0x0000200F to run ECC logic test).

     After PORRST, I see in CCS that EFCPINS is 0x000002E0. But according to TRM, it should be 0x00000000. Could you please double check this?

    Thank you!

    Libo

  • Hi Libo,

    I was getting the self test stuck at 0 confused with an actual stuck at type fault, and answered the above incorrectly.  I edited the above response. 

    I'll check into your questions.  Thank you for catching this. 

  • Hi Anthony,

    Now I am really confused about the stuck at 0 test.

    My question: May the stuck at 0 test be triggered multiple times for each PORRST\ cycle? Or more exactly, may a stuck at 0 test be started again after a warm reset?

    Thank  you!


    Libo

  • Anthony,

    for me are now all questions answered except my question 4 (see "Posted by Oliver Gr??ndonner on Nov 21 2013 02:39 AM" above): You answered this question through "Posted by Anthony F. Seely on Nov 25 2013 18:00 PM", but to be honest I'm not 100% satisfied. Reason:
    a. I understand that we have to write an error handler code, to catch an Uncorrected Load Failure of the eFuse Controller, in case the TMS570 will start execution of the code (IMHO this is covered through spna106d.pdf, page 3, chapter 2, point 7). No questions regarding this point.
    b. I also understand that in case of an Uncorrected Load Failure the nERROR pin will be forced LOW. Also no questions regarding this point.
    c. What I don't understand: Would it be possible for a user of the TMS570 to argue in the following way:
        "In case an Uncorrected Load Failure occurs the safe state will be maintained, cause of the following two reasons:
          1. In case the code execution will be started, the failure will be catched cause of the error handler (as described above in point a). And the error handler enters the safe state.
          2. In case the code execution will not be started at all, the safe state is also maintained, cause the TMS570 is held in a kind of a reset state. (Just as a note: I took this assumption, cause of http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/209680.aspx, "Posted by KGreb on Aug 23 2012 07:44 AM".)"
    d. Or is the argumentation which is given in point c not valid and/or enough? I.e. does TI sees the need to catch an Uncorrected Load Failure, of the eFuse controller, by an "external HW circuit" through monitoring the nERROR pin directly after system startup?

    I hope you get my point. The intention of my questions is just to find out how to handle an Uncorrected Load Failure of the eFuse controller.

    Regards
    Oliver.

  • Hi Anthony,

    first happy new year!

    Any update here?

    Thanks!

    Libo

     

  • @Libo:

    Thanks and Happy new year as well -

    Let me try to summarize the purpose of the stuck at test.  You are not testing the EFUSEs.   What you are testing are the 'wires' between the EFUSE controller and the ESM to make sure that these connections are working.   This would be the error signals.  So you program the BOUNDARY register to take control of these wires overriding the actual error status.  Then you can force a fake error to make sure that the message gets to the ESM.

    So the BOUNDARY register is all about forcing the output of the EFUSE controller.  There's no status in there to read.  I believe you could run this test multiple times after power on reset, but I don't think it would make much sense to run it any more frequently than the EFUSE ECC logic test ... so even if you could, you probably wouldn't.

    Regarding warm versus power on reset, instead of deducing this by the default register value, I'd suggest checking the SYSESR register.  There is a PORST bit in this register that tells you if the last reset was a power on reset. You will need to test this bit, store the result in RAM somewhere, and then clear the SYSESR register becuase the flags there are not cleared by reset... they accumulate.  But it's a more reliable way to tell PORRST v.s. WARM reset.

    Regarding the EFCPINS register - the TRM is wrong with respect to the bits marked 'reserved'.  Some of these bits are actually tieoffs to the module to select behaviors that we don't need to get into for the purpose of this discussion, and some of these are tied to '1'.  I checked and your value looks correct based on the tieoffs.

    Please treat the reserved bits as 'x' (you don't know their state) and so mask them off and don't test them.  This is the simplest approach for now.    We'll need to update this chapter though;  thanks for pointing this out.

    BTW - the EFUSE registers seem to be tied to the power on reset as their reset.  Still I'd use the SYSESR to distinguish if I were you.

    @Oliver:

    If an autoload error occurs,  the nERROR pin will be driven.  I'm trying to get confirmation on whether or not the CPU actually is held in reset.   But if it's released from reset, the argument is that it's improbable it will execute incorrectly *and* clear the nERROR pin.   On the other hand because critical components are effected by EFUSE we can't really say the CPU will execute code correctly.  

  • @Oliver,

    The CPU is not held in reset due to an autoload error.  Reset will be released, but again you can't rely on the CPU to execute correctly in this state.

     

     

  • Hi Anthony,

    thanks for the answer. As I mentioned, I can repeat the stuck at 0 test after warm reset. Originally I would re-run the test to exclude soft error at the error report wires. But I think if this makes sense is more a philosophy question. If a soft error occurs in the stuck at 0 test, it is no more sure if such soft error has occurred during auto load. So from a very conservative perspective, the micro controller should not be started.

    (FYI: The verify answer button disappears.)

    Regards,

    Libo

  • Hi Anthony.

    Also for you all the best for 2014.

    Thank you for your respsonse, I understood it. So the argumentation in my post from "Posted by Oliver Gr??ndonner on Dec 19 2013 08:39 AM", point c is NOT valid.

    So from my side is the followng question still open: What does TI suggest to handle an Uncorrected Load Failure of the eFuse controller in a SAFETY CRITICAL ASIL D project?
    (Perhaps a possible answer is the one from me in "Posted by Oliver Gr??ndonner on Dec 19 2013 08:39 AM", point d.)

    Regards
    Oliver.

  • Hi Oliver,

    Happy new year to you as well!  

    Yes, the eFUSE autoload fail triggers the nERROR and this must be used at the system level by an external component.

    I'd just add that this is for a real autoload failure on power on reset.   The stuck at test (version A) also allows you to activate this path to the nERROR pin but that's for testing the nERROR pin mechanism; so the external device should be smart enough to distinguish by the duration of nERROR assertion or some other mechanism the difference between a real nERROR and a 'stuck at 0 test' nERROR.   That's a system/software level problem though - we don't have an fixed proposal for this other than allowing the control of the nERROR pin.

    Best Regards,

    Anthony

  • Hi Libo,

    Yes, if the stuck at test fails, it's still considered a serious error (Class 2 - see section 32.3.2.2 of SPNU499B).

    The reason is that you don't know at the system level if there was an autoload error because the path to the nERROR pin is broken.  

    Best Regards,

    Anthony

  • Hi Anthony,


    yes, you are right if it is a hard error. My concern is more related to a soft error which occurs at some time point (for example at the first stuck at 0 test) and disappears later (for example at the 2nd stuck at 0 test). In my original plan, I would reset the CPU and re-run the stuck at 0 test if it fails. If it is a hard error, then the software flow will stay in an endless loop ( test --- failure --- warm reset ---- test again ---- failure --- reset ....). This is also safe for the application (the system can't be started because of the endless loop). But if it is a soft error and the test passes at the 2nd try, then I can continue with next steps. But in this case, there is no way to know if the soft error also occurred during auto load.

    Therefore, I agree with you:  Yes, the test can be run multiple times. But even in the case of soft error, it should not be re-run to make sure the system stays in a safe state ("from a very conservative perspective" with regard to soft error). 

    Thanks and regards,

    Libo

    as I had originally planned.

  • Hi Anthony,

    thank you for your fast feedback in "Posted by Anthony F. Seely on Jan 07 2014 09:40 AM".

    I understand that in case of an uncorrectable Load Failure of the eFuse controller, the only reliable statement which TI can give for sure is that the nERROR pin will be driven low. And as a consequence the only reliable solution would be to catch this at the system level by a TMS570-external component.

    But our projects, which use the TMS570, are already in such a phase, where such HW changes are very complicated to realize. Cause of this fact, and after internal discussions we decided NOT to use a TMS570-external component. Rather we decided to use a chain of reasoning, through which we argue that it is very likely that in case of an uncorrectable Load Failure (of the eFuse controller) the provided safety mechanisms will identify this fact, and so the safe state will be reached.

    Regards
    Oliver.

  • Thanks Oliver.

    Oliver,  Libo,  are there open items on this thread ?  I think we've gotten them and I'll close the thread in a few days unless you advise me otherwise.

    Thanks and Best Regards,

    Anthony

  • Anthony,

    no from my side are no open points.

    Thanks for the discussion and regards

    Oliver.

  • Hi Anthony,


    I don't have any further questions. Thanks for your help!

    Regards,

    Libo

  • Thanks Oliver, Libo,  espeically for your patience w. this question.