C6678 Configuration Bus Issues

SPH

I’ve encountered a couple of issues with peripheral initialisation on the C6678. I have workarounds for both which seem reliable but I would really like to understand what the issues are. This will allow me to be confident both that the workarounds are robust and that we are unlikely to see any further such issues.

Issue 1

I currently use 8 of the maps within the sRIO RXU to direct messages onto a different queue for each core. Each core initialises its own map entry and, since the Rx channels have to be disabled and re-enabled around this programming, I use a semaphore to mutually exclude the cores, so that only one will be writing to the registers at any given time. After a fair amount of debugging, I found that the queue programming was not reliable: of the order of 1% of the time, writing the RXU_MAP_L, RXU_MAP_H and RXU_MAP_QID (in that order) for entry n would corrupt the RXU_MAP_L register for entry (n + 1). The corrupted value would always have the top 24-bits as 0, with the bottom 8-bits having a value not obviously related to the values written. My workaround, which appears reliable, is to read the RXU_MAP_L register for entry (n + 1) before programming entry n, then check whether entry (n + 1) has been corrupted and reinstate it if so.

Issue 2

During initialisation, I was getting a configuration bus error, with error “Write Error” and status “Success”. What does this mean? Is there a register which gives the faulting address, like there is for the other memory protection units? When this was occurring, I was able to narrow the cause down to the initialisation of one of the EDMA3 instances that is used by all 8 cores to transfer data over the PCIe interface. During initialisation – and after core 0 does the fundamental initialisation of the peripheral – all the cores were writing to the same set of a few registers (actually, this bit of code should be re-factored so that only one core writes these registers in the peripheral and the remaining cores only initialise their own state but what’s there works until I have the time to do this) so I tried, on a whim, mutually excluding this portion of the initialisation and I have not seen the problem again. Unless this has simply shifted the timing and that has caused the problem to become benign, this implies that there is a requirement to mutually exclude certain types of access to some registers on the configuration bus. I could not find this specified in the documentation on the device; please could you provide further guidance on this.

Note that I have been developing using TMX silicon, if that makes a difference, though Issue 1 has definitely been seen on TMS silicon as well.

Regards,

SPH

over 13 years ago

0 tscheck over 13 years ago

TI__Mastermind 23525 points

For issue #1, I'm guessing it is related to Advisory 15 in http://www.ti.com/lit/er/sprz334e/sprz334e.pdf. Can you confirm that you are disabling the various mapping registers via the tt=11 workaround, and you are not making changes to any SRIO registers in the 1B400h to1B9FCh range before you read the bad values in the mapping registers?

Regards,

Travis

0 Steven Ji over 13 years ago

TI__Genius 12065 points

SPH,

For issue#2, do the 8 cores access the same EDMA global registers for the initialization simultaneously please? It may cause the resource conflict.

There are 8 shadow regions in EDMA and the user could allocate different channel/TCC resource to each region. Then each core could access each shadow region to setup the EDMA configuration without accessing the same global registers. It is designed for the multicore control and you could take a look at section 2.7 "EDMA3 Channel Controller Regions" and section 2.10 "Memory Protection" in the EDMA user guide for details.

Hope it could help prevent the issue you have seen.

0 SPH over 13 years ago in reply to tscheck

Intellectual 635 points

Hi Travis,

Thanks for your reply. I am aware of that erratum and have confirmed that I am not writing to any of the MAC registers, either during initialisation or normal operation, so - by my reading of the erratum - I should not be experiencing this problem. Furthermore, the corruption only appears to occur (when it does occur - the probability is, as I mentioned, fairly low) when one of the core is doing the following sequence:

Read RXU_MAP_L(n+1) - the value is correct
Write RXU_MAP_L(n)
Write RXU_MAP_H(n)
Write RXU_MAP_QID(n)
Read RXU_MAP_L(n+1) - occasionally the value is corrupted

Writing to these registers is mutually excluded so, while the other cores may be writing to other registers at the time, they are not writing to any other RXU_MAP registers. This leads me to conclude that the corruption is specifically related to this sequence, rather than some erroneous write elsewhere, possibly in conjunction with some specific - though seemingly unrelated - behaviour on another core.

Regards,

SPH

0 SPH over 13 years ago in reply to Steven Ji

Intellectual 635 points

Hi Steven,

Thanks for your reply. Yes, all the cores do access a few of the same global registers. As I have said, this is not ideal but I've ported our C64x+ driver and, at the moment, I do not wish to re-factor beyond the minimum necessary to get the C6678 going. I've read the sections that you mention in the User's Guide (and the EDMA3 section in the device manual) but don't see anything indicating the necessity to mutually exclude access to the registers in the global region - in fact the EDMA3 UG states (without caveat) "You can design the application software to use regions or to ignore them altogether"; is this potential resource conflict documented anywhere?

If this explains the configuration bus error, that would suggest that my workaround is robust. However, the obvious next question is: to which other registers in other peripherals does this same requirement exist?

Regards,

SPH

0 Steven Ji over 13 years ago in reply to SPH

TI__Genius 12065 points

SPH,

Let's take one step back. May I ask what configuration bus error you are seeing please?

Are we talking about the "CFG Bus Error Register" (ECFGERR, 0x01820408) in the CorePac please?

The "ERR" bit field shows 0x4 indicating "CFG write status error detected" and "STAT" shows "0" indicating "Success or unrecognized RID/WID", are they correct?

And what is the "XID" (transaction ID) showing please?

Do you see this error on every CorePac or only a couple CorePacs please? Will the error happen on random CorePac or always CorePac_x will show this kind of error please?

0 SPH over 13 years ago in reply to Steven Ji

Intellectual 635 points

Hi Steven,

I'm afraid that I will have to answer from memory as my debug information got deleted and I haven't got time to reproduce the problem today.

>> Are we talking about the "CFG Bus Error Register" (ECFGERR, 0x01820408) in the CorePac please?

Yes.

>> The "ERR" bit field shows 0x4 indicating "CFG write status error detected" and "STAT" shows "0" indicating "Success or unrecognized RID/WID", are they correct?

Yes, that is what I saw - though I only recall the document stating that STAT = 0 was Success.

>> And what is the "XID" (transaction ID) showing please?

I'm afraid that I don't recall: I will try to reproduce this next week and get this information for you.

>> Do you see this error on every CorePac or only a couple CorePacs please? Will the error happen on random CorePac or always CorePac_x will show this kind of error please?

I don't recall seeing a pattern but, when I get around to reproducing it, I will do some analysis on this.

Regards,

SPH.

0 tscheck over 13 years ago in reply to SPH

TI__Mastermind 23525 points

Hi SPH,

I'm kind of at a loss then... I should of asked if you were using 1.0 or 2.0 silicon too, since the aliasing issue was fixed in the 2.0 silicon. The corrupted value is not matching anything in the corresponding aliased MAC register, corret? That would be a simple thing to check too. Honestly, there is nothing different about these MMR registers than any others. There are no race conditions and no restrictions on writing/reading the values. Writing one mapping table entry has no effect on another. When you are reading/writing these registers, are you using variables for the values? Where are these values being stored? Is there a chance that they could be cached? Are you reading via CPU or CCS window?

Not sure what else to check, I'm pretty sure that the is not a HW issue, or it would have been reported. Is there any way you could duplicate on an EVM, where we could see it?

Regards,

Travis

0 SPH over 13 years ago in reply to tscheck

Intellectual 635 points

Travis,

We haven't taken delivery of any 2.0 silicon yet; this was all on TMX and TMS 1.0 silicon. I believe that we have some cards with 2.0 silicon arriving soon so I may be able to get hold of one to try, if we believe that it could be an aliasing issue.

I've removed the work around and added some additional debug when I detect that it occurs. In the first case that I've captured:

Before programming the MAP4 entry, 0x0290043c reads 0xff000000
I program the RXU_MAP4 entries
0x0290043c now reads 0x00000099
0x0291b43c reads 0x00000000

>> When you are reading/writing these registers, are you using variables for the values?

No, the values are calculated in place and the values that have been written by the code are always as expected

>> Are you reading via CPU or CCS window?

We don't use CCS (we are working in a heterogeneous multi-processor system so we have developed our own tools to allow efficient debugging across the entire system) so it is all read by the CPU.

>> Is there any way you could duplicate on an EVM, where we could see it?

Unfortunately we don't have an EVM.

Thanks for your continuing help,

SPH

0 SPH over 13 years ago in reply to SPH

Intellectual 635 points

Hi Steven,

I've had a chance to reproduce this so can give you some more information. Note that this was on TMS 1.0 silicon (I had implemented the workaround prior to getting anything other than TMX parts so this is the first time that I've seen the problem on TMS silicon). I've reproduced the problem 3 times:

Cores 1 & 6 both got exceptions
Core 1 ECFGERR was 0x80000D00
Core 6 ECFGERR was 0x80000C00
Core 2 got the exception
ECFGERR was 0x80000D00
Cores 2 & 5 both got exceptions
Core 2 ECFGERR was 0x80000C00
Core 5 ECFGERR was 0x80000E00

Please let me know if there is any other information I can get you,

SPH

0 Steven Ji over 13 years ago in reply to SPH

TI__Genius 12065 points

SPH,

Thanks for collecting the info of the error registers.

It indicates we have "CFG write status error" with "unrecognized RID/WID". The Transaction ID (such as 0xC/0xD/0xE shown in the register) is basically a counter and assigned by the CorePac to each transaction on-the-fly.

The error normally happens when we have configuration write access to any reserved or invalid configuration register space. Unfortunately we do not have the fault address registers to capture which instruction causes the error.

But since you were able to narrow down the issue was due to multicore accessing EDMA registers, could you check if every CorePac is accessing the valid register address please?

If you think nothing wrong with the register accessing, could you share the example project which could re-produce the issue that we can analyze further please? Thanks a lot.

0 SPH over 13 years ago in reply to Steven Ji

Intellectual 635 points

Hi Steven,

I believe that all cores are accessing the correct registers (albeit more times than are strictly required) as, when the configuration bus error does not occur, the EDMA3 does what it has been programmed to do. I looks to me like there is a race hazard of some sort when the initialization is not mutually excluded.

I don't believe it is practical to share the project as the dependencies are system wide and, even if I was able to get all those to you, it would build an image for our specific hardware. However, if you can provide me a way to get it to you, I can let you have a copy of the EDMA3 driver and explain the usage that is causing the problem. I would hope that this allows you to reproduce the problem fairly easily.

Best regards,

SPH

0 SPH over 13 years ago in reply to SPH

Intellectual 635 points

Steven, Travis,

As another data point, I've managed to get my hands on a card with rev 2.0 silicon today and reran the tests: I still see the EDMA3 problem but, after ~600 cycles (with 2 6678s so 1200 boots), I have yet to see the sRIO issue.

Regards,

SPH

0 Steven Ji over 13 years ago in reply to SPH

TI__Genius 12065 points

SPH,

To further debug this, it will be good if you are able to simplify the test case and share only the EDMA part here which could re-produce the issue.

We just need the exact accessing sequence and specific EDMA registers accessing, which could trigger the issue you are observing.

Ideally the configuration port should be able to arbitrate the multicore accessing to the configuration resource. The CFG bus error will only happen when the address decoded as reserved space.

And there is EDMA3 LLD driver from TI which could be downloaded here:

http://software-dl.ti.com/dsps/dsps_public_sw/sdo_tii/psp/edma3_lld/index.html

And some introduction of the LLD driver

http://processors.wiki.ti.com/index.php/Programming_the_EDMA3_using_the_Low-Level_Driver_(LLD)#EDMA3_LLD_Download_.2F_Contributed_Examples

It might be good to compare with your own driver to see if any resource management needs to be taken care of.

Hope it helps.

0 SPH over 13 years ago in reply to Steven Ji

Intellectual 635 points

Hi Steven,

It is not that the test case is complex, it is that there is a lot of infrastructure required to load, run and retrieve the post-crash analysis information from the DSP in our system. Anyway, I've added in some additional debug and I now have some additional evidence. Firstly, I found that adding much debug (each debug point costs ~20ns) stops the problem from occurring; this possibly suggests that having each core performing back-to-back writes could be a prerequisite to seeing the problem. Secondly, I've been able to align the cores to an accuracy of <100ns and think that I have been able to narrow it down a bit. First, some background on how we bring the DSP up:

Core 0 does the global initialisation for all drivers plus its core-local initialisation
Note that since the EDMA3 driver is a port of the C64x+ version (and it is still used on the C64x+ DSPs in our system), some global initialisation is in functions that also perform local initialisation. This results in some registers being written by each core, and not just core 0. I haven't refactored since I've not found anything in the documentation that precludes this.
When it reaches the end of the initialisation function, core 0 release the other cores (broadly at the same time) to perform their own core-local initialisation in parallel.
The only mutual exclusion performed before the EDMA3 driver initialisation is for a read-modify-write on a CINTC register, which is < 30ns so this does little to serialise the cores.
When the failure occurs, at least 2 cores (including the one that gets the exception) look to be writing to the same PaRAM entry at approximately the same time. Specifically, sets 128 and 129 are being written to (with the same constants values from each core); even more specifically, it is the Opt, SrcBIdxDstBIdx, LinkBCntRld, SrcCIdxDstCIdx and CCnt values that are being written to for each PaRAM set.

Does that give you enough to go on? I'm afraid that I can't justify to my PM the time taken to reverse engineer a TI driver to discover things that he feels should be documented in a manual and with no guarantee that we would end up with a complete list of the resource management issues that need to be addressed.

Best regards,

SPH

0 SPH over 13 years ago in reply to SPH

Intellectual 635 points

Travis,

Any update on the sRIO issue?

Thanks,

SPH

0 tscheck over 13 years ago in reply to SPH

TI__Mastermind 23525 points

Hi,

We tried basic tests here and could not duplicate your issue. We tried in multiple ways, first using the benchmarking example from the MCSDK and stepping through the code as the mapping table entries were programmed. Secondly, we wrote values directly into the mapping registers you mentioned above. We didn't see any corruption in either case. As I mentioned, I don't think there is a HW problem here. Is there any way you can strip out most of your code and provide just a snippet of code that shows the problem. If it is just SRIO code, we should be able to run that on our EVM and try to duplicate.

Regards,

Travis

0 SPH over 13 years ago in reply to tscheck

Intellectual 635 points

Hi Travis,

>> As I mentioned, I don't think there is a HW problem here.

Yes, I know that you said this before but, as I've shown with the rev 2.0 silicon, there is certainly a HW sensitivity. (I've currently booted rev 2.0 devices about 1500 times without failure and I've never booted a rev 1.0 device even as many as a hundred times without this failure before I put in the workaround.) Where there is a sensitivity, one has to consider the possibility of a cause...

Anyway, I've got to go now but I have managed to put a little bit of time towards this today and can at least tell you, in case it helps, that (in the one case that I've analysed) it is the write to RXU_MAP2_QID that corrupts RXU_MAP3_L. I'll look a bit closer on Monday and see what the other cores were doing around that time.

SPH

0 tscheck over 13 years ago in reply to SPH

TI__Mastermind 23525 points

SPH,

I want to be clear that we aren't abandoning the effort here, but so far we are not able to reproduce your issue in the tests we ran. You mentioned earlier in the thread that you didn't have 2.0 silicon access yet, but in your last post you referenced

SPH said:
as I've shown with the rev 2.0 silicon, there is certainly a HW sensitivity. (I've currently booted rev 2.0 devices about 1500 times without failure

. So this is a new data point. Are you saying that you haven't seen the failure in 1500 times using Rev2.0 silicon, with or without the workaround? Do you still see the failure without the workaround?

Also, your testing today referenced mapping entries 2 and 3, while your preceeding post referenced mapping entries 4 and 5. We looked at all the entries during our tests and did not see an issue. And as you alluded to earlier, the errata doesn't seem to apply here at all. Let me know what you find on Monday and if you can provide the code that shows the issue.

Regards,

Travis

0 tscheck over 13 years ago in reply to tscheck

TI__Mastermind 23525 points

Just noticed your earlier post referencing 2.0 silicon. Not sure why I didn't see it earlier, but still want to clarify if the problem went away without any type of workaround or not?

Regards,

Travis

0 SPH over 13 years ago in reply to tscheck

Intellectual 635 points

Hi Travis,

>> want to clarify if the problem went away without any type of workaround or not?

When I'm working on this, I am running a version of my code that detects the problem rather than corrects it. Running this, I have only seen the problem on TMX and TMS rev 1.0 silicon; the problem appears to have gone away on the rev 2.0 without any type of workaround.

>> Also, your testing today referenced mapping entries 2 and 3, while your preceeding post referenced mapping entries 4 and 5

Yes, I reproduced the problem with some additional debug code. Core n configures mapping n, so it depends which core exhibits the problem as to which entries are involved.

I've been looking a bit more closely at the failure that I captured on Friday, and what each core is doing at the time of the corruption, and the scenario doesn't look too complex. Core 0 is waiting at the rendezvous at the end of the initialisation function (as explain in the EDMA3 related post above). The other cores:

Hooking a number of interrupts through the CINTC
This one sees the corruption so is setting the RXU mapping entry
Waiting on semaphore owned by core 2
Waiting on semaphore owned by core 2
Waiting on semaphore owned by core 2
Waiting on semaphore owned by core 2
Waiting on semaphore owned by core 2

To give a bit more information, I have a semaphore (#3) which I use for general mutual exclusion during initialisation. It is this semaphore that core 2 owns and cores 3-7 are waiting on with the following code:

   SemDirect = c66_sem_REG_PTR->aSemDirect[Semaphore];

   while(SemDirect != 1)
   {
      /* Semaphore is currently taken */
      ENABLE_INTERRUPTS(CsrSave);
      /* Allow a handful of nanosecs for any pending interrupt to get in */
      __asm("\t NOP 5");
      __asm("\t NOP 5");
      __asm("\t NOP 5");
      DISABLE_INTERRUPTS(CsrSave);

      /* Now try again for the sem */
      SemDirect = c66_sem_REG_PTR->aSemDirect[Semaphore];
   }

Core 0 is waiting a another loop:

      /* Wait for all other core sems to be taken */
      for(i = 1; i < NUM_CORES; i++)
      {
         while(c66_sem_REG_PTR->aSemQuery[16 + i] == 1)
         {
            /* Poll while free */
         }
      }

The only core doing anything significantly different (other than the one that "causes" the corruption) is core 1. It is interacting with the CINTC to set up various interrupt and exception sources. The CINTC is mutexed by a different semaphore (#5) and the basic pattern is:

Update local state
Acquire sempahore
read-modify-write the CINTC channel map entry to map the source event to the appropriate host interrupt
Release semaphore
Clear the source event status
Enable the source event

If there is anything else that you would like me to look at, please let me know.

Best regards,

SPH

0 tscheck over 13 years ago in reply to SPH

TI__Mastermind 23525 points

That is good news that the issue doesn't show up for v2.0 silicon! I've sent some internal inquiries to see if there is something that was missed regarding the existing errata for older silicon, for example if there was any sequencing requirement for programming the mapping entries. I will let you know if anything comes to light. We are also happy to try your code if you can attach it.

Regards,

Travis

0 tscheck over 13 years ago in reply to tscheck

TI__Mastermind 23525 points

Ok, I had the designer do some digging and we could not find anything. She checked the RTL and confirmed that there is nothing different about these mapping registers versus all the other MMR in the peripheral. The aliasing issue in the errata is the only issue we are aware of. I know you said you don't use CCS and are using your own software tools to read the values via CPU, but if you have access to CCS can you try to open a memory window and verify the values that way by stepping through the code?

Regards,

Travis

0 SPH over 13 years ago in reply to tscheck

Intellectual 635 points

Hi Travis,

>> can you try to open a memory window and verify the values that way by stepping through the code?

TBH, I'm not sure that is practical. As I mentioned, this occurs ~1% of the time - on one of the 7 secondary cores - so I might need to single step through 700 times before I saw this. Also, due to the relatively rarity of the problem, I suspect that it might be an interaction between the core writing the RXU mapping registers and one of the other cores making some ostensibly unrelated configuration bus accesses: hence, I'm doubtful about whether we would ever see the problem, even if I had the time to step through 700 times, as the other cores would already have completed their initialisation.

What I can do, maybe today but certainly this week, is to capture a few more instances and see what the other cores are doing at the same time, see if we can spot a pattern. Would that be useful?

Many thanks,

SPH

0 tscheck over 13 years ago in reply to SPH

TI__Mastermind 23525 points

SPH said:
this occurs ~1% of the time - on one of the 7 secondary cores - so I might need to single step through 700 times

Good point! And this may be why we aren't seeing anything either. We will look at anything you might provide, the more reliable we can produce it the better.

Thanks,

Travis

0 SPH over 13 years ago in reply to tscheck

Intellectual 635 points

tscheck said:
We will look at anything you might provide, the more reliable we can produce it the better.

Hi Travis,

I've captured 2 other failures:

Failure 1

Core 0 and 1 are at the rendezvous, cores 3, 5, 6, & 7 are waiting on the semaphore held (at the time of the corruption) by core 4.

Core 2 is programming the CINTC and core 4 sees the corruption to RXU_MAP5_L when RXU_MAP4_QID is written

Failure 2

Core 0 and 2 are at the rendezvous, cores 3, 4, 5 & 6 are waiting on the semaphore held (at the time of the corruption) by core 1.

Core 7 is programming the CINTC and core 1 sees the corruption to RXU_MAP2_L when RXU_MAP1_QID is written

So there appears to be a link between the CINTC programming (which, as I mentioned, also accesses the Semaphore2 peripheral) and the corruption of the RXU mapping register.

Please let me know if there is more information that I can try to get for you.

SPH

0 SPH over 13 years ago in reply to SPH

Intellectual 635 points

Steven, Travis,

Has there been any progress on either of these issues?

Cheers,

SPH

0 tscheck over 13 years ago in reply to SPH

TI__Mastermind 23525 points

I think we are going to need your code, or simplified version, to try to duplicate on the EVM. I'm not sure how to move forward beyond that. I've discussed the issue with a number for folks, and haven't come up with anything.

I don't like leaving things open-ended, but I'm assuming you will be moving forward with rev2.0 silicon correct? That is still not exhibiting any issues I assume.

Regards,

Travis

Processors

Processors forum

C6678 Configuration Bus Issues