This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Could not write to L1P SRAM



hi Sirs,

I am using C6657 DSP. Here I have a problem that Core 1 could not write data to L1P when it was set as 32KB SRAM.

I have checked protection registers(0x0184A6xx) which are all set to 0x0000FFFF by default.

Could you please teach me why this happened and how to solve it ?

Any help will be deeply appreciated .

Thank you in advance.

Bai

  • Hi Bai,

    Have you tried writing to L1P by using the CCS Memory Browser,

    Thanks,

    HR

  • What are you using to write to L1P?  For writes outside of the core (or emulation) - basically anything but CPU and IDMA writes, you will need to use the global address as it's outside of the CorePac and it will need the global address to route it to the correct memory locations.  For example if you were trying to write to 0x00E00000 of CorePac0 via EDMA, then you would need to write to 0x10E00000 instead - which is it's global address.

    Best Regards,
    Chad 

  • hi Chad,

    Thank you for your prompt reply.

    I've tested points as following.

    1  To write L1P of CORE 1 which is set as 32KB SRAM with global address via CPU, but the value didn't change when I check by the memory browser

           *(unsigned int *)0x11E00000 = 3;

    2  To write the same address as above by using the memory browser, the value changed and the write is succeed.

    3  I've checked the memory protection registers. L1PMPPA16~L1PMPPA31 are all 0x0000FFFF.

    4  I didn't use memory lock function.

    I have no idea about this problem. 

    Best Regards,

    Bai

  • hi HR,

    Thank you for your reply.

    I've tried writing to L1P by using the CCS Memory Brower. And it write successfully.

    But to write with CPU was failed.

    I have no idea about it .

    Best Regards,

    Bai

  • hi Chad,

    By the way, in fact, I want to download program code to L1P via PCIe. But it seems that PCIe could not write to L1P.

    Therefore, I download program code to Shared Memory first. And then move these program code from Shared Memory to L1P via CPU.

    But it did not work, either.

    Best Regards,

    Bai

  • Hi Bai,

    Can you post a small project showing your issue?

    Thanks,

    HR

  • Bai,

    I was thinking about this a bit more, and I had forgotten that the CPU cannot write to L1P.  It's only for code base and self-modifying code is not allowed, and that's the only reason one would want to use CPU to write to the space.  It's not common that people use L1P in SRAM mode so I don't see many questions on the subject.

    I'm not aware of any limitations (other than Global Addressing and have the BAR setup correctly) for PCIe to write to L1P SRAM.  I am having a colleague investigate this for me and will get back to you on this point.

    Do you mind if I ask how big your code base is and why you want to use L1P as SRAM instead of cache?

    Best Regards,
    Chad 

  • Bai,

    It was confirmed and tested by a colleague that PCIe can write to L1P SRAM.  Note that during PCIe Boot that the L1P will be set to cache, so if you want PCIe to write directly to it, you'll need to first configure it as SRAM then perform the writes.  You'll also want to make sure the PCIe BAR is configured correctly for the Global Addressing and your linker command file is setup with the global addressing.

    Best Regards,
    Chad 

  • Chad,

    That's very very kind of you for testing this. Your reply is important to me.

    Now how to set the L1P of Core 1 turns to the main problem. I'm aware of there's only ONE L1PCFG register for both core 0 & 1. I wander how to distinguish which core set the L1PCFG, and could core 0 set the L1PCFG register of core 1 ?

    I'm making motor control code using C6657.  Considering the data creditability, I use L1P&L1D as SRAM instead of CACHE. Therefore, I want to boot code into L1P SRAM via PCIe.

    Best Regards,

    Bai

  • Actually each core has it's own L1PCFG register.  Thinking about this some more, each core will need to set their own L1P Cache/SDRAM configuration from within the core itself as these registers are not Globally accessible but internal to the corepac's themselves.

    You would need to perform a two level boot to do this over PCIe in order to get your main code directly into the L1P SDRAM.  1st level boot would load a small routine to be run on the corepac when PCIe initial boot sequence is complete.  This will be used to configure the L1P Cache as SDRAM.  Then notify you PCIe (could be a memory location that PCIe is polling from a host and you write a specific value to it) when it's completed and have the PCIe dump the rest of the code.

    I'm going to add a boot specialist to the discussion so he can give better details regarding the dual boot procedures.

    Regarding using L1P&L1D as SRAM instead of cache for data credibility.  I'm not sure I fully understand why you feel this is more credible than using it as cache.  Especially L1P.  It's only program code that gets pulled in, and it should never be modified, there's never a coherence issue.  

    For L1D the only thing I can think of is if you're sharing the data across cores and this being a cache coherence related concern.  When cache coherence is maintained properly this is never an issue.  

    That said, I'd at a minimum leave L1P as Cache.

    Best Regards,

    Chad

  • Chad,

    Thank you so much.

    I have directly loaded the program into the L1P SRAM via PCIe boot by following the steps what you suggested above.

    May I know the new method on setting the L1PCFG if you have a better one?

    I am designing a feedback control system for controlling a motor. Therefore I want make the time of the feedback routine as short as possible. At the same time, the time of the feedback routine must be stable. In another word, this time must be the same every time. 

    There are two reason why I did not want to use the cache.

    Firstly, I have no experience in using the cache before,  so that I could not be sure that the data & program in the feedback routine  are always in the L1x cache. I figured that the time of the feedback routine will be longer when the cache miss occurs. In a word, I have no idea how to make the data & program which in the feedback routine always in the L1x cache. 

    Secondly, Actually I could do the coherence by myself, if I put the data & program into the SL2. But the write back functions cost too much to my feedback routine.

    Overall, I have to use the L1P & L1D as SRAM instead of Cache. I would like to hear your suggestion if you have a good idea for me.

    Thank you once again for your help.

    Best Regards,

    Bai

  • Bai,

    The only way to configure L1P Cache is from the core itself writing the the L1PCFG register.  As I mentioned, there's one for each corepac, and each corepac must configure it's own,  There is no global addressing available for the CorePac registers and therefor to access/modify these registers it must be done within the CorePac which means either CorePac or IDMA which is within the CorePac and configured to run by the CorePac can modify these registers.

    A couple questions. 

    1.) How large is your program when built?  Does it all fit in L1P?  Or are you just trying to get certain code to be there 100% of the time.  

    If it all fits, leaving cache on, will result in the code be left in cache permanently after the first time it's loaded.

    If it won't all fit, but the kernel you're worried about fits in say 16K, it's better to split the L1P into partial SRAM partial cache.

    2.) Are you sharing data between cores?  Cache coherence within the corepac is 100% maintained by the corepac (i.e. Coherence between L1D and L2 is automatically handled - There's nothing to handle on the L1P side as that is only Code being cached and should Never be modified.)

    I understand your timing concerns, but I would say that it is a.) Important to minimize the time the feedback routine and b.) have it guaranteed to complete within a specific time.

    I would look into the analysis as to what the specific needs are in regards to timing and guarantee that's met.

  • Chad,

    Sorry for the late response and thank you for your suggestions.

    Q:       1.) How large is your program when built?  Does it all fit in L1P?  Or are you just trying to get certain code to be there 100% of the time.  

    A:       All of the code is larger than size of the L1P. I just want to leave certain code to be there 100% at any time, not all of them. But  the size of the certain code is larger than 16K, I am afraid that I have to use the L1P all as SRAM.

    Q:       2.) Are you sharing data between cores? 

    A:        Yes, Core 1 shared the data in the L1D & SL2 with Core 0.

               If leaving the L1D as cache, I have to control the coherence between L1D & SL2. I had tested the time of the CACHE functions(eg: CACHE_wbInvL1D()) which almost cost 1us. That  is a fearful waste of time to my system, because I want to have the feedback routine guaranteed to complete within a few microseconds.

    May I ask another two questions?

    1)  I am using L1D & L1P as 32k cache on another core. May I know how the CACHE_wbInvL1D() work? Referring the documents, I figured that cache controller could write back / invalid the data in 128 bytes once time. But how it work if the length of data is shorter than 128 bytes ?

    2) I am aware that one register accessing of peripheral will cost about 60 cycles. Does it cost too much? for example uPP registers.

    Thank you in advance.

    Best Regards,

    Bai

               
     

  • 1.) It always operates on a cache line of data at a time.  If you writeback or invalidate any part of a cache line of data, the whole cache line will have the action performed on it.

    2.) 60 cycles sounds a bit high, are you doing a function call to get to the registers?  How was this measured?

  • Chad,

    1)  Thank you for providing the requested information. I think I need to have another look at my usage.

    2)  No, I am not doing any function calls. I got the registers via direct addressing, and measured the time by monitoring the TCSL registers. 

    Best regards,

    Bai

     

    Thank you for providing the requested information.

  • Chad

    Sorry for making you think that my question has been resolved. 

    Would you mind teaching me the question 2) I mentioned above?

    Thank you in advance.

    Best Regards,

    Bai

  • Bai,

    What register specifically where you using for testing this?  I'm wondering if this has to wait on external data.  While configuration registers are slow access, they shouldn't be that slow, and you do have the option of using IDMA to read/write configuration registers so the CPU isn't stalling waiting on this.

    Best Regards,
    Chad 

  • Chad,

    I configured upp registers as below, and used the TSCL register of CPU for testing each register read/write access.

    Actually, I didn't configure the IDMA for CPU at all, so the mode of IDMA is default I think.

    typedef struct {

    volatile UINT32 ACT: 1;
    volatile UINT32 PEND: 1;
    volatile UINT32 r3_2: 2;
    volatile UINT32 WM: 4;
    volatile UINT32 r31_8: 24;
    }UPS2_BITS;

    typedef union {
    volatile UINT32 ALL;
    volatile UPS2_BITS BITS;
    }UPS2_REG; // I,Q

    #define UPP_LoadDescriptor2DMAI_vo(DataPtr, LineCnt, BytesPerLine, LineIndex) \
    uppregs->UPID0 = (UINT32)( DataPtr ); \
    uppregs->UPID1.ALL = (UINT32)((LineCnt)<<16 | (BytesPerLine)); \
    uppregs->UPID2.ALL = (UINT32)( LineIndex );


    void upp_write()
    {
    /* clear DMA status for new transmit */
    uppregs->UPIER.ALL = 0x00001F1F; // 15 cycles

    /* check if DMA I is pending */
    while (uppregs->UPIS2.BITS.PEND) {} // 120 cycles

    /* send packet */ // 20 cycles
    UPP_LoadDescriptor2DMAI_vo((void *)tx_pstg, 1U, BUFFER_SIZE, BUFFER_SIZE);

    while (uppregs->UPIS2.BITS.ACT == 0) {}

    }

    I found that it costed almost 120 cycles to check the bit of pend. So the bit of ACT was.

    Q: Is it cost more cycles when access bit field instead of 32 bit register?

  • I don't see how the UPIS2 is defined in the code.

    That said, it would appear that you're actually polling on a condition for this.  I can't say for certain that it wasn't still pending action and thus you have the high cycle count.  

    Best Regards,

    Chad

  • Chad,

    Sorry for the late reply. 

    This problem has been solved. 

    Thank you for your support all the same.

    Best Regards,

    Bai