This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Multi-core programming in C6472 using shared memory

Other Parts Discussed in Thread: TMS320C6472

Hello,

I want to build multicore application in C6472 using the shared memory. I am using CCS v4 as the compiler. I have read the document SPRUEG5C. The arbitration logic seems to be quite complex to me. In this regard I have some questions (I'm sorry for posting them if they appear silly) .

1. Does using optimization level-3 in CCSV4 imply that the shared memory access is configured to be pre-fetchable always? In that case I think I cannot use atomic access monitor in optimization level-3. Because SPRUEG5C says "Atomic access should go only to non-prefetchable address spaces." Am I correct?

2. Other than the configuration of prefetchable/nonprefetchable part of the shared memory, the power down issues and the fault indications, do I need to use the SMC memory mapped registers manually from my code or they will be used by the arbitration logic hardware only?

3. Reading the document for SMC controller it seems to me that the user 'talks' to the atomic access monitor and the atomic access monitor controls the arbitration logic hardware. The user cannot directly control the arbitration logic hardware without using atomic access monitor. Am I right?

4. While programming do I need to specify somehow the per-bank SMC controller through which I am trying to access the shared memory or the location of the shared memory I'm trying to access itself indicates the hardware about the per-bank controller through which the request should go?

5. From the C code if I want to access a shared memory location by a declaration like "#define VALUE (*((volatile unsigned int *) 0x00200000))" and then "VALUE=0x1234", when this code is compiled will it automatically pass the write(or read) request through the atomic access monitor (by using LL, SL, CMTL instructions in assembly) or I have to take some other step to ensure atomic access? If so then what are such steps?

6. Can anyone please send me an example project where shared memory is used by multiple cores preferably without using DSP-BIOS?

 

Regards,

AC.

 

 

  • I don't know if you are already aware of this, but here is the link to SMMQT: http://software-dl.ti.com/dsps/dsps_registered_sw/sdo_sb/targetcontent/MQT/index.html

    I does use DSP/BIOS though, but the source code might answer many of your questions.

  • AC,

    Is there a particular reason that would like to stay away from BIOS for your multi-core application?  If not, TI offers a product called 'IPC'  that facilitates developing multicore BIOS applications on devices including C6472.  Low-level hardware operations like interacting with Atomic Access Monitors on C6472 are abstracted away by IPC modules such as GateMP which is used for protection of shared resources including shared memory.  Other functionality that is offered include inter-processor notifications, multicore heaps and data structures.

    Regarding your questions,  I'll have to get back to you regarding questions #1 & 2.

    Regarding question #3, yes you are right.  The user only interacts with the atomic access monitor.  When using IPC, the user interacts with the GateMP module which itself interacts with AAM's at a lower level.

    Regarding question #4, the location of the shared memory itself determines the mapping to hardware.

    Regarding question #5, no--operations on volatile variables aren't automatically made atomic between multiple processors.  You would have to protect these operations using atomic access monitors (or GateMP if you are using IPC).

    Regarding question #6--IPC ships with a couple multi-core applications that can be built for C6472.  However, these applications do use BIOS.

    Regards,

    Shreyas

  • Hello Shreyas,

    Do you have any particular reason to recommed IPC over SMMQT? SMMQT also uses the AAMs for intercore arbitration, and provides simplified API calls via the BIOS MSGQ module.

    I would be greatly interested to know about any differentiation between the two choices, as I have already some progress with SMMQT.

    Regards,

    Viswa.

  • Thanks a lot  Viswanath L and Shreyas Prasad for your prompt and precise replies. I am going through the SMMQT examlpes to decipher them. As Shreyas has asked, the reasons I am interested to build application without DSP BIOS are:

    1. I want to learn and observe the actions going on in the register level. I know sometimes that sounds a bit impractical and time-inefficient, notwithstanding. Actually previously I stumbled in the same way for firing an ISR in a single core without using DSP BIOS. But later I managed to do that by mixing some C and assembly codes. I am also able to generate interprocessor interrupt from one core and service that from other core without BIOS using IPCGR registers and the corresponding event numbers. So now I am targeting the shared memory access.

    2. I want the whole code to be visile to me(as much as possible). So I am trying to avoid API based abstractions as much as possible.

    3. I want to avoid any overhead from the application code. Though it is said that DSP BIOS comes with a little overhead, still just trying to cope up if the application is manageable without BIOS then its fine. 

    Thanks once again for the replies. Regards,

    AC.

  • Viswanath, the main difference between IPC and SMMQT is that SMMQT works with BIOS 5.x and IPC works with BIOS 6.x.  Also, SMMQT has limited functionality and device support compared to IPC.  It does offer message passing via MSGQ and it ships with code to use Atomic Access Monitors on C6472.  IPC is also more portable between multiple devices since hardware details (i.e. atomic access monitors, hardware semaphores, etc) are abstracted away in top-level modules.

    AC, I understand your motivations to avoid BIOS and operate at the register level. FYI (regarding point #2), BIOS6 and IPC are both open source and ship with the source code as well.

  • Thanks Shreyas for the information.

    1. Can you please tell me the path where the open source libraries are there for BIOS or IPC?

    2. What is the concept of memory bank? Is it there inside the Shared  Memory Controller only(seems like that from the indication of SMC boundary in figure 2 of SPRUEG5C)? Or the whole shared L2 RAM (768KB) is divided into 4 physically separate address spaces called banks in case of C6472(seems like that from sect. 4.2 of SPRUEG5C which says: "SMC divides SL2 RAMs address space into 4 physical pages.")? What is the meaning of "256 bits wide memory bank" mentioned in the SMC controller user guide(figure 2, SPRUEG5C)?

    3. If the memory banks are really 4 segments of the physical memory, then

        (i) What are the address boundaries? Are they(bank-0, bank-1, bank-2, bank-3) equally spaced within the total 0x BFFFF locations of the SL2 RAM of C6472 or something else?

        (ii) Can four different cores read/write different SL2 RAM locations in/through four different banks at the same time(assuming there is no previous request pending)? Or the arbitration logic will come into picture to resolve the conflict and sequentially arrange the requests and give the cores a feeling that the read/write are done simultaneously? Assuming the accesses are not made atomic because they are accessing different locations.

     

    Regards,

    AC.

  • You can download SYS/BIOS at http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/sysbios/index.html and IPC at http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/ipc/index.html.  Note that you will also need XDCTools since both BIOS and IPC depend on this product.  XDCTools can be downloaded at http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/rtsc/index.html.

    All 3 products contain both built libraries and source code.

    I'm not familiar enough with the memory architecture of C6472 to answer your remaining questions.  I will forward this post to someone more knowledgeable about this topic.

    Regards,

    Shreyas

     

  • I forwarded your remaining questions and obtained the following details regarding the C6472 SMC:

    C6472 SMC controls 4 banks of memory.  Each bank is 256 bits wide and the banks are interleaved as follows:
    - bank 0: base address + 0:31, 128:159, ...
    - bank 1: base address + 32:63, 160:191, ...
    - bank 2: base address + 64:95, 192:223, ...
    - bank 3: base address + 96:127, 224:255, ...
     
    SMC allows 4 concurrent accesses to 4 different banks.  If there is a bank conflict, it will select one access and let the others wait.


    Regards,
    Shreyas
  • Hi AC.

    I have EXACTLY the same issues as you with the understanding of the SMC. I have been studying the SMC (SPRUEG5C) as well as the CSL user's guide with slow progress towards understanding how to use the SMC and ensure atomicity. There is clearly also a shortage in proper simple example projects that work 'out-of-the-box'.

    My reason for avoiding SYS/BIOS is that my multi-core application uses the SRIO peripheral in DirectIO mode, and this mode is not supported under SYS/BIOS. Only message-passing mode is supported under SYS/BIOS. There are MANY other perfectly legitimate and sensible reasons to avoid an OS. Generally you can get much better optization and performance out of an application by programming it in bare-board (i.e. NO SYS/BIOS), especially if that application involves the execution of repetitive single tasks, regardless of whether it is multi-core or not. As soon as your application becomes more multi-task oriented, it is strongly advised to move to an OS (Like SYS/BIOS) of some sort.

    I have a bare-board (i.e. NO SYS/BIOS) multi-core application in which all the cores try to access (i.e. read AND write) a common integer variable in SL2 RAM. This variable is used as a semaphore between the cores to arbitrate access to other resources, and simply has the value 1 for a 'busy' condition, and 0 for 'non-busy'. This variable has the simple purpose of indicating to the cores that certain resources are blocked from being accessed, because another core is busy working on them. However, if atomic access to this variable is not guaranteed, it could obviously happen that two cores simultaneously read the value as 0 (non-busy), then simultaneously assert the variable to 1 (busy), and then simultaneously access the other resources, thus defeating the purpose of atomicity to the resources. According to the specifications of the SL2 controller, it supports atomic access monitoring, but I have had no success in understanding how this is accomplished

     

    My question is the following:

    * Is the atomic access to SL2 RAM supposed to be transparent to the end-user, or does it have to be controlled manually?

    In my opinion it should be transparent.

     

    I have been successful in developing this code for the C6474 by using the on-chip SEMAPHORE module, but now I want to migrate the code to the C6472 for performance comparison purposes.

    Have you had any success in the mean-time? Do you have any advice on how this (seemingly simple) task mentioned above can be accomplished/guaranteed? Maybe a different approach? Your help would be greatly appreciated.

     

    Regards.

    Estian.

     

     

  • Hi,

    atomic access to the SL2 RAM is supported via the instruction set of the device. You'll find an explanation and examples in the TMS320C64x/C64x+ DSP

    CPU and Instruction Set Reference Guide (SPRU732) in the chapter "C64x+ CPU Atomic Operations".


    We offer 3 instructions that work together with shared L2 memory on the TMS320C6472:

    • LL — Load Linked Word from Memory

    • SL — Store Linked Word to Buffer

    • CMTL — Commit Store Linked Word to Memory Conditionally

    How does it work?

    The LL instruction reads a word of memory and prepares to execute an SL instruction. The LL instruction reads a word form memory with a side effect, a link valid flag is set true and the address is monitored. If any other process stores to that address, the link valid flag is cleared. The link valid flag is also cleared if the SL instruction is executed with a different address. The SL instruction buffers a word to be stored to memory by the CMTL instruction. It does not commit the change. Finally the CMTL instruction reads the value of the link valid flag. If the link valid flag is true, the data buffered by the SL instruction is written to memory. If the commit fails, the update must be retried.

     

    I hope that helps.

     

    Kind regards,

    one and zero

  • Hi,

    As per my understanding (developed by reading and discussion in this thread) the user 'talks' to the atomic access monitor and the atomic access monitor controls the arbitration logic hardware. The user cannot directly control the arbitration logic hardware without using atomic access monitor. In my application more than one cores were trying to read a location simultaneously but that was not a write attempt. So I could bypass the requirement of atomicity. I used the shared memory location as a simple memory mapped register and that worked. If not mentioned, the code generation tool will not generate LL, SL, CMTL instructions (such as in my case I did not want atomicity, so SL2 access was using simple load store instructions). Also the post by Shreyas Prasad helped me understand the interleaved memory structure which is helpful for VLIW architecture. For large chunk of data (~2000) I checked the time required to write this chunk by different cores to different non intersecting regions in SL2. The overall time is almost same for simultaneous try of 1,2,3,4 cores. For more cores this time increases. This re-ensures the 4-bank structure.

    But as per my understanding harping on the same string as one and zero, if DSP BIOS is to be avoided then the only way to ensure atomicity is to use LL, SL, CMTL. Now next question comes whether to embed these assembly codes inside C code using 'asm'? I was advised in forum not to do so. Because for doing that I need other registers also. But don't know whether the cross compiled C code is doing  something with that register or not. So there comes question of push pop into stack or else knowing the way the registers are handled by the code gen. tools. Things will be complicated. So I think writing functions using these assembly instructions and calling them from C code in order to maintain atomicity will be a better option. Though in that case also there are some restrictions on the register usage (can be found in 'optimizing compiler' doc) still I think (not tried) that will be a simple way.

    Regards,

    AC.

  • Hi AC,

    you're absolutely right. You shouldn't use the asm() in your C-code. What you can do is copy the exa mples in an .asm file. You can call the assembler functions from C.

    For more info on how to mix C and assembly please have a look in TMS320C6000 Optimizing Compiler (SPRU187), Chapter 7.5

    Kind regards,

    one and zero

  • Hi One and Zero / AC,

    Thank you for your prompt replies and advice!

    Wow! I really can't believe that things can be so complicated for something so simple. Atomicity is certainly much simpler with a hardware semaphore like the C6474...

    Have the TI developers considered developing a CSL API that would make this process a little simpler, and perhaps include in a new release of the CSL?

     

    Ok, so before I spend a few weeks attempting to get this working the way you suggest, I would like to know:

    1. Is there perhaps another way of developing a semaphore that guarantees atomicity towards common resources for the C6472? I'll have you know that I have studied all the available documents on the TI website that address this, but to me, none of them seem as sound and reliable as a hardware solution.
    2. Which 'examples' are you referring to before, and where exactly can I get hold of them?

    Regards.

    Estian.

     

  • Hi.

     

    Ok, so based on the examples in the documentation you suggested, I have attempted a very simple approach and it seems I am stumbling at the very first hurdle. Here is the simple C code for my main routine:

     

    #include <stdio.h>

    extern asmfunc(void);

    void main(void)

    {

     asmfunc();

    }

     

    And here is the simple assembly function (.asm file) I created, and simply included into my CCS 4.0 project:

     

    .global _asmfunc

    _asmfunc:

    NOP 4

     

    I am getting the fllowing compiler error:

     

    "../asmfunc.asm", ERROR!   at line 1: [E0002] Illegal mnemonic specified

    .global _asmfunc


     

    What am I doing wrong? I am also unsure as which Build Options settings I have to fiddle with in CCS 4.0. There are so many parameters, its making my head spin... :)

     

    Please help!

    Estian.

     

     

     

  • Hi Estian,

    please try like that:

         .global _asmfunc

    _asmfunc:

    NOP 4

     

    Kind regards,

    one and zero

  • Hi one and zero.

    I cannot see the difference between what you suggested and my previous post, but your code seems to work!

    I copied your code directly into my .asm file and recompiled

    What is the difference. The spacing?

    Estian.

  • Hi Estian,

    yes it's the spacing. Only labels can start in the first column ....

    Kind regards,

    one and zero

  • Hi.

    Ok, so I'm officially lost again.

    I have managed to replicate the simple examples of SPRU187Q (Ch. 7) with success.

    However, I am unable to successfully implement the shared accumulator SL2 atomic operations of SPRU732H, (Ch 9, sec. 9.3.2).

    One again, here is my C code with main routine:

     

    #include <stdio.h>

    extern asmfunc(int * );

    #pragma DATA_SECTION(gvar,".SL2"); // SL2 in Linker Command Starts at 0x10200000

    int gvar = 0;

    void main(void)

    {  

     asmfunc(&gvar);

    }

     

    And here is the .asm file:

     

    .global _asmfunc

    _asmfunc:

    LL *A8, A6 ;load linked (lock) and store in A6

    NOP 4

    ADD A6,1,A6 ;add one to A6 and store back into A6

    SL A6, *A8 ;new value to store back

    CMTL *A8, A1 ;commit the store

    NOP 4

    [!A1] B _asmfunc ;if commit failed, try again

     

    I'm getting the following compiler errors:

     

    "../asmfunc.asm", ERROR!   at line 4: [E0004] Memory operand must be B side register LL *A8, A6 ;load linked (lock) and store in A6

    "../asmfunc.asm", ERROR!   at line 7: [E0004] Memory operand must be B side register SL A6, *A8 ;new value to store back

    "../asmfunc.asm", ERROR!   at line 8: [E0004] Memory operand must be B side register CMTL *A8, A1 ;commit the store

     

    I naively attempted to change all the A registers to B registers, which seems to compile. However, the function simply hangs up during execution, and according to Table 7-2 (SPRU187Q) these changes does not make sense in any case..

    Please help and explain!

    Estian.

  • Estian,

    You need to be very careful when using registers in Assembly called from C.  When you call asmfunc and pass the address of gvar the compiler and assembler have an agreement as to what register the address of gvar is going to be stored in.  You can't just switch all of the registers to B registers and hope for this to work, because the Compiler is still going to store the gvar address value in the same location.  From the looks of this code, the assembly routing expects the address of gvar to be in register A8.  Whether or not this is the correct register, I am not sure.  Check the Compiler Guide for the parameter passing conventions or debug this on a Simulator or hardware and look at the assembly instructions generated by the compiler in Main just before calling asmfunc and see where the address of A8 is being stored. 

    Regarding the errors that you are getting from the assembler, you also need to be familiar with the Instructions that you are using.  See the C64x+ Cpu Instruction Set Users Guide for details on the available instructions for the 6472 (I believe you referenced this document.)  When you look at the LL instruction, it specifies that it needs to use the .D2 unit.  This unit gets it's operand from the B register file.  So, one of the operands must be in a B register.  I _think_ it's the pointer value, but I'm not 100% sure.  Again, it never hurts to try this, compile it, and then step through and debug it to see if the registers get updated as expected.

    Regards,
    Dan