This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

clock frequency of 64x+ DSP in OMAP 3530

Other Parts Discussed in Thread: OMAP3530

Hi,

 

Can some body please tell me what is the default clock frequency of 64x+ DSP in OMAP 3530 Mistral EVM.Will it be 430 MHZ or lower than that.How can i change the clok freq of DSP.Can i do it through some gel files.

 

I am running a code in 64x+ and i feel that it is running at  lower speed by seeing the performance.

 

Please help.

 

Thanks & Regards,

Manoj

 

 

  • Manoj R said:

    Can some body please tell me what is the default clock frequency of 64x+ DSP in OMAP 3530 Mistral EVM.Will it be 430 MHZ or lower than that.How can i change the clok freq of DSP.Can i do it through some gel files.

    Are you running with the Linux LSP provided with the EVM?  If so, by default the OMAP35xx is configured for OPP3 which has the ARM running at 500MHz and would have the IVA DSP running at 360MHz.  However, the IVA DSP is not enabled in the u-boot or Linux kernel by default.  When you use the DVSDK, then the IVA DSP is turned on by a software module called DSPLink, which then would configure the IVA DSP for the 360MHz operation.  This is consistent with OPP3 operation.

    The IVA DSP is held in reset after a power on reset and the ARM core would be required to enable it.  You can use GEL files to enable the DSP and configure the frequency, but it needs to match the appropriate OPP (Operating Performance Point) as the ARM Cortex-A8.  The top level GEL file typically used is omap3430_cortexA.gel (similar device to OMAP3530).

  • Hi,

     

    Thanks for ur reply.

     

    I am not working on Linux.I am working on CCS 3.3 on Windows.In this scenario what will be the default clk freq of ARM and DSP and how can i change the DSP clk to max freq .(I assume it is 430 MHz).

    Manoj

  • In the case of CCS your ARM and DSP clock frequencies are determined by your GEL file as Brandon suggested, or your code configuring them. The default OMAP3 GEL file has a variety of clock settings where you change OPP much like you would in Linux, take a look under the GEL drop down menu of CCS and try some of the settings, it should print out at the bottom of CCS (for the ARM) what the clock settings are (assuming you are using the default OMAP3 gel file).

  • BrandonAzbell said:

    Are you running with the Linux LSP provided with the EVM?  If so, by default the OMAP35xx is configured for OPP3 which has the ARM running at 500MHz and would have the IVA DSP running at 360MHz.  However, the IVA DSP is not enabled in the u-boot or Linux kernel by default.  When you use the DVSDK, then the IVA DSP is turned on by a software module called DSPLink, which then would configure the IVA DSP for the 360MHz operation.  This is consistent with OPP3 operation.

    The IVA DSP is held in reset after a power on reset and the ARM core would be required to enable it.  You can use GEL files to enable the DSP and configure the frequency, but it needs to match the appropriate OPP (Operating Performance Point) as the ARM Cortex-A8.  The top level GEL file typically used is omap3430_cortexA.gel (similar device to OMAP3530).

    Hi Brandon,

    When you talked about IVA DSP, were you referring to the C64+ DSP core on the OMAP3530? I'm using the Linux LSP provided with the EVM. Is there a way to configure the frequency for the C64+ DSP core?

    Thanks,

  • RobbySun said:
    When you talked about IVA DSP, were you referring to the C64+ DSP core on the OMAP3530?

    You are correct, IVA DSP is another way of referring to the C64x+.

    RobbySun said:
    I'm using the Linux LSP provided with the EVM. Is there a way to configure the frequency for the C64+ DSP core?

    I have never tested the C64x+ frequency when working with the cpufreq driver as discussed in section 10.3.2 of user\OMAP35x_SDK_1.0.2\docs\OMAP35x\UserGuide_1_0_2.pdf, however the document seems to claim that the IVA/C64x+ frequency is directly related to the ARM frequency. Based on the documentation I am under the impression that the cpufreq driver will adjust both the frequency of the ARM and the DSP as well as the core voltage.

  • I have a similar situation here:

    Running OMAP35x Mistral EVM with Linux on the ARM that loads and executes programs on the C64x+.

    The C64x+ code is developed using an external Linux tool set that includes the optimising cl6x compiler, DSPBIOS and LINK libraries etc.

    The problem I find is that a staright forward program runs many times (15+) slower on the C64x+ than it does if compiled and run on the ARM side! The program in question provides big integer maths for calculating prime numbers. 

    I have switched on full optimisation -O3 when compiling for the C6x+. 

    The above indicates that the C64x+ is running at 360MHz and the ARM at 500MHz. But surley, with its advanced architecure, the C64x+ should still fly?

    I am running the C64x+ program in main() but running it as a DSPBIOS task (priority 15) created by main it runs even slower! 

    Any help would be very much appreciated.

    Dave

  • David Hardwick said:
    The above indicates that the C64x+ is running at 360MHz and the ARM at 500MHz. But surley, with its advanced architecure, the C64x+ should still fly?

    The performance of the DSP relative to the ARM is going to be dependent on many factors, some code will run better on the ARM and some code will run better on the DSP, in a very general sense the more algorithmic looking code (loops with heavy math) will run better on the DSP and more arbitrary code (lots of branching varied code) will run better on the ARM. In the case of the OMAP3 due to the efficient architecture of the ARM Cortex A8 along with the significantly higher clock frequency over the DSP, general code will probably run faster on the Cortex A8 than on the C64x+ DSP, unless the code is very optimizeable for the DSP, in that it would use more functional units per clock cycle and other DSP benefits such as SPLOOP for software pipelining so that you leverage the DSP's strengths. To get this sort of performance on the DSP may require more than just the -o3 option, you may also have to implement intrinsics and pragmas with careful loop construction to help the compiler optimize your algorithm loops, perhaps even some hand written assembly. Also keep in mind that the C64x+ is meant to be a fixed point DSP, so working with floats can have a big impact on performance, you mention you are doing integer math so I suspect this is not one of the factors here but should be kept in mind. Essentially though the C64x+ architecture is advanced and is capable of performing more calculations per second than the ARM (even with the clock frequency disparity), you do not necessarily get that performance with just the compiler and -o3 option.

    However before you get into all that software optimization by hand, the first and possibly most impactful thing you can do is optimizing your memory map, as it does not matter much if your code is optimized if you have to stall constantly to read from external RAM, it is not uncommon to see such dramatic performance differences and more just based on how code and data is placed in memory. I suspect this may be the main cause of slow down in this case. If your application is not very large, it may be possible to fit the entire application into DSP internal memory, doing this can bring about many orders of magnitude performance improvement over everything being external and just relying on the cache. Even if your application is too large as a whole, if you can move in code and data sections that are used most often and are critical in your processing loop you can see dramatic performance improvements that way as well. This being said, to start out, how is your memory map for the DSP setup? If you have external code/data than do you have the caches enabled and set to the maximum sizes?

  • Hi Bernie,

    As far as I know I am using all the default settings which utilises a cache for accessing code and data on the C64x+:

    OMAP3530\dsplink-omap3530-base.tci contains:

    prog.module("GBL").ENABLEALLTRC        = false ;
    prog.module("GBL").PROCID              = parseInt (arguments [0]) ;

    prog.module("GBL").C64PLUSCONFIGURE    = true  ;
    prog.module("GBL").C64PLUSL2CFG        = "32k" ;
    prog.module("GBL").C64PLUSL1DCFG       = "32k" ;
    prog.module("GBL").C64PLUSL1PCFG       = "16k";
    prog.module("GBL").C64PLUSMAR128to159  = 0x00000080 ;

    The map file of the c64x+ test program contains:

    MEMORY CONFIGURATION

             name                origin    length      used     unused   attr    fill
    ----------------------       --------  ---------  --------  --------  ----  --------
      IRAM                         107f8000   00010000  00000000  00010000  RWIX
      CACHE_L2              10808000   00008000  00000000  00008000  RWIX
      CACHE_L1P           10e04000   00004000  00000000  00004000  RWIX
      L1DSRAM                10f04000   00004000  00000000  00004000  RWIX
      CACHE_L1D           10f10000   00008000  00000000  00008000  RWIX
      RESET_VECTOR   86000000   00000080  00000000  00000080  RWIX
      DDR2                        86000080   01bfff80  01429ba2  007d63de  RWIX
      DSPLINKMEM         87c00000   00030000  00000000  00030000  RWIX
      POOLMEM               87c30000   000d0000  00000000  000d0000  RWIX

    I am compiling via:

    CC = $(TI_TOOLS)/cl6x -eo=.o -q -pdr -pdv -pden -ml3 -mv6400+ --disable:sploop $(INTFLAGS) -O3

     

    Looking into this further, I have found that running the following simple test on the C64x+...

             unsigned long long r;
             unsigned long d;
             unsigned long b;
             for(b = 0; b < 99999; b++)
             {
                for (d = 0; d < 99999; d++)
                {
                   r = d / 123;
                   r = r * 123;
                   r = r % 2;
                }
             }

    ...it runs fast (<1 sec).

    But...

             unsigned long long r;
             unsigned long long d;
             unsigned long long b;

             for(b = 0; b < 99999; b++)
             {
                for (d = 0; d < 99999; d++)
                {
                   r = d / 123;
                   r = r * 123;
                   r = r % 2;
                }
             }

    ...runs very slow (26 secs)!

    And...

             unsigned long long r;
             unsigned long long d;
             unsigned long long b;
             for (d = 0; d < 9999999999; d++)
             {
                   r = d / 123;
                   r = r * 123;
                   r = r % 2;
             }


    ...runs fast (<1 sec).

    Appologies if I'm missing something simple here but I am very new to the system.        

  • David Hardwick said:
    Looking into this further, I have found that running the following simple test on the C64x+...

             unsigned long long r;
             unsigned long d;
             unsigned long b;
             for(b = 0; b < 99999; b++)
             {
                for (d = 0; d < 99999; d++)
                {
                   r = d / 123;
                   r = r * 123;
                   r = r % 2;
                }
             }

    ...it runs fast (<1 sec).

    But...

             unsigned long long r;
             unsigned long long d;
             unsigned long long b;

             for(b = 0; b < 99999; b++)
             {
                for (d = 0; d < 99999; d++)
                {
                   r = d / 123;
                   r = r * 123;
                   r = r % 2;
                }
             }

    ...runs very slow (26 secs)!

    And...

             unsigned long long r;
             unsigned long long d;
             unsigned long long b;
             for (d = 0; d < 9999999999; d++)
             {
                   r = d / 123;
                   r = r * 123;
                   r = r % 2;
             }


    ...runs fast (<1 sec).

    Appologies if I'm missing something simple here but I am very new to the system.        

    I think that the "fast" examples are not actually doing what you think they are. You are performing nearly 10 billion loop iterations, and when you think about it there's no way that a C64x+ DSP at any clock speed that the OMAP 3530 is capable of running it at can perform that many operations in less than one second. What's most likely happening is that the compiler is optimizing out the loop since it can determine what the final value of r will be at run time, or it's skipping the assignment to r altogether if it determines that r is not used. For some reason the compiler may not be detecting this opportunity for the slower version, although it's probably still not doing what you expect it to.

    In order to defeat this kind of optimization you need to ensure that some input is not known at compile time and that the output is not discarded. A simple way to do this in C is to use the argc parameter to main to seed the calculation and return the result. You can also use srand, or really the return value to any call to any library function that won't be inlined (even if the return value is just success/error code).

    You should bear in mind that neither the Cortex-A8 nor the C64x+ are highly optimized for 64bit integer operations (they're more 32bit processors). Also, neither can do division in hardware, so the divide operation may be very slow - on the other hand, it's not unlikely that the compiler will be converting the division into a multiply and a shift. If you want to determine the performance of division then you should not divide by a constant value (again, pick something whose value would not be determinable during compilation). The modulo operation will most likely be converted into an AND and even the multiplication may be converted into shifts and subtracts, although that's not as likely.

     

  • Thanks Gilead,

    That helps make sense of it all.