This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

About float calculation in 6678

If I want to run some code with float data on 6678, is there any special properity needs to be stated?

I tested the running time of  a project with float data calculation on evm 6678 and another 6713 evaluation platform, I found that the effeciency of one core of 6678 is only two

times of 6713. This result didn't fit the description of 6678 in its mannual, which is at lease 4 times. I doubt that I misoperated something, would somebody be kind enough to

help me?

Thanks a lot,

May

  • May,

    Some of the floating point instructions on the C6678 are fairly specialized and while it can provide up to 4x the performance, not all floating point code is going to be able to achieve 4x the performance (very SIMD oriented processing on these specialized instructions.)  The 2x performance is reasonable for less specialized operation. 

    You may want to have a look at the C66x CPU and Instruction set training material and the Reference Guide.  The training material does some comparisons of C66x vs C67x floating point instructions and the CPU and Instruction Set Reference Guide lists out the specific instructions for each architecture (see the appendix for complete breakout by core.)

    Best Regards,

    Chad

  • Hi Chad,

    Thanks for your answer. Once a TI engineer told that under same main freq of core, the performance of 6678 is at least 4 times of 6713. According to your answer, 2x performance is reasonable. But the main freq of core on 6713 is 200Hz, and the main freq of core on 6678 is 1GHz, why the result is still 2x?

    Looking forward to your reply!

    May

  • May,

    Sorry for any confusion here.  I assumed you meant 2x the performance on a cycle per cycle basis (i.e. C6678 running the same code in 1/2 the cycles that C6713 runs for largely floating point DSP oriented kernel)  That would actually put it at about 10x the raw performance, but could be anywhere from 5x to 20x depending for specific floating point kernel performance (1 to 1/4 of the cycle count) depending upon routines. 

    You can execute the C6713 kernel code natively on C6678 devices, it may require relinking because of differences in memory map, but the raw kernels will run. 

    You will not need to do anything special beyond giving the correct device type to generate and use the floating point instruction set of the C66xx family, but you may be able to get better performance by structuring the code.  There's some good overviews in the training material here to help you optimize the code performance.

    Best Regards,

    Chad

  • Thanks Chad,

    You said that "use the floating point instruction set of c66xx family", is that mean "FMPYSP,FADDSP..." ? Under the target configuration of 6678, I tried to replace "a*b" with FMPYSP(a,b) , but it couldn't be built , where a and b are floating data, then  I tried to replace "a*b" with _fmpysp(a,b) , it still couldn't be built. I had included "dsplib.66a"in the project.

    Can you tell me how to "use the floating point instruction set of c66xx family", does it mean that I need to replace the calculation symble "+,-,*"with special function?

    Looking forward to your reply!

    May

  • Hi May,

    FMPYSP.... is the instruction within the environment of assembly code. They need the core registers(e.g A16,B26) as the operator, so you can't use the instruction directly in your C code. There are two alternatives to solve it:

    1. Using the embeded assembly code in your C code, like:

    asm(" FMPYSP A16,A17,A20");  // there must exist a space between the first quote and the instruction.

    but in this way, you need to operate on core registers, so it's not recommended to implement too much code like that in your project. Usually it's used to setup some configuration registers or enable/disable interrupt by DINT and RINT instructions.

    2. Using the intrinsic functions which is provided by TI:

    float c = _dmpysp(float a, float b);

    Have a look at Table 7-7 of spru187s(TMS320C6000 Optimizing Compiler v7.2 User Guide), it summarize all the intrinsic functions like _dmpysp supported by C6600. You can invoke these functions directly in your C code. It's the best way.  

    Hope it helps.

    Allen

  • Hi Allen,

    I've tried _dmpysp and _fmpysp, but it couldn't be built, is there any library needs to be included?

    Thanks,

    May

  • Hi Allen,

    I tested the following code:

    float aa = 1.1;

    float bb = 2.2;

    float cc = 0.0;

    cc = _dmpysp(aa,bb);

    the output of cc is wrong! while cc = aa * bb, the result is 2.42.

    Then tried cc = _fmpysp(aa,bb), the complier said  _fmpysp couldn't be recognized. I found this function in Ti's training material.

    Looking forward to your reply,

    May

  • Hi,

    _dmpysp should be ok for build and there is no intrinsic called _fmpysp.

    No need to include any library for this feature, it's coming with the complier.

    Allen

  • As for the result, it's right if you check the actual memory of address of 'cc ' instead of the 'Watch window'.

  • Hi Allen,

    Somebody had told me that something wrong with the watch window, so I printed the result, but the result was the wrong value.

  • Hi May,

    The _dmpysp is a 2-way SIMD intrinsic for floating-point MPY operation, which will produce 2 single-precision results into a __float2_t type data.

    So if you only want to realize a FLOP MPY, using _mpysp2dp and define cc as double type data:

    double cc = _mpysp2dp(float aa,float bb);

    Then print the cc to check whether it's correct.

     

    Allen