This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Fast RTS for DM642 - The computation times are comparable (simulator emulator)?

Dear All,

  I'm trying to speed-up my program for the DSP DM642. In particular, I'm using a DM642 Evaluation Module. I use the CCS Version: 5.1.1.00031.


My program contains a porting of the core of the OpenCV 1.1 to calculate the optical flow and homography of two images. A pure C language can be pretty heavy to compute, and I was looking how to optimize my code.

I read a lot of information and I opted to use the

C62x/C64x Fast Run-Time Support (RTS) Library, to boost the operations.

My questions are related to the example contained into the library, in particular about the computation time that I obtained once I enabled the clock from the code composer studio.

I configured the target as simulator and run the program both in debug and release mode (optimization level 3).

I compare the operations addsp_i, subsp_i, mpysp_i, divsp_i, recipsp_i

with +, -, *, /, 1./x

The computation time I got are the follows

Debug mode: +, -, *, /, 1./x

Pipelined addition time: 101.562500
Pipelined substraction time: 106.19
Pipelined multiplication time: 98.19
Pipelined division time: 328.25
Pipelined reciprocal time: 1394.88

Debug Mode: addsp_i, subsp_i, mpysp_i, divsp_i, recipsp_i

Pipelined addition time: 285.687500
Pipelined substraction time: 298.56
Pipelined multiplication time: 224.94
Pipelined division time: 646.56
Pipelined reciprocal time: 609.56

Release Mode: +, -, *, /, 1./x

Pipelined addition time: 78.250000
Pipelined substraction time: 82.88
Pipelined multiplication time: 74.75
Pipelined division time: 306.44
Pipelined reciprocal time: 1373.50

Release Mode: addsp_i, subsp_i, mpysp_i, divsp_i, recipsp_i

Pipelined addition time: 37.187500
Pipelined substraction time: 38.19
Pipelined multiplication time: 7.81
Pipelined division time: 59.13
Pipelined reciprocal time: 18.44

The results obtained with release mode suggest to use the Fast RTS library. However, I could not properly evaluate the performance with the emulator. I'm a novice and I would like to ask confirm if the Fast RTS with the DM642 should be fast as shown by the simulator.
Can You kindly confirm that the FastRTS will reduce the computation time, for similar operations, with the DM642?

Can you give me an advice about which library I should use to speed-up fixed points operations or which documentation I should read? The amount of information about this topic is pretty huge, and sometimes the information are dispersed (just in my opinion, as novice).

Thank you in advance for any help.

Regards,

Alessandro

  • Alessandro,

    You are obviously talented, knowledgeable, insightful, organized, and precise (1./x instead of 1/x). You are definitely more than a novice, and we are glad you are working with TI processors.

    Just for your information, there is a DM64x Forum which might be more appropriate for your questions in the future; in this case, for purely DSP core-related questions, you are asking about things that are exact overlaps between this C64x forum and that DM64x one. If your questions were more directly related to the video ports or other peripherals on the DM642, the DM64x forum would be the better choice. There is also a TI C/C++ Compiler forum for optimization questions and a Code Composer Forum for simulator questions. A lot of choices and opportunities, and not as confusing as I make it sound.

    Your questions are really asking whether the simulator is accurate and what optimization techniques we would recommend.

    There are various simulator names, and the people on the Code Composer Forum can recite the names and features. I always use the ones that say Device in the name and have the part number, but I do not see a CCSv5 device simulator for the DM642. Which simulator are you using? If it does not model the memory that you are using, then the cycle counts will probably not match with the EVM. A device simulator will generally be within 5% at worst, and usually within 1-2% for most algorithms.

    But for relative comparisons, your analysis above gives you all the right answers. Perhaps the pipelined multiplication will take a little more than 7.81 of whatever your units are, but the Release Configuration with the Fast RTS library will give you the fastest performance on the EVM, just as it did on the simulator.

    Since you are running these tests on a simulator and an EVM, you might be at an early stage of this program. If so, I would strongly recommend moving to a newer processor. If you require some video ports, then there are DaVinci parts that would work, one of the best matches being the DM8148 or one of its derivative parts. But that would depend on more of your system requirements. Just moving to the DM647 would get you some more performance with just about the same peripheral architecture.

    The DM647 gives you the C64x+ core. It is still a fixed-point processor, so it would need the Fast RTS library for better floating point performance.

    The DM8148, C6748, and some other processors, have the C674x core. It has all the enhanced performance of the C64x+ fixed point core plus native floating point instructions; it is quite truly the best of both worlds since we were able to get the high clock speeds of the fixed point DSP and add very fast floating point, too.

    Way too much information for your questions, but those are my opinions on what might be helpful to you.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Dear RandyP,

      thank you so much for the kind words, and for the helpful and precious information.

    Thanks to your post, I have a lot of things I can study, search, and take in consideration. That helps me a lot.

    Thank you again!

    Best regards,

    Alessandro