Fast Emulated Foating-Point??

MattB

Expert 2025 points

Other Parts Discussed in Thread: SPRC122

Hi,

I'm wondering whether we can afford to do some floating-point calculations on our C6455 and have been looking at C_fastRTS.

I created my own example based on inline_usage.c and found that it would "software pipeline" the multiple loop but not the add loop and wondered why(?).

But I was pretty impressed with it doing 16 floating-point multiples in 140 cycles.

My general question is:

Is it possible to get this kind of performance doing matrix maths (probably matrix multiplies)? Or does it get too complicated for the software pipelining to work?

Does anyone know of any more complex examples of using C_fastRTS?

Thanks,

Matt

P.S.

Here's what I did:

Void inline_usage() { float left [N] = { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6}; float right [N] = { 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7}; float add_res[N], mult_res[N]; int i; STS_set(&inlineAddSts, CLK_gethtime()); for(i = 0; i < N; i++){ add_res[i] = addsp_i(left[i], right[i]); } STS_delta(&inlineAddSts, CLK_gethtime()); STS_set(&inlineMultiplySts, CLK_gethtime()); for(i = 0; i < N; i++){ mult_res[i] = mpysp_i(left[i], right[i]); } STS_delta(&inlineMultiplySts, CLK_gethtime()); }

over 14 years ago

0 RandyP over 14 years ago

TI__Guru* 84110 points

Where does C_fastRTS come from? I am not familiar with it and did not find it in a search on TI.com. But I am glad you have found impressive performance numbers.

In your compiler options, you can invoke the --src_interlist option (aka -s) to get the asm files generated with interweaved C source. There should also be software pipelining analysis included that may help you figure out why software pipelining was not used in both cases.

Perhaps one gets inlined and the other does not?

0 MattB over 14 years ago in reply to RandyP

Expert 2025 points

Hi Randy,

Thanks for taking an interest!

RandyP said:
Where does C_fastRTS come from? I am not familiar with it and did not find it in a search on TI.com. But I am glad you have found impressive performance numbers.

C_fastRTS is part of the C62x/C64x Fast Run-Time Support (RTS) Library.

The manual (c_fastRTS.pdf) gives performance figures for "inlined and pipelined" calls of:

multisp_i = 5.03 cycles

addsp_i = 11.33 cycles

At first I didn't understand how an emulated floating-point calculation could be done this quickly but I now see what it means by "inlined and pipelined". The key lies in there being a number of similar calculations that can go in a loop.

In my example the add loop (not pipelined) took over a 1000 cycles i.e. 1000/16 over 60 cycles per add; which I think fits my understanding of how many instructions are required to do a normal (linear) floating-point calculation. The multiply loop (pipelined) took 141 cycles i.e 141/16 nearly 9 cycles per multiple; not as good as what the manual says but I think pretty impressive and probably usable.

But is it possible to write real/non-trivial code that achieves these figures?

RandyP said:
In your compiler options, you can invoke the --src_interlist option (aka -s) to get the asm files generated with interweaved C source.

Er, yes, did that!

RandyP said:
There should also be software pipelining analysis included that may help you figure out why software pipelining was not used in both cases.

For the add loop it says:

;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* Disqualified loop: Loop contains control code ;*----------------------------------------------------------------------------*

RandyP said:
Perhaps one gets inlined and the other does not?

No the, tha addsp_i is inlined just not "pipelined".

While I would like to know why the add loop is disqualified I haven't really looked at it myself yet.

What I was really looking for was someone with practical experience of trying to do floating-point maths on the C6455 who could give an indication of it's feasibility.

We've got a C6455 because we've got some pixel bashing to do and we needed a high bandwidth interconnect (SRIO) but (as I'm sure you're aware) no application fits into a single pigeon hole and I'm also being asked to do various calculations. Plan A was to do these calculations using fixed-point maths but obviously there are plenty of complications with fixed-point maths.

Hence I guess the question is:

What's easier, fixed-point maths or getting the compiler to pipeline floating-point maths?

Thanks,

Matt

0 MattB over 14 years ago

Expert 2025 points

MattB said:
I created my own example based on inline_usage.c and found that it would "software pipeline" the multiple loop but not the add loop and wondered why(?).

I installed CGT V6.0.18 (which is the version mentioned in C_fastRTS.pdf) and it appeared to be software pipelining the add loop as expected (I looked in the asm). (I couldn't measure the performance because the STS window wouldn't update; I guess this is something to do with CGT versions and DSP/BIOS version, I'm using DSP/BIOS V5.41 in CCSv4.)

I then went back to CGT V6.1.13 and selected --symdebug:none rather than --symdebug:dwarf. This appeared to do the trick and the add loop is now also pipelined.

The results are:

addsp_i = 203 cycles which gives 203/16 approx 12 cycles per add

multsp_i = 123 cycles which gives 123/16 approx 8 cycles per multiply

0 MattB over 14 years ago in reply to MattB

Expert 2025 points

MattB said:
I then went back to CGT V6.1.13 and selected --symdebug:none rather than --symdebug:dwarf. This appeared to do the trick and the add loop is now also pipelined.

I was wrong, it hasn't done the trick!

In CGT V6.1.13 with --symdebug:none the optimizer displays software pipeline information but finishes with:

ii = 39 Did not find schedule

Disqualified loop: Did not find schedule

Whereas in V6.0.18 the optimizer displays the software pipeline information and finishes with:

ii = 21 Schedule found with 2 iterations in parallel

I'm guessing this means that V6.0.18 did much better than V6.1.13(?)!

I've done my measurements (for V6.1.13) again: add loop with --symdebug:dwarf is 600 cycles, add loop with --symdebug:none is 208 cycles. I'm surprised the symbolic debug info makes such a big difference and wonder if my measurement technique (wrap the loop in STS calls) is correct(?).

0 RandyP over 14 years ago in reply to MattB

TI__Guru* 84110 points

MattB,

A few months ago, I was looking for something like this and could not find it with all my searches. Thanks for finding it for me.

I want to get to the answer of why the docs say one thing and you are getting another in terms of performance.

It would help me a lot if you could a) attach a zipped copy of your project to a reply to this post and b) tell me exactly where you found the cycle times that you mentioned. I searched in spru653 and did not find even a reference to addsp_i, so again my searches are failing me.

The compiler version issue you ran into has been reported for many applications. Unfortunately, v6.1.x does fix several bugs, so it produces "better" code for some applications that get those bugs, while v6.0.x tends to produce smaller and faster code which is "better" for other applications.

RandyP

0 MattB over 14 years ago in reply to RandyP

Expert 2025 points

RandyP said:
a) attach a zipped copy of your project to a reply to this post

Done.

Unfortunately I had changed the project since my last post so what I've attached is a new project. Obviously I've tried to make all the options the same but it's probably unsurprising that when I run it I get different timings! For this version the addSts average is about 300 cycles.

RandyP said:
b) tell me exactly where you found the cycle times that you mentioned. I searched in spru653 and did not find even a reference to addsp_i, so again my searches are failing me.

I think what I did was download and install the C62/C64 FastRTS (http://focus.ti.com/docs/toolsw/folders/print/sprc122.html). This created the following folders (among others):

fastrts62x64x\c6400\mthlib

fastrts62x64x\c6400\C_fastRTS

C_fastRTS contains the stuff I'm talking about including c_fastRTS.pdf which is where those performance figures come from. I did this a while ago so it is possible I'm remembering incorrectly!

RandyP said:
The compiler version issue you ran into has been reported for many applications. Unfortunately, v6.1.x does fix several bugs, so it produces "better" code for some applications that get those bugs, while v6.0.x tends to produce smaller and faster code which is "better" for other applications.

Okay, so now I know this I will consider comparing V6.0.23 to V6.1.13. Although at the moment compiler performance is not holding us back; we've got plenty of basic work to do first!

I can see there have been updates to V6.0.x since the release of V6.1.x but I guess there are unlikely to be further releases of V6.0.x(?).

Are there CCS and/or DSP/BIOS compatibility issues to consider if returning to V6.0.x? In other words why didn't my STS window update when I built with V6.0.18 (CCSv4 and DSP/BIOS V5.41.02.14)?

Thanks,

Matt

C_fastRTSExample.ZIP

Processors

Processors forum

Fast Emulated Foating-Point??