Benchmarking on MATHLIB

Boll

Other Parts Discussed in Thread: MATHLIB

I am performing benchmarking on mathlib and find inconsistent result against that stated in MATHLIB_c66x_TestReport.html.

Test Conditions:

- C6000 Code Generation Tools 7.3.1

- C6608 Device Cycle Approximate Simulator, Little Endian

- Import mathlib_c66x_3_0_2_0\packages\ti\mathlib\src\sinsp\c66\sindp_66_LE_ELF in CCS 5.03 for double precision evaluation.

During my first test all profile cycle display are 0 using clock(), I am not clear what is wrong; maybe some initialization is not performed.Then I use TSC instead to perform profiling. Below is the print result.

[TMS320C66x_0] RTS: 217 cycles
[TMS320C66x_0] ASM: 180 cycles
[TMS320C66x_0] C: 317 cycles
[TMS320C66x_0] Inline: 297 cycles
[TMS320C66x_0] Vector: 306 cycles

The result in MATHLIB_c66x_TestReport.html:

Problems

1. From my test result, "ASM" is fastest; however inline is fastest in MATHLIB_c66x_TestReport.html

2. "Vector" speed is very slow.

3. It is found in MATHLIB_c66x_TestReport.html that "TCI6608 Device Functional Simulator, Little Endian" is used for testing. Why not use Device Cycle Approximate Simulator since

I also perform the evaluation in EVM, similar result is achieved.

Can anyone help? Thanks.

Boll

over 12 years ago

0 Boll over 12 years ago

Intellectual 340 points

When I turn the optimization level to -o3, the test result is very close to MATHLIB_c66x_TestReport.html, shown as below.

[TMS320C66x_0] --------------------------------------------------------------------------------
[TMS320C66x_0] Cycle Profile: sinDP
[TMS320C66x_0] --------------------------------------------------------------------------------
[TMS320C66x_0] RTS: 203 cycles
[TMS320C66x_0] ASM: 173 cycles
[TMS320C66x_0] C: 122 cycles
[TMS320C66x_0] Inline: 94 cycles
[TMS320C66x_0] Vector: 20 cycles
[TMS320C66x_0] --------------------------------------------------------------------------------

However when using -o3, software pipeline is introduced, which can affect speed benchmarking of each version of sin function.

Boll

0 RandyP over 12 years ago in reply to Boll

TI__Guru* 84110 points

Boll,

The benchmarks were done with a specific version of the compiler, so you might see small variation from one version to the next. There can also be some slight variation between the actual device and the simulator, usually in terms of the memory model and the location of the breakpoint. The biggest variation comes from the compiler settings, so if you were not using the best optimization, it makes sense that you would not get the best results.

Without software pipelining, the inline and vector versions are not going to perform well. They were written with the intention of using software pipelining.

If you want to upload your project to this thread, it might help others who want to see how you got the results that you did. It could be a big benefit to the community. I especially like that you got better numbers for the Vector version.

Regards,
RandyP

0 Boll over 12 years ago in reply to RandyP

Intellectual 340 points

Dear RandyP,

Thanks for your kind reply.

I am clear with it now.

As mentioned in my first post, I am testing those projects in MATHLIB package provided by TI.

BTW, default optimization level in those sample projects is set to none, which is a little bit misleading.

Boll

0 Boll over 12 years ago in reply to RandyP

Intellectual 340 points

Hi,

I got one more question.

For single precision of those math functions, assembly version is better than C version.

It is reasonable since C code needs to be assembled to assembly code.

However for the double precision functions, why assembly version is much slower than C version?

0 RandyP over 12 years ago in reply to Boll

TI__Guru* 84110 points

Boll,

This is a very interesting point. Your own testing confirmed that this can be the case: C code running faster than assembly.

The reason for this is probably in the history of the development of these routines and the progress we have made in the quality of the C compiler. The asm version might not take advantage of all of the architectural improvements in the device since the assembly code was originally written, but the C code can easily do this. If the asm version has not been rigorously updated with each architecture improvement, then that could explain at least part of this difference.

Since your testing confirms that the data was taken correctly, we can at least be grateful to have the table so users can choose the version that best fits their needs.

This will be considered for upcoming releases. Thank you for pointing it out.

Regards,
RandyP

Processors

Processors forum

Benchmarking on MATHLIB