Part Number: TDA2SX
Other Parts Discussed in Thread: TDA2
Tool/software:
Hello,
I would like to know where to find documentation on the OpenCL for the C66, which seems not really useful exists, and have the following questions:
- Is using OpenCL recommended over using intrinsics directly?
- Should using OpenCL lead to the same or better performance using intrinsics, or can it worsen performance?
- Where can I find some examples of how to use it for the C66?
- How to properly load data to floatn data types, attending to memory alignment?
I have tried to rewrite some code, but the outputs seem to differ in some cases, which I suspect has to due to memory alignment.
With the following test code:
void test_func(float * restrict in_ptr)
{
__float2_t vec_a, vec_b;
__float2_t vec_c = _ftof2(0.0f,0.0f);
float test_in_1[] = {0.678513,0.75461321};
float test_in_2[] ={847.3125,684351.13};
float2 vec_af2, vec_bf2, vec_c2f;
printf("\n\nTEST LOCAL ARRAY\n");
vec_a = _mem8_f2(&test_in_1[0]);
vec_b = _mem8_f2(&test_in_2[0]);
vec_c = _dmpysp(vec_a,vec_b);
vec_c = _daddsp(_dmpysp(vec_a,vec_b),vec_c);
printf("Pointer Input: %f %f %f %f\n\n",test_in_1[0],test_in_1[1],test_in_2[0],test_in_2[1]);
printf("Intrisics input: %f %f\n",_hif2(vec_a),_lof2(vec_a));
printf("Intrinsics output: %f\n",_hif2(vec_c)+_lof2(vec_c));
vec_af2 = *(float2*)(&test_in_1[0]);
vec_bf2 = *(float2*)(&test_in_2[0]);
vec_c2f = vec_af2 * vec_bf2;
vec_c2f += vec_af2 * vec_bf2;
printf("OpenCL input: %f %f\n",vec_af2.hi, vec_af2.lo);
printf("OpenCL output: %f\n",vec_c2f.hi + vec_c2f.lo);
printf("\n\nTEST INPUT POINTER\n");
vec_a = _mem8_f2(&in_ptr[0]);
vec_b = _mem8_f2(&in_ptr[2]);
vec_c = _dmpysp(vec_a,vec_b);
vec_c = _daddsp(_dmpysp(vec_a,vec_b),vec_c);
printf("Pointer Input: %f %f %f %f\n\n",in_ptr[0],in_ptr[1],in_ptr[2],in_ptr[3]);
printf("Intrisics input: %f %f %f %f\n",_hif2(vec_a),_lof2(vec_a),_hif2(vec_b),_lof2(vec_b));
printf("Intrinsics output: %f\n",_hif2(vec_c)+_lof2(vec_c));
vec_af2 = *(float2*)(&in_ptr[0]);
vec_bf2 = *(float2*)(&in_ptr[2]);
vec_c2f = vec_af2 * vec_bf2;
vec_c2f += vec_af2 * vec_bf2;
printf("OpenCL input: %f %f %f %f\n",vec_af2.hi, vec_af2.lo,vec_bf2.hi, vec_bf2.lo);
printf("OpenCL output: %f\n",vec_c2f.hi + vec_c2f.lo);
}
Where in_ptr is pointing to some location within a const float in_data[] = {...}
I get the following results:
[HOST] [DSP1 ] 97.967547 s: TEST LOCAL ARRAY
[HOST] [DSP1 ] 97.967577 s: Pointer Input: 0.678513 0.754613 847.312500 684351.125000
[HOST] [DSP1 ] 97.967608 s:
[HOST] [DSP1 ] 97.967638 s: Intrisics input: 0.754613 0.678513
[HOST] [DSP1 ] 97.967669 s: Intrinsics output: 1033990.625000
[HOST] [DSP1 ] 97.967699 s: OpenCL input: 0.754613 0.678513
[HOST] [DSP1 ] 97.967699 s: OpenCL output: 1033990.625000
[HOST] [DSP1 ] 97.967730 s:
[HOST] [DSP1 ] 97.967730 s:
[HOST] [DSP1 ] 97.967760 s: TEST INPUT POINTER
[HOST] [DSP1 ] 97.967791 s: Pointer Input: -0.000000 0.662687 0.932101 -0.000000
[HOST] [DSP1 ] 97.967821 s:
[HOST] [DSP1 ] 97.967821 s: Intrisics input: 0.662687 -0.000000 -0.000000 0.932101
[HOST] [DSP1 ] 97.967852 s: Intrinsics output: -0.000000
[HOST] [DSP1 ] 97.967913 s: OpenCL input: -0.000000 0.294224 0.932101 0.662687
[HOST] [DSP1 ] 97.967913 s: OpenCL output: 0.389957
It can be seen that with local arrays the data is loaded correctly, but in TEST INPUT POINTER the load with OpenCL it is actually shifted by 1 float, e.g. 0.294224 is the float before &in_ptr[0], this leads to a wrong output.