This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA2SX: OpenCL on C66

Part Number: TDA2SX
Other Parts Discussed in Thread: TDA2

Tool/software:

Hello,

I would like to know where to find documentation on the OpenCL for the C66, which seems not really useful exists, and have the following questions:

  • Is using OpenCL recommended over using intrinsics directly?
  • Should using OpenCL lead to the same or better performance using intrinsics, or can it worsen performance?
  • Where can I find some examples of how to use it for the C66?
  • How to properly load data to floatn data types, attending to memory alignment?

    I have tried to rewrite some code, but the outputs seem to differ in some cases, which I suspect has to due to memory alignment.

    With the following  test code:

    void test_func(float *  restrict in_ptr)
    {
        __float2_t vec_a, vec_b;
        __float2_t vec_c = _ftof2(0.0f,0.0f);
        float test_in_1[] = {0.678513,0.75461321};
        float test_in_2[] ={847.3125,684351.13};
        float2 vec_af2, vec_bf2, vec_c2f;


    printf("\n\nTEST LOCAL ARRAY\n");
        vec_a = _mem8_f2(&test_in_1[0]);
        vec_b = _mem8_f2(&test_in_2[0]);
        vec_c = _dmpysp(vec_a,vec_b);
        vec_c = _daddsp(_dmpysp(vec_a,vec_b),vec_c);

        printf("Pointer Input: %f %f %f %f\n\n",test_in_1[0],test_in_1[1],test_in_2[0],test_in_2[1]);

        printf("Intrisics input: %f %f\n",_hif2(vec_a),_lof2(vec_a));
        printf("Intrinsics output: %f\n",_hif2(vec_c)+_lof2(vec_c));

        vec_af2 = *(float2*)(&test_in_1[0]);
        vec_bf2 = *(float2*)(&test_in_2[0]);
        vec_c2f =  vec_af2 * vec_bf2;
        vec_c2f +=  vec_af2 * vec_bf2;

        printf("OpenCL input: %f %f\n",vec_af2.hi, vec_af2.lo);
        printf("OpenCL output: %f\n",vec_c2f.hi + vec_c2f.lo);

            
        printf("\n\nTEST INPUT POINTER\n");
        vec_a = _mem8_f2(&in_ptr[0]);
        vec_b = _mem8_f2(&in_ptr[2]);
        vec_c = _dmpysp(vec_a,vec_b);
        vec_c = _daddsp(_dmpysp(vec_a,vec_b),vec_c);

        printf("Pointer Input: %f %f %f %f\n\n",in_ptr[0],in_ptr[1],in_ptr[2],in_ptr[3]);
        printf("Intrisics input: %f %f %f %f\n",_hif2(vec_a),_lof2(vec_a),_hif2(vec_b),_lof2(vec_b));
        printf("Intrinsics output: %f\n",_hif2(vec_c)+_lof2(vec_c));

        vec_af2 = *(float2*)(&in_ptr[0]);
        vec_bf2 = *(float2*)(&in_ptr[2]);
        vec_c2f =  vec_af2 * vec_bf2;
        vec_c2f +=  vec_af2 * vec_bf2;

        printf("OpenCL input: %f %f %f %f\n",vec_af2.hi, vec_af2.lo,vec_bf2.hi, vec_bf2.lo);
        printf("OpenCL output: %f\n",vec_c2f.hi + vec_c2f.lo);

        }


    Where in_ptr is pointing to some location within a const float in_data[] = {...}

    I get the following results:
     [HOST] [DSP1  ]     97.967547 s: TEST LOCAL ARRAY
     [HOST] [DSP1  ]     97.967577 s: Pointer Input: 0.678513 0.754613 847.312500 684351.125000
     [HOST] [DSP1  ]     97.967608 s:  
     [HOST] [DSP1  ]     97.967638 s: Intrisics input: 0.754613 0.678513
     [HOST] [DSP1  ]     97.967669 s: Intrinsics output: 1033990.625000
     [HOST] [DSP1  ]     97.967699 s: OpenCL input: 0.754613 0.678513
     [HOST] [DSP1  ]     97.967699 s: OpenCL output: 1033990.625000
     [HOST] [DSP1  ]     97.967730 s:  
     [HOST] [DSP1  ]     97.967730 s:  
     [HOST] [DSP1  ]     97.967760 s: TEST INPUT POINTER
     [HOST] [DSP1  ]     97.967791 s: Pointer Input: -0.000000 0.662687 0.932101 -0.000000
     [HOST] [DSP1  ]     97.967821 s:  
     [HOST] [DSP1  ]     97.967821 s: Intrisics input: 0.662687 -0.000000 -0.000000 0.932101
     [HOST] [DSP1  ]     97.967852 s: Intrinsics output: -0.000000
     [HOST] [DSP1  ]     97.967913 s: OpenCL input: -0.000000 0.294224 0.932101 0.662687
     [HOST] [DSP1  ]     97.967913 s: OpenCL output: 0.389957

    It can be seen that with local arrays the data is loaded correctly, but in TEST INPUT POINTER  the load  with OpenCL it is actually shifted by 1 float, e.g. 0.294224 is the float before &in_ptr[0], this leads to a wrong output.