[FAQ] TDA4AH-Q1: C7x Scalable Vector Programming

Part Number: TDA4AH-Q1

Tool/software:

Hi Team,

Can you explain scalable vector programming with an example?

Regards,

Betsy Varughese

  • What is the purpose of using scalable vector programming:

    For C7x, there are different C7000 variants, and as a result, there are different vector lengths on C7000 variants. On some variants (eg: C7120 & C7100), a vector can be support up to 512 bits and on other variants(eg: C7504 & C7524), a vector can be up to 256 bits. Therefore, it can be very helpful if there was a way to write vector code in a vector-length agnostic way. i.e, it would be useful if a programmer could write the c++ code for a particular algorithm once, and it would automatically compile and run on each C7000 variant without changes to the C++ code, using the maximum vector size that is possible on that C7000 variant. To support this, there is a feature of the C7000 C++ compiler and C7000 host emulation called the Scalable Vector Programming Model.

    The scalable vector programming model consists of scalable vector types and associated C++ type traits.

    Scalable Vector Types :

    The scalable vector types, along with associated C++ traits. allow the programmer to write their code in such a way as to ensure the code will compile and run seamlessly on all C7x variants. This can be used only in C++ code, they cannot be used in C  code. When  a scalable vector type is used, the size of the type will depend on the C7x variant being compiled for (i.e, for example , the c7x::float_vec type will be 16 elements or 512 bits in length on C7100 and C7120, but only 8 float elements on C7504 and C7524 variants.)

    To add scalability to our streaming engine applied loop, follow these steps:

    1. Include c7x_scalable.h in the source code, i.e. #include <c7x_scalable.h>

              Note: These utilities are available for use in C++ code only due to use of C++ language features in their implementation.

          2. ​Based on the implementation logic, use available APIs from c7x_scalable.h

    Sample code using C7x scalable vector programming concepts :

    ​Below is an example of a function accepting two integer vectors, adding them element-wise, and returning an integer vector. C7000 scalable vector types can be accessed by including the c7x_scalable.h file in your source file.

    #include <c7x.h>
    #include <c7x_scalable.h>
    
    c7x::int_vec add_two_int_vectors(c7x::int_vec a, c7x::int_vec b)
    {
        return a + b;
    }

    here, c7x::int_vec will be 16 elements on 7100 and 7120 variants, and 8 elements on 7504 and 7524 variants.

    Sample codes:

    1. Without using streaming engine

       
    #include <c7x.h>
    
    #define ARRAY_SIZE (64)
    #define SIMD_WIDTH (64)
    
    void vadd_exec_c7x(int8_t *pInA, int8_t *pInB, int8_t *pOutC)
    {   
        for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += SIMD_WIDTH) {
     
            // Read a vector of 64-8b elements from input Array A
            uchar64 vInA = *stov_ptr(uchar64, (int8_t *)(pInA + ctr));
            // Reinterprets it as ushort_vec 
            ushort32 InA = __as_ushort32(vInA);
            
            // Read a vector of 64-8b elements from input Array B
            uchar64 vInB = *stov_ptr(uchar64, (int8_t *)(pInB + ctr));
            // Reinterprets it as ushort_vec 
            ushort32 InB = __as_ushort32(vInB);
    
            // Added 32-16b elements in parallel
            ushort32 vOutC = (InA + InB);
            // Converts back to uchar_vec
            uchar64  OutC = __as_uchar64(vOutC);
    
            //Store 64-8b elements to output array C
            *stov_ptr(uchar64, (int8_t *)(pOutC + ctr)) = OutC;
        }
    
    }


    With scalable vector programming concept:
       
    #include <c7x_scalable.h>
    
    using namespace  c7x;
    
    #define ARRAY_SIZE (64)
    
    void vadd_exec_c7x_scalable(int8_t *pInA, int8_t *pInB, int8_t *pOutC)
    {  
        typedef typename c7x::make_full_vector<c7x::uchar_vec>::type  vec1;
        typedef typename c7x::make_full_vector<c7x::ushort_vec>::type vec2;
        int32_t eleCount = c7x::element_count_of<vec1>::value;
        for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += eleCount){
            // Read a vector of 64-8b elements from input Array A
            vec1 vInA = *stov_ptr(vec1, (int8_t *)(pInA + ctr));
            // Reinterprets it as ushort_vec 
            vec2 InA = __as_ushort32(vInA);
            
            // Read a vector of 64-8b elements from input Array B
            vec1 vInB = *stov_ptr(vec1, (int8_t *)(pInB + ctr));
            // Reinterprets it as ushort_vec 
            vec2 InB = __as_ushort32(vInB);
    
            // Added 32-16b elements in parallel
            vec2 vOutC = (InA + InB);
            // Converts back to uchar_vec
            vec1   OutC = __as_uchar64(vOutC);
        
            //Store 64-8b elements to output array C
            *stov_ptr(vec1, (int8_t *)(pOutC + ctr)) = OutC;
    }
    
    }


    2. Using Streaming Engine

       
    #include <c7x.h>
    
    #define ARRAY_SIZE (64)
    #define SIMD_WIDTH (64)
    
    void vadd_exec_c7x (int8_t *pInA, int8_t *pInB, int8_t *pOutC)
    {   
        // SE- SA config params
        __SE_TEMPLATE_v1 se0Params;
        __SA_TEMPLATE_v1 sa0Params;
    
        __SE_ELETYPE SE_ELETYPE;
        __SE_VECLEN  SE_VECLEN;
        __SA_VECLEN  SA_VECLEN;
    
        SE_VECLEN  = __SE_VECLEN_64ELEMS;
        SA_VECLEN  = __SA_VECLEN_64ELEMS;
        SE_ELETYPE = __SE_ELETYPE_8BIT;
    
        /**********************************************************************/
        /* Prepare streaming engine 1 to fetch the input                      */
        /**********************************************************************/
        se0Params = __gen_SE_TEMPLATE_v1();
        se0Params.ICNT0   = ARRAY_SIZE;
        se0Params.ELETYPE = SE_ELETYPE;
        se0Params.VECLEN  = SE_VECLEN;
        se0Params.DIMFMT  = __SE_DIMFMT_1D;
    
        /**********************************************************************/
        /* Prepare SA template to store output                                */
        /**********************************************************************/
        sa0Params = __gen_SA_TEMPLATE_v1();
        sa0Params.ICNT0  = ARRAY_SIZE;
        sa0Params.DIM1   = ARRAY_SIZE;
        sa0Params.VECLEN = SA_VECLEN;
        sa0Params.DIMFMT = __SA_DIMFMT_1D;
    
       __SE0_OPEN((int8_t *)pInA, se0Params);  
       __SE1_OPEN((int8_t *)pInB, se0Params); 
       __SA0_OPEN(sa0Params);    
        
       for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += SIMD_WIDTH){
        
        // Read a vector of 64-8b elements from input Array A
        uchar64 vInA =__SE0ADV(uchar64);
        // Reinterprets it as ushort_vec
        ushort32 InA = __as_ushort32(vInA);
        
        // Read a vector of 64-8b elements from input Array B
        uchar64 vInB =__SE1ADV(uchar64);
        // Reinterprets it as ushort_vec
        ushort32 InB = __as_ushort32(vInB);
    
        // Added 32-16b elements in parallel
        ushort32 vOutC = (InA + InB);
        // Converts back to uchar_vec
        uchar64   OutC = __as_uchar64(vOutC);
        
        //Store 64-8b elements to output array C
       *__SA0(uchar64, pOutC)= OutC;
        }
        
        __SE0_CLOSE();
        __SE1_CLOSE();
        __SA0_CLOSE();
    
    }


    With scalable vector programming concepts:

    #include <c7x_scalable.h>
    
    using namespace  c7x;
    
    #define ARRAY_SIZE (64)
    #define SIMD_WIDTH (64)
    
    void vadd_exec_c7x_scalable (int8_t *pInA, int8_t *pInB, int8_t *pOutC)
    {   
        typedef typename c7x::make_full_vector<c7x::uchar_vec>::type  vec1;
        typedef typename c7x::make_full_vector<c7x::ushort_vec>::type vec2;
        int32_t eleCount = c7x::element_count_of<vec1>::value;
    
        // SE- SA config params
        __SE_TEMPLATE_v1 se0Params;
        __SA_TEMPLATE_v1 sa0Params;
    
        __SE_ELETYPE SE_ELETYPE;
        __SE_VECLEN  SE_VECLEN;
        __SA_VECLEN  SA_VECLEN;
    
        SE_VECLEN  = c7x::se_veclen<vec1>::value;;
        SA_VECLEN  = c7x::sa_veclen<vec1>::value;
        SE_ELETYPE = c7x::se_eletype<vec1>::value;
    
        /**********************************************************************/
        /* Prepare streaming engine 1 to fetch the input                      */
        /**********************************************************************/
        se0Params = __gen_SE_TEMPLATE_v1();
        se0Params.ICNT0   = ARRAY_SIZE;
        se0Params.ELETYPE = SE_ELETYPE;
        se0Params.VECLEN  = SE_VECLEN;
        se0Params.DIMFMT  = __SE_DIMFMT_1D;
    
        /**********************************************************************/
        /* Prepare SA template to store output                                */
        /**********************************************************************/
        sa0Params = __gen_SA_TEMPLATE_v1();
        sa0Params.ICNT0  = ARRAY_SIZE;
        sa0Params.DIM1   = ARRAY_SIZE;
        sa0Params.VECLEN = SA_VECLEN;
        sa0Params.DIMFMT = __SA_DIMFMT_1D;
    
       __SE0_OPEN((int8_t *)pInA, se0Params);  
       __SE1_OPEN((int8_t *)pInB, se0Params); 
       __SA0_OPEN(sa0Params); 
      
    
        for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += eleCount){
            // Read a vector of 64-8b elements from input Array A
            vec1 vInA = strm_eng<0, vec1>::get_adv();
            // Reinterprets it as ushort_vec
            vec2 InA = __as_ushort32(vInA);
     
            // Read a vector of 64-8b elements from input Array B
            vec1 vInB =strm_eng<1, vec1>::get_adv();
            // Reinterprets it as ushort_vec 
            vec2 InB = __as_ushort32(vInB);
    
            // Added 32-16b elements in parallel
            vec2 vOutC = (InA + InB);
            // Converts back to uchar_vec
            vec1  OutC  = __as_uchar64(vOutC);
    
            //Store 64-8b elements to output array C
            __vpred tmp = c7x::strm_agen<0, vec1>::get_vpred();
             vec1 *VB1 = c7x::strm_agen<0, vec1>::get_adv(pOutC);
             __vstore_pred(tmp, VB1, OutC);
    
        }
    
        __SE0_CLOSE();
        __SE1_CLOSE();
        __SA0_CLOSE();
    }


    Regards,
    Betsy Varughese