Tool/software:
Hi Team,
Can you explain scalable vector programming with an example?
Regards,
Betsy Varughese
What is the purpose of using scalable vector programming:
For C7x, there are different C7000 variants, and as a result, there are different vector lengths on C7000 variants. On some variants (eg: C7120 & C7100), a vector can be support up to 512 bits and on other variants(eg: C7504 & C7524), a vector can be up to 256 bits. Therefore, it can be very helpful if there was a way to write vector code in a vector-length agnostic way. i.e, it would be useful if a programmer could write the c++ code for a particular algorithm once, and it would automatically compile and run on each C7000 variant without changes to the C++ code, using the maximum vector size that is possible on that C7000 variant. To support this, there is a feature of the C7000 C++ compiler and C7000 host emulation called the Scalable Vector Programming Model.
The scalable vector programming model consists of scalable vector types and associated C++ type traits.
Scalable Vector Types :
The scalable vector types, along with associated C++ traits. allow the programmer to write their code in such a way as to ensure the code will compile and run seamlessly on all C7x variants. This can be used only in C++ code, they cannot be used in C code. When a scalable vector type is used, the size of the type will depend on the C7x variant being compiled for (i.e, for example , the c7x::float_vec type will be 16 elements or 512 bits in length on C7100 and C7120, but only 8 float elements on C7504 and C7524 variants.)
To add scalability to our streaming engine applied loop, follow these steps:
Note: These utilities are available for use in C++ code only due to use of C++ language features in their implementation.
2. Based on the implementation logic, use available APIs from c7x_scalable.h
Sample code using C7x scalable vector programming concepts :
Below is an example of a function accepting two integer vectors, adding them element-wise, and returning an integer vector. C7000 scalable vector types can be accessed by including the c7x_scalable.h file in your source file.
#include <c7x.h> #include <c7x_scalable.h> c7x::int_vec add_two_int_vectors(c7x::int_vec a, c7x::int_vec b) { return a + b; }
here, c7x::int_vec will be 16 elements on 7100 and 7120 variants, and 8 elements on 7504 and 7524 variants.
Sample codes:
1. Without using streaming engine
#include <c7x.h> #define ARRAY_SIZE (64) #define SIMD_WIDTH (64) void vadd_exec_c7x(int8_t *pInA, int8_t *pInB, int8_t *pOutC) { for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += SIMD_WIDTH) { // Read a vector of 64-8b elements from input Array A uchar64 vInA = *stov_ptr(uchar64, (int8_t *)(pInA + ctr)); // Reinterprets it as ushort_vec ushort32 InA = __as_ushort32(vInA); // Read a vector of 64-8b elements from input Array B uchar64 vInB = *stov_ptr(uchar64, (int8_t *)(pInB + ctr)); // Reinterprets it as ushort_vec ushort32 InB = __as_ushort32(vInB); // Added 32-16b elements in parallel ushort32 vOutC = (InA + InB); // Converts back to uchar_vec uchar64 OutC = __as_uchar64(vOutC); //Store 64-8b elements to output array C *stov_ptr(uchar64, (int8_t *)(pOutC + ctr)) = OutC; } }
With scalable vector programming concept:
#include <c7x_scalable.h> using namespace c7x; #define ARRAY_SIZE (64) void vadd_exec_c7x_scalable(int8_t *pInA, int8_t *pInB, int8_t *pOutC) { typedef typename c7x::make_full_vector<c7x::uchar_vec>::type vec1; typedef typename c7x::make_full_vector<c7x::ushort_vec>::type vec2; int32_t eleCount = c7x::element_count_of<vec1>::value; for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += eleCount){ // Read a vector of 64-8b elements from input Array A vec1 vInA = *stov_ptr(vec1, (int8_t *)(pInA + ctr)); // Reinterprets it as ushort_vec vec2 InA = __as_ushort32(vInA); // Read a vector of 64-8b elements from input Array B vec1 vInB = *stov_ptr(vec1, (int8_t *)(pInB + ctr)); // Reinterprets it as ushort_vec vec2 InB = __as_ushort32(vInB); // Added 32-16b elements in parallel vec2 vOutC = (InA + InB); // Converts back to uchar_vec vec1 OutC = __as_uchar64(vOutC); //Store 64-8b elements to output array C *stov_ptr(vec1, (int8_t *)(pOutC + ctr)) = OutC; } }
2. Using Streaming Engine
#include <c7x.h> #define ARRAY_SIZE (64) #define SIMD_WIDTH (64) void vadd_exec_c7x (int8_t *pInA, int8_t *pInB, int8_t *pOutC) { // SE- SA config params __SE_TEMPLATE_v1 se0Params; __SA_TEMPLATE_v1 sa0Params; __SE_ELETYPE SE_ELETYPE; __SE_VECLEN SE_VECLEN; __SA_VECLEN SA_VECLEN; SE_VECLEN = __SE_VECLEN_64ELEMS; SA_VECLEN = __SA_VECLEN_64ELEMS; SE_ELETYPE = __SE_ELETYPE_8BIT; /**********************************************************************/ /* Prepare streaming engine 1 to fetch the input */ /**********************************************************************/ se0Params = __gen_SE_TEMPLATE_v1(); se0Params.ICNT0 = ARRAY_SIZE; se0Params.ELETYPE = SE_ELETYPE; se0Params.VECLEN = SE_VECLEN; se0Params.DIMFMT = __SE_DIMFMT_1D; /**********************************************************************/ /* Prepare SA template to store output */ /**********************************************************************/ sa0Params = __gen_SA_TEMPLATE_v1(); sa0Params.ICNT0 = ARRAY_SIZE; sa0Params.DIM1 = ARRAY_SIZE; sa0Params.VECLEN = SA_VECLEN; sa0Params.DIMFMT = __SA_DIMFMT_1D; __SE0_OPEN((int8_t *)pInA, se0Params); __SE1_OPEN((int8_t *)pInB, se0Params); __SA0_OPEN(sa0Params); for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += SIMD_WIDTH){ // Read a vector of 64-8b elements from input Array A uchar64 vInA =__SE0ADV(uchar64); // Reinterprets it as ushort_vec ushort32 InA = __as_ushort32(vInA); // Read a vector of 64-8b elements from input Array B uchar64 vInB =__SE1ADV(uchar64); // Reinterprets it as ushort_vec ushort32 InB = __as_ushort32(vInB); // Added 32-16b elements in parallel ushort32 vOutC = (InA + InB); // Converts back to uchar_vec uchar64 OutC = __as_uchar64(vOutC); //Store 64-8b elements to output array C *__SA0(uchar64, pOutC)= OutC; } __SE0_CLOSE(); __SE1_CLOSE(); __SA0_CLOSE(); }
With scalable vector programming concepts:
#include <c7x_scalable.h> using namespace c7x; #define ARRAY_SIZE (64) #define SIMD_WIDTH (64) void vadd_exec_c7x_scalable (int8_t *pInA, int8_t *pInB, int8_t *pOutC) { typedef typename c7x::make_full_vector<c7x::uchar_vec>::type vec1; typedef typename c7x::make_full_vector<c7x::ushort_vec>::type vec2; int32_t eleCount = c7x::element_count_of<vec1>::value; // SE- SA config params __SE_TEMPLATE_v1 se0Params; __SA_TEMPLATE_v1 sa0Params; __SE_ELETYPE SE_ELETYPE; __SE_VECLEN SE_VECLEN; __SA_VECLEN SA_VECLEN; SE_VECLEN = c7x::se_veclen<vec1>::value;; SA_VECLEN = c7x::sa_veclen<vec1>::value; SE_ELETYPE = c7x::se_eletype<vec1>::value; /**********************************************************************/ /* Prepare streaming engine 1 to fetch the input */ /**********************************************************************/ se0Params = __gen_SE_TEMPLATE_v1(); se0Params.ICNT0 = ARRAY_SIZE; se0Params.ELETYPE = SE_ELETYPE; se0Params.VECLEN = SE_VECLEN; se0Params.DIMFMT = __SE_DIMFMT_1D; /**********************************************************************/ /* Prepare SA template to store output */ /**********************************************************************/ sa0Params = __gen_SA_TEMPLATE_v1(); sa0Params.ICNT0 = ARRAY_SIZE; sa0Params.DIM1 = ARRAY_SIZE; sa0Params.VECLEN = SA_VECLEN; sa0Params.DIMFMT = __SA_DIMFMT_1D; __SE0_OPEN((int8_t *)pInA, se0Params); __SE1_OPEN((int8_t *)pInB, se0Params); __SA0_OPEN(sa0Params); for(int32_t ctr = 0; ctr < ARRAY_SIZE; ctr += eleCount){ // Read a vector of 64-8b elements from input Array A vec1 vInA = strm_eng<0, vec1>::get_adv(); // Reinterprets it as ushort_vec vec2 InA = __as_ushort32(vInA); // Read a vector of 64-8b elements from input Array B vec1 vInB =strm_eng<1, vec1>::get_adv(); // Reinterprets it as ushort_vec vec2 InB = __as_ushort32(vInB); // Added 32-16b elements in parallel vec2 vOutC = (InA + InB); // Converts back to uchar_vec vec1 OutC = __as_uchar64(vOutC); //Store 64-8b elements to output array C __vpred tmp = c7x::strm_agen<0, vec1>::get_vpred(); vec1 *VB1 = c7x::strm_agen<0, vec1>::get_adv(pOutC); __vstore_pred(tmp, VB1, OutC); } __SE0_CLOSE(); __SE1_CLOSE(); __SA0_CLOSE(); }
Regards,
Betsy Varughese