Hello,
Does TDA4MV sdk support OpenCL?
If not, how do you suggest to perform image processing/ matrix manipulation the most efficient way?
Thanks,
Nikolay.
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello,
Does TDA4MV sdk support OpenCL?
If not, how do you suggest to perform image processing/ matrix manipulation the most efficient way?
Thanks,
Nikolay.
Hello,
Currently, the SDK does not support OpenCL natively. There are ways of building the filesystem to include support for OpenCL for the GPU, but this is still being tested further.
There are dedicated accelerators on our device that can assist in various image processing and matrix multiplication needs. One is the DSPs on chip, which are the C7x.
Can you elaborate what kind of image processing you wish to do? This should help identify what component of the SoC would be best to use.
Regards,
Erick
Hello,
Thanks for the swift answer.
I think that using the accelerators would be a better solution for me.
The operations are:
1. Affine transformations: resize(bi linear), rotation, moving
2. Morphological operations: erode, dilate.
If there are some examples for this - I will gladly accept them.
Thanks,
Nikolay.
Nikolay,
I'm wondering if you meant OpenCV? It is available, and this would be the most basic way to do these transformations.
But if you want to offload these tasks to other cores, this will require more custom implementations to get this to work. For example, on the GPU, you would need to develop an OpenGL/OpenCL/Vulkan implementation to do this. For other accelerators, we would need to see in what form your image files are in to see if it makes sense exploring those as usually they don't have native Linux interfaces.
Regards,
Erick
Hey,
Well, I want to get maximum performance, that's why I asked about OpenCL support.
The images I use are uint8 tensors.
How can I compile the kernel in order to use OpenCL? Is there any manual for that or I should prepare for an adventure?
Is there any manual for the dedicated hardware usage?
Thanks,
Nikolay.
Nikolay,
The GPU would probably be your best bet right now. I was able to compile and run an OpenCL application on my system. I can share the filesystem I am booting here:
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/tisdk_2D00_default_2D00_image_2D00_j721e_2D00_evm.tar.xz
It has the header files and the library for OpenCL built-in.
These are my boot images:
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/3000.sysfw.itb
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/3000.tiboot3.bin
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/1452.tispl.bin
https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/5165.u_2D00_boot.img
You could boot your board with these in your boot partition, and the rootfs in the other partition.
I used this example that I found online:
#include <stdio.h> #include <stdlib.h> #ifdef __APPLE__ #include <OpenCL/cl.h> #else #include <CL/cl.h> #endif #define VECTOR_SIZE 1024 //OpenCL kernel which is run for every work item created. const char *saxpy_kernel = "__kernel \n" "void saxpy_kernel(float alpha, \n" " __global float *A, \n" " __global float *B, \n" " __global float *C) \n" "{ \n" " //Get the index of the work-item \n" " int index = get_global_id(0); \n" " C[index] = alpha* A[index] + B[index]; \n" "} \n"; int main(void) { int i; // Allocate space for vectors A, B and C float alpha = 2.0; float *A = (float*)malloc(sizeof(float)*VECTOR_SIZE); float *B = (float*)malloc(sizeof(float)*VECTOR_SIZE); float *C = (float*)malloc(sizeof(float)*VECTOR_SIZE); for(i = 0; i < VECTOR_SIZE; i++) { A[i] = i; B[i] = VECTOR_SIZE - i; C[i] = 0; } // Get platform and device information cl_platform_id * platforms = NULL; cl_uint num_platforms; //Set up the Platform cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms); platforms = (cl_platform_id *) malloc(sizeof(cl_platform_id)*num_platforms); clStatus = clGetPlatformIDs(num_platforms, platforms, NULL); //Get the devices list and choose the device you want to run on cl_device_id *device_list = NULL; cl_uint num_devices; clStatus = clGetDeviceIDs( platforms[0], CL_DEVICE_TYPE_GPU, 0,NULL, &num_devices); device_list = (cl_device_id *) malloc(sizeof(cl_device_id)*num_devices); clStatus = clGetDeviceIDs( platforms[0],CL_DEVICE_TYPE_GPU, num_devices, device_list, NULL); // Create one OpenCL context for each device in the platform cl_context context; context = clCreateContext( NULL, num_devices, device_list, NULL, NULL, &clStatus); // Create a command queue cl_command_queue command_queue = clCreateCommandQueue(context, device_list[0], 0, &clStatus); // Create memory buffers on the device for each vector cl_mem A_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(float), NULL, &clStatus); cl_mem B_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(float), NULL, &clStatus); cl_mem C_clmem = clCreateBuffer(context, CL_MEM_WRITE_ONLY,VECTOR_SIZE * sizeof(float), NULL, &clStatus); // Copy the Buffer A and B to the device clStatus = clEnqueueWriteBuffer(command_queue, A_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), A, 0, NULL, NULL); clStatus = clEnqueueWriteBuffer(command_queue, B_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), B, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clCreateProgramWithSource(context, 1,(const char **)&saxpy_kernel, NULL, &clStatus); // Build the program clStatus = clBuildProgram(program, 1, device_list, NULL, NULL, NULL); // Create the OpenCL kernel cl_kernel kernel = clCreateKernel(program, "saxpy_kernel", &clStatus); // Set the arguments of the kernel clStatus = clSetKernelArg(kernel, 0, sizeof(float), (void *)&alpha); clStatus = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&A_clmem); clStatus = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&B_clmem); clStatus = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&C_clmem); // Execute the OpenCL kernel on the list size_t global_size = VECTOR_SIZE; // Process the entire lists size_t local_size = 64; // Process one item at a time clStatus = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL); // Read the cl memory C_clmem on device to the host variable C clStatus = clEnqueueReadBuffer(command_queue, C_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), C, 0, NULL, NULL); // Clean up and wait for all the comands to complete. clStatus = clFlush(command_queue); clStatus = clFinish(command_queue); // Display the result to the screen for(i = 0; i < VECTOR_SIZE; i++) printf("%f * %f + %f = %f\n", alpha, A[i], B[i], C[i]); // Finally release all OpenCL allocated objects and host buffers. clStatus = clReleaseKernel(kernel); clStatus = clReleaseProgram(program); clStatus = clReleaseMemObject(A_clmem); clStatus = clReleaseMemObject(B_clmem); clStatus = clReleaseMemObject(C_clmem); clStatus = clReleaseCommandQueue(command_queue); clStatus = clReleaseContext(context); free(A); free(B); free(C); free(platforms); free(device_list); return 0; }
And my build command on the target was: gcc -lOpenCL cl_example.c
There were complaints by ld that it could not find -lOpenCL, so I added a soft link to fix it: ln -s /usr/lib/libOpenCL.so.1 /usr/lib/libOpenCL.so
Let me know if this helps!
Thanks,
Erick