This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5748: running error of conv1d opencl example

Part Number: AM5748

1.when ,keep running ti opencl example conv1d , linux ram will continue to increase.

QueuetoSubmit is 60000us some time.

why?

QueuetoSubmit=6520 us, SubmittoStart=68163 us, StarttoEnd=353829 us, dspruntime_us=4115.766602 us,

2 I modify the example,the inux ram will  not continue to increase.

but QueuetoSubmit  will is 6000us some time.

code is :

main.cpp:

/******************************************************************************
* Copyright (c) 2017, Texas Instruments Incorporated - http://www.ti.com/
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of Texas Instruments Incorporated nor the
* names of its contributors may be used to endorse or promote products
* derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
* THE POSSIBILITY OF SUCH DAMAGE.
*****************************************************************************/
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <unistd.h>
#include <time.h>
#include <malloc.h>
#include <string.h>

#include "ocl_util.h"
#include "ti_kernels.dsp_h"

//#define TI_VERSION

struct mallinfo mi_1;

// input image is COLSxROWS, filtered output image is COLSxROWS
// while filtering is applied to each row in image
#define COLS 1920//1920/10 //1920 418
#define ROWS 1080//1080/10 //1080 8
#define COLORDEPTH 12

// 1-D size 5 Gaussion Kernel, code is generic for non-symmetric kernel
#define FILTERSIZE 5
float FILTER[FILTERSIZE] = { 0.06136f, 0.24477f, 0.38774f, 0.24477f, 0.06136f };
#define EPISOLON 0.00001f

using namespace cl;
using namespace std;

static unsigned us_diff (struct timespec &t1, struct timespec &t2)
{ return (t2.tv_sec - t1.tv_sec) * 1e6 + (t2.tv_nsec - t1.tv_nsec) / 1e3; }

static int VerifyResults(CommandQueue &Q, Buffer &pOutput,
float *pGolden, int cols, int rows);

void input_init(float *pInput,int bufSize,int VALRANGE_);
void convfilte(float *pGolden,float *pInput,float *filtercoef,
int filtlen,int bufSize,int ROWS_, int COLS_ );

typedef struct timeout_
{
cl_ulong QueuetoSubmit_;
cl_ulong SubmittoStart_;
cl_ulong StarttoEnd_;
cl_ulong timetotal_us_;
cl_ulong timeout_us_;
cl_ulong timeout_us_max_;

}timeout_str;

void ocl_event_times_tst_init( timeout_str *timeout_obj,cl_ulong timeout_us);

void ocl_event_times_tst_1(const cl::Event &ev, const char* name, timeout_str *timeout_obj);
void log_dsp(const char *file_name,const char *kernelName,int *timeoutcnt, timeout_str timeout_obj,int timeus_linux,float dspruntime_us,struct mallinfo mi_1);
void dspruntime(float *dspruntime_us,CommandQueue &Q, Buffer &bufdspruntime_us);

void bufWrite(float *src,CommandQueue &Q, Buffer &bufDst,int bufsize);
void bufWrite_end(float *src,CommandQueue &Q, Buffer &bufDst,int bufsize);
void bufRead(float *dest,CommandQueue &Q, Buffer &bufSrc,int bufsize);

cl_ulong timeout_us=0;

#define RETURN(x) return x
int ModyFun(int argc ,char *argv[]);
int TI_tst(int argc, char *argv[]);
int main(int argc, char *argv[])
{
#ifdef TI_VERSION // *///////////////////////////////
return TI_tst( argc, argv);
#else // *///////////////////////////////
return ModyFun( argc ,argv);

#endif
}

static int VerifyResults(CommandQueue &Q, Buffer &bufOutput,
float *pGolden, int cols, int rows)
{
float *pOutput = (float *) Q.enqueueMapBuffer(bufOutput, CL_TRUE,
CL_MAP_READ | CL_MAP_WRITE,
0, cols*rows*sizeof(float));
int num_errors = 0;
for (int i = 0; i < cols * rows; i++)
{
if ((pOutput[i] - pGolden[i] > EPISOLON) ||
(pGolden[i] - pOutput[i] > EPISOLON))
if (num_errors++ < 10)
printf("Result diff at %d: expect %f, got %f\n", i, pGolden[i],
pOutput[i]);
}
if (num_errors != 0) printf("Total %d errors\n", num_errors);

// reset output buffer for next test
Event ev1;
memset(pOutput, 0, cols*rows*sizeof(float));
Q.enqueueUnmapMemObject(bufOutput, pOutput, NULL, &ev1);
ev1.wait();

return num_errors;
}
void input_init(float *pInput,int bufSize,int VALRANGE_)
{
for (int i = 0; i < bufSize; i++)
pInput[i] = (float) (rand() % VALRANGE_);
}
void convfilte(float *pGolden,float *pInput,float *filtercoef,int filtlen,int bufSize,int ROWS_, int COLS_ )
{
for (int j = 0; j < ROWS_; j++)
for (int i = 0; i < COLS_; i++)
{
float left_2 = pInput[j * COLS + (i-2 < 0 ? 0 : i-2)];
float left_1 = pInput[j * COLS + (i-1 < 0 ? 0 : i-1)];
float self = pInput[j * COLS + i];
float right_1 = pInput[j * COLS + (i+1 >= COLS ? COLS-1 : i+1)];
float right_2 = pInput[j * COLS + (i+2 >= COLS ? COLS-1 : i+2)];

pGolden[j*COLS + i] = left_2 * filtercoef[0] + left_1 * filtercoef[1]
+ self * filtercoef[2]
+ right_1 * filtercoef[3] + right_2 * filtercoef[4];
}
}
void ocl_event_times_tst_1(const cl::Event &ev, const char* name, timeout_str *timeout_obj)
{
cl_ulong t_que, t_sub, t_strt, t_end;

ev.getProfilingInfo(CL_PROFILING_COMMAND_QUEUED, &t_que);
ev.getProfilingInfo(CL_PROFILING_COMMAND_SUBMIT, &t_sub);
ev.getProfilingInfo(CL_PROFILING_COMMAND_START, &t_strt);
ev.getProfilingInfo(CL_PROFILING_COMMAND_END, &t_end);

/*----------------------------------------------------------------------
* Normalize the time to microseconds
*--------------------------------------------------------------------*/
t_que /= 1000; t_sub /= 1000; t_strt /= 1000; t_end /= 1000;

cl_ulong QueuetoSubmit=t_sub-t_que;
cl_ulong SubmittoStart=t_strt-t_sub;
cl_ulong StarttoEnd=t_end-t_strt;
cl_ulong timeout_us=timeout_obj->timeout_us_;
cl_ulong timetotal_us=QueuetoSubmit+SubmittoStart+ StarttoEnd;

timeout_obj->QueuetoSubmit_= t_sub-t_que;
timeout_obj->SubmittoStart_= SubmittoStart;
timeout_obj->StarttoEnd_= StarttoEnd;
timeout_obj->timetotal_us_= timetotal_us;
if(timeout_obj->timetotal_us_ > timeout_obj->timeout_us_max_)
timeout_obj->timeout_us_max_=timeout_obj->timetotal_us_;


if (!name) name = "";
if(timetotal_us>=timeout_us)
{
printf("++++ocl_event_times_tst %s : ",name);
printf("Queue to Submit: %lld us\n",t_sub-t_que);
printf("++++ocl_event_times_tst %s : ",name);
printf("Submit to Start : %lld us\n",t_strt-t_sub);
printf("++++ocl_event_times_tst %s : ",name);
printf("Start to End : %lld us\n",t_end-t_strt);

// printf("++++ocl_event_times_tst %s : Start to End : %d us\n",name,t_end-t_strt);


// cout<< name << " : Queue to Submit: " << t_sub-t_que << " us" << endl;
// cout<< name << " : Submit to Start : " << t_strt-t_sub << " us" << endl;
// cout<< name << " : Start to End : " << t_end-t_strt << " us" << endl;
// cout<< endl;
}
}

void ocl_event_times_tst_init( timeout_str *timeout_obj,cl_ulong timeout_us)
{
timeout_obj->QueuetoSubmit_=0;
timeout_obj->SubmittoStart_=0;
timeout_obj->StarttoEnd_=0;
timeout_obj->timetotal_us_=0;
timeout_obj->timeout_us_=timeout_us;
timeout_obj->timeout_us_max_=0;
}
void log_dsp(const char *file_name,const char *kernelName,int *timeoutcnt, timeout_str timeout_obj,int timeus_linux,float dspruntime_us,struct mallinfo mi_1)
{
//char *file_name="file_dsplog_conv1d_tst.txt";
if(timeout_obj.timeout_us_max_>timeout_obj.timeout_us_)
{
FILE *file_dsplog;
// file_dsplog=fopen(file_name,"w"); // /home/root/dsp_test/
file_dsplog=fopen(file_name,"a+");

// fprintf(file_dsplog,"************************\n");
// fprintf(file_dsplog,"************************\n");
// fprintf(file_dsplog,"DSP running log:\n");

timeoutcnt[0]+=1;
fprintf(file_dsplog,"kernelName=%s,",kernelName);

fprintf(file_dsplog,"QueuetoSubmit=%lld us, ",timeout_obj.QueuetoSubmit_);
fprintf(file_dsplog,"SubmittoStart=%lld us, ",timeout_obj.SubmittoStart_);
fprintf(file_dsplog,"StarttoEnd=%lld us, ",timeout_obj.StarttoEnd_);
fprintf(file_dsplog,"dspruntime_us=%f us, ",dspruntime_us);

fprintf(file_dsplog,"max_time_out=%lld us, ",timeout_obj.timeout_us_max_);
fprintf(file_dsplog,"timeout_us=%lld us, ",timeout_obj.timeout_us_);
fprintf(file_dsplog,"timeus_linux=%d us, ",timeus_linux);

fprintf(file_dsplog," cnt_time_out=%d, ",timeoutcnt[0]);

//mi_1 = mallinfo();
fprintf(file_dsplog," mallinfo=%d, ",mi_1.uordblks);
// printf("\n mallinfo :%d \n",mi_1.uordblks); //print memory overflow
fprintf(file_dsplog,"\n");

fclose(file_dsplog);
}



}

void dspruntime(float *dspruntime_us,CommandQueue &Q, Buffer &bufdspruntime_us)
{
#if 0
float *pdspruntime_us = (float *) Q.enqueueMapBuffer(bufdspruntime_us, CL_TRUE,
CL_MAP_READ | CL_MAP_WRITE,
0, 1*sizeof(float));
*dspruntime_us=*pdspruntime_us;
// //printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());

Event ev;
//memset(&dspruntime_us, 0, 1*sizeof(float));
Q.enqueueUnmapMemObject(bufdspruntime_us, pdspruntime_us, NULL, &ev);
ev.wait();
#else
bufRead(dspruntime_us,Q, bufdspruntime_us,1*sizeof(float));
#endif

}

void bufWrite(float *src,CommandQueue &Q, Buffer &bufDst,int bufsize)
{
src = (float *) Q.enqueueMapBuffer(bufDst, CL_TRUE,
CL_MAP_WRITE, 0, bufsize);
printf("++++bufWrite fun\n");
}
void bufWrite_end(float *src,CommandQueue &Q, Buffer &bufDst,int bufsize)
{
Q.enqueueUnmapMemObject(bufDst, src);
Q.finish();
printf("++++bufWrite_end fun\n");
}
void bufRead(float *dest,CommandQueue &Q, Buffer &bufSrc,int bufsize)
{
int i=0;
float *dest_t = (float *) Q.enqueueMapBuffer(bufSrc, CL_TRUE,
CL_MAP_READ | CL_MAP_WRITE,
0, bufsize);
for(i=0;i<bufsize/4;i++)
{
dest[i]=dest_t[i];
}
// //printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());

Event ev;
//memset(&dspruntime_us, 0, 1*sizeof(float));
Q.enqueueUnmapMemObject(bufSrc, dest_t, NULL, &ev);
ev.wait();
ev.~Wrapper();
Q.finish();

}

int ModyFun(int argc ,char *argv[])
{
int timeoutcnt=0;
struct timespec t0, t1;
int num_errors = 0;
int input_numcompunits = 0;
if (argc > 1) input_numcompunits = atoi(argv[1]); // valid: 1, 2, 4, 8

timeout_us=18000;
if(argc > 2) timeout_us = atoi(argv[2]);

printf("+++timeout_us=%lld us\n",timeout_us);
Context context (CL_DEVICE_TYPE_ACCELERATOR);
std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
CommandQueue Q(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
int NUMCOMPUNITS;
cl_ulong LOCALMEMSIZE;
devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS, &NUMCOMPUNITS);
devices[0].getInfo(CL_DEVICE_LOCAL_MEM_SIZE, &LOCALMEMSIZE);
printf("LOCALMEMSIZE=%d\n",LOCALMEMSIZE);
printf("NUMCOMPUNITS=%d\n",NUMCOMPUNITS);
if (input_numcompunits > 0 && NUMCOMPUNITS > input_numcompunits)
NUMCOMPUNITS = input_numcompunits;

printf("NUMCOMPUNITS=%d\n",NUMCOMPUNITS);


Program::Binaries binary(1, make_pair(ti_kernels_dsp_bin,
sizeof(ti_kernels_dsp_bin)));
Program program = Program(context, devices, binary);
program.build(devices);

float dspruntime_us=0;
int bufSize = COLS * ROWS * sizeof(float);
Buffer bufInput( context, CL_MEM_READ_WRITE, bufSize); //CL_MEM_READ_ONLY
Buffer bufOutput(context, CL_MEM_READ_WRITE, bufSize); //CL_MEM_WRITE_ONLY
Buffer bufFilter(context, CL_MEM_READ_ONLY, FILTERSIZE * sizeof(float));
Buffer bufdspruntime_us(context, CL_MEM_READ_WRITE, 1 * sizeof(float));

int BLOCK_HEIGHT = LOCALMEMSIZE / (2 * 2 * COLS * sizeof(float));
printf("BLOCK_HEIGHT=%d\n",BLOCK_HEIGHT);
// Prepare testing data, compute golden output on host


float *pGolden = (float *) malloc(bufSize);

if (pGolden == NULL)
{
printf("Failed to allocate memory for golden results\n");
exit(0);
}
#if 1
const char *file_name="file_dsplog_conv1d_tst.txt";
FILE *file_dsplog;
// file_dsplog=fopen(file_name,"w"); // /home/root/dsp_test/
file_dsplog=fopen(file_name,"a+");

fprintf(file_dsplog,"************************\n");
fprintf(file_dsplog,"************************\n");
fprintf(file_dsplog,"DSP running log:\n");
fprintf(file_dsplog,"NUMCOMPUNITS=%d,",NUMCOMPUNITS);
fprintf(file_dsplog,"timeout_us=%lld us, ",timeout_us);
fprintf(file_dsplog,"COLS=%d ,ROWS=%d,\n",COLS,ROWS);

fclose(file_dsplog);
#endif
//////////////////////pInput 赋值,host pGolden 2d卷积计算 start//////////////////////////
float *pInput;
#if 1
pInput = (float *) Q.enqueueMapBuffer(bufInput, CL_TRUE,
CL_MAP_WRITE, 0, bufSize);
int VALRANGE = (1 << COLORDEPTH);
input_init(pInput, COLS * ROWS, VALRANGE);


srand(time(NULL));
clock_gettime(CLOCK_MONOTONIC, &t0);
convfilte(pGolden,pInput,FILTER, FILTERSIZE, COLS * ROWS, ROWS, COLS );

clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (host_compute): %d usecs\n", us_diff(t0, t1));
Q.enqueueUnmapMemObject(bufInput, pInput);
Q.finish();
#else //error
//float *pInput;
printf("++++bufWrite fun used\n");
bufWrite(pInput,Q, bufInput, bufSize);
printf("++++bufWrite fun used over\n");
int VALRANGE = (1 << COLORDEPTH);
input_init(pInput, COLS * ROWS, VALRANGE);


srand(time(NULL));
clock_gettime(CLOCK_MONOTONIC, &t0);
convfilte(pGolden,pInput,FILTER, FILTERSIZE, COLS * ROWS, ROWS, COLS );

clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (host_compute): %d usecs\n", us_diff(t0, t1));

bufWrite_end(pInput,Q, bufInput, bufSize);
#endif
// free(pInput);
///////////////////////pInput 赋值,host pGolden 2d卷积计算 end////////////////////////

Q.enqueueWriteBuffer(bufFilter, CL_TRUE, 0, FILTERSIZE * sizeof(float),
FILTER);

///////////////////////////////
Kernel kernel(program, "null");

Kernel k_base(program, "k_conv1d_5x1");
k_base.setArg(0, bufInput);
k_base.setArg(1, bufOutput);
k_base.setArg(2, bufFilter);
k_base.setArg(3, COLS);
k_base.setArg(4, ROWS);
k_base.setArg(5, COLS);
k_base.setArg(6, COLS);
k_base.setArg(7, bufdspruntime_us);

Kernel k_loop(program, "k_loop");
k_loop.setArg(0, bufInput);
k_loop.setArg(1, bufOutput);
k_loop.setArg(2, bufFilter);
k_loop.setArg(3, COLS);
k_loop.setArg(4, ROWS);
k_loop.setArg(5, COLS);
k_loop.setArg(6, COLS);
k_loop.setArg(7, bufdspruntime_us);

Kernel k_loop_simd(program, "k_loop_simd");
k_loop_simd.setArg(0, bufInput);
k_loop_simd.setArg(1, bufOutput);
k_loop_simd.setArg(2, bufFilter);
k_loop_simd.setArg(3, COLS);
k_loop_simd.setArg(4, ROWS);
k_loop_simd.setArg(5, COLS);
k_loop_simd.setArg(6, COLS);
k_loop_simd.setArg(7, bufdspruntime_us);


Kernel k_loop_db(program, "k_loop_db");
k_loop_db.setArg(0, bufInput);
k_loop_db.setArg(1, bufOutput);
k_loop_db.setArg(2, bufFilter);
k_loop_db.setArg(3, COLS);
k_loop_db.setArg(4, ROWS);
k_loop_db.setArg(5, COLS);
k_loop_db.setArg(6, COLS);
k_loop_db.setArg(7, BLOCK_HEIGHT);
k_loop_db.setArg(8, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_db.setArg(9, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_db.setArg(10, bufdspruntime_us);

Kernel k_loop_simd_db(program, "k_loop_simd_db");
k_loop_simd_db.setArg(0, bufInput);
k_loop_simd_db.setArg(1, bufOutput);
k_loop_simd_db.setArg(2, bufFilter);
k_loop_simd_db.setArg(3, COLS);
k_loop_simd_db.setArg(4, ROWS);
k_loop_simd_db.setArg(5, COLS);
k_loop_simd_db.setArg(6, COLS);
k_loop_simd_db.setArg(7, BLOCK_HEIGHT);
k_loop_simd_db.setArg(8, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_simd_db.setArg(9, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_simd_db.setArg(10, bufdspruntime_us);

Kernel k_loop_simd_db_extc(program, "k_loop_simd_db_extc");
k_loop_simd_db_extc.setArg(0, bufInput);
k_loop_simd_db_extc.setArg(1, bufOutput);
k_loop_simd_db_extc.setArg(2, bufFilter);
k_loop_simd_db_extc.setArg(3, COLS);
k_loop_simd_db_extc.setArg(4, ROWS);
k_loop_simd_db_extc.setArg(5, COLS);
k_loop_simd_db_extc.setArg(6, COLS);
k_loop_simd_db_extc.setArg(7, BLOCK_HEIGHT);
k_loop_simd_db_extc.setArg(8, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_simd_db_extc.setArg(9, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_simd_db_extc.setArg(10, bufdspruntime_us);


try

{
while(1)
{
// The OpenCL runtime will lazily load the device program upon the first
// enqueue of a kernel from the program, so the elapsed time overall from
// the first enqueue will be longer to account for the loading of the
// program. To remove program loading overhead from kernel performance,
// enqueue a null kernel before running other kernels.

KernelFunctor null = kernel.bind(Q, NDRange(1), NDRange(1));
clock_gettime(CLOCK_MONOTONIC, &t0);
null().wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (loading program): %d usecs\n", us_diff(t0, t1));


///////////////////////k_conv1d_5x1 start //////////////////
Event ev;
// straight-forward kernel (unoptimized, baseline)
timeout_str timeout_k_base;
ocl_event_times_tst_init( &timeout_k_base, timeout_us);

clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_base, NullRange, NDRange(COLS, ROWS),
NDRange(COLS, ROWS/NUMCOMPUNITS), NULL, &ev);
ev.wait();

ocl_event_times_tst_1(ev, "k_base " , &timeout_k_base);
clock_gettime(CLOCK_MONOTONIC, &t1);
dspruntime(&dspruntime_us,Q,bufdspruntime_us);
//printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo())
mi_1 = mallinfo();
log_dsp(file_name, "k_base ",&timeoutcnt, timeout_k_base,us_diff(t0, t1),dspruntime_us,mallinfo());
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (k_baseline): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);
if(us_diff(t0, t1)>timeout_us)
{
mi_1 = mallinfo();
printf("\n mallinfo :%d \n",mi_1.uordblks); //print memory overflow

}
ev.~Wrapper();
///////////////k_conv1d_5x1 end ////////////
///////////////k_loop end start////////////
Event ev1;
timeout_str timeout_k_loop;
ocl_event_times_tst_init( &timeout_k_loop, timeout_us);

clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop, NullRange, NDRange(ROWS),
NDRange(ROWS/NUMCOMPUNITS), NULL, &ev1);
ev1.wait();


ocl_event_times_tst_1(ev1, "k_loop " , &timeout_k_loop);
clock_gettime(CLOCK_MONOTONIC, &t1);
dspruntime(&dspruntime_us,Q,bufdspruntime_us);
//printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());
log_dsp(file_name, "k_loop ",&timeoutcnt, timeout_k_loop,us_diff(t0, t1),dspruntime_us,mallinfo());
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (k_loop): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);
ev1.~Wrapper();
///////////////k_loop end end////////////
///////////////k_loop_simd start////////////
Event ev2;
timeout_str timeout_k_loop_simd;
ocl_event_times_tst_init( &timeout_k_loop_simd, timeout_us);

clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_simd, NullRange, NDRange(ROWS),
NDRange(ROWS/NUMCOMPUNITS), NULL, &ev2);
ev2.wait();


ocl_event_times_tst_1(ev2, "k_loop_simd " , &timeout_k_loop_simd);



clock_gettime(CLOCK_MONOTONIC, &t1);
dspruntime(&dspruntime_us,Q,bufdspruntime_us);
log_dsp(file_name, "k_loop_simd ",&timeoutcnt, timeout_k_loop_simd,us_diff(t0, t1),dspruntime_us,mallinfo());
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (k_loop_simd): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);
ev2.~Wrapper();
//printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());
///////////////k_loop_simd end ////////////


///////////////pipeline, k_loop_db,k_loop_simd_db,k_loop_simd_db_extc end start////////////

// set the double buffer pipeline to at least 8 times
if (BLOCK_HEIGHT > (ROWS / NUMCOMPUNITS / 8))
BLOCK_HEIGHT = (ROWS / NUMCOMPUNITS / 8) + 1;

if (BLOCK_HEIGHT > 0)
{
Event ev3;
timeout_str timeout_k_loop_db;
ocl_event_times_tst_init( &timeout_k_loop_db, timeout_us);

clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_db, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev3);
ev3.wait();

ocl_event_times_tst_1(ev3, "k_loop_db " , &timeout_k_loop_db);


clock_gettime(CLOCK_MONOTONIC, &t1);
dspruntime(&dspruntime_us,Q,bufdspruntime_us);
//printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());
log_dsp(file_name, "k_loop_db ",&timeoutcnt, timeout_k_loop_db,us_diff(t0, t1),dspruntime_us,mallinfo());
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (k_loop_db): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);
ev3.~Wrapper();
//////////////////////////////////////////
Event ev4;
timeout_str timeout_k_loop_simd_db;
ocl_event_times_tst_init( &timeout_k_loop_simd_db, timeout_us);
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_simd_db, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev4);
ev4.wait();

ocl_event_times_tst_1(ev4, "k_loop_simd_db " , &timeout_k_loop_simd_db);


clock_gettime(CLOCK_MONOTONIC, &t1);
dspruntime(&dspruntime_us,Q,bufdspruntime_us);
//printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());
log_dsp(file_name, "k_loop_simd_db ",&timeoutcnt, timeout_k_loop_simd_db,us_diff(t0, t1),dspruntime_us,mallinfo());
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (k_loop_simd_db): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);
ev4.~Wrapper();
//////////////////////////////////////////
Event ev5;
timeout_str timeout_k_loop_simd_db_extc;
ocl_event_times_tst_init( &timeout_k_loop_simd_db_extc, timeout_us);
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_simd_db_extc, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev5);
ev5.wait();

ocl_event_times_tst_1(ev5, "k_loop_simd_db_extc " , &timeout_k_loop_simd_db_extc);


clock_gettime(CLOCK_MONOTONIC, &t1);
dspruntime(&dspruntime_us,Q,bufdspruntime_us);
//printf("++++++dspruntime_us=%f us \n",dspruntime_us,mallinfo());
log_dsp(file_name, "k_loop_simd_db_extc ",&timeoutcnt, timeout_k_loop_simd_db_extc,us_diff(t0, t1),dspruntime_us,mallinfo());
if(us_diff(t0, t1)>timeout_us)
printf("Elapsed (k_loop_simd_db_extc): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);

ev5.~Wrapper();


}
//ev.release();
//ev.Event;
//ev.~Wrapper();

}

}

catch (Error err)
{
cerr << "ERROR: " << err.what() << "(" << err.err() << ", "
<< ocl_decode_error(err.err()) << ")" << endl;
}

if (num_errors != 0)
{
cout << "Failed with " << num_errors << " errors" << endl;
RETURN(-1);
} else
{
cout << "Pass!" << endl;
RETURN(0);
}

free(pGolden);

}
int TI_tst(int argc, char *argv[]){
struct timespec t0, t1;
int num_errors = 0;
int input_numcompunits = 0;
if (argc > 1) input_numcompunits = atoi(argv[1]); // valid: 1, 2, 4, 8


Context context (CL_DEVICE_TYPE_ACCELERATOR);
std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
CommandQueue Q(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
int NUMCOMPUNITS;
cl_ulong LOCALMEMSIZE;
devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS, &NUMCOMPUNITS);
devices[0].getInfo(CL_DEVICE_LOCAL_MEM_SIZE, &LOCALMEMSIZE);
if (input_numcompunits > 0 && NUMCOMPUNITS > input_numcompunits)
NUMCOMPUNITS = input_numcompunits;

printf("++NUMCOMPUNITS=%d,\n",NUMCOMPUNITS);
Program::Binaries binary(1, make_pair(ti_kernels_dsp_bin,
sizeof(ti_kernels_dsp_bin)));
Program program = Program(context, devices, binary);
program.build(devices);

// The OpenCL runtime will lazily load the device program upon the first
// enqueue of a kernel from the program, so the elapsed time overall from
// the first enqueue will be longer to account for the loading of the
// program. To remove program loading overhead from kernel performance,
// enqueue a null kernel before running other kernels.
Kernel kernel(program, "null");
KernelFunctor null = kernel.bind(Q, NDRange(1), NDRange(1));
clock_gettime(CLOCK_MONOTONIC, &t0);
null().wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (loading program): %d usecs\n", us_diff(t0, t1));

// Prepare testing data, compute golden output on host
int bufSize = COLS * ROWS * sizeof(float);
Buffer bufInput( context, CL_MEM_READ_ONLY, bufSize);
Buffer bufOutput(context, CL_MEM_WRITE_ONLY, bufSize);
Buffer bufFilter(context, CL_MEM_READ_ONLY, FILTERSIZE * sizeof(float));
#ifndef _TI_RTOS
float *pGolden = (float *) malloc(bufSize);
#else
float *pGolden = (float *) __malloc_ddr(bufSize);
#endif
if (pGolden == NULL)
{
printf("Failed to allocate memory for golden results\n");
exit(0);
}

float *pInput = (float *) Q.enqueueMapBuffer(bufInput, CL_TRUE,
CL_MAP_WRITE, 0, bufSize);
int VALRANGE = (1 << COLORDEPTH);
for (int i = 0; i < COLS * ROWS; i++)
pInput[i] = (float) (rand() % VALRANGE);
srand(time(NULL));
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int j = 0; j < ROWS; j++)
for (int i = 0; i < COLS; i++)
{
float left_2 = pInput[j * COLS + (i-2 < 0 ? 0 : i-2)];
float left_1 = pInput[j * COLS + (i-1 < 0 ? 0 : i-1)];
float self = pInput[j * COLS + i];
float right_1 = pInput[j * COLS + (i+1 >= COLS ? COLS-1 : i+1)];
float right_2 = pInput[j * COLS + (i+2 >= COLS ? COLS-1 : i+2)];

pGolden[j*COLS + i] = left_2 * FILTER[0] + left_1 * FILTER[1]
+ self * FILTER[2]
+ right_1 * FILTER[3] + right_2 * FILTER[4];
}
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (host_compute): %d usecs\n", us_diff(t0, t1));
Q.enqueueUnmapMemObject(bufInput, pInput);
Q.finish();
Q.enqueueWriteBuffer(bufFilter, CL_TRUE, 0, FILTERSIZE * sizeof(float),
FILTER);

try
{
while(1){
Event ev;
// straight-forward kernel (unoptimized, baseline)
Kernel k_base(program, "k_conv1d_5x1");
k_base.setArg(0, bufInput);
k_base.setArg(1, bufOutput);
k_base.setArg(2, bufFilter);
k_base.setArg(3, COLS);
k_base.setArg(4, ROWS);
k_base.setArg(5, COLS);
k_base.setArg(6, COLS);
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_base, NullRange, NDRange(COLS, ROWS),
NDRange(COLS, ROWS/NUMCOMPUNITS), NULL, &ev);
ev.wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (k_baseline): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);

Kernel k_loop(program, "k_loop");
k_loop.setArg(0, bufInput);
k_loop.setArg(1, bufOutput);
k_loop.setArg(2, bufFilter);
k_loop.setArg(3, COLS);
k_loop.setArg(4, ROWS);
k_loop.setArg(5, COLS);
k_loop.setArg(6, COLS);
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop, NullRange, NDRange(ROWS),
NDRange(ROWS/NUMCOMPUNITS), NULL, &ev);
ev.wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (k_loop): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);

Kernel k_loop_simd(program, "k_loop_simd");
k_loop_simd.setArg(0, bufInput);
k_loop_simd.setArg(1, bufOutput);
k_loop_simd.setArg(2, bufFilter);
k_loop_simd.setArg(3, COLS);
k_loop_simd.setArg(4, ROWS);
k_loop_simd.setArg(5, COLS);
k_loop_simd.setArg(6, COLS);
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_simd, NullRange, NDRange(ROWS),
NDRange(ROWS/NUMCOMPUNITS), NULL, &ev);
ev.wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (k_loop_simd): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);

int BLOCK_HEIGHT = LOCALMEMSIZE / (2 * 2 * COLS * sizeof(float));
// set the double buffer pipeline to at least 8 times
if (BLOCK_HEIGHT > (ROWS / NUMCOMPUNITS / 8))
BLOCK_HEIGHT = (ROWS / NUMCOMPUNITS / 8) + 1;
if (BLOCK_HEIGHT > 0)
{
Kernel k_loop_db(program, "k_loop_db");
k_loop_db.setArg(0, bufInput);
k_loop_db.setArg(1, bufOutput);
k_loop_db.setArg(2, bufFilter);
k_loop_db.setArg(3, COLS);
k_loop_db.setArg(4, ROWS);
k_loop_db.setArg(5, COLS);
k_loop_db.setArg(6, COLS);
k_loop_db.setArg(7, BLOCK_HEIGHT);
k_loop_db.setArg(8, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_db.setArg(9, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_db, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev);
ev.wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (k_loop_db): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);

Kernel k_loop_simd_db(program, "k_loop_simd_db");
k_loop_simd_db.setArg(0, bufInput);
k_loop_simd_db.setArg(1, bufOutput);
k_loop_simd_db.setArg(2, bufFilter);
k_loop_simd_db.setArg(3, COLS);
k_loop_simd_db.setArg(4, ROWS);
k_loop_simd_db.setArg(5, COLS);
k_loop_simd_db.setArg(6, COLS);
k_loop_simd_db.setArg(7, BLOCK_HEIGHT);
k_loop_simd_db.setArg(8, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_simd_db.setArg(9, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_simd_db, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev);
ev.wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (k_loop_simd_db): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);

Kernel k_loop_simd_db_extc(program, "k_loop_simd_db_extc");
k_loop_simd_db_extc.setArg(0, bufInput);
k_loop_simd_db_extc.setArg(1, bufOutput);
k_loop_simd_db_extc.setArg(2, bufFilter);
k_loop_simd_db_extc.setArg(3, COLS);
k_loop_simd_db_extc.setArg(4, ROWS);
k_loop_simd_db_extc.setArg(5, COLS);
k_loop_simd_db_extc.setArg(6, COLS);
k_loop_simd_db_extc.setArg(7, BLOCK_HEIGHT);
k_loop_simd_db_extc.setArg(8, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
k_loop_simd_db_extc.setArg(9, __local(BLOCK_HEIGHT*2*COLS*sizeof(float)));
clock_gettime(CLOCK_MONOTONIC, &t0);
Q.enqueueNDRangeKernel(k_loop_simd_db_extc, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev);
ev.wait();
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("Elapsed (k_loop_simd_db_extc): %d usecs\n", us_diff(t0, t1));
num_errors += VerifyResults(Q, bufOutput, pGolden, COLS, ROWS);
}

mi_1 = mallinfo();
printf("\n mallinfo :%d \n",mi_1.uordblks); //print memory overflow

}

#ifndef _TI_RTOS
free(pGolden);
#else
__free_ddr(pGolden);
#endif
}
catch (Error err)
{
cerr << "ERROR: " << err.what() << "(" << err.err() << ", "
<< ocl_decode_error(err.err()) << ")" << endl;
}

if (num_errors != 0)
{
cout << "Failed with " << num_errors << " errors" << endl;
RETURN(-1);
} else
{
cout << "Pass!" << endl;
RETURN(0);
}
}

/*
int bufSize = COLS * ROWS * sizeof(float);
Buffer bufInput( context, CL_MEM_READ_ONLY, bufSize);
Buffer bufOutput(context, CL_MEM_WRITE_ONLY, bufSize);
Buffer bufFilter(context, CL_MEM_READ_ONLY, FILTERSIZE * sizeof(float));

float *pInput = (float *) Q.enqueueMapBuffer(bufInput, CL_TRUE,
CL_MAP_WRITE, 0, bufSize);
for (int i = 0; i < COLS * ROWS; i++)
pInput[i] = (float) (rand() % VALRANGE);

Q.enqueueUnmapMemObject(bufInput, pInput);
Q.finish();

float *pOutput = (float *) Q.enqueueMapBuffer(bufOutput, CL_TRUE,
CL_MAP_READ | CL_MAP_WRITE,
0, cols*rows*sizeof(float));
Event ev;
memset(pOutput, 0, cols*rows*sizeof(float));
Q.enqueueUnmapMemObject(bufOutput, pOutput, NULL, &ev);
ev.wait();
*/

/*

Q.enqueueNDRangeKernel(k_base, NullRange,
NDRange(COLS, ROWS), NDRange(COLS, ROWS/NUMCOMPUNITS), NULL, &ev);
__kernel void k_conv1d_5x1(),

Q.enqueueNDRangeKernel(k_loop, NullRange, NDRange(ROWS),
NDRange(ROWS/NUMCOMPUNITS), NULL, &ev);
__kernel void k_loop(),

Q.enqueueNDRangeKernel(k_loop_simd, NullRange, NDRange(ROWS),
NDRange(ROWS/NUMCOMPUNITS), NULL, &ev);
__kernel void k_loop_simd(),

Q.enqueueNDRangeKernel(k_loop_db, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev);
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_db();

Q.enqueueNDRangeKernel(k_loop_simd_db, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev);
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db(),

Q.enqueueNDRangeKernel(k_loop_simd_db_extc, NullRange,
NDRange(NUMCOMPUNITS), NDRange(1), NULL, &ev);
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db_extc()
*/


/*

__kernel void null() { }


// 1D convolution applied to each row of an image, filter size 5x1
//k_base
__kernel void k_conv1d_5x1(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch
)
{
int col = get_global_id(0);
int row = get_global_id(1);

float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

// if symmetric filter, 2 multiplications can be optimized away
output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}

//k_loop
__kernel void k_loop(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch
)
{
int row = get_global_id(0);
int col;

// no boundary checks: col: [2, COLS-3]
for (col = 2; col < COLS-2; col++)
{
float left_2 = input[row * inPitch + col-2];
float left_1 = input[row * inPitch + col-1];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + col+1];
float right_2 = input[row * inPitch + col+2];
output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int boundaries[4] = { 0, 1, COLS-2, COLS-1 };
for (int i = 0; i < 4; i++)
{
col = boundaries[i];
float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}
}

//k_loop_simd
__kernel void k_loop_simd(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch
)
{
int row = get_global_id(0);
int col;

// _nassert(input % 8 == 0);
// _nassert(output % 8 == 0);
// _nassert(inPitch % 2 == 0);
// _nassert(outPitch % 2 == 0);

#if 0
// first version: SIMDize the computation
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_left_2 = *((float2 *) &input[row * inPitch + col-2]);
float2 v2_self = *((float2 *) &input[row * inPitch + col]);
float2 v2_right_2 = *((float2 *) &input[row * inPitch + col+2]);

float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &output[row * outPitch + col]) =
(v2_left_2 + v2_right_2) * filter[0]
+ (v2_left_1 + v2_right_1) * filter[1]
+ v2_self * filter[2];
}
#else
// second version: pipeline the memory loads
float2 v2_left_2 = *((float2 *) &input[row * inPitch + 2-2]);
float2 v2_self = *((float2 *) &input[row * inPitch + 2]);
float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_right_2 = *((float2 *) &input[row * inPitch + col+2]);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &output[row * outPitch + col]) =
v2_left_2 * filter[0] + v2_left_1 * filter[1]
+ v2_self * filter[2]
+ v2_right_1 * filter[3] + v2_right_2 * filter[4];

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}
#endif

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
for (int i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}
}

//k_loop_db
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_db(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
)
{
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int b_row, block_height, next_b_row, next_block_height;
local float *in_buf0 = lInput;
local float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
local float *out_buf0 = lOutput;
local float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;
event_t ev_in0, ev_in1, ev_out0, ev_out1;

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
ev_in0 = async_work_group_copy(in_buf0, &input[row_begin * inPitch],
block_height * inPitch, 0);

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// prefetch next block
if (! last_block)
ev_in1 = async_work_group_copy(in_buf1, input + next_b_row * inPitch,
next_block_height * inPitch, 0);
// wait for prefecthed block to finish
wait_group_events(1, &ev_in0);
ev_in0 = ev_in1;

// for each row in the block: compute
for (int row = 0; row < block_height; row++)
{
int col;
// no boundary checks: col: [2, COLS-3]
for (col = 2; col < COLS-2; col++)
{
float left_2 = in_buf0[row * inPitch + col-2];
float left_1 = in_buf0[row * inPitch + col-1];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + col+1];
float right_2 = in_buf0[row * inPitch + col+2];
out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int boundaries[4] = { 0, 1, COLS-2, COLS-1 };
for (int i = 0; i < 4; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// store block output back to output image
ev_out1 = async_work_group_copy(output + b_row * outPitch, out_buf0,
block_height * outPitch, 0);
// wait for previous store to finish
if (! first_block) wait_group_events(1, &ev_out0);
ev_out0 = ev_out1;

// swap buffers for next block
if (! last_block)
{
local float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
wait_group_events(1, &ev_out0);
}

//k_loop_simd_db
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
)
{
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int b_row, block_height, next_b_row, next_block_height;
local float *in_buf0 = lInput;
local float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
local float *out_buf0 = lOutput;
local float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;
event_t ev_in0, ev_in1, ev_out0, ev_out1;

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
ev_in0 = async_work_group_copy(in_buf0, &input[row_begin * inPitch],
block_height * inPitch, 0);

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// prefetch next block
if (! last_block)
ev_in1 = async_work_group_copy(in_buf1, input + next_b_row * inPitch,
next_block_height * inPitch, 0);
// wait for prefecthed block to finish
wait_group_events(1, &ev_in0);
ev_in0 = ev_in1;

// for each row in the block: compute
for (int row = 0; row < block_height; row++)
{
int col;
float2 v2_left_2 = *((float2 *) &in_buf0[row * inPitch + 2-2]);
float2 v2_self = *((float2 *) &in_buf0[row * inPitch + 2]);
float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_right_2 = *((float2 *) &in_buf0[row * inPitch + col+2]);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &out_buf0[row * outPitch + col]) =
v2_left_2 * filter[0]
+ v2_left_1 * filter[1]
+ v2_self * filter[2]
+ v2_right_1 * filter[3]
+ v2_right_2 * filter[4];

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
for (int i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// store block output back to output image
ev_out1 = async_work_group_copy(output + b_row * outPitch, out_buf0,
block_height * outPitch, 0);
// wait for previous store to finish
if (! first_block) wait_group_events(1, &ev_out0);
ev_out0 = ev_out1;

// swap buffers for next block
if (! last_block)
{
local float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
wait_group_events(1, &ev_out0);
}


extern void c_loop_simd_db_extc(int row_begin, int row_end,
__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
);

//k_loop_simd_db_extc
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db_extc(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
)
{
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

c_loop_simd_db_extc(row_begin, row_end,
input, output, filter,
COLS, ROWS, inPitch, outPitch,
BLOCK_HEIGHT, lInput, lOutput);
}


*/

kernel.cl:

/******************************************************************************
* Copyright (c) 2017, Texas Instruments Incorporated - http://www.ti.com/
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of Texas Instruments Incorporated nor the
* names of its contributors may be used to endorse or promote products
* derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
* THE POSSIBILITY OF SUCH DAMAGE.
*****************************************************************************/

//#define TI_VERSION

#define testdsptime
//#define testdsptime_1
#ifdef testdsptime
typedef struct dspRuntime_
{
uint64_t start_time_;
uint64_t end_time_;
uint64_t overhead_;
uint64_t cyclecount_;
}dspRuntime;
#endif

#ifndef TI_VERSION
__kernel void null() { }

// 1D convolution applied to each row of an image, filter size 5x1
//k_base
__kernel void k_conv1d_5x1(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
__global float *restrict dspruntime_us
)
{

#ifdef testdsptime
/*
dspRuntime dspRuntimeObj={0,0,0,0};
init_portion_tst(&dspRuntimeObj);
entry_hook_tst(&dspRuntimeObj);
*/
#endif
int col = get_global_id(0);
int row = get_global_id(1);

float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

// if symmetric filter, 2 multiplications can be optimized away
output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
#ifdef testdsptime
float time_us=0;
/*
uint64_t Cyclecount=0;
exit_hook_tst(&time_us,&Cyclecount,&dspRuntimeObj);
//if(col==0)
// printf("++kernel time_us=%llf us \n",time_us);
//printf("++kernel The code section took: %lld CPU cycles\n", Cyclecount);
*/
dspruntime_us[0]=time_us;
#endif
}

//k_loop
__kernel void k_loop(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
__global float *restrict dspruntime_us
)
{
#ifdef testdsptime
/*
dspRuntime dspRuntimeObj={0,0,0,0};
init_portion_tst(&dspRuntimeObj);
entry_hook_tst(&dspRuntimeObj);
*/
#endif
int row = get_global_id(0);
int col;

// no boundary checks: col: [2, COLS-3]
for (col = 2; col < COLS-2; col++)
{
float left_2 = input[row * inPitch + col-2];
float left_1 = input[row * inPitch + col-1];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + col+1];
float right_2 = input[row * inPitch + col+2];
output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int boundaries[4] = { 0, 1, COLS-2, COLS-1 };
for (int i = 0; i < 4; i++)
{
col = boundaries[i];
float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}
#ifdef testdsptime
float time_us=0;
/*
uint64_t Cyclecount=0;
exit_hook_tst(&time_us,&Cyclecount,&dspRuntimeObj);

//printf("++kernel time_us=%llf us \n",time_us);
//printf("++kernel The code section took: %lld CPU cycles\n", Cyclecount);
*/
dspruntime_us[0]=time_us;
#endif
}

//k_loop_simd
__kernel void k_loop_simd(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
__global float *restrict dspruntime_us
)
{
#ifdef testdsptime
/*
dspRuntime dspRuntimeObj={0,0,0,0};
init_portion_tst(&dspRuntimeObj);
entry_hook_tst(&dspRuntimeObj);
*/
#endif
int row = get_global_id(0);
int col;

// _nassert(input % 8 == 0);
// _nassert(output % 8 == 0);
// _nassert(inPitch % 2 == 0);
// _nassert(outPitch % 2 == 0);

#if 0
// first version: SIMDize the computation
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_left_2 = *((float2 *) &input[row * inPitch + col-2]);
float2 v2_self = *((float2 *) &input[row * inPitch + col]);
float2 v2_right_2 = *((float2 *) &input[row * inPitch + col+2]);

float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &output[row * outPitch + col]) =
(v2_left_2 + v2_right_2) * filter[0]
+ (v2_left_1 + v2_right_1) * filter[1]
+ v2_self * filter[2];
}
#else
// second version: pipeline the memory loads
float2 v2_left_2 = *((float2 *) &input[row * inPitch + 2-2]);
float2 v2_self = *((float2 *) &input[row * inPitch + 2]);
float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_right_2 = *((float2 *) &input[row * inPitch + col+2]);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &output[row * outPitch + col]) =
v2_left_2 * filter[0] + v2_left_1 * filter[1]
+ v2_self * filter[2]
+ v2_right_1 * filter[3] + v2_right_2 * filter[4];

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}
#endif

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
for (int i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}

#ifdef testdsptime
float time_us=0;
/*
uint64_t Cyclecount=0;
exit_hook_tst(&time_us,&Cyclecount,&dspRuntimeObj);

//printf("++kernel time_us=%llf us \n",time_us);
//printf("++kernel The code section took: %lld CPU cycles\n", Cyclecount);
*/
dspruntime_us[0]=time_us;
#endif
}

//k_loop_db
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_db(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput, // double buffer
__global float *restrict dspruntime_us
)
{
#ifdef testdsptime
dspRuntime dspRuntimeObj={0,0,0,0};
init_portion_tst(&dspRuntimeObj);
entry_hook_tst(&dspRuntimeObj);
#endif
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int b_row, block_height, next_b_row, next_block_height;
local float *in_buf0 = lInput;
local float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
local float *out_buf0 = lOutput;
local float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;
event_t ev_in0, ev_in1, ev_out0, ev_out1;

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
ev_in0 = async_work_group_copy(in_buf0, &input[row_begin * inPitch],
block_height * inPitch, 0);

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// prefetch next block
if (! last_block)
ev_in1 = async_work_group_copy(in_buf1, input + next_b_row * inPitch,
next_block_height * inPitch, 0);
// wait for prefecthed block to finish
wait_group_events(1, &ev_in0);
ev_in0 = ev_in1;

// for each row in the block: compute
for (int row = 0; row < block_height; row++)
{
int col;
// no boundary checks: col: [2, COLS-3]
for (col = 2; col < COLS-2; col++)
{
float left_2 = in_buf0[row * inPitch + col-2];
float left_1 = in_buf0[row * inPitch + col-1];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + col+1];
float right_2 = in_buf0[row * inPitch + col+2];
out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int boundaries[4] = { 0, 1, COLS-2, COLS-1 };
for (int i = 0; i < 4; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// store block output back to output image
ev_out1 = async_work_group_copy(output + b_row * outPitch, out_buf0,
block_height * outPitch, 0);
// wait for previous store to finish
if (! first_block) wait_group_events(1, &ev_out0);
ev_out0 = ev_out1;

// swap buffers for next block
if (! last_block)
{
local float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
wait_group_events(1, &ev_out0);

#ifdef testdsptime
float time_us=0;
uint64_t Cyclecount=0;
exit_hook_tst(&time_us,&Cyclecount,&dspRuntimeObj);

//printf("++kernel time_us=%llf us \n",time_us);
//printf("++kernel The code section took: %lld CPU cycles\n", Cyclecount);
dspruntime_us[0]=time_us;
#endif
}

//k_loop_simd_db
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput, // double buffer
__global float *restrict dspruntime_us
)
{
#ifdef testdsptime
dspRuntime dspRuntimeObj={0,0,0,0};
init_portion_tst(&dspRuntimeObj);
entry_hook_tst(&dspRuntimeObj);
#endif
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int b_row, block_height, next_b_row, next_block_height;
local float *in_buf0 = lInput;
local float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
local float *out_buf0 = lOutput;
local float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;
event_t ev_in0, ev_in1, ev_out0, ev_out1;

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
ev_in0 = async_work_group_copy(in_buf0, &input[row_begin * inPitch],
block_height * inPitch, 0);

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// prefetch next block
if (! last_block)
ev_in1 = async_work_group_copy(in_buf1, input + next_b_row * inPitch,
next_block_height * inPitch, 0);
// wait for prefecthed block to finish
wait_group_events(1, &ev_in0);
ev_in0 = ev_in1;

// for each row in the block: compute
for (int row = 0; row < block_height; row++)
{
int col;
float2 v2_left_2 = *((float2 *) &in_buf0[row * inPitch + 2-2]);
float2 v2_self = *((float2 *) &in_buf0[row * inPitch + 2]);
float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_right_2 = *((float2 *) &in_buf0[row * inPitch + col+2]);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &out_buf0[row * outPitch + col]) =
v2_left_2 * filter[0]
+ v2_left_1 * filter[1]
+ v2_self * filter[2]
+ v2_right_1 * filter[3]
+ v2_right_2 * filter[4];

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
for (int i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// store block output back to output image
ev_out1 = async_work_group_copy(output + b_row * outPitch, out_buf0,
block_height * outPitch, 0);
// wait for previous store to finish
if (! first_block) wait_group_events(1, &ev_out0);
ev_out0 = ev_out1;

// swap buffers for next block
if (! last_block)
{
local float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
wait_group_events(1, &ev_out0);

#ifdef testdsptime
float time_us=0;
uint64_t Cyclecount=0;
exit_hook_tst(&time_us,&Cyclecount,&dspRuntimeObj);

//printf("++kernel time_us=%llf us \n",time_us);
//printf("++kernel The code section took: %lld CPU cycles\n", Cyclecount);
dspruntime_us[0]=time_us;
#endif
}


extern void c_loop_simd_db_extc(int row_begin, int row_end,
__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
);

//k_loop_simd_db_extc
__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db_extc(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput, // double buffer
__global float *restrict dspruntime_us
)
{
#ifdef testdsptime
dspRuntime dspRuntimeObj={0,0,0,0};
init_portion_tst(&dspRuntimeObj);
entry_hook_tst(&dspRuntimeObj);
#endif
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

c_loop_simd_db_extc(row_begin, row_end,
input, output, filter,
COLS, ROWS, inPitch, outPitch,
BLOCK_HEIGHT, lInput, lOutput);
#ifdef testdsptime
float time_us=0;
uint64_t Cyclecount=0;
exit_hook_tst(&time_us,&Cyclecount,&dspRuntimeObj);

//printf("++kernel time_us=%llf us \n",time_us);
//printf("++kernel The code section took: %lld CPU cycles\n", Cyclecount);
dspruntime_us[0]=time_us;
#endif
}

#else

__kernel void null() { }


// 1D convolution applied to each row of an image, filter size 5x1
__kernel void k_conv1d_5x1(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch
)
{
int col = get_global_id(0);
int row = get_global_id(1);

float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

// if symmetric filter, 2 multiplications can be optimized away
output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}


__kernel void k_loop(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch
)
{
int row = get_global_id(0);
int col;

// no boundary checks: col: [2, COLS-3]
for (col = 2; col < COLS-2; col++)
{
float left_2 = input[row * inPitch + col-2];
float left_1 = input[row * inPitch + col-1];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + col+1];
float right_2 = input[row * inPitch + col+2];
output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int boundaries[4] = { 0, 1, COLS-2, COLS-1 };
for (int i = 0; i < 4; i++)
{
col = boundaries[i];
float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}
}


__kernel void k_loop_simd(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch
)
{
int row = get_global_id(0);
int col;

// _nassert(input % 8 == 0);
// _nassert(output % 8 == 0);
// _nassert(inPitch % 2 == 0);
// _nassert(outPitch % 2 == 0);

#if 0
// first version: SIMDize the computation
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_left_2 = *((float2 *) &input[row * inPitch + col-2]);
float2 v2_self = *((float2 *) &input[row * inPitch + col]);
float2 v2_right_2 = *((float2 *) &input[row * inPitch + col+2]);

float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &output[row * outPitch + col]) =
(v2_left_2 + v2_right_2) * filter[0]
+ (v2_left_1 + v2_right_1) * filter[1]
+ v2_self * filter[2];
}
#else
// second version: pipeline the memory loads
float2 v2_left_2 = *((float2 *) &input[row * inPitch + 2-2]);
float2 v2_self = *((float2 *) &input[row * inPitch + 2]);
float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_right_2 = *((float2 *) &input[row * inPitch + col+2]);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &output[row * outPitch + col]) =
v2_left_2 * filter[0] + v2_left_1 * filter[1]
+ v2_self * filter[2]
+ v2_right_1 * filter[3] + v2_right_2 * filter[4];

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}
#endif

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
for (int i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = input[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = input[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = input[row * inPitch + col];
float right_1 = input[row * inPitch + (col+1 >= COLS ? COLS-1 : col+1)];
float right_2 = input[row * inPitch + (col+2 >= COLS ? COLS-1 : col+2)];

output[row * outPitch + col] = left_2 * filter[0] + left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3] + right_2 * filter[4];
}
}


__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_db(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
)
{
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int b_row, block_height, next_b_row, next_block_height;
local float *in_buf0 = lInput;
local float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
local float *out_buf0 = lOutput;
local float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;
event_t ev_in0, ev_in1, ev_out0, ev_out1;

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
ev_in0 = async_work_group_copy(in_buf0, &input[row_begin * inPitch],
block_height * inPitch, 0);

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// prefetch next block
if (! last_block)
ev_in1 = async_work_group_copy(in_buf1, input + next_b_row * inPitch,
next_block_height * inPitch, 0);
// wait for prefecthed block to finish
wait_group_events(1, &ev_in0);
ev_in0 = ev_in1;

// for each row in the block: compute
for (int row = 0; row < block_height; row++)
{
int col;
// no boundary checks: col: [2, COLS-3]
for (col = 2; col < COLS-2; col++)
{
float left_2 = in_buf0[row * inPitch + col-2];
float left_1 = in_buf0[row * inPitch + col-1];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + col+1];
float right_2 = in_buf0[row * inPitch + col+2];
out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int boundaries[4] = { 0, 1, COLS-2, COLS-1 };
for (int i = 0; i < 4; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// store block output back to output image
ev_out1 = async_work_group_copy(output + b_row * outPitch, out_buf0,
block_height * outPitch, 0);
// wait for previous store to finish
if (! first_block) wait_group_events(1, &ev_out0);
ev_out0 = ev_out1;

// swap buffers for next block
if (! last_block)
{
local float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
wait_group_events(1, &ev_out0);
}


__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
)
{
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int b_row, block_height, next_b_row, next_block_height;
local float *in_buf0 = lInput;
local float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
local float *out_buf0 = lOutput;
local float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;
event_t ev_in0, ev_in1, ev_out0, ev_out1;

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
ev_in0 = async_work_group_copy(in_buf0, &input[row_begin * inPitch],
block_height * inPitch, 0);

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// prefetch next block
if (! last_block)
ev_in1 = async_work_group_copy(in_buf1, input + next_b_row * inPitch,
next_block_height * inPitch, 0);
// wait for prefecthed block to finish
wait_group_events(1, &ev_in0);
ev_in0 = ev_in1;

// for each row in the block: compute
for (int row = 0; row < block_height; row++)
{
int col;
float2 v2_left_2 = *((float2 *) &in_buf0[row * inPitch + 2-2]);
float2 v2_self = *((float2 *) &in_buf0[row * inPitch + 2]);
float2 v2_left_1 = (float2) (v2_left_2.s1, v2_self.s0);
for (col = 2; col < COLS-2-1; col+=2)
{
float2 v2_right_2 = *((float2 *) &in_buf0[row * inPitch + col+2]);
float2 v2_right_1 = (float2) (v2_self.s1, v2_right_2.s0);

* ((float2 *) &out_buf0[row * outPitch + col]) =
v2_left_2 * filter[0]
+ v2_left_1 * filter[1]
+ v2_self * filter[2]
+ v2_right_1 * filter[3]
+ v2_right_2 * filter[4];

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
for (int i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// store block output back to output image
ev_out1 = async_work_group_copy(output + b_row * outPitch, out_buf0,
block_height * outPitch, 0);
// wait for previous store to finish
if (! first_block) wait_group_events(1, &ev_out0);
ev_out0 = ev_out1;

// swap buffers for next block
if (! last_block)
{
local float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
wait_group_events(1, &ev_out0);
}


extern void c_loop_simd_db_extc(int row_begin, int row_end,
__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
);

__kernel __attribute__((reqd_work_group_size(1,1,1)))
void k_loop_simd_db_extc(__global float *restrict input,
__global float *restrict output,
__global float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
__local float *restrict lInput, // double buffer
__local float *restrict lOutput // double buffer
)
{
// get the rows that this workgroup needs to process
int gid = get_global_id(0);
int gsz = get_global_size(0);
int row_begin = gid * (ROWS / gsz);
int row_end = (gid + 1) * (ROWS / gsz); // exclusive
if (gid == gsz - 1) row_end = ROWS; // exclusive

c_loop_simd_db_extc(row_begin, row_end,
input, output, filter,
COLS, ROWS, inPitch, outPitch,
BLOCK_HEIGHT, lInput, lOutput);
}
#endif

k_extc.c

/******************************************************************************
* Copyright (c) 2017, Texas Instruments Incorporated - http://www.ti.com/
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of Texas Instruments Incorporated nor the
* names of its contributors may be used to endorse or promote products
* derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
* THE POSSIBILITY OF SUCH DAMAGE.
*****************************************************************************/

#include <stdio.h>
#include <stdbool.h>
#include <c6x.h>
#include "dsp_c.h"
#include "dsp_edmamgr.h"

void c_loop_simd_db_extc(int row_begin, int row_end,
float *restrict input,
float *restrict output,
float *restrict filter,
int COLS,
int ROWS,
int inPitch,
int outPitch,
int BLOCK_HEIGHT,
float *restrict lInput, // double buffer
float *restrict lOutput // double buffer
)
{
// partition rows into chunks, prefect next chunk, compute this chunk
bool first_block, last_block;
int row, b_row, block_height, next_b_row, next_block_height;
float *in_buf0 = lInput;
float *in_buf1 = lInput + BLOCK_HEIGHT * inPitch;
float *out_buf0 = lOutput;
float *out_buf1 = lOutput + BLOCK_HEIGHT * outPitch;

EdmaMgr_Handle chan_in, chan_out;
chan_in = __ocl_EdmaMgr_alloc_intrakernel(1);
chan_out = __ocl_EdmaMgr_alloc_intrakernel(1);
if (! chan_in || ! chan_out)
{
printf("Failed to allocate edma handle.\n");
return;
}

// fetch first block
block_height = (row_begin + BLOCK_HEIGHT >= row_end) ? row_end - row_begin
: BLOCK_HEIGHT;
EdmaMgr_copy2D1D(chan_in, &input[row_begin * inPitch], in_buf0,
COLS * sizeof(float), block_height,
inPitch *sizeof(float));

// for each block
for (b_row = row_begin; b_row < row_end; b_row += BLOCK_HEIGHT)
{
first_block = (b_row == row_begin);
last_block = (b_row + BLOCK_HEIGHT >= row_end);
block_height = (b_row + BLOCK_HEIGHT >= row_end) ? row_end - b_row
: BLOCK_HEIGHT;
next_b_row = b_row + block_height;
next_block_height = (next_b_row + BLOCK_HEIGHT > row_end)
? row_end - next_b_row : BLOCK_HEIGHT;

// wait for prefecthed block to finish
EdmaMgr_wait(chan_in);
// prefetch next block
if (! last_block)
EdmaMgr_copy2D1D(chan_in, input + next_b_row * inPitch, in_buf1,
COLS * sizeof(float), next_block_height,
inPitch *sizeof(float));

// for each row in the block: compute
for (row = 0; row < block_height; row++)
{
int col;
__float2_t v2_left_2 = _mem8_f2((void *) &in_buf0[row * inPitch + 2-2]);
__float2_t v2_self = _mem8_f2((void *) &in_buf0[row * inPitch + 2]);
__float2_t v2_left_1 = _ftof2(_lof2(v2_self), _hif2(v2_left_2));
__float2_t vf0 = _ftof2(filter[0], filter[0]);
__float2_t vf1 = _ftof2(filter[1], filter[1]);
__float2_t vf2 = _ftof2(filter[2], filter[2]);
__float2_t vf3 = _ftof2(filter[3], filter[3]);
__float2_t vf4 = _ftof2(filter[4], filter[4]);
for (col = 2; col < COLS-2-1; col+=2)
{
__float2_t v2_right_2 = _mem8_f2((void *) &in_buf0[row*inPitch+col+2]);
__float2_t v2_right_1 = _ftof2(_lof2(v2_right_2), _hif2(v2_self));

_amem8_f2((void *) &out_buf0[row * outPitch + col]) =
_daddsp(
_daddsp(
_daddsp(
_daddsp(_dmpysp(v2_left_2, vf0),
_dmpysp(v2_left_1, vf1)),
_dmpysp(v2_self, vf2)),
_dmpysp(v2_right_1, vf3)),
_dmpysp(v2_right_2, vf4));

v2_left_2 = v2_self;
v2_left_1 = v2_right_1;
v2_self = v2_right_2;
}

// boundary conditions
// Alternatively, user can choose to pad the input data
int extra_col = (col == COLS-3 ? 1 : 0);
int boundaries[5] = { 0, 1, COLS-2, COLS-1, COLS-3 };
int i;
for (i = 0; i < 4 + extra_col; i++)
{
col = boundaries[i];
float left_2 = in_buf0[row * inPitch + (col-2 < 0 ? 0 : col-2)];
float left_1 = in_buf0[row * inPitch + (col-1 < 0 ? 0 : col-1)];
float self = in_buf0[row * inPitch + col];
float right_1 = in_buf0[row * inPitch + (col+1 >= COLS ? COLS-1:col+1)];
float right_2 = in_buf0[row * inPitch + (col+2 >= COLS ? COLS-1:col+2)];

out_buf0[row * outPitch + col] = left_2 * filter[0]
+ left_1 * filter[1]
+ self * filter[2]
+ right_1 * filter[3]
+ right_2 * filter[4];
}
}

// wait for previous store to finish
if (! first_block) EdmaMgr_wait(chan_out);
// store block output back to output image
EdmaMgr_copy1D2D(chan_out, out_buf0, output + b_row * outPitch,
COLS * sizeof(float), block_height,
outPitch * sizeof(float));

// swap buffers for next block
if (! last_block)
{
float *tmp = in_buf0; in_buf0 = in_buf1; in_buf1 = tmp;
tmp = out_buf0; out_buf0 = out_buf1; out_buf1 = tmp;
}
}

// wait for last block store to finish
EdmaMgr_wait(chan_out);

EdmaMgr_free(chan_in);
EdmaMgr_free(chan_out);
}

dsp_cycles.h

#ifndef DSP_CYCLES_H_
#define DSP_CYCLES_H_

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h> // defines uint64_t
#include <c6x.h> // defines _itoll, TSCH, TSCL

#define DSP_CORECLOCK (750*1000000)
// In the variable declaration portion of the code:
uint64_t start_time, end_time, overhead, cyclecount;
typedef struct dspRuntime_
{
uint64_t start_time_;
uint64_t end_time_;
uint64_t overhead_;
uint64_t cyclecount_;
}dspRuntime;
// Code to be profiled
void test_fuctioncode();


// In the initialization portion of the code:
void init_portion();

void entry_hook();
void exit_hook(float *time_us,uint64_t *Cyclecount);

void init_portion_tst(dspRuntime *dspRuntimeObj);

void entry_hook_tst(dspRuntime *dspRuntimeObj);
void exit_hook_tst(float *time_us,uint64_t *Cyclecount,dspRuntime *dspRuntimeObj);

#endif

dsp_cycles.c


#include <stdio.h>
#include <stdint.h> // defines uint64_t
#include <c6x.h> // defines _itoll, TSCH, TSCL
#include "DSP_cycles.h"
// In the variable declaration portion of the code:
extern uint64_t start_time, end_time, overhead, cyclecount;

// Code to be profiled
void test_fuctioncode()
{
start_time = _itoll(TSCH, TSCL);
//function_or_code_here();
end_time = _itoll(TSCH, TSCL);
cyclecount = end_time-start_time-overhead;
printf("The code section took: %lld CPU cycles\n", cyclecount);


}

// In the initialization portion of the code:
void init_portion()
{
TSCL = 0; //enable TSC
start_time = _itoll(TSCH, TSCL);
end_time = _itoll(TSCH, TSCL);
overhead = end_time-start_time; //Calculating the overhead of the method.

}

void entry_hook()
{
start_time = _itoll(TSCH, TSCL);
}
void exit_hook(float *time_us,uint64_t *Cyclecount)
{
end_time = _itoll(TSCH, TSCL);
cyclecount = end_time-start_time-overhead;
//printf("The code section took: %lld CPU cycles\n", cyclecount);
//float ms=(float)cyclecount/DSP_CORECLOCK*1000.0;
//float us=(float)cyclecount/(DSP_CORECLOCK/1000000.0);
time_us[0]=(float)cyclecount/(DSP_CORECLOCK/1000000.0);
Cyclecount[0]=cyclecount;


//printf("The code section took: %11f us\n", us);
}

// In the initialization portion of the code:
void init_portion_tst(dspRuntime *dspRuntimeObj)
{
TSCL = 0; //enable TSC
dspRuntimeObj->start_time_ = _itoll(TSCH, TSCL);
dspRuntimeObj->end_time_ = _itoll(TSCH, TSCL);
dspRuntimeObj->overhead_ = dspRuntimeObj->end_time_- dspRuntimeObj->start_time_; //Calculating the overhead of the method.

}

void entry_hook_tst(dspRuntime *dspRuntimeObj)
{
dspRuntimeObj->start_time_ = _itoll(TSCH, TSCL);
}
void exit_hook_tst(float *time_us,uint64_t *Cyclecount,dspRuntime *dspRuntimeObj)
{
dspRuntimeObj->end_time_ = _itoll(TSCH, TSCL);
dspRuntimeObj->cyclecount_ = dspRuntimeObj->end_time_- dspRuntimeObj->start_time_
-dspRuntimeObj->overhead_;
//printf("The code section took: %lld CPU cycles\n", dspRuntimeObj->cyclecount_);
//float ms=(float)dspRuntimeObj->cyclecount_/DSP_CORECLOCK*1000.0;
//float us=(float)dspRuntimeObj->cyclecount_/(DSP_CORECLOCK/1000000.0);
time_us[0]=(float)dspRuntimeObj->cyclecount_/(DSP_CORECLOCK/1000000.0);
Cyclecount[0]=dspRuntimeObj->cyclecount_;


//printf("The code section took: %11f us\n", us);
}

Makefile

EXE = conv1d_tst
CXXFLAGS = -O3
CL6XFLAGS = -o3 -mw --symdebug:none

include ../make.inc

$(EXE): main.o
@$(CXX) $(CXXFLAGS) main.o $(LDFLAGS) $(LIBS) -lrt -o $@

main.o: ti_kernels.dsp_h

ti_kernels.dsp_h: DSP_cycles.obj k_extc.obj

log

************************
************************
DSP running log:
NUMCOMPUNITS=2,timeout_us=20000 us, COLS=1920 ,ROWS=1080,
kernelName=k_base ,QueuetoSubmit=1 us, SubmittoStart=8389 us, StarttoEnd=18634 us, dspruntime_us=0.000000 us, max_time_out=27024 us, timeout_us=20000 us, timeus_linux=27225 us, cnt_time_out=1, mallinfo=230096,
kernelName=k_base ,QueuetoSubmit=1 us, SubmittoStart=21 us, StarttoEnd=20956 us, dspruntime_us=0.000000 us, max_time_out=20978 us, timeout_us=20000 us, timeus_linux=22053 us, cnt_time_out=2, mallinfo=231568,

  • Hi,

    The opencl example on the AM57xx is used for reference and to be used as-is.

    From your description, it seems like you have been able to fix the issue.

    Note that we are not supporting any bug fixes on the opencl examples included in the SDK.

    Regards

    Karthik

  • Reply 1:  to https://e2e.ti.com/support/processors-group/processors/f/processors-forum/992202/am5748-how-to-use-the-l2-sram-of-dsp-subsysterm-with-linux-and-opencl-developing-am5748  (which is locked now)

    Since you are already looking at the shipped "conv1d" example, this example shows how to query available OpenCL local memory (L2 scratch memory on DSP) and use "SetArg( , __local(size))" to allocate OpenCL local memory.

    If you want to shrink available L2 cache and give more OpenCL local memory, you can run a kernel that contains "__cache_l2_64k()" first, then run your kernels, and at the end run another kernel that contains "__cache_l2_128k()" to reset the cache size to default.  "sgemm" example shows how to adjust and restore L1D cache size for L1D scratch memory, you can do similar things for L2: https://git.ti.com/cgit/opencl/ti-opencl/tree/examples/sgemm/kernel.cl#n43 

    Builtin function declarations are on the EVM filesystem: /usr/share/ti/opencl/dsp_c.h  User guide is here: https://downloads.ti.com/mctools/esd/docs/opencl/memory/cache-operations.html 

    -Yuan

  • Reply 2: regarding the memory leakage on Event object.

    "conv1d" example was written to illustrate kernel performance optimizations.  Since there are only a few different versions, we reused the same Event object "ev".  By using the same "ev" repeatedly on EnqueueNDRangeKernel(..., &ev) call, the previous ev object was not properly released.  This is caused by the limitations of the "C++" bindings (release is a hidden method in base class).

    You already figured this out and switched to use different Event objects (ev, ev1, ev2, ...) for each occurrence.  However, calling "~Wrapper()" is not necessary.  When each of these Event object go out of the scope, the destructor will be called automatically.

    Should you choose to re-use the same Event object, you can assign it an empty Event object to release it.  We have examples in the ooo/ooo_map/ooo_callback examples: https://git.ti.com/cgit/opencl/ti-opencl/tree/examples/ooo_callback/ooo_callback.cpp#n209    https://git.ti.com/cgit/opencl/ti-opencl/tree/examples/ooo_callback/ooo_callback.cpp#n271 

    -Yuan

  • Reply 3: regarding QueuetoSubmit time, 6ms is probably not the case.  Maybe double check your time measurement code?  OpenCL runtime is implemented using multiple threads.  Are there other threads/processes going on in your applications/system?

    If you look at the "conv1d" reported time on AM57 EVM,

    # ./conv1d
    Elapsed (loading program): 693 usecs
    Elapsed (host_compute): 30454 usecs
    Elapsed (k_baseline): 17085 usecs
    Elapsed (k_loop): 16918 usecs
    Elapsed (k_loop_simd): 10435 usecs
    Elapsed (k_loop_db): 6761 usecs
    Elapsed (k_loop_simd_db): 4972 usecs
    Elapsed (k_loop_simd_db_extc): 4656 usecs
    Pass!

    These time are measured on the Arm side, which includes everything from start to finish of a kernel invocation: Queueing, round-trip communication, dsp computation.  You can see the last two versions finish within 5ms.

    -Yuan

  • 1,I use dsp l2 cach to calculate on dsp ,but the result is different from A15 sometime.why?

    I use async_work_group_copy function to copy data from msmc to l2 cache.

    the calculate function on dsp part is dsp obj. 

    2 .how to debug dsp function obj use c6xx compile?

  • 3 how to  test dsp running time ?use ocl_event_times funtion in ocl_util.h file?

    4 I only copy 418*8 data from a15 to dsp ,dsp to a15, the  time of start to end,tested with ocl_event_times ,is different some time。 from 3us to 12ms?why?

    it is so that using dsp circular time ,function is exit_hook_tst in DSP_cycles.h?

  • The result should be same.  Are these integer computation or floating point computation?  For floating point, DSP results might be slightly different from A15 due to rounding/accumulating differences.  For integers, the results should be identical.

    Maybe start from a version that computes from DDR and produces correct results first, then optimize it to use async_work_group_copy and L2 cache.

    To debug, printf works in OpenCL kernel, and you could also debug with CCS via an attached emulator.  https://downloads.ti.com/mctools/esd/docs/opencl/debug/index.html  (debug with gdb is no longer supported)

  • ocl_event_times reports time observed on Arm side.  If you want pure dsp time, you can profile with __clock64() builtin function for actual cycle counts, then divide by DSP frequency (e.g. 750MHz) for real time.  https://downloads.ti.com/mctools/esd/docs/opencl/extensions/opencl-c-builtin-function-extensions.html

    I explained timing differences in another thread: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/990943/am5748-opencl-examples-conflicts-with-sound-card/3706060

    Linux kernel controls thread scheduling, and there are usually 3 threads in a OpenCL application on AM57.  Maybe you will have to increase the granularity of your kernel (e.g. 2ms) to counter the scheduling effect. 

  • I used float point add function.

  • How to increase the  granularity of my kernel ? for example?

  • Hi Yuan,

    I have some questions about l2 cach and async_work_group_copy.

    In my test example,

    the result is different from DSP local and A15.

    1 the result with different  kernel functions.

    VectorAdd is ok.

    VectorAdd_loc is right when the bufsize of a_l,b_l,c_l are 4*bufSize; 

    VectorAdd_loc is wrong when the size of  b_l is different from a_l,c_l; then a_l,c_l are the same size 4*bufSize; why?

    the results of VectorAdd_loc_1,VectorAdd_loc_2,VectorAdd_loc_3,VectorAdd_loc_4 are not right. and not run in am5748.

    Why does the above situation occur?

    I need 418*10 float data to proces in my real project ,and must be processed completely in 6ms in audio application,whatever use single or dual dsp core.

    the example code as follows:

    main.cpp file:

    /******************************************************************************
    * Copyright (c) 2013-2014, Texas Instruments Incorporated - http://www.ti.com/
    * All rights reserved.
    *
    * Redistribution and use in source and binary forms, with or without
    * modification, are permitted provided that the following conditions are met:
    * * Redistributions of source code must retain the above copyright
    * notice, this list of conditions and the following disclaimer.
    * * Redistributions in binary form must reproduce the above copyright
    * notice, this list of conditions and the following disclaimer in the
    * documentation and/or other materials provided with the distribution.
    * * Neither the name of Texas Instruments Incorporated nor the
    * names of its contributors may be used to endorse or promote products
    * derived from this software without specific prior written permission.
    *
    * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
    * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
    * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
    * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
    * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
    * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
    * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
    * THE POSSIBILITY OF SUCH DAMAGE.
    *****************************************************************************/
    #define __CL_ENABLE_EXCEPTIONS
    #include <CL/cl.hpp>
    #include <iostream>
    #include <cstdlib>
    #include <cassert>
    #include "ocl_util.h"

    #ifdef _TI_RTOS
    #include "kernel.dsp_h"
    #include "../rtos_main.c"
    #endif

    using namespace cl;
    using namespace std;

    typedef cl_float cl_data_type;
    #define LOCMEM_USED
    #define RETURN(x) return x
    int main(int argc, char *argv[])
    {
    int NumElements = (418*9/3);//32*13*8/4;//32*13*8;//32*20*10;//418*10;//8*1024*1024; // 8 Mb

    int adjust=0;
    if((NumElements%32)!=0)
    {
    adjust=32-NumElements%32;
    NumElements+=adjust;
    }
    printf("++NumElements%32=%d\n",NumElements%32);

    int NumWorkGroups = 1;//256;
    int VectorElements = 1;//4;
    int NumVecElements = NumElements / VectorElements;
    int WorkGroupSize = NumVecElements / NumWorkGroups;

    int kernelType=0;
    char *kernelType_s;
    kernelType_s=argv[1];
    kernelType=atoi(kernelType_s);
    printf("++kernelType=%d\n",kernelType);



    int bufsize = sizeof(cl_data_type) * NumElements;

    cl_data_type *srcA = (cl_data_type *)__malloc_ddr(bufsize);
    cl_data_type *srcB = (cl_data_type *)__malloc_ddr(bufsize);
    cl_data_type *dst = (cl_data_type *)__malloc_ddr(bufsize);
    cl_data_type *Golden = (cl_data_type *)__malloc_ddr(bufsize);
    assert(srcA != NULL && srcB != NULL && dst != NULL && Golden != NULL);

    for (int i=0; i < NumElements; ++i)
    {
    srcA[i] = srcB[i] = i;//<<2;
    Golden[i] = 2.0*srcA[i];//srcB[i] + srcA[i];
    dst[i] = Golden[i];//0;
    }

    try
    {
    Context context(CL_DEVICE_TYPE_ACCELERATOR);
    std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

    int d = 0;
    std::string str;
    devices[d].getInfo(CL_DEVICE_NAME, &str);
    cout << "DEVICE: " << str << endl << endl;

    cout << "Offloading vector addition of " << NumElements;
    cout << " elements..." << endl << endl;
    Buffer bufA,bufB ,bufDst;
    switch(kernelType)
    {
    case 0:
    {
    bufA =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, srcA);
    bufB =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    bufsize, srcB);
    bufDst=Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, dst);
    }break;
    case 1:
    case 3:
    {
    bufA =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, srcA);
    bufB =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    bufsize, srcB);
    bufDst=Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, dst);
    }break;

    case 2:
    case 4:
    case 5:
    {
    bufA =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, srcA);

    bufDst=Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, dst);
    }break;
    default:
    {
    bufA =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, srcA);
    bufB =Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    bufsize, srcB);
    bufDst=Buffer(context,
    (cl_mem_flags) CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR,
    bufsize, dst);
    }break;

    }

    #if 0
    Program::Sources source(1, std::make_pair(kernelStr,strlen(kernelStr)));
    Program program = Program(context, source);

    program.build(devices);
    #else
    char *bin;
    //xxx.out文件名必须与×.cl文件名一致
    const char kernel_file[]="kernel_locMem_tst.out";
    int bin_length = ocl_read_binary( kernel_file, bin);//kernel_fileName kernel_file "kernel.out"
    Program::Binaries binary(1, std::make_pair(bin, bin_length));
    Program program = Program(context, devices, binary);
    program.build(devices);
    delete [] bin;
    #endif

    Kernel kernel;//(program, "VectorAdd");

    switch(kernelType)
    {
    case 0:
    {
    kernel=Kernel(program, "VectorAdd");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufB);
    kernel.setArg(2, bufDst);
    printf("+++ USE DDR,VectorAdd \n");
    }break;
    case 1:
    {
    kernel=Kernel(program, "VectorAdd_loc");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufB);
    kernel.setArg(2, bufDst);


    kernel.setArg(3, __local(sizeof(cl_data_type)*NumElements));
    kernel.setArg(4, __local(sizeof(cl_data_type)*NumElements));
    kernel.setArg(5, __local(sizeof(cl_data_type)*NumElements));
    printf("+++ USE DDR,VectorAdd_loc \n");
    }break;
    case 2:
    {
    kernel=Kernel(program, "VectorAdd_loc_1");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufDst);

    kernel.setArg(2, __local(sizeof(cl_data_type)*NumElements));
    kernel.setArg(3, __local(sizeof(cl_data_type)*NumElements));

    printf("+++ USE DDR,VectorAdd_loc_1 \n");
    }break;



    case 3:
    {
    kernel=Kernel(program, "VectorAdd_loc_2");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufB);
    kernel.setArg(2, bufDst);

    kernel.setArg(3, __local(sizeof(cl_data_type)*NumElements));
    kernel.setArg(4, __local(sizeof(cl_data_type)*NumElements));
    printf("+++ USE DDR,VectorAdd_loc_2 \n");
    }break;

    case 4:
    {
    kernel=Kernel(program, "VectorAdd_loc_3");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufDst);

    kernel.setArg(2, __local(sizeof(cl_data_type)*NumElements));
    printf("+++ USE DDR,VectorAdd_loc_3 \n");
    }break;

    case 5:
    {
    kernel=Kernel(program, "VectorAdd_loc_4");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufDst);

    kernel.setArg(2, __local(sizeof(cl_data_type)*NumElements));
    kernel.setArg(3, __local(sizeof(cl_data_type)*NumElements));
    printf("+++ USE DDR,VectorAdd_loc_3 \n");
    }break;


    default:
    {

    kernel=Kernel(program, "VectorAdd");
    kernel.setArg(0, bufA);
    kernel.setArg(1, bufB);
    kernel.setArg(2, bufDst);
    printf("+++ USE DDR,VectorAdd \n");
    }break;
    }

    CommandQueue Q(context, devices[d], CL_QUEUE_PROFILING_ENABLE);
    int cnt=0;
    int cnt_total=500;

    //while(1)
    //for(int j=0;j<;j++)
    {
    Event ev1;
    Q.enqueueNDRangeKernel(kernel, NullRange, NDRange(NumVecElements),
    NDRange(WorkGroupSize), NULL, &ev1);
    ev1.wait();
    cnt++;
    if(cnt>=cnt_total)
    {
    ocl_event_times(ev1, "Kernel Exec");
    cnt=0;
    }
    for (int i=0; i < NumElements; ++i)
    if (Golden[i] != dst[i])
    {
    cout << "Failed at Element " << i << ": "
    << Golden[i] << " != " << dst[i] << endl;
    // RETURN(-1);
    }

    }

    __free_ddr(srcA);
    __free_ddr(srcB);
    }
    catch (Error& err)
    {
    cerr << "ERROR: " << err.what() << "(" << err.err() << ", "
    << ocl_decode_error(err.err()) << ")" << endl;
    }

    for (int i=0; i < NumElements; ++i)
    if (Golden[i] != dst[i])
    {
    cout << "Failed at Element " << i << ": "
    << Golden[i] << " != " << dst[i] << endl;
    RETURN(-1);
    }

    __free_ddr(dst);
    __free_ddr(Golden);

    cout << "Success!" << endl;

    RETURN(0);
    }

    //////////////////

    kernel_locMem_tst.cl file:

    __kernel void null(void)
    {
    }
    //kernel __attribute__((reqd_work_group_size(1,1,1)))

    __kernel void VectorAdd(global float* a,
    global float* b,
    global float* c)
    {
    #if 0
    int id = get_global_id(0);
    c[id] = a[id] + b[id];
    #else
    int bufSize=get_local_size(0);

    add_op(a,b,c,bufSize);
    #endif
    }
    #if 1
    __kernel void VectorAdd_loc(__global float* a,
    __global float* b,
    __global float* c,
    __local float *restrict a_l,
    __local float *restrict b_l,
    __local float *restrict c_l
    )
    {
    int id = get_global_id(0);
    #if 0
    int bufSize=get_local_size(0);

    add_op(a,b,c,bufSize);

    #else
    int bufSize=get_local_size(0);

    event_t ev_in0;
    ev_in0 = async_work_group_copy(a_l, a,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_in0);
    #if 1
    if(id=0)
    {
    printf("++kernel async_work_group_copy(a_l,a) \n");
    }
    #endif
    #if 1
    event_t ev_in1;
    ev_in1 = async_work_group_copy(b_l, b,
    sizeof(float)*bufSize, 0); //bufSize
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_in1);
    #endif
    #if 1
    if(id=0)
    {
    printf("++kernel async_work_group_copy(b_l,b) \n");
    }
    #endif
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel: a_l[%d]=%f <> b_l[%d]=%f\n",i,a_l[i],i,b_l[i]);
    }
    }
    #endif
    //add_op(a_l,a_l,c_l,bufSize);
    add_op_1(a_l,c_l,bufSize);
    #if 0

    add_op_1(a_l,a_l,bufSize);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel: a_l[%d]=%f \n",i,a_l[i]);
    }
    }
    #endif

    #if 1
    if(id==0)
    {
    for(int i=0;i<bufSize;i++)
    {
    c_l[i]=a_l[i];
    }
    }
    #else
    memcpy(c_l,a_l,sizeof(float)*bufSize);
    #endif

    #endif

    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> b_l[%d]=%f <> c_l[%d]=%f \n",i,a_l[i],i,b_l[i],i,c_l[i]);
    }
    }
    #endif
    #if 0
    if(id==0)
    {
    for(int i=0;i<bufSize;i++)
    {
    if(c_l[i]!=c[i])
    {
    printf("++kernel err: c[%d]=%f <> c_l[%d]=%f\n",i,c[i],i,c_l[i]);
    }
    }
    }
    #endif
    event_t ev_out0;
    ev_out0 = async_work_group_copy(c, c_l,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_out0);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> b_l[%d]=%f <> c_l[%d]=%f <>c[%d]=%f\n",i,a_l[i],i,b_l[i],i,c_l[i],i,c[i]);
    }
    }
    #endif
    #endif
    }
    #endif
    #if 1
    __kernel void VectorAdd_loc_1(__global float* a,
    __global float* c,
    __local float *restrict a_l,
    __local float *restrict c_l
    )
    {
    int id=get_global_id(0);
    #if 0
    int bufSize=get_local_size(0);

    add_op(a,a,c,bufSize);

    #else
    int bufSize=get_local_size(0);

    event_t ev_in0, ev_out0;

    ev_in0 = async_work_group_copy(a_l, &a[0],
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_in0);

    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel: a_l[%d]=%f \n",i,a_l[i]);
    }
    }
    #endif
    add_op_1(a_l,c_l,bufSize);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f \n",i,a_l[i],i,c_l[i]);
    }
    }
    #endif
    #if 1
    if(id==0)
    {
    for(int i=0;i<bufSize;i++)
    {
    if(c_l[i]!=c[i])
    {
    printf("++kernel err: c[%d]=%f <> c_l[%d]=%f\n",i,c[i],i,c_l[i]);
    }
    }
    }
    #endif
    ev_out0 = async_work_group_copy(c, c_l,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_out0);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f <>c[%d]=%f\n",i,a_l[i],i,c_l[i],i,c[i]);
    }
    }
    #endif
    #endif
    }

    #endif
    #if 1
    __kernel void VectorAdd_loc_2(__global float* a,
    __global float* b,
    __global float* c,
    __local float *restrict a_l,
    __local float *restrict c_l
    )
    {
    int id=get_global_id(0);
    #if 0
    int bufSize=get_local_size(0);

    add_op(a,a,c,bufSize);

    #else
    int bufSize=get_local_size(0);

    event_t ev_in0, ev_out0;

    ev_in0 = async_work_group_copy(a_l, a,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_in0);

    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel: a_l[%d]=%f \n",i,a_l[i]);
    }
    }
    #endif
    add_op_1(a_l,c_l,bufSize);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f \n",i,a_l[i],i,c_l[i]);
    }
    }
    #endif
    #if 1
    if(id==0)
    {
    for(int i=0;i<bufSize;i++)
    {
    if(c_l[i]!=c[i])
    {
    printf("++kernel err: c[%d]=%f <> c_l[%d]=%f\n",i,c[i],i,c_l[i]);
    }
    }
    }
    #endif
    ev_out0 = async_work_group_copy(c, c_l,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_out0);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f <>c[%d]=%f\n",i,a_l[i],i,c_l[i],i,c[i]);
    }
    }
    #endif
    #endif
    }

    #endif

    #if 1
    __kernel void VectorAdd_loc_3(__global float* a,
    __global float* c,
    __local float *restrict c_l
    )
    {
    int id=get_global_id(0);
    #if 0
    int bufSize=get_local_size(0);

    add_op(a,a,c,bufSize);

    #else
    int bufSize=get_local_size(0);

    event_t ev_in0, ev_out0;

    ev_in0 = async_work_group_copy(c_l, a,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_in0);

    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel: a_l[%d]=%f \n",i,a_l[i]);
    }
    }
    #endif
    add_op_1(c_l,c_l,bufSize);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f \n",i,a_l[i],i,c_l[i]);
    }
    }
    #endif
    #if 1
    if(id==0)
    {
    c[0]=2;
    for(int i=0;i<bufSize;i++)
    {
    if(c_l[i]!=c[i])
    {
    printf("++kernel err: c[%d]=%f <> c_l[%d]=%f\n",i,c[i],i,c_l[i]);
    }
    }
    }
    #endif
    ev_out0 = async_work_group_copy(c, c_l,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_out0);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f <>c[%d]=%f\n",i,a_l[i],i,c_l[i],i,c[i]);
    }
    }
    #endif
    #endif
    }

    #endif


    #if 1
    __kernel void VectorAdd_loc_4(__global float* a,
    __global float* c,
    __local float *restrict a_l,
    __local float *restrict c_l
    )
    {
    int id=get_global_id(0);
    #if 0
    int bufSize=get_local_size(0);

    add_op(a,a,c,bufSize);

    #else
    int bufSize=get_local_size(0);

    event_t ev_in0, ev_out0;

    ev_in0 = async_work_group_copy(a_l, a,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_in0);

    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel: a_l[%d]=%f \n",i,a_l[i]);
    }
    }
    #endif
    add_op_1(a_l,c_l,bufSize);
    #if 0
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f \n",i,a_l[i],i,c_l[i]);
    }
    }
    #endif
    #if 1
    if(id==0)
    {
    //c[0]=2;
    for(int i=0;i<bufSize;i++)
    {
    if(c_l[i]!=c[i])
    {
    printf("++kernel err: c[%d]=%f <> c_l[%d]=%f\n",i,c[i],i,c_l[i]);
    }
    }
    }
    #endif
    event_t ev_out1 = async_work_group_copy(c, c_l,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_out1);

    ev_out0 = async_work_group_copy(c, c_l,
    sizeof(float)*bufSize, 0);
    // wait for prefecthed block to finish
    wait_group_events(1, &ev_out0);
    #if 1
    if(id==0)
    {
    for(int i=0;i<10;i++)
    {
    printf("++kernel : a_l[%d]=%f <> c_l[%d]=%f <>c[%d]=%f\n",i,a_l[i],i,c_l[i],i,c[i]);
    }
    }
    #endif
    #endif
    }

    #endif

    ///////////////////////////

    dsp_fun.c file:

    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #include "dsp_fun.h"


    void add_op(float *a,float *b,float *c,int bufSize)
    {
    int i=0;

    for(i=0;i<bufSize;i++)
    {
    c[i]=a[i]+b[i];
    }

    #if 0
    for( i=0;i<10;i++)
    {
    printf("++kernel add_op: a[%d]=%f <> b[%d]=%f <> c[%d]=%f \n",i,a[i],i,b[i],i,c[i]);
    }
    #endif
    }
    void add_op_1(float *a,float *c,int bufSize)
    {
    int i=0;

    for(i=0;i<bufSize;i++)
    {
    c[i]=2.0*a[i];
    }

    #if 0
    for( i=0;i<10;i++)
    {
    printf("++kernel add_op: a[%d]=%f <> b[%d]=%f <> c[%d]=%f \n",i,a[i],i,b[i],i,c[i]);
    }
    #endif
    }

    /////////////////////////////////////

    dsp_fun.h file :

    #ifndef DSP_FUN_H_
    #define DSP_FUN_H_
    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #include "dsp_fun.h"


    void add_op(float *a,float *b,float *c,int bufSize);
    void add_op_1(float *a,float *c,int bufSize);

    #endif

    ///////////////////////////

    Makfile :

    EXE = locMem_tst
    EXE_MD = locMem_tst_md
    CXXFLAGS = -O3

    CL6XFLAGS = -o3 -mw --symdebug:none

    include ../make.inc

    $(EXE): main.o
    @$(CXX) $(CXXFLAGS) main.o $(LDFLAGS) $(LIBS) -o $@

    #$(EXE_MD): main_md.o
    # @$(CXX) $(CXXFLAGS) main_md.o $(LDFLAGS) $(LIBS) -lrt -o $@
    main.o: kernel_locMem_tst.dsp_h
    kernel_locMem_tst.dsp_h: dsp_fun.obj
    #kernel_locMem_tst.out: dsp_fun.obj


    clean::
    @rm -fr $(EXE_MD)

    //////////////////////

  • To avoid confusion, we recommend using "reqd_work_group_size(1,1,1)" attribute when using "async_work_group_copy" functions.  Work group size (1,1,1) does not mean you can only process one array element in each workgroup/workitem.  You can map a workgroup/workitem to a whole array or a section of an array.  Please refer to "conv1d" example how async_work_group_copy() works.

  • Hi Yuan,

    I have a question.

    if I develop dsp core of am5748 with openmp,What is the difference between real-time about openCL and openMP?

    How does the real-time performance of openmp compare with opencl?

  • Real-time issue (huge variance of latencies among different runs) is more Linux related, as I explained in the other thread.  You should try Processor SDK Linux RT, and see if it improves the predictability of latencies.

    Even if you develop DSP code with OpenMP, the communication between DSP and Arm remain the same (still the OpenCL path).

  • If I use TI-RTOS develop dsp of AM5748,Can I not operate the driver (SPI,McASP)?

    Can I write IPC (comunicate between A15 and DSP)by myself?How to write the IPC? Is there an example for IPC?

  • Yes, you can write IPC by yourself, here is the link to IPC in RTOS user guide: https://software-dl.ti.com/processor-sdk-rtos/esd/docs/06_03_00_106/rtos/index_Foundational_Components.html#ipc

    OpenCL also works in RTOS mode: http://downloads.ti.com/mctools/esd/docs/opencl/rtos/index.html

    As to whether you can use the driver (SPI, McASP) in TI-RTOS, I would recommend that you open a new thread on this topic.  Let's keep this thread related to OpenCL only.  Thanks!

  • I mean ,A15 is developed with linux,DSP core is developed with TI-RTOS,the communication between A15 and DSP with IPC。is it feasible? 

    DSP  do not operater SPI and MCASP,only do with audio data,processed with audio algorithms,is it feasible? 

  • Yes, it is feasible.  That's basically how OpenCL works in Processor SDK Linux.  OpenCL host runtime on Arm (running Linux) communicates with OpenCL device runtime on DSP (running TI-RTOS) via MessageQ API from IPC.

    https://git.ti.com/cgit/opencl/ti-opencl/tree/host/src/core/dsp/tal/mbox_impl_msgq.cpp

    https://git.ti.com/cgit/opencl/ti-opencl/tree/monitor/src/monitor.c#n398

    There might be simpler examples:

    root@am57xx-evm:~# ls /usr/share/ti/ti-ipc-tree/examples/DRA7XX_linux_elf/ex02_messageq/
    dsp1 host ipu2 products.mak shared
    dsp2 ipu1 makefile readme.txt

    It will be interesting to see if you can get Linux timing variance under control with a single application/communication thread.