Using combined In-Out-Buffer on EVE/VCOP

Tobias Schulze

Hello Everyone,

I'm trying to implement an EVE app that merges two 16bit values into one 32bit value. The goal is to overwrite the values from the input buffer with the new calculated ones, so that I work on only one combined in/out buffer instead of two seperate buffers. Is this possible on VCOP or does a VCOP kernel always contain a designated input and output buffer (IbufL and IbufH)?

If possible, can I "merge" IbufL and IbufH together to one buffer, so that I process 32kB instead of 16kB of data without relaoding the buffer (via EDMA)?

Regards,

Tobias

over 6 years ago

0 Yordan Kamenov over 6 years ago

TI__Mastermind 42515 points

Hi Tobias,

I have forwarded your question to EVE experts.

Regards,
Yordan

0 Anshu Jain over 6 years ago

TI__Guru 55130 points

Hi Tobias,

There are two parts to your question. One is re-use of internal memories of EVE other one is the DMA of internal buffers to/from external memories. I am writing my response for both of them :

Part 1 : Re-use of internal memories of EVE

There is no restriction from VCOP on internal buffers, you can read and write to the same buffer. But you should make sure that you are not reading back from the same location where you have written as this could lead to race condition as store and loads happens independently. Another thing to consider is the performance impact because you will be reading and writing to the same buffer which will take 2 cycles instead of 1 if you read and write to different buffers. Regarding you question of using full 32KB of memory, again as such there is no restriction from VCOP on the memories but the problem will come in DMA which I is explained in second part of this

Part 2: DMA of internal buffers to external memories.

If you use full 32KB for input and and re-use them for output then problem will come for DMA as you will have to do DMA of 32KB data from external memory to internal memory and at the same time you will have to do DMA of already computed 32KB of data from internal memory to external memory. This can be done if you make sure you first DMA out all the computed data before writing the new data. Hence it should be taken care carefully while programming the DMA.

Regards,

Anshu

0 Tobias Schulze over 6 years ago in reply to Anshu Jain

Intellectual 900 points

Hey Anshu,

I used the non-BAM example "evelib_fir_filter_2d" as a reference and customized DMA to work as I wanted to (and took care of order of IN/OUT-writes as you mentioned). Now I use a combined buffer for input and output as described above. However if I want to process more than 16kB of data with VCOP, EVE-simulator produces multiple error messages:

"multi_eve_subsystem.EVE_0.MEM_SWITCH: Invalid access to reserved space"

These occur only when I debug the app and step over the execution function of the kernel. If I run the whole app without breakpoints it works fine. If I run the same app directly on an EVE (e.g. TDA3 EVM) the app gets stuck in loop when it wants to execute the kernel the second time. On the first execution all calculated data and memory writes are OK.

So is there anything else I have to edit when I want to change memory ranges of IBUFL/IBUFH?

Thanks,
Tobias

0 Anshu Jain over 6 years ago in reply to Tobias Schulze

TI__Guru 55130 points

Hi Tobias,

As you are using one of our example so I am assuming you are using alias memory view? So in this case you will have 16KB of IBUFH and 16KB of IBUFL (Total of 32KB for input and ouput). From the error it is looking like inside your kernel you are going beyond these two buffers and accessing the restricted region. We will have to look into the kernel and to comment more on this.

Regards,

Anshu

0 Tobias Schulze over 6 years ago in reply to Anshu Jain

Intellectual 900 points

Hey Anshu,

Yes, I am using alias memory view. The kernel is meant to process two 10 bit input values (which are stored in a 16 bit buffer) and merge them to one 20 bit value (stored in a 32 bit buffer). That means we have two times as much inputs as outputs. This is the equivalent C-Code:

	UINT16 *pFrameAdd16 = (UINT16*)ipcMsg->frameAddr;
	UINT32 *pFrameAdd32 = (UINT32*)ipcMsg->frameAddr;
	UINT32 numPixel = ipcMsg->vPixel * ipcMsg->hPixel;

	for( UINT32 i = 0; i < numPixel; i++ )
	{
		pFrameAdd32[i] = (UINT32)pFrameAdd16[i * 2] << 10 | (UINT32)pFrameAdd16[i * 2 + 1]  ;
	}

The kernel code I came up with looks like this:

#define ELEMSZ           sizeof(*in1_ptr)
#define VECTORSZ        (VCOP_SIMD_WIDTH*ELEMSZ)
#define SHIFT (10)

void buffer_merge
(
   __vptr_uint16  in1_ptr,
   __vptr_uint32  optr, 
   unsigned short width,
   unsigned short height
)
{
   __vector Vin1;
   __vector Vin2; 
   __vector Vout; 
   __vector Vshift;
   __vector Vin1s;

   Vshift = SHIFT;

   for (int I1 = 0; I1 < height; I1++)
   {
       for (int I2 = 0; I2 < width/VCOP_SIMD_WIDTH; I2++)
       {
           __agen Addr_out;
      
           Addr_out = I1*width*ELEMSZ*2 + I2*VECTORSZ*2;

           (Vin1,Vin2) = in1_ptr[Addr_out].deinterleave();
           Vin1s	= (Vin1 << Vshift);
           Vout     = Vin1s | Vin2;
           optr[Addr_out] = Vout;
       }
   }
}

0 Tobias Schulze over 6 years ago in reply to Tobias Schulze

Intellectual 900 points

In Simulatior error messages appear, but the overall output values are OK. As mentioned before, on actual hardware the app gets stuck in a loop.
The Screenshot below shows end of combined memory IBUFL/IBUFH after VCOP has processed one 32kB block.
It writes to addresses 0x40050000 to 0x40057FFD.

Regards,
Tobias

0 Pramod Kumar Swami over 6 years ago in reply to Tobias Schulze

TI__Genius 12110 points

Hi Tobias,

What are the vaues of below 4 parameters when you invoke the function?
_vptr_uint16 in1_ptr,
__vptr_uint32 optr,
unsigned short width,
unsigned short height

Thanks,
With Regards,
Pramod

0 Tobias Schulze over 6 years ago in reply to Pramod Kumar Swami

Intellectual 900 points

Hi Pramod,

__vptr_uint16 in1_ptr: pointer to VCOP buf, in this case (uint16_t*) 0x40050000

__vptr_uint32 optr: pointer to same VCOP buff, since I want to use whole VCOP memory as one single In/Out buffer, so here it is (uint32_t*) 0x40050000

unsigned short width: number of input pixels in a row, here 16384

unsigned short height: in this case 1 since I want to process subsequent data, not data in a 2D way

Regards,

Tobias

0 Anshu Jain over 6 years ago in reply to Tobias Schulze

TI__Guru 55130 points

Hi Tobias,
There are two problems with this kernel.
1. As your width is ranging till 16384 so when icnt2 = 2047 then addr_out = 2047*8*2 = 32752. As your output data is 32 bit you will be accessing 32752 + 8 * 4 bytes = 32784 which is beyond 32KB. This is the reason simulator is throwing that warning. And as this is the last entry thats why your output might be looking as expected.
2. Second problem is the one which I mentioned earlier also in this thread. Your kernel is reading and writing from the same location without ensuring that your load is not from the same location as store. After your first iteration you will be reading 8 * sizeof(uint16_t) = 16 bytes but you will be writing 8 * sizeof(uint32_t) = 32 bytes. So you already have over written your input.

Regards,
Anshu

0 Anshu Jain over 6 years ago in reply to Anshu Jain

TI__Guru 55130 points

Tobias,
I just noticed that your input is via de-interleave load so my second point is not correct. You can ignore my second comment. But the first comment still holds good which is causing the issue which you are observing.

Regards,
Anshu

0 Anshu Jain over 6 years ago in reply to Anshu Jain

TI__Guru 55130 points

Tobias,
Couple of more points to add on your kernel :
1. You can directly use Vdst |= Vsrc1 << Vsrc2; // VSHFOR Vsrc1, Vsrc2, Vdst, Vdst instead of using two instruction to do the same.
2. This kernel is data bound i.e. you are doing for DMA then the actual compute itself. In general in such scenario you should try to do more processing by adding more kernels so as to efficiently utilizing VCOP's compute power.
3. Currently your kernel is not utilizing VCOP efficiently as it is only using one functional unit. Currently it is taking 2 cycles to generate 8 outputs per iteration. You can improve it by processing more data and intelligently placing your input and output buffer. I am pasting some modification to your kernel to improve it ( I have not compiled it so it may have errors). With these modification this kernel should take 2 cycles to generate 16 outputs per iteration as opposed to 8 outputs . Here I have divided input block into 2 parts with 1st part in IBUFL and 2nd part in IBUFH ( You can easily do this via DMA).

void buffer_merge
(
__vptr_uint16 inPtrL,
__vptr_uint16 inPtrH,
__vptr_uint32 optrL,
__vptr_uint32 optrH,
unsigned short width,/* Should be max 8192 as width should be in terms of elements size */
unsigned short height
)
{
__vector Vin1, Vin2;
__vector Vin3, Vin4;
__vector Vshift;

Vshift = SHIFT;

for (int I1 = 0; I1 < height ; I1++)
{
for (int I2 = 0; I2 < width/( VCOP_SIMD_WIDTH); I2++)
{
__agen Addr_out;

Addr_out = I1*width*ELEMSZ*2 + I2*VECTORSZ*2;

(Vin1,Vin2) = inPtrL[Addr_out].deinterleave();
(Vin3,Vin4) = inPtrH[Addr_out].deinterleave();

Vin2 |= Vin1 << Vshift;
Vin4 |= Vin3 << Vshift;

optrL[Addr_out] = Vin2;
optrH[Addr_out] = Vin4;
}
}
}
Regards,
Anshu

0 Tobias Schulze over 6 years ago in reply to Anshu Jain

Intellectual 900 points

Hey Anshu,

thank you a lot for your tipps on my kernel. You were right regarding the width I used with my initial kernel code. It was too large, so that on the last VCOP cycle the memory boundaries were exceeded. I fixed that and now my application doesnt get stuck in a loop or throws errors anymore.

I also tried your improvement to my kernel and with a few changes (e.g. changing VCOP_SIMD_WIDTH to 16) it worked nicely.

As mentioned my application now always finishes without code related errors, however I encountered another problem: Whenever I choose an VCOP buffer size bigger or equal than 16kB at some point my output buffer contains wrong values. These values always appear from a changing positiion everytime I restart the app and aren't even close to the expected values.

The strange thing is that these errors/wrong values only appear when I start my app without any breakpoints (directly on EVE/TDA3x). If I run it with breakpoints and debug through it, all values are ok. In EVE simulator the output also seems fine (with and without breakpoints in the code). Are there any runtime problems I have to pay attention to? Or could this be an EDMA error?

The attached image shows the wrong values. Expected values should always be one higher than the previous one, like they are until [2098039].

Again, thanks a lot and Regards!

Tobias

0 Anshu Jain over 6 years ago in reply to Tobias Schulze

TI__Guru 55130 points

Hi Tobias,

This is looking like your DMA is overwriting some portion of the output data. As mentioned earlier in this case as you must be writing the output data at the same time as you are reading new data to the same memory location, which might be causing this issue. Have you made sure in your code that you are first finishing the writing of output data before you read the new data? Can you explicitly put wait after each EDMA and see if you are still seeing the issue?

Regards,

Anshu

0 Pramod Kumar Swami over 6 years ago in reply to Anshu Jain

TI__Genius 12110 points

Anshu, Tobias,

Since the first query as titled (Using combined In-Out-Buffer on EVE/VCOP) has been answered and resolved, I would suggest to close this thread and start a new thread for different questions. It enables a easier search in the E2E and much cleaner way of separating the different discussions.

Thanks,

With Regards,

Pramod

0 Tobias Schulze over 6 years ago in reply to Anshu Jain

Intellectual 900 points

Hey Anshu,

I found a solution to my problem. Apparently there were some false PaRAMs set for the EDMA (or it was intended to be used a different way than I wanted it to be). This lead to a strange condition where DMA copied a already processed block a second time to a random location. This error only occured when I executed the programm on hardware and without breakpoints. In EVE simulator and while stepping through it, it worked fine. Now I have set the PaRAMS myself and it works without errors.

Regards
Tobias

Processors

Processors forum

Using combined In-Out-Buffer on EVE/VCOP