Optimized code for subtracting two arrays from each other

John Kofod

Hi,

In my project I need to subtract two huge 8-bit arrays (150kByte grayscale images) from each other. Therefore Im looking for an optimized way to do this. The "slave-way" with for-loops and normal subtraction takes of course forever, and the I cant seem to find a more dsp-friendly way to do this.

Thanks, John

over 15 years ago

0 Archaeologist over 15 years ago

TI__Guru* 84285 points

What processor are you using?

0 John Kofod over 15 years ago in reply to Archaeologist

Prodigy 200 points

DM6437 EVM board with Code Composer Studio 3.3.

And got both the IMGLIB and VLIB libraries working.

0 John Kofod over 15 years ago in reply to John Kofod

Prodigy 200 points

no ideas???

0 Archaeologist over 15 years ago in reply to John Kofod

TI__Guru* 84285 points

I'm sure there are many pre-build libraries that do exactly what you want, and I was hoping someone who knows more about them would have answered by now. You are probably better off looking for one of those libraries.

If you want to write the code yourself, the compiler should be able to make a highly optimized loop by using the SUB4 instruction if you make sure to let the compiler know the arrays are aligned favorably, are big enough, and don't alias one another. For instance:

cl6x -mv6400+ -O2 char_array_diff.c

#define PTR_IS_64BIT_ALIGNED(x) (_nassert(((unsigned)(x) & 0x7) == 0))

void char_array_diff(char * restrict x, char * restrict y, char * restrict a)
{
    int i;
    PTR_IS_64BIT_ALIGNED(x);
    PTR_IS_64BIT_ALIGNED(y);
    PTR_IS_64BIT_ALIGNED(a);
    for (i = 0; i < 400; i++)
        a[i] = x[i] - y[i];
}

This gives a software pipelined loop that computes 16 output chars every 3 cycles, which I believe is the fastest possible for char vector difference on this architecture. You could also look at the _sub4() intrinsic, which would also result in the SUB4 instruction.

[Edit: make clear what the performance is --Archaeologist]

0 John Kofod over 15 years ago in reply to Archaeologist

Prodigy 200 points

If there are optimized libraries for this, they are well hidden.

The suggestion "Archaeologist" wrote, wasn't that far from what I already had, and didn't speed up the process, damn. Though it was a good shot. Thank you.

I've tried the different compiler instructions, but it still takes around 100 ms to process those 150.000 subtractions.

0 Archaeologist over 15 years ago in reply to John Kofod

TI__Guru* 84285 points

John Kofod said:

If there are optimized libraries for this, they are well hidden.

The suggestion "Archaeologist" wrote, wasn't that far from what I already had, and didn't speed up the process, damn. Though it was a good shot. Thank you.

I've tried the different compiler instructions, but it still takes around 100 ms to process those 150.000 subtractions.

The CPU can run the software pipelined loop in about 28k cycles, which is well under 1 ms. Your bottleneck must be memory access; if that's the case, no matter how tight you make the loop, you will not see speedup. Can you move some of these arrays to faster memory?

0 John Kofod over 15 years ago in reply to Archaeologist

Prodigy 200 points

I'm guessing you're right. And I'm trying to do that with slicing using EDMA. This must be the right approach?

At the moment, all the data and output arrays are placed in DDR2 extern memory.

But the DM6437 doesn't have that much intern memory. And its some big arrays that needs to be processed.

Thanks anyway...

0 Brad Griffis over 15 years ago in reply to John Kofod

TI__Guru*** 125430 points

If your data is in external memory it's critical to make sure that both the data cache has been enabled and the MAR bits correctly set. More details here:

http://processors.wiki.ti.com/index.php?title=Enabling_64x%2B_Cache

I don't believe we have any libraries for this purpose. I've seen some 3rd party libraries (e.g. Kane Computing, etc) that do vector math operations. Ours focus on things like convolutions, fir filters, iir filters, median filters, etc.

Best regards,

Brad

0 John Kofod over 15 years ago in reply to Brad Griffis

Prodigy 200 points

Thanks for the heads up.

It must be that the cache is not enabled in my extern memory. I've read all the documents I could find about cache enabling and MAR bits, but they all refer to DSP/BIOS. None to CSL. It must be possible to set from CSL?

0 Brad Griffis over 15 years ago in reply to John Kofod

TI__Guru*** 125430 points

From CSL? What CSL are you using? I don't believe we ever made a CSL for DM643x (though I think some register definitions exist inside the PSP). If you're not using BIOS then I would just write the registers directly.

For example, let's say you wish to make 64MB of DDR2 (base address 0x80000000) cacheable. You can do something like:

// definitions for relevant MARs
#define MAR128 *(volatile unsigned int*)0x01848200
#define MAR129 *(volatile unsigned int*)0x01848204
#define MAR130 *(volatile unsigned int*)0x01848208
#define MAR131 *(volatile unsigned int*)0x0184820C

// Configure each of the MARs to make the corresponding 16MB of memory cacheable
MAR128 = 1;
MAR129 = 1;
MAR130 = 1;
MAR131 = 1;

If you're using the "register layer CSL" that ships inside the PSP it would look like this:

#include <soc.h>
#include <cslr_cache.h>

CSL_CacheRegs *pCacheRegs = (CSL_CacheRegs*)CSL_CACHE_0_REGS;

pCacheRegs->MAR[128] = 1;
pCacheRegs->MAR[129] = 1;
pCacheRegs->MAR[130] = 1;
pCacheRegs->MAR[131] = 1;

Best regards,
Brad

0 John Kofod over 15 years ago in reply to Brad Griffis

Prodigy 200 points

That solved it...

Processors

Processors forum

Optimized code for subtracting two arrays from each other