TDA4VM: C7x image processing algorithm, issue copying the image into the buffer.

Part Number: TDA4VM


Tool/software:

Hi,

I have an image processing algorithm which takes one RGBIR raw image as input and generated complete IR raw output.
Resolution of the image is width 2592 and heigh 1944. I have connected to the XDS110 on board debugger in J721e Soc and loaded the binary on to the C7x core. 
I am using code composer studio version 12.4 in ubuntu 22.

While reading the image into the input buffer, I noticed that it is copying slower.
I have tried with 100 rows, and I noticed it took 170 seconds. Tried to copy 100 rows in different ways, copying each pixel, copying 32 pixels in iteration, copying row by row, copying in single go. 
Whatever the way, it took 170 seconds. 

I have attached the piece of code for reference.

Can you please suggest me the appropriate way to do copying? 
Is this issue really caused by fread() ?

int i = 0;
FILE *fp = fopen("/home/thalamr/workspace_v_latest/rgbir_instrinsic/input1.raw", "rb);
const unsigned int img_width = 2592;
const unsigned int img_height = 1944;
const unsigned int out_stride = 2688;

unsigned short* input = (unsigned short*) malloc(out_stride * img_height * sizeof(unsigned short));
unsigned short* outputBayer = (unsigned short*) malloc(out_stride * img_height * sizeof(unsigned short));


for (i = 0; i < img_height; i++)
{
    fread(&input[i * out_stride], sizeof(unsigned short), img_width, fp);
}

Thanks and Regards,
Srinivas Thalam

  • The bottleneck on performance is in the low level details of the implementation of fread.  Unfortunately, there is no solution.  To understand it better, please read the article Tips for using printf.  One relevant passage says of a key buffer used to exchange data with the host ...

    Note: This section is rather small, which means that large reads and writes may require many hits of the C$$IO$$ breakpoint to be performed. This buffer can be increased, but the limiting factor is the debugger needs to be able to resize its own internal buffer.

    Thanks and regards,

    -George

  • Hi George,

    We have a limited memory in the dsp lnk.cmd file allocated for the program.
    Since my .raw image is 10mb, i need 50 mb of space for heap. 

    Here I am attaching the piece of code from the lnk.cmd file.

    -heap 0x70000 // 448 kB
    -stack 0x4000 // 16 kB
    --cinit_compression=off
    --args 0x1000
    --diag_suppress=10068 // "no matching section"

    MEMORY
    {
    MSMCSRAM_CINIT (RWX) : org = 0x70000000, len = 0x000100
    L2SRAM (RWX): org = 0x64800000, len = 0x080000 // for J7
    L1DSRAM (RWX): org = 0x64E00000, len = 0x004000 // 16kB, for J7
    MSMCSRAM (RWX): org = 0x70000100, len = 0x7fff00
    EXTMEM_STATIC (RWX): org = 0x80000000, len = 0x200000
    EXTMEM_DATACN (RWX): org = 0x80200000, len = 0x400000
    EXTMEM (RWX): org = 0x80600000, len = 0x400000
    EXTMEMPAGE (RWX): org = 0x80A00000, len = 0x200000
    }

    lnk.cmd file I have used which is in this path.

    /ti-processor-sdk-rtos-j721e-evm-09_02_00_05/dsplib_09_02_00_04/cmake/linkers/lnk.cmd

    I increased the L2SRAM size to 10mb, I did not get any erros, but in ccs when i use more the 448kb, while debugger, the cursor disappearing.
    Tried to increase the L1DSRAM aslo.

    L2SRAM (RWX): org = 0x64800000, len = 0x100000 // for J7
    L1DSRAM (RWX): org = 0x65000000, len = 0x004000 // 16kB, for J7

    Can you please help me increasing the size?


    Thanks and Regards,
    Srinivas Thalam

  • Srinivas - do you need to use C I/O? If you will be using CCS for debug, why not just use the CCS memory load feature to load the raw binary image to memory?

  • I increased the L2SRAM size to 10mb

    A memory range like L2SRAM must match the memory present in the system.  I presume your system has a specification which documents all of the memory ranges.  If this memory range does not match that specification, it will not work.  

    Increasing heap and stack is very unlikely to improve performance.  The buffer I refer to in my last post has a fixed length that does not depend on the heap or stack.

    Thanks and regards,

    -George

  • Hi George Mock,

    I have tried in two different ways before touching the link.cmd file.

    1. using malloc().

    unsigned short* input = (unsigned short*) malloc(out_stride * img_height * sizeof(unsigned short));

    2. using __attribute__

    unsigned short input[2592 * 1944] __attribute__((section(".sysmem")));

    When I loaded the image (Tools->Load Memory) in ccs, I noticed that only around 440kb is copied into the buffer (below mentioned screenshot of memory browser).
    440kb is allocated to the heap memroy in lnk.cmd file. And when I checked in the memory browser rest of the memroy it dispalyed target failed to read 0x000.

    I need around 50mb approximately since I am having input image, output image, and two buffers each of 10mb size.

    Can you suggest me the way to overcome this issue ?

    Thanks and Regards,
    Srinivas Thalam

  • target failed to read 0x000.

    This looks like a known issue that existed with earlier versions of CCS. CCS 12.4 is fairly old. Could you try updating your version. CCS 20.3.0 is the latest but is quite a change since the IDE framework was updated. Ideally you would move to this version since we no longer support CCS 12.x. However you can try updating to CCS 12.8.1 if you do not want to move to the new IDE framework.

  • I am using both 12.4 and 12.8.1 versions, i got the same error in both, I am able use only 448kb of L2SRAM (-heap 0x70000 // 448 kB). I installed the latest version 20.1.1 and did not get to use the latest version perfectly, it is quite different from 12 verison.

    Do we need to increase the memory, because in both the versions, we are able to use limited memory only.

    Thanks & Regards,
    Srinivas Thalam

  • Srinivas,

    I think only 448KB of L2 RAM is available on C7x. 

    Regards,

    Brijesh

  • Hi Brijesh,

    Thank you for confirming.

    I used external memory and able process my image and generate output in 1.6370 sec which is more unusual time.
    So which memory region do you suggest me to load my 9.6 Mb raw image  for reducing the time ?
    How and where can I allocate ( 2592 * 1944 * 2 ) bytes of space for loading my image ?

    Thanks & Regards,
    Srinivas Thalam

  • Hi,

    You would need to use DDR only for such a large image..  But you could probably process the image in parts, copy small part of the image in internal memory, process it, write output back to DDR and do same for the rest of the images.. 

    Regards,

    Brijesh