DM8127 dsp usecase(tristream) algorithm porting problem.


Hi, I'm developing on DM8127 IPNC with IPNC_RDK 2.0
My ultimate goal is porting algorithm on dsp core.
multich_usecase_dsp.c  default setting is [tristream 1080p60p(H.264) + D130p(H.264) + 1080p5p(Mjpeg)]
My algorithm needs 1024x768 30fps and using Y space data.
I found usecase of OSD example. (1080p60p stream modified frame buffer and 1080P H.264 stream and 1080P Mjpeg stream layed logo on topleft.)

[A]. "swosd"  in "multich_tristream_fullFeature.c"(Not use dsp core)

[B]. "sw_osd" using dsp.
In OsdLink_algProcessData(...) function (osdLink_alg.c)


So, I modified [B] code below.

#if 0 //original code (just memcpy Y/UV logo buffer by dsp core)
for (i = 0; i < 64; i++)
{
memcpy((void*)((UInt32)pFrame->addr[0][0] + 720 * i), &TILogo_Y_160_64[160 * i], 160);
}
for (i = 0; i < 32; i++)

{
memcpy((void*)((UInt32)pFrame->addr[0][1] + 720 * i), &TILogo_UV_160_64[160 * i], 160);
}
#else
unsigned char* ucPtr; // for only V value
ucPtr = (unsigned char*)pFrame->addr[0][0];

// 720x480 part converted black or white in 1920x1080 fullframe.
for (i = 0 ;i<720 ; i++)
{
for (j = 0 ;j < 480 ; j++)
{
        if (ucPtr[i*1920 + j] <128)
ucPtr[i*1920 + j] = 10;
else
ucPtr[i*1920 + j] = 245;
}
}
#endif

This code is simple image processing in part of full frame.(binary convert)
I think this simple algorithm need low CPU(DSP) power, but edited result framerate is slowdown 60 fps --> 17 fps
I don't know why so many delay occur.
Thank you.
  • In the first case, the CPU/DSP is reading and writing 64*160+32*160 = 15360 bytes.

    In the second case, the CPU/DSP is reading and writing 720*480 = 345600 bytes apart from it is computing some condition for every byte.

    Check for the DSP compiler settings. This second case code can  be optimized by using some intrinsics.

  • In reply to Venugopala:

    Try this  (I have not compiled it though):

    unsigned char* ucPtr; // for only V value
    unsigned int inWord;
    unsigned int restrict* inPtr;
    unsigned int restrict* outPtr;

    unsigned int compareWord = (128<<24) | (128<<16) | (128<<8) | 128;
    unsigned int mask1 = (10<<24) | (10<<16) | (10<<8) | 10;
    unsigned int  mask2 = (245<<24) | (245<<16) | (245<<8) | 245;

    ucPtr = (unsigned char*)pFrame->addr[0][0];

    // 720x480 part converted black or white in 1920x1080 fullframe.
    #pragma MUST_ITERATE(86400, 86400, 4)
    //for
    (i = 0 ;i<720 ; i++)
    //{
    inPtr = outPtr = ucPtr;
    for (j = 0 ;j < (120*720) ; j+=4)
    {
    inWord = _amem(inPtr++);
    bits = _cmpgtu4(inWord, compareWord);
    cbits = bits ^ 0xF;

    greaterWord = _xpnd4(bits);
    lesserWord = _xpnd4(cbits);

    greaterWord = greaterWord & mask2;
    lesserWord = lesserWord & mask1;

    outputWord = _add4(greaterWord, lesserWord);

    _amem(outPtr++) = outputWord;

    }

  • Daffik,

    Can you measure the time taken for processing by the algorithm?

    The slow down could be because of many reasons -

    1. Algorithm execute latency

    2. DDR read latency (un-available bandwidth)

    Also do you need 60fps frame rate in your system ? I believe your algorithm usecase needs only 30fps stream - so you may consider running the system capture at 30fps - this will further improve your DDR b/w availability.

    Regards

    Rajat

  • In reply to Venugopala:

    Thank you, Venugopala.

    I tried your code but compile error occured.

    I think your code is somthing optimized by low-level code for DSP and that will be boost original code.

    Unfortunatly I'm not familiar with dsp side low-level code. (I will be learning about dsp side optimize method.)

    I will try by this order.

    1. Try modification about McFW links chain more. (camera capture setting, link tree,...(TriStream is not good for me))

    2. Try dsp wrapping library? (c6ezaccel, c6ezrun... I don't know this stub would help or not in McFW)

    3. If dsp delay problem resolved, then porting my algorithm (myAlg_lnk.c, myAlg_priv.h, myAlg_tsk.c, myAlg_alg.c ..... [McFW method] )

    4. Convert C code to dsp optimized code (plan..)

    thank you!

  • In reply to Rajat Sagar:

    Thank you, Rajat Sagar.

    My McFW Web Setting : 1920x1080 H.264 30fps

    I try elapsed time by Clock_getTicks() function.

    OsdLink_algProcessData() function ( mcfw/src_bios6/links_c6xdsp/swosd/osdLink_alg.c)

    -------------------------------------------------------------------------

    elapsedDSP_Start = Clock_getTicks();
    for (i = 0 ;i<240 ; i++) {
    for (j = 0 ;j < 320 ; j++) {
    if (ucPtr[i*1920 + j] <168)
    ucPtr[i*1920 + j] = 10;
    else
    ucPtr[i*1920 + j] = 245;
    }
    }
    elapsedDSP_End = Clock_getTicks();
    calcMyAlg = elapsedDSP_End - elapsedDSP_Start;
    -------------------------------------------------------------------------
    
    
    One "If" and one memory allocaction in the 1 loop cycle.
    
    
    1) calcMyAlc result 12ms in 320x240
    2) calcMyAlc result 24ms in 320x480
    3) calcMyAlc result 48ms in 640x480
    
    
    The 3rd setting occur delay ploblem.
    I think this ploblem caused by algorithm takes more than 33ms (in 30fps)
    If above elapesed test is right, why time consumption so large?
    
    
    If 1024x768 size resolution, It takes 122.88ms I think.
    DSP is more slower in general C code instruction?
    
    
    Thank you.
    
    
    
    
    [Reference] My booting log.
    
    
    U-Boot 2010.06 (Feb 29 2012 - 16:01:47) DM812x_IPNC_2.00.00

    TI8148-GP rev 2.1

    ARM clk: 600MHz
    DDR clk: 400MHz
    IVA clk: 450MHz
    ISS clk: 400MHz
    DSP clk: 400MHz
    DSS clk: 200MHz

    DRAM: 512 MiB
    NAND: HW ECC Hamming Code selected
    256 MiB
    Using default environment

    The 2nd stage U-Boot will now be auto-loaded
    Please do not interrupt the countdown till TI8148_EVM prompt if 2nd stage is already flashed
    Hit any key to stop autoboot: 0

    NAND read: device 0 offset 0x20000, size 0x40000
    262144 bytes read: OK
    ## Starting application at 0x81000000 ...


    U-Boot 2010.06 (Feb 29 2012 - 16:02:12) DM812x_IPNC_2.00.00

    TI8148-GP rev 2.1

    ARM clk: 600MHz
    DDR clk: 400MHz
    IVA clk: 450MHz
    ISS clk: 400MHz
    DSP clk: 400MHz
    DSS clk: 200MHz

    I2C: ready
    DRAM: 512 MiB
    NAND: HW ECC Hamming Code selected
    256 MiB
    MMC: OMAP SD/MMC: 0, ON-BOARD SDIO: 1
    .:;rrr;;.
    ,5#@@@@#####@@@@@@#2,
    ,A@@@hi;;;r5;;;;r;rrSG@@@A,
    r@@#i;:;s222hG;rrsrrrrrr;ri#@@r
    :@@hr:r;SG3ssrr2r;rrsrsrsrsrr;rh@@:
    B@H;;rr;3Hs;rrr;sr;;rrsrsrsrsrsr;;H@B
    @@s:rrs;5#;;rrrr;r#@H:;;rrsrsrsrsrr:s@@
    @@;;srs&X#9;r;r;;,2@@@rrr:;;rrsrsrsrr;;@@
    @@;;rrsrrs@MB#@@@@@###@@@@@@#rsrsrsrsrr;;@@
    G@r;rrsrsr;#X;SX25Ss#@@#M@#9H9rrsrsrsrsrs;r@G
    
    
    
    
    
    

  • In reply to Daffiko:

    Hi Daffiko,

    I think ucPtr[i*1920 + j] is the image frame pointer right? If so, the reason for long time could be because the frame buffer is not cached by the DSP and thus doing individual pixel level accesses is very time costly.
    Can you try changing the algorithm such that you prefetch 1 or 2 line of image in your code and then run the for loop processing for pixels in thoes two lines - this should reduce the overall DDR reads and improve the processing time.
    The same holds true for the image buffer writes.

    To validate the therory, you could provide dummy variable pointer instead of image buffer. Make sure the dummy variable is allocated from the local heap. This will give you the real processing time taken by DSP.

    Also, you can get some quick gain by increasing the DSP clock from 400Mhz to say 800Mhz.

    Regards
    Rajat
  • In reply to Rajat Sagar:

    Hi Rajat Sager,

    You are right. I measured two test case in OsdLink_algProcessData(..) function.


    unsigned char* ucPtr;
    ucPtr = (unsigned char*)pFrame->addr[0][0];

    // mcfw/src_bios6/links_c6xdsp/swosd/osdLogo.c, (unsigned char 10240bytes Y frame Logo file)
    extern unsigned char TILogo_Y_160_64[ ];

    // individual pixel level accesses
    for (i = 0 ;i<240 ; i++) {
    for (j = 0 ;j < 10240 ; j++) {
    if (ucPtr[j] < 168)
    tmpAval++; // some tiny code.
    else
    tmpBval++;
    }
    }
    ==> It takes 240ms

    // dsp owned memory access (TILogo_Y_160_64[] variable is static global variable)
    for (i = 0 ;i<240 ; i++) {
    for (j = 0 ;j < 10240 ; j++) {
    if (TILogo_Y_160_64[j] <168)
    tmpAval++;
    else
    tmpBval++;
    }
    }
    ==> It takes 6ms

    40x more faster. So.. I designed like this

    1. memcpy from image frame to local memory

    2. algorithm processing using dsp cached memory

    3. memcpy from local processed cache data to image frame.


    I tried this.

    unsigned char myProcImage[1920*1080]; //declare global variable in dsp-side code.
    Int32 OsdLink_algProcessData(OsdLink_Obj * pObj)
    {
    ...

    unsigned char* ucPtr;
    ucPtr = (unsigned char*)pFrame->addr[0][0];

    ...

    memcpy(myProcImage, ucPtr, 1920*1080);

    for (i = 0 ;i<1920*1080 ; i++) {
    if (myProcImage[i] < 168)
    myProcImage[i] = 0;
    else
    myProcImage[i] = 255;
    }

    memcpy(ucPtr, myProcImage, 1920*1080);
    }

    for loops are significantly faster than before.
    But 2 problem occured

    1) memcpy takes too much times.
    --> How can I cache that image frame(Shared Region?) to dsp (local) memory?

    2) Global variable "myProcImage" size limited
    --> If I modified myProcImage size to 1920*1080*2 then good work.
    But, If size is 1920*1080*3 then compiler error occured. Error log like this,
    [Error Log]

    ...
    ...

    /home/user27/DevNow/ipnc/ti_tools/cgt6x_7_3_2/bin/lnk6x --warn_sections -q --silicon_version=6740 -c --dynamic -o2 -x --zero_init=off --retain=_Ipc_ResetVector /home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/obj/ti814x-evm/c6xdsp/release/main_c6xdsp.oe674 /home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/obj/ti814x-evm/c6xdsp/release/MAIN_APP_c6xdsp_pe674.oe674 /home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/obj/ti814x-evm/c6xdsp/release/ipnc_rdk_configuro/linker_mod.cmd -o /home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/bin/ti814x-evm/ipnc_rdk_c6xdsp_release.xe674 -m /home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/bin/ti814x-evm/ipnc_rdk_c6xdsp_release.xe674.map -l/home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/mcfw/src_bios6/lib/ti814x-evm/c6xdsp/release/ipnc_rdk_bios6.ae674 -l/home/user27/DevNow/ipnc/ti_tools/cgt6x_7_3_2/lib/rts6740_elf.lib -l/home/user27/DevNow/ipnc/ti_tools/framework_components_3_21_02_32/packages/ti/sdo/fc/ecpy/lib/debug/ecpy.ae674
    "/home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/obj/ti814x-evm/c6xdsp/release/ipnc_rdk_configuro/linker_mod.cmd", line 392: error:
    run placement fails for object ".systemHeap", size 0x200007 (page 0).
    Available ranges:
    DDR3_DSP size: 0x800000 unused: 0xc4888 max hole: 0xc4690
    warning: entry-point symbol other than "_c_int00" specified:
    "ti_sysbios_family_c64p_Hwi0"
    error: errors encountered during linking;
    "/home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/bin/ti814x-evm/
    ipnc_rdk_c6xdsp_release.xe674" not built
    make[2]: *** [/home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/build/ipnc_rdk/bin/ti814x-evm/ipnc_rdk_c6xdsp_release.xe674] error 1
    make[2]: Leaving directory `/home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/mcfw/src_bios6/main_app'
    make[1]: *** [apps] error 2
    make[1]: Leaving directory `/home/user27/DevNow/ipnc/ipnc_rdk/ipnc_mcfw/mcfw/src_bios6'
    make: *** [mcfw_bios6] error 2
    user27@user27-ubuntu32:~/DevNow/ipnc/ipnc_rdk$



    I think this problem caused by memory map cfg. (Maybe static memory allocation overflow problem..)

    I don't know arm-dsp memory share or transfer interface, So I want to know about
    memory access interface fundamentally.

    Please advice me.
    Thank you for answer!



  • In reply to Daffiko:

    (1) memcpy is very in-efficient. Please you EDMA copy for copying the frame buffer.

    (2) Instead of copying the full frame, can you try copying few lines at a time? You will need to adjust the for loops accordingly. This will actually be more efficient than copying the full frame and then processing. Another optimization could be to pipeline the edma copy and the process function - such that while EDMA is copying the next few lines, your algorithm can process the previously copied lines.

    Do you have the IPNC RDK codebase? edma copy is implemented in the rdk, so I think you may be use that as reference.

    Hope I am clear in explanation.

    Regards

    Rajat

  • In reply to Daffiko:

    If you want to get larger stack, you could change the configure files below:

    ipnc_rdk/ipnc_mcfw/mcfw/src_bios6/cfg/ti814x/config_dsp.bld

    AND

    ipnc_rdk/ipnc_mcfw/mcfw/src_bios6/cfg/ti814x/config_m3.bld

    I changed the SR1_SIZE from 56M to 48M, DSP_DATA_SIZE from 8M to 16M, and it works well. It should solve your link error.

    These two files need to be changed the same way and the same time! After that in the top level (ipnc_rdk), run: make all

  • In reply to Rajat Sagar:

    Hi Rajat,

    I am trying to do something similar to what Daffiko is doing in this thread / question. You mention that Daffiko should be able to find an example of an EDMA copy implemented in the DM8127 IPNC RDK. I have attempted to find and understand an example that I can use with the DM8127 example DSP Usecase. But I have searched through the entire DM8127 IPNC RDK and I'm not finding any good example to use, modify, or learn from.

    I've started a separate question / thread showing what I've found in the DM8127 IPNC RDK and why I don't think they can be used with the example DSP Usecase (but I may be wrong).

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/182597.aspx#658353

    Please look at that question / thread and let me know what I'm not understanding. And what API and example EDMA transfer code I should be attempting to use.

    If anyone else has successfully gotten an EDMA memory-to-memory transfer to compile and run with the DM8127 example DSP Usecase, please let me know.

    Thanks!

    Allen