This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

H.264 encoder with barrier implementation on uncacheable memory

Hi,

I implemented barrier that works on uncached memory:

void uncachedOrderedBarrier(volatile int_fast8_t* barrierValue, int_fast8_t userIndex, int_fast8_t isLastUser)
{
    _nassert((int) barrierValue % 8 == 0);
    _nassert(userIndex >= 0);

    //Cache_inv((void*)0x9FFFFF00, 1, Cache_Type_L1D, FALSE);

    while (*barrierValue != userIndex); // Wait for user's turn to enter the barrier

    if (isLastUser) {
        *barrierValue = -userIndex; // Last user enables exit mode of the barrier
    } else {
        *barrierValue = userIndex + 1; // Give turn to next user to enter the barrier

        while (*barrierValue != -userIndex - 1);

        *barrierValue = -userIndex; // Give turn to next user to exit the barrier
    }
}

void MulticoreApi_swbarr(int32_t coreID, int32_t swbarr_id, uint32_t swbarr_cnt)
{
    uncachedOrderedBarrier((int_fast8_t*) &uncachedOrderedBarrierValues[swbarr_id], coreID, swbarr_cnt == coreID + 1);
}


The problem is that H.264 HP encoder (01.00.02.00) produces artifacts (shaking video slices) with this barrier implementation.
 I found out that to remove artifacts I need to Cache_inv or Cache_wbInv any cache-enabled memory address inside the barrier (see commented Cache_inv in the code above).

When I replace Cache_inv with a dummy delay loop OR any other cache operation like Cache_wbInvAll OR Cache_inv cache-disabled memory address the same artifacts appear again - I tried different variants.

I spent a lot of time debugging this issue and do not see problems in my barrier implementation. Can you explain this behaviour?

Regards,

Andrey Lisnevich

  • Hi Andrey,

    Codec calls shmmap_sync apis for cache coherece across cores. API is implemented at application side, please check for any changes in that with the issue. If you have made any changes to sync apis, please share that.

    Does it work with ANY cache-enabled memory address? or related/near to any input/output buffer address?

    Also please share encoded output with the artifact issue.

    Regards

    Sudheesh

  • Hi Sudheesh,

    It works with ANY cache-enabled address, close or far, MSMC or DDR3.
    I did no changes to shared memory APIs.

    For me it is reproducible when running 4 cores with Full HD resolution.

    Regards,

    Andrey Lisnevich

  • Corrupted output sample attached.

    out0.ts.zip
  • Hi Andrey,

    Based on output you shared, looks like input YUV data corrupted. 

    Only Core 1, 2,3 slices are corrupted and core 1 slice looks fine. 

    master (Core 0) shares input data pointers to slave cores and goes to barrier. Slave cores get updated data pointers once they come out of barrier using sync APIs.  Looks like this sync got failed.

    To make sure above is problem, you can disable cache for  all codec shared buffer requests.

    Can you make sure all input buffers & memtabs requested from codec are aligned to 256? please share your sync API implementation. If it is reproducible with standalone setup you prepared earlier please share with us..

    Regards

    Rama 

  • Hi Rama,

    You can find the demo to reproduce the issue by the following URL:

    https://drive.google.com/file/d/0Byw88ezNrM71SmJPOWVwNDJBR0U/edit?usp=sharing


    Details about the demo:
    - It runs H.264 decoder on cores 0-3
    - Decoder reads H.264 elementary input from addres 0x90000000 (13 MiB) in infinite loop. Demo input is attached (stream.264). You should load the file by the address above before starting demo using Load Memory tool.
    - Decoder sends frames to H.264 encoder
    - H.264 encoder runs on cores 4-7
    - Encoder outpus data by 0x98000000 address
    - After encoding reasonable number of frames you can pause demo and save generated H.264 elementary stream using Save Memory tool
    - Demo informs how many 32 bit words of H.264 data were generated in console:
         [C66xx_4] Writing 673 bytes total 824820 words
    - You can see artifacts in generated output
    - If you uncomment the following line in TranscodeComponent.c artifacts will gone:

    //Cache_inv((void*)0x0C000000, 1, Cache_Type_L1D, FALSE); // TODO: temporary fix - barrier should invalidate something

    I use the following components:
    - XDCtools 3.30.3.47
    - EDMALLD 2.11.11
    - FC 3.30.0.06
    - IPC 3.22.2.11
    - SYS/BIOS 6.40.2.27
    - H.264 decoder 01.01.04.00
    - H.264 encoder 01.00.02.00

    Feel free to ask any questions.

    Regards,

    Andrey Lisnevich

  • Hi Andrey,

    We could able to reproduce the issue with the test setup provided.

    We will update further as we examine the issue.

    Regards

    Sudheesh

  • Hi Andrey,

    We are stil debugging the issue. There is no major update.

    Regards

    Sudheesh

  • Hi Sudheesh,

    Do you see the root of this strange issue? Is it our integration code or TI codec?


    Regards,

    Andrey Lisnevich

  • Hi Andrey,

    We could not able to debug further much on this issue as we got locked up with other priorities.

    We will give you an update early next week.

    Thanks and regards

    Sudheesh

  • Hi Sudheesh, wondering if you have had a chance to check this issue.

    thanks for your help,

    Paula 

  • Hi Paula/Andrey,

    Currently i'm looking into the this issue.

    In first few process calls, input buffer pointer is not getting updated at slave cores though Master properly updates those pointers and shares across cores.

    Looks like shared memory is updated by slave cores also there by causing the above cache issue. 

    Thanks and regards

    Sudheesh

  • Hi Andrey,

     

    Cache issue is solved with the attached library when I checked in your application.

    Could you please try with attached library(please change extension of file) at your end?

    Regards

    Sudheesh

    8080.h264hpvenc_ti.txt

  • Thanks Sudheesh,

    Now it works and I can test which barrier implementation is the best.

    Regards,

    Andrey Lisnevich