This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Transcoding multiple streams on multiple cores slowdown

Hello,

I am creating transcoding application that should transcode input MPEG-2 video streams into H.264 output. I am using latest versions of MPEG-2 and H.264 BP encoders for C66x platform (target hardware is DSCP-8681).

I use PCIe mapped buffers to receive input MPEG-2 stream from and output H.264 transcoded stream to host PC.

When I run transcoding on single core it is able to transcode 708x576 stream in realtime. But when I run transcoding tasks on other cores too, transcoding slows down and a single core became not able to handle 708x576 transcoding in realtime. Cores do not interact (no IPC).

Do you know what causes the slowdown?

  • Hi Andriy,

    Are you configuring the DSPs to run @ 1.25 GHz or are they at 1 GHz? Can you please post the configuration for the Encoder? What bitrate/profile are the encoders running at? Regarding your observation, when all the cores are concurrently running, there is contention at the DDR interface. DDR has the bandwidth to serve all cores when the requests are staggered, but when requests are bursty from cores then it would result in this contention and the cores have to wait for EDMA transfers to finish because they are competing with other cores. 

    Regards,

    Vivek

  • Hi Vivek,

    In our transcoder for now we do not use caching.

    We do the following without EDMA:

    - Copy MPEG2 frames from shared memory region allocated by PCIe driver (don't know where it is located) to DDR3 memory.

    - Copy transcoded H.264 frames from DDR3 memory to shared memory region allocated by PCIe driver

    All other memory operations done within H264BP driver using configured DSKT2, RMAN, EDMA3.

    If the bottleneck is in EDMA how can we overcome it?

    Configuration looks like:

    width = 704
    height = 576;
    frameRate = 25;

    transcoderConfiguration.outputBitRate = 512000;
    transcoderConfiguration.intraFrameInterval = 30;
    transcoderConfiguration.maxMBsPerSlice = 8160;
    transcoderConfiguration.levelIdc = 40;
    transcoderConfiguration.forceIFrame = 0;
    transcoderConfiguration.qpIntra = 28;
    transcoderConfiguration.qpInter = 28;
    transcoderConfiguration.qpMax = 51;
    transcoderConfiguration.qpMin = 0;
    transcoderConfiguration.maxBytesPerSlice = 0;
    transcoderConfiguration.intra4x4EnableIdc = 1;
    transcoderConfiguration.constrainedIntraPredEnable = 0;
    transcoderConfiguration.picOrderCountType = 0;
    transcoderConfiguration.maxMVperMB = 4;
    transcoderConfiguration.lfDisableIdc = 0;
    transcoderConfiguration.quartPelDisable = 0;
    transcoderConfiguration.mvDataEnable = 0;
    transcoderConfiguration.airMbPeriod = 0;
    transcoderConfiguration.hierCodingEnable = 0;
    transcoderConfiguration.intraRefreshMethod = 0;
    transcoderConfiguration.Intra_QP_modulation = 1;
    transcoderConfiguration.rateControlPreset = 1;
    transcoderConfiguration.rcAlgo = 1;
    transcoderConfiguration.idrEnable = 1;

    params.profileIdc = 66; // Profile IDC (66=baseline, 77=main, 88=extended)
    params.levelIdc = (IH264VENC_Level) transcoderConfiguration->levelIdc; // Level IDC
    params.searchRange = 64; // Max search range
    params.rcAlgo = transcoderConfiguration->rcAlgo; // Algorithm to be used by rate control scheme. Valid values are 0 (DCES_TM5) and 1(PLR). It is useful only when rateControlPreset is IVIDEO_USER_DEFINED

    params.videncParams.encodingPreset = 3;
    params.videncParams.rateControlPreset = transcoderConfiguration->rateControlPreset; // Enable
    params.videncParams.inputChromaFormat = XDM_YUV_420P;
    params.videncParams.dataEndianness = XDM_BYTE;
    params.videncParams.maxInterFrameInterval = 0;
    params.videncParams.inputContentType = IVIDEO_PROGRESSIVE;
    params.videncParams.maxFrameRate = transcoderConfiguration->outputFrameRate * 1000;

    dynamicParams.qpIntra = transcoderConfiguration->qpIntra; // initial QP of I frames Range[-1,51]. -1 is for auto initialization
    dynamicParams.qpInter = transcoderConfiguration->qpInter; // initial QP of P frames Range[-1,51]. -1 is for  auto initialization.
    dynamicParams.qpMax = transcoderConfiguration->qpMax; // Max Quantization parameter
    dynamicParams.qpMin = transcoderConfiguration->qpMin; // Min Quantization parameter
    dynamicParams.maxMBsPerSlice = transcoderConfiguration->maxMBsPerSlice;
    dynamicParams.maxBytesPerSlice = transcoderConfiguration->maxBytesPerSlice; // Maximum number of bytes in aslice
    dynamicParams.sliceRefreshRowStartNumber = 0; // Start row number for intra slice
    dynamicParams.sliceRefreshRowNumber = 0; // Number of rows to be intra coded
    dynamicParams.filterOffsetA = 0; // alpha offset for loop filter [-12, 12] even number
    dynamicParams.filterOffsetB = 0; // beta offset for loop filter [-12, 12] even number
    dynamicParams.intra4x4EnableIdc = transcoderConfiguration->intra4x4EnableIdc; // H.264 Encoder Slice level Control for Intra4x4 Modes
    dynamicParams.pfNalUnitCallBack = NULL; // A function pointer
    dynamicParams.streamFormat = IH264_BYTE_STREAM; // IH264_BYTE_STREAM = 0,  IH264_NALU_STREAM = 1 (only regarded when pfNalUnitCallBack is non-zero)
    dynamicParams.log2MaxFNumMinus4 = 0; // Sets log2_max_frame_num_minus4 [0,12]
    dynamicParams.chromaQPIndexOffset = 0; // Valid value [-12,12] -> default 0, index into mapping table of luma to chroma QP
    dynamicParams.constrainedIntraPredEnable = transcoderConfiguration->constrainedIntraPredEnable; // Enable/Disable constraint Intra Pred
    dynamicParams.picOrderCountType = transcoderConfiguration->picOrderCountType; // Sets picture order cnt type Valid values -> 0 and 2, 2 is recommended for base profile
    dynamicParams.maxMVperMB = transcoderConfiguration->maxMVperMB; // Maximum MV per MB (Values of 1 & 4 are valid)
    dynamicParams.lfDisableIdc = transcoderConfiguration->lfDisableIdc; // Controls enable/disable loop filter
    dynamicParams.quartPelDisable = transcoderConfiguration->quartPelDisable; // Enable/Disable Quarter Pel=>1: Only Half Pel 0: Both Half & Quarter Pel
    dynamicParams.mvDataEnable = transcoderConfiguration->mvDataEnable; // Enable/Disable exposure of MV data
    dynamicParams.airMbPeriod = transcoderConfiguration->airMbPeriod; // Adaptive intra refresh period ( 0 means: no effect)
    dynamicParams.hierCodingEnable = transcoderConfiguration->hierCodingEnable; // Enable/Disable Hierarchical P frame encoding
    dynamicParams.intraRefreshMethod = transcoderConfiguration->intraRefreshMethod; // Mechanism to do intra Refresh
    dynamicParams.Intra_QP_modulation = transcoderConfiguration->Intra_QP_modulation; // Intra frame QP modulation 1 ON : 0 OFF
    dynamicParams.Max_delay = 3; // Rate control delay in steps of 1/30 sec
    dynamicParams.numSliceGroups = 0; // Number of Slice Groups Minus 1, 0 == no FMO, 1 == two slice groups, etc.(is <= for type 2 FMO)
    dynamicParams.sliceGroupMapType = 0; // 0:  Interleave, 2: Foreground with left-over, # 4: Raster Scan
    dynamicParams.sliceGroupChangeDirectionFlag = 0; // 0: raster scan (relevant to type 4 only # 1: reverse raster scan (relevant to type 4 only)
    dynamicParams.sliceGroupChangeRate = 0; // (relevant to type 4 only - refer standard for expln)
    dynamicParams.sliceGroupChangeCycle = 0; // (relevant to type 4 only - refer standard for expln)
    // dynamicParams.sliceGroupParams // Zeros
    dynamicParams.numSliceASO = 0; // (0 == ASO absent) (>0 => ASO present && Specifies the dimension of asoSliceOrder);
    // dynamicParams.asoSliceOrder // Zeros
    dynamicParams.top_slice_line = 0;
    dynamicParams.bottom_slice_line = 0;
    dynamicParams.idrEnable = transcoderConfiguration->idrEnable; // Flag to make all I-frames IDR

    dynamicParams.videncDynamicParams.targetFrameRate = transcoderConfiguration->outputFrameRate * 1000;
    dynamicParams.videncDynamicParams.refFrameRate = transcoderConfiguration->outputFrameRate * 1000;
    dynamicParams.videncDynamicParams.intraFrameInterval = transcoderConfiguration->intraFrameInterval;
    dynamicParams.videncDynamicParams.inputWidth = transcoderConfiguration->outputWidth;
    dynamicParams.videncDynamicParams.inputHeight = transcoderConfiguration->outputHeight;
    dynamicParams.videncDynamicParams.targetBitRate = transcoderConfiguration->outputBitRate;
    dynamicParams.videncDynamicParams.generateHeader = XDM_ENCODE_AU;
    dynamicParams.videncDynamicParams.captureWidth = 0;
    dynamicParams.videncDynamicParams.forceIFrame = transcoderConfiguration->forceIFrame;

    params.videncParams.maxHeight = dynamicParams.videncDynamicParams.inputHeight;
    params.videncParams.maxWidth = dynamicParams.videncDynamicParams.inputWidth;
    params.videncParams.maxFrameRate = dynamicParams.videncDynamicParams.targetFrameRate;
    params.videncParams.maxBitRate = 6000000;

    Regards,

    Andriy Lysnevych

  • Andriy,

    Thanks for posting the codec configuration. So, you are running H264 BP Encoder, 704x576, 25fps @ 512 Kbps. Can you please confirm if you are running DSP at 1.25 GHz or 1 GHz?

    As yours is a transcoder application, both input (mpeg2 frames) and output (h264 frames) are very small (compared to the YUVs). I am not talking about the I/O of these compressed frames when I said about DDR contention. What I was referring to was the YUV data going between L2 and DDR during the transcode operation (YUV data generated by MPEG2 Decoder and H264 Encoder). All the cores are concurrently scheduling these EDMA transfers at Macro block level and there'll be contention at the EDMA3 TC, so cores will get stalled when all 8 cores are concurrently executing. You don't have the control to stagger these DDR transactions and pace them out. We have to budget for the overhead while computing the channel density.

    Can you please post the per frame cycles of encode and decode functions a) when only single core is running and b) when all cores are concurrently running. Just wanted to check the multi core overhead you are observing. I believe even with the overhead, you should be able to transcode mpeg2 --> h264 @ 25fps when you run the DSP at 1.25 GHz.

    Regards,

    Vivek

  • Hi Vivek,

    We use 1.25 GHz (DSPC-8681E). I use the following code to calculate cycles:

    long long decodeStartTime = _itoll(TSCH, TSCL);
    
    result = decoderFunctions->process(decoder, &inDecoderBufferDescriptor, outDecoderBufferDescriptor, &inDecoderArguments,
                    &outDecoderArguments);
    
    long long decodeEndTime = _itoll(TSCH, TSCL);
    
    printf("Decode ticks=%lld\n", decodeEndTime - decodeStartTime);
    
    
    long long encodeStartTime = _itoll(TSCH, TSCL);
    
    
    result = encoder1Functions->process(encoder1Handle, &inEncoder1BufferDescriptor,
            &outEncoder1BufferDescriptor, inEncoder1Arguments, outEncoder1Arguments);
    
    long long encodeEndTime = _itoll(TSCH, TSCL);
    
    printf("Encode ticks=%lld\n", encodeEndTime - encodeStartTime);
    
    

    Everage cycles per encode when 4 cores are transcoding stream: 56356796

    Everage cycles per encode when only 1 core is transcoding stream:   35147230

    Detailed logs for 1, 2, 3, 4 cores attached.

    logs.zip
  • Hi Vivek,

    What do you thing about this +60% overhead when encoding on 4 cores?

    Regards,

    Andriy Lysnevych

  • Hi Andriy,

    From some of our earlier profiling results on C6678, Multicore degradation was about 15-20% when all cores run concurrently and burst data requests to DDR. What you are observing (60%) looks very high... Are you taking advantage of multiple EDMA instances (C6678 has 3) available? If you look at the MCSDK Video, we reserve 0 to PCIe transfers and instance-1 is for cores 0,1,2,3 and instance-2 is for 4,5,6 and 7. 

    Regards,

    Vivek

  • Hi Vivek,

    Using 2 EDMA really helped. Overhead for 4 cores is about 40% now. I will use Cache and try to decrease it more. I have few questions:

    1) How many 704 streams can be transcoded on single c66x DSP with 8 cores in theory? We use H.264 BP encoder and MPEG2 decoder from TI.

    2) We use mapped buffers approach for PCIe transfers. Is EDMA used for transfers when mapped buffers approach is used?

    3) Can we use EDMA instance 0 not for PCIe, but for transcoding tasks and how much will it slowdown PCIe transfers?

    We use the following configuration for pcie_drv initialization:

    #define MAPPED_BUFFER_SIZE           (0x00400000)

    pciedrv_open_config_t pciedrv_open_config = { 0 };

    pciedrv_open_config.dsp_outbound_reserved_mem_size = 0;
    pciedrv_open_config.start_dma_chan_num = 0;
    pciedrv_open_config.num_dma_channels = 0;
    pciedrv_open_config.start_param_set_num = 0;
    pciedrv_open_config.num_param_sets = 0;
    pciedrv_open_config.dsp_outbound_block_size = MAPPED_BUFFER_SIZE;

    pciedrv_open(&pciedrv_open_config);

    And then create required mapped buffers using cmem driver.

  • 1) How many 704 streams can be transcoded on single c66x DSP with 8 cores in theory? We use H.264 BP encoder and MPEG2 decoder from TI.

    >> My expectation is that you should be able to transcode 1 per core

    2) We use mapped buffers approach for PCIe transfers. Is EDMA used for transfers when mapped buffers approach is used?

    >> When you just map buffers, X86 memory is made visible in the DSP. So, it depends on how actual DSP reads from that buffer (using EDMA vs. CPU). I think Decoder reads by CPU and also Encoder writes to the output buffer using CPU. PCIe transactions based on CPU is very slow. If you can use EDMA to pre-load the buffer into DDR before decoder reads & on the encoder side, let encoder write to L2/DDR and the use EDMA to get that to host, that should significantly boost performance.

    Also, DSP should continue doing the transcoding of frame 'N' and EDMA should take care of getting frame N+1 into DDR and transcoded frame 'N-1' from DDR to host. These three things should happen concurrently.

    3) Can we use EDMA instance 0 not for PCIe, but for transcoding tasks and how much will it slowdown PCIe transfers?

    >> Yes, you can. EDMA instance 0 can be statically split - so some paRAMs and channels can be used for host transfers and the rest for transcoding. Please ensure that there is no overlap in resource usage.

  • Hello Vivek,

    I modified transcoder and put all the structures that are used by encoder and task stack to L2SRAM: BufDesc's, Params, Dynamic Params, In/Out arguments, Status, etc. Input and output buffers itself  are in DDR3 memory. In my configuration all L2 memory is used as RAM, L1P and L1D are caches, but I do not use cache API to manually put data into cache.

    I use EDMA instance 1 for cores #0 and #2, EDMA instance 2 is used for cores #1 and #2.

    I run four transcoders on #0, #1, #2 and #3 cores of DSP #0 and one transcoder on #0 core of DSP #1 at the same time. Input data (MPEG2 elementary stream) for all transcoders is the same.

    Results:

    read - reading from input mapped buffer into DDR3

    write - writing from DDR3 into output mapped buffer

    decode - process call of decoder

    encode - process call of encoder

    total - total transcode loop (read, write, decode, encode and light instructions between this calls)

    DSP #0 core #0

    read   total = 2550162225    average = 2219462    percent = 3.54
    decode total = 8646275491    average = 7525043    percent = 12.00
    encode total = 60623998279    average = 52854401    percent = 84.11
    write  total = 227546479    average = 198384    percent = 0.32
    total  total = 72073950506    average = 62727546

    DSP #0 core #1

    read   total = 2294828360    average = 2297125    percent = 3.49
    decode total = 8063856478    average = 8071928    percent = 12.25
    encode total = 55108287769    average = 55274110    percent = 83.75
    write  total = 299644311    average = 300545    percent = 0.46
    total  total = 65802155166    average = 65868023

    DSP #0 core #2

    read   total = 2271550664    average = 2273824    percent = 3.45
    decode total = 8066513355    average = 8074587    percent = 12.26
    encode total = 55135496870    average = 55301401    percent = 83.77
    write  total = 307853531    average = 308779    percent = 0.47
    total  total = 65819143917    average = 65885028

    DSP #0 core #3

    read   total = 2244562508    average = 2246809    percent = 3.41
    decode total = 8077625832    average = 8085711    percent = 12.27
    encode total = 55158939377    average = 55324914    percent = 83.79
    write  total = 319352796    average = 320313    percent = 0.49
    total  total = 65831831650    average = 65897729

    DSP #1 core #0

    read   total = 3141149063    average = 2167804    percent = 4.61
    decode total = 8774638849    average = 6055651    percent = 12.87
    encode total = 55968627939    average = 38679079    percent = 82.07
    write  total = 285463916    average = 197279    percent = 0.42
    total  total = 68192717816    average = 47061917

    Basically what I see is:

    1) Improving read or write will not help much.

    2) More then 80% of time is used for encode call so it is the place that should be optimized

    3) I put all the input that I can to L2 memory but degradation is on the same high level when using multiple cores (i.e.+40% comparing to single core execution)

    My questions are:

    1) Do you know possible reasons of this degradation?

    2) What else can I do to lower the degradation when using multiple cores?

    3) What is better - use L2 memory as a cache and put input structures to the cache or use it is as RAM and put all the input structures into it?

    4) Code of transcoder, system heap and other staff is located in DDR3 memory (transcoder on core #0 uses 0x80000000 - 0x81FFFFFF, core #1 uses 0x82000000 - 0x83FFFFFF, etc.). Can this be an issue? (I attached .map file for review)

    5) Can encoder configuration (Params, Dynamic Params) be the reason of this degradation?

    6) I use the code to measure time inside SYS/BIOS task: timeStamp = _itoll(TSCH, TSCL). Is it correct?

    My goal is to transcode 8 streams of 704 resolution on 8 cores of single DSP in realtime but I can't reach it because of this degradation (can do it only on one core). I attached logs from all the cores, SYS/BIOS configuration file .cfg, and memory map .map file of transcoder.

    Regards,

    Andrey Lisnevich

    profile.zip
  • Hi Andrey,

    On C6678, we once verified MPEG2 to H264BP transcoding and we can do 1 transcode on each of the 8 cores without issues.

    For single core profiling, our numbers are:

    D1_H264enc_1p5M: 21M cycles per frame

    D1_MPEG2dec_4M: 3.4M cycles per frame

    The numbers above are around half of what you are getting:

    "decode total = 8774638849 average = 6055651 percent = 12.87
     encode total = 55968627939 average = 38679079 percent = 82.07"

    The single core performance degradation in your application can be due to Cache usage. In our application, we are using 64K L2 cache. Also, are you setting DDR3 memory range as cacheable and prefetchable? This needs to be enabled via setting MAR registers, and details can be found from TMS320C66x DSP CorePac User Guide. If DDR3 is not configured as cacheable and prefetchable, there can be big penalty in cycle performance.

    Regarding multicore, one issue is about the placement of .far and .fardata sections: these sections must be placed in non-overlapping regions for different cores, e.g., LL2 or DDR3 dedicated to individual cores. This can cause misbehavior for multicore.

    In order to utilize LL2 more efficiently, please try to reduce the stack size for transcode task, which is now 0x10000 (actual peak value can be obtained from ROV). Also add “RMAN.maxAlgs = 3;” in .cfg to reduce size of .far (or .fardata) section. Some sections (e.g., .vecs, .switch) can be moved to MSMC. For the huge section of internalMemoryHeap (0x64000), it may also be reduced after checking the actual peak usage with ROV.

    Thanks,

    Hongmei

  • Hi Hongmei,

    All your recommendations are correct and I implemented them in my code. But they do not speed-up transcode greately. I managed to make encode/decode much faster only when placed code segment ".text" into MSMCSRAM (it was in DDR3 segments dedicated to each core before).

    Now encode takes about 23.0M cycles and decode 4.0M cycles, that is much closer to your results. And now I can transcode one 704x576 stream on each core.

    Total degradation when using all 8 cores simultaneously is about 27% now: encode 29.0M cycles, decode 5.0M cycles. But sometimes it can be grow up to 40% (I believe depending on stream).

    Unfortunately I am experiencing another issue now that I am going to discuss in separate thread: http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/236954.aspx

    Thanks!