Transcoding multiple streams on multiple cores slowdown

Andriy Lysnevych

Hello,

I am creating transcoding application that should transcode input MPEG-2 video streams into H.264 output. I am using latest versions of MPEG-2 and H.264 BP encoders for C66x platform (target hardware is DSCP-8681).

I use PCIe mapped buffers to receive input MPEG-2 stream from and output H.264 transcoded stream to host PC.

When I run transcoding on single core it is able to transcode 708x576 stream in realtime. But when I run transcoding tasks on other cores too, transcoding slows down and a single core became not able to handle 708x576 transcoding in realtime. Cores do not interact (no IPC).

Do you know what causes the slowdown?

over 13 years ago

0 Vivek Chengalvala over 13 years ago

TI__Expert 3715 points

Hi Andriy,

Are you configuring the DSPs to run @ 1.25 GHz or are they at 1 GHz? Can you please post the configuration for the Encoder? What bitrate/profile are the encoders running at? Regarding your observation, when all the cores are concurrently running, there is contention at the DDR interface. DDR has the bandwidth to serve all cores when the requests are staggered, but when requests are bursty from cores then it would result in this contention and the cores have to wait for EDMA transfers to finish because they are competing with other cores.

Regards,

Vivek

0 Andrey Lisnevich over 13 years ago in reply to Vivek Chengalvala

Genius 3305 points

Hi Vivek,

In our transcoder for now we do not use caching.

We do the following without EDMA:

- Copy MPEG2 frames from shared memory region allocated by PCIe driver (don't know where it is located) to DDR3 memory.

- Copy transcoded H.264 frames from DDR3 memory to shared memory region allocated by PCIe driver

All other memory operations done within H264BP driver using configured DSKT2, RMAN, EDMA3.

If the bottleneck is in EDMA how can we overcome it?

Configuration looks like:

width = 704
height = 576;
frameRate = 25;

transcoderConfiguration.outputBitRate = 512000;
transcoderConfiguration.intraFrameInterval = 30;
transcoderConfiguration.maxMBsPerSlice = 8160;
transcoderConfiguration.levelIdc = 40;
transcoderConfiguration.forceIFrame = 0;
transcoderConfiguration.qpIntra = 28;
transcoderConfiguration.qpInter = 28;
transcoderConfiguration.qpMax = 51;
transcoderConfiguration.qpMin = 0;
transcoderConfiguration.maxBytesPerSlice = 0;
transcoderConfiguration.intra4x4EnableIdc = 1;
transcoderConfiguration.constrainedIntraPredEnable = 0;
transcoderConfiguration.picOrderCountType = 0;
transcoderConfiguration.maxMVperMB = 4;
transcoderConfiguration.lfDisableIdc = 0;
transcoderConfiguration.quartPelDisable = 0;
transcoderConfiguration.mvDataEnable = 0;
transcoderConfiguration.airMbPeriod = 0;
transcoderConfiguration.hierCodingEnable = 0;
transcoderConfiguration.intraRefreshMethod = 0;
transcoderConfiguration.Intra_QP_modulation = 1;
transcoderConfiguration.rateControlPreset = 1;
transcoderConfiguration.rcAlgo = 1;
transcoderConfiguration.idrEnable = 1;

params.profileIdc = 66; // Profile IDC (66=baseline, 77=main, 88=extended)
params.levelIdc = (IH264VENC_Level) transcoderConfiguration->levelIdc; // Level IDC
params.searchRange = 64; // Max search range
params.rcAlgo = transcoderConfiguration->rcAlgo; // Algorithm to be used by rate control scheme. Valid values are 0 (DCES_TM5) and 1(PLR). It is useful only when rateControlPreset is IVIDEO_USER_DEFINED

params.videncParams.encodingPreset = 3;
params.videncParams.rateControlPreset = transcoderConfiguration->rateControlPreset; // Enable
params.videncParams.inputChromaFormat = XDM_YUV_420P;
params.videncParams.dataEndianness = XDM_BYTE;
params.videncParams.maxInterFrameInterval = 0;
params.videncParams.inputContentType = IVIDEO_PROGRESSIVE;
params.videncParams.maxFrameRate = transcoderConfiguration->outputFrameRate * 1000;

dynamicParams.qpIntra = transcoderConfiguration->qpIntra; // initial QP of I frames Range[-1,51]. -1 is for auto initialization
dynamicParams.qpInter = transcoderConfiguration->qpInter; // initial QP of P frames Range[-1,51]. -1 is for auto initialization.
dynamicParams.qpMax = transcoderConfiguration->qpMax; // Max Quantization parameter
dynamicParams.qpMin = transcoderConfiguration->qpMin; // Min Quantization parameter
dynamicParams.maxMBsPerSlice = transcoderConfiguration->maxMBsPerSlice;
dynamicParams.maxBytesPerSlice = transcoderConfiguration->maxBytesPerSlice; // Maximum number of bytes in aslice
dynamicParams.sliceRefreshRowStartNumber = 0; // Start row number for intra slice
dynamicParams.sliceRefreshRowNumber = 0; // Number of rows to be intra coded
dynamicParams.filterOffsetA = 0; // alpha offset for loop filter [-12, 12] even number
dynamicParams.filterOffsetB = 0; // beta offset for loop filter [-12, 12] even number
dynamicParams.intra4x4EnableIdc = transcoderConfiguration->intra4x4EnableIdc; // H.264 Encoder Slice level Control for Intra4x4 Modes
dynamicParams.pfNalUnitCallBack = NULL; // A function pointer
dynamicParams.streamFormat = IH264_BYTE_STREAM; // IH264_BYTE_STREAM = 0, IH264_NALU_STREAM = 1 (only regarded when pfNalUnitCallBack is non-zero)
dynamicParams.log2MaxFNumMinus4 = 0; // Sets log2_max_frame_num_minus4 [0,12]
dynamicParams.chromaQPIndexOffset = 0; // Valid value [-12,12] -> default 0, index into mapping table of luma to chroma QP
dynamicParams.constrainedIntraPredEnable = transcoderConfiguration->constrainedIntraPredEnable; // Enable/Disable constraint Intra Pred
dynamicParams.picOrderCountType = transcoderConfiguration->picOrderCountType; // Sets picture order cnt type Valid values -> 0 and 2, 2 is recommended for base profile
dynamicParams.maxMVperMB = transcoderConfiguration->maxMVperMB; // Maximum MV per MB (Values of 1 & 4 are valid)
dynamicParams.lfDisableIdc = transcoderConfiguration->lfDisableIdc; // Controls enable/disable loop filter
dynamicParams.quartPelDisable = transcoderConfiguration->quartPelDisable; // Enable/Disable Quarter Pel=>1: Only Half Pel 0: Both Half & Quarter Pel
dynamicParams.mvDataEnable = transcoderConfiguration->mvDataEnable; // Enable/Disable exposure of MV data
dynamicParams.airMbPeriod = transcoderConfiguration->airMbPeriod; // Adaptive intra refresh period ( 0 means: no effect)
dynamicParams.hierCodingEnable = transcoderConfiguration->hierCodingEnable; // Enable/Disable Hierarchical P frame encoding
dynamicParams.intraRefreshMethod = transcoderConfiguration->intraRefreshMethod; // Mechanism to do intra Refresh
dynamicParams.Intra_QP_modulation = transcoderConfiguration->Intra_QP_modulation; // Intra frame QP modulation 1 ON : 0 OFF
dynamicParams.Max_delay = 3; // Rate control delay in steps of 1/30 sec
dynamicParams.numSliceGroups = 0; // Number of Slice Groups Minus 1, 0 == no FMO, 1 == two slice groups, etc.(is <= for type 2 FMO)
dynamicParams.sliceGroupMapType = 0; // 0: Interleave, 2: Foreground with left-over, # 4: Raster Scan
dynamicParams.sliceGroupChangeDirectionFlag = 0; // 0: raster scan (relevant to type 4 only # 1: reverse raster scan (relevant to type 4 only)
dynamicParams.sliceGroupChangeRate = 0; // (relevant to type 4 only - refer standard for expln)
dynamicParams.sliceGroupChangeCycle = 0; // (relevant to type 4 only - refer standard for expln)
// dynamicParams.sliceGroupParams // Zeros
dynamicParams.numSliceASO = 0; // (0 == ASO absent) (>0 => ASO present && Specifies the dimension of asoSliceOrder);
// dynamicParams.asoSliceOrder // Zeros
dynamicParams.top_slice_line = 0;
dynamicParams.bottom_slice_line = 0;
dynamicParams.idrEnable = transcoderConfiguration->idrEnable; // Flag to make all I-frames IDR

dynamicParams.videncDynamicParams.targetFrameRate = transcoderConfiguration->outputFrameRate * 1000;
dynamicParams.videncDynamicParams.refFrameRate = transcoderConfiguration->outputFrameRate * 1000;
dynamicParams.videncDynamicParams.intraFrameInterval = transcoderConfiguration->intraFrameInterval;
dynamicParams.videncDynamicParams.inputWidth = transcoderConfiguration->outputWidth;
dynamicParams.videncDynamicParams.inputHeight = transcoderConfiguration->outputHeight;
dynamicParams.videncDynamicParams.targetBitRate = transcoderConfiguration->outputBitRate;
dynamicParams.videncDynamicParams.generateHeader = XDM_ENCODE_AU;
dynamicParams.videncDynamicParams.captureWidth = 0;
dynamicParams.videncDynamicParams.forceIFrame = transcoderConfiguration->forceIFrame;

params.videncParams.maxHeight = dynamicParams.videncDynamicParams.inputHeight;
params.videncParams.maxWidth = dynamicParams.videncDynamicParams.inputWidth;
params.videncParams.maxFrameRate = dynamicParams.videncDynamicParams.targetFrameRate;
params.videncParams.maxBitRate = 6000000;

Regards,

Andriy Lysnevych

0 Vivek Chengalvala over 13 years ago in reply to Andrey Lisnevich

TI__Expert 3715 points

Andriy,

Thanks for posting the codec configuration. So, you are running H264 BP Encoder, 704x576, 25fps @ 512 Kbps. Can you please confirm if you are running DSP at 1.25 GHz or 1 GHz?

As yours is a transcoder application, both input (mpeg2 frames) and output (h264 frames) are very small (compared to the YUVs). I am not talking about the I/O of these compressed frames when I said about DDR contention. What I was referring to was the YUV data going between L2 and DDR during the transcode operation (YUV data generated by MPEG2 Decoder and H264 Encoder). All the cores are concurrently scheduling these EDMA transfers at Macro block level and there'll be contention at the EDMA3 TC, so cores will get stalled when all 8 cores are concurrently executing. You don't have the control to stagger these DDR transactions and pace them out. We have to budget for the overhead while computing the channel density.

Can you please post the per frame cycles of encode and decode functions a) when only single core is running and b) when all cores are concurrently running. Just wanted to check the multi core overhead you are observing. I believe even with the overhead, you should be able to transcode mpeg2 --> h264 @ 25fps when you run the DSP at 1.25 GHz.

Regards,

Vivek

0 Andrey Lisnevich over 13 years ago in reply to Vivek Chengalvala

Genius 3305 points

Hi Vivek,

We use 1.25 GHz (DSPC-8681E). I use the following code to calculate cycles:

long long decodeStartTime = _itoll(TSCH, TSCL);

result = decoderFunctions->process(decoder, &inDecoderBufferDescriptor, outDecoderBufferDescriptor, &inDecoderArguments,
                &outDecoderArguments);

long long decodeEndTime = _itoll(TSCH, TSCL);

printf("Decode ticks=%lld\n", decodeEndTime - decodeStartTime);


long long encodeStartTime = _itoll(TSCH, TSCL);


result = encoder1Functions->process(encoder1Handle, &inEncoder1BufferDescriptor,
        &outEncoder1BufferDescriptor, inEncoder1Arguments, outEncoder1Arguments);

long long encodeEndTime = _itoll(TSCH, TSCL);

printf("Encode ticks=%lld\n", encodeEndTime - encodeStartTime);

Everage cycles per encode when 4 cores are transcoding stream: 56356796

Everage cycles per encode when only 1 core is transcoding stream: 35147230

Detailed logs for 1, 2, 3, 4 cores attached.

logs.zip

0 Andrey Lisnevich over 13 years ago in reply to Andrey Lisnevich

Genius 3305 points

Hi Vivek,

What do you thing about this +60% overhead when encoding on 4 cores?

Regards,

Andriy Lysnevych

0 Vivek Chengalvala over 13 years ago in reply to Andrey Lisnevich

TI__Expert 3715 points

Hi Andriy,

From some of our earlier profiling results on C6678, Multicore degradation was about 15-20% when all cores run concurrently and burst data requests to DDR. What you are observing (60%) looks very high... Are you taking advantage of multiple EDMA instances (C6678 has 3) available? If you look at the MCSDK Video, we reserve 0 to PCIe transfers and instance-1 is for cores 0,1,2,3 and instance-2 is for 4,5,6 and 7.

Regards,

Vivek

0 Andrey Lisnevich over 13 years ago in reply to Vivek Chengalvala

Genius 3305 points

Hi Vivek,

Using 2 EDMA really helped. Overhead for 4 cores is about 40% now. I will use Cache and try to decrease it more. I have few questions:

1) How many 704 streams can be transcoded on single c66x DSP with 8 cores in theory? We use H.264 BP encoder and MPEG2 decoder from TI.

2) We use mapped buffers approach for PCIe transfers. Is EDMA used for transfers when mapped buffers approach is used?

3) Can we use EDMA instance 0 not for PCIe, but for transcoding tasks and how much will it slowdown PCIe transfers?

We use the following configuration for pcie_drv initialization:

#define MAPPED_BUFFER_SIZE (0x00400000)

pciedrv_open_config_t pciedrv_open_config = { 0 };

pciedrv_open_config.dsp_outbound_reserved_mem_size = 0;
pciedrv_open_config.start_dma_chan_num = 0;
pciedrv_open_config.num_dma_channels = 0;
pciedrv_open_config.start_param_set_num = 0;
pciedrv_open_config.num_param_sets = 0;
pciedrv_open_config.dsp_outbound_block_size = MAPPED_BUFFER_SIZE;

pciedrv_open(&pciedrv_open_config);

And then create required mapped buffers using cmem driver.

0 Vivek Chengalvala over 13 years ago in reply to Andrey Lisnevich

TI__Expert 3715 points

1) How many 704 streams can be transcoded on single c66x DSP with 8 cores in theory? We use H.264 BP encoder and MPEG2 decoder from TI.

>> My expectation is that you should be able to transcode 1 per core

2) We use mapped buffers approach for PCIe transfers. Is EDMA used for transfers when mapped buffers approach is used?

>> When you just map buffers, X86 memory is made visible in the DSP. So, it depends on how actual DSP reads from that buffer (using EDMA vs. CPU). I think Decoder reads by CPU and also Encoder writes to the output buffer using CPU. PCIe transactions based on CPU is very slow. If you can use EDMA to pre-load the buffer into DDR before decoder reads & on the encoder side, let encoder write to L2/DDR and the use EDMA to get that to host, that should significantly boost performance.

Also, DSP should continue doing the transcoding of frame 'N' and EDMA should take care of getting frame N+1 into DDR and transcoded frame 'N-1' from DDR to host. These three things should happen concurrently.

3) Can we use EDMA instance 0 not for PCIe, but for transcoding tasks and how much will it slowdown PCIe transfers?

>> Yes, you can. EDMA instance 0 can be statically split - so some paRAMs and channels can be used for host transfers and the rest for transcoding. Please ensure that there is no overlap in resource usage.

0 Andrey Lisnevich over 13 years ago in reply to Vivek Chengalvala

Genius 3305 points

Hello Vivek,

I modified transcoder and put all the structures that are used by encoder and task stack to L2SRAM: BufDesc's, Params, Dynamic Params, In/Out arguments, Status, etc. Input and output buffers itself are in DDR3 memory. In my configuration all L2 memory is used as RAM, L1P and L1D are caches, but I do not use cache API to manually put data into cache.

I use EDMA instance 1 for cores #0 and #2, EDMA instance 2 is used for cores #1 and #2.

I run four transcoders on #0, #1, #2 and #3 cores of DSP #0 and one transcoder on #0 core of DSP #1 at the same time. Input data (MPEG2 elementary stream) for all transcoders is the same.

Results:

read - reading from input mapped buffer into DDR3

write - writing from DDR3 into output mapped buffer

decode - process call of decoder

encode - process call of encoder

total - total transcode loop (read, write, decode, encode and light instructions between this calls)

DSP #0 core #0

read   total = 2550162225   average = 2219462   percent = 3.54
decode total = 8646275491   average = 7525043   percent = 12.00
encode total = 60623998279   average = 52854401   percent = 84.11
write total = 227546479   average = 198384   percent = 0.32
total total = 72073950506   average = 62727546

DSP #0 core #1

read   total = 2294828360   average = 2297125   percent = 3.49
decode total = 8063856478   average = 8071928   percent = 12.25
encode total = 55108287769   average = 55274110   percent = 83.75
write total = 299644311   average = 300545   percent = 0.46
total total = 65802155166   average = 65868023

DSP #0 core #2

read   total = 2271550664   average = 2273824   percent = 3.45
decode total = 8066513355   average = 8074587   percent = 12.26
encode total = 55135496870   average = 55301401   percent = 83.77
write total = 307853531   average = 308779   percent = 0.47
total total = 65819143917   average = 65885028

DSP #0 core #3

read   total = 2244562508   average = 2246809   percent = 3.41
decode total = 8077625832   average = 8085711   percent = 12.27
encode total = 55158939377   average = 55324914   percent = 83.79
write total = 319352796   average = 320313   percent = 0.49
total total = 65831831650   average = 65897729

DSP #1 core #0

read   total = 3141149063   average = 2167804   percent = 4.61
decode total = 8774638849   average = 6055651   percent = 12.87
encode total = 55968627939   average = 38679079   percent = 82.07
write total = 285463916   average = 197279   percent = 0.42
total total = 68192717816   average = 47061917

Basically what I see is:

1) Improving read or write will not help much.

2) More then 80% of time is used for encode call so it is the place that should be optimized

3) I put all the input that I can to L2 memory but degradation is on the same high level when using multiple cores (i.e.+40% comparing to single core execution)

My questions are:

1) Do you know possible reasons of this degradation?

2) What else can I do to lower the degradation when using multiple cores?

3) What is better - use L2 memory as a cache and put input structures to the cache or use it is as RAM and put all the input structures into it?

4) Code of transcoder, system heap and other staff is located in DDR3 memory (transcoder on core #0 uses 0x80000000 - 0x81FFFFFF, core #1 uses 0x82000000 - 0x83FFFFFF, etc.). Can this be an issue? (I attached .map file for review)

5) Can encoder configuration (Params, Dynamic Params) be the reason of this degradation?

6) I use the code to measure time inside SYS/BIOS task: timeStamp = _itoll(TSCH, TSCL). Is it correct?

My goal is to transcode 8 streams of 704 resolution on 8 cores of single DSP in realtime but I can't reach it because of this degradation (can do it only on one core). I attached logs from all the cores, SYS/BIOS configuration file .cfg, and memory map .map file of transcoder.

Regards,

Andrey Lisnevich

profile.zip

0 Hongmei Gou over 13 years ago in reply to Andrey Lisnevich

TI__Expert 4335 points

Hi Andrey,

On C6678, we once verified MPEG2 to H264BP transcoding and we can do 1 transcode on each of the 8 cores without issues.

For single core profiling, our numbers are:

D1_H264enc_1p5M: 21M cycles per frame

D1_MPEG2dec_4M: 3.4M cycles per frame

The numbers above are around half of what you are getting:

"decode total = 8774638849 average = 6055651 percent = 12.87
encode total = 55968627939 average = 38679079 percent = 82.07"

The single core performance degradation in your application can be due to Cache usage. In our application, we are using 64K L2 cache. Also, are you setting DDR3 memory range as cacheable and prefetchable? This needs to be enabled via setting MAR registers, and details can be found from TMS320C66x DSP CorePac User Guide. If DDR3 is not configured as cacheable and prefetchable, there can be big penalty in cycle performance.

Regarding multicore, one issue is about the placement of .far and .fardata sections: these sections must be placed in non-overlapping regions for different cores, e.g., LL2 or DDR3 dedicated to individual cores. This can cause misbehavior for multicore.

In order to utilize LL2 more efficiently, please try to reduce the stack size for transcode task, which is now 0x10000 (actual peak value can be obtained from ROV). Also add “RMAN.maxAlgs = 3;” in .cfg to reduce size of .far (or .fardata) section. Some sections (e.g., .vecs, .switch) can be moved to MSMC. For the huge section of internalMemoryHeap (0x64000), it may also be reduced after checking the actual peak usage with ROV.

Thanks,

Hongmei

0 Andrey Lisnevich over 13 years ago in reply to Hongmei Gou

Genius 3305 points

Hi Hongmei,

All your recommendations are correct and I implemented them in my code. But they do not speed-up transcode greately. I managed to make encode/decode much faster only when placed code segment ".text" into MSMCSRAM (it was in DDR3 segments dedicated to each core before).

Now encode takes about 23.0M cycles and decode 4.0M cycles, that is much closer to your results. And now I can transcode one 704x576 stream on each core.

Total degradation when using all 8 cores simultaneously is about 27% now: encode 29.0M cycles, decode 5.0M cycles. But sometimes it can be grow up to 40% (I believe depending on stream).

Unfortunately I am experiencing another issue now that I am going to discuss in separate thread: http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/236954.aspx

Thanks!

Processors

Processors forum

Transcoding multiple streams on multiple cores slowdown