Don't understand CE Multi-core overhead example

Gaston Schelotto

Hello,

A benchmarked example shown in CE Multi-core Overhead Analysis suggests that cache maintenance is the significant overhead in the multi-core architecture. I see Step 4 is about 21000 microseconds (~ 95.0%) but this step is related to Activating, processing, deactivating the codec (from what I have understood is a fast process). Others steps including invalidation and write back buffers are about 2% of overhead (and I assume these are the lowest processes). Could anybody clarify this point?

Regards,
gaston

over 13 years ago

0 GAnthony over 13 years ago

TI__Intellectual 2790 points

Hello Gaston:

The benchmark example found that the cache overhead (Sum of steps 3 and 5) was about 3.8 % of the total processing time.

Steps 1 + 2 + 6 + 7 - 150 microseconds ~ 0.7%

Step 3 - 500 microseconds ~ 1.8%

Step 4 - 21000 microseconds ~ 95.0%

Step 5 - 450 microseconds ~ 2.0%

Step 4 includes the processing time for the algorithm, which in this case was a video decoder, taking about 21 ms per frame. This processing time includes the codec engine activation and deactivation of the algorithm, the time for which is not explicitly given, but assumed to be negligible compared to the video decoder.

So the total overhead (cache plus other) surrounding the execution of the video decode appears to be around 4.5% in this example.

- Gil

0 Chris Ring over 13 years ago

TI__Genius 17205 points

This article explains a bit more about why cache is a concern, and the operations necessary for frameworks like Codec Engine:
http://processors.wiki.ti.com/index.php/Cache_Management

If you're the algorithm author, and you know some of these cache management overheads are unnecessary, the CE Overhead article you reference contains several techniques for optimizing/eliminating the cache management happening within CE.

Chris

0 Gaston Schelotto over 13 years ago in reply to GAnthony

Expert 2870 points

Gil,

Thank you for giving more details. Anyway, I'm not sure if this is a good example to show the cache overhead because it is much lower than the processing time for the algorithm. In this case I'd use another example (i.e. universal copy) that implements a faster algorithm (memcpy) compared to the Inv / Wb buffer management.What's your opinion?

Regards,
gaston

0 GAnthony over 13 years ago in reply to Gaston Schelotto

TI__Intellectual 2790 points

Gaston:

Overhead time applies to anything other than the algorithm processing time, so it's best to use a real-world example, in this case a video decoder, to see realistic cache overhead percentages.

Using universal copy in place of the *algorithm* would only change the cache overhead percentages, not the actual cache overhead time, assuming the I/O buffers are the same size and number.

It seems to me the video decoder example is better because it is a *real* algorithm, which nevertheless shows a non-negligible impact due to cache operations.

Regards,

- Gil

Processors

Processors forum

Don't understand CE Multi-core overhead example