This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Don't understand CE Multi-core overhead example

Hello,

A benchmarked example shown in CE Multi-core Overhead Analysis suggests that cache maintenance is the significant overhead in the multi-core architecture. I see Step 4 is about 21000 microseconds (~ 95.0%) but this step is related to Activating, processing, deactivating the codec (from what I have understood is a fast process). Others steps including invalidation and write back buffers are about 2% of overhead (and I assume these are the lowest processes). Could anybody clarify this point?

Regards,
gaston

  • Hello Gaston:

        The benchmark example found that the cache overhead (Sum of steps 3 and 5) was about 3.8 % of the total processing time.

        Steps 1 + 2 + 6 + 7 - 150 microseconds ~ 0.7%
        Step 3 - 500 microseconds ~ 1.8%
        Step 4 - 21000 microseconds ~ 95.0%
        Step 5 - 450 microseconds ~ 2.0%
        Step 4 includes the processing time for the algorithm, which in this case was a video decoder, taking about 21 ms per frame.  This processing time includes the codec engine activation and deactivation of the algorithm, the time for which is not explicitly given, but assumed to be negligible compared to the video decoder.
     
        So the total overhead (cache plus other) surrounding the execution of the video decode appears to be around 4.5% in this example.
    - Gil
  • This article explains a bit more about why cache is a concern, and the operations necessary for frameworks like Codec Engine:
         
    http://processors.wiki.ti.com/index.php/Cache_Management

    If you're the algorithm author, and you know some of these cache management overheads are unnecessary, the CE Overhead article you reference contains several techniques for optimizing/eliminating the cache management happening within CE.

    Chris

  • Gil,

    Thank you for giving more details. Anyway, I'm not sure if this is a good example to show the cache overhead because it is much lower than the processing time for the algorithm. In this case I'd use another example (i.e. universal copy) that implements a faster algorithm (memcpy) compared to the Inv / Wb buffer management.What's your opinion?

    Regards,
    gaston

  • Gaston:

        Overhead time applies to anything other than the algorithm processing time, so it's best to use a real-world example, in this case a video decoder, to see realistic cache overhead percentages.

        Using universal copy in place of the *algorithm* would only change the cache overhead percentages, not the actual cache overhead time, assuming the I/O buffers are the same size and number. 

        It seems to me the video decoder example is better because it is a *real* algorithm, which nevertheless shows a non-negligible impact due to cache operations.

    Regards,

    - Gil