This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6748 Benchmarking: L1P Cache

I am evaluating several candidate DSPs, and I noticed that the C6748 is thrashing L1P cache during some of the benchmarks.  On investigation, it was revealed that the Run-Time Support library calls were causing L1P conflicts with the floating-point math benchmarks, since FP divide operations are handled by RTS functions.  Our target application will have to perform FP math throughout much of the code, so this is a critical issue.  What solutions exist?

Thanks,

Clayton Gilmore

Software Engineer

Rockwell Collins, Inc.

  • Clayton,

    Are you saying that your benchmark code is being evicted from the cache so that the floating point code from the RTS library can be brought in? How large is the benchmark code? Does it occupy the entire cache?

    Regards, Daniel

  • Daniel Allred said:

    Are you saying that your benchmark code is being evicted from the cache so that the floating point code from the RTS library can be brought in?

    Yes.  That's exactly right.

    Daniel Allred said:

    How large is the benchmark code? Does it occupy the entire cache?

    Yes, the total benchmark code is over 100KB, which is greater than the 32KB of available L1P cache.

    It is broken up into much smaller functions, so that no single benchmark ever uses up all 32KB at once.  And they run in sequence, so they don't compete with each other for cache.  This is representative of what the target application will look like, as well.

    The problem is that, because L1P is direct-mapped and because the benchmark code is several times the size of L1P, the RTS library will conflict with some portion of the benchmark code no matter where I place it in the memory map.

     

  • Clayton,

    I think my immediate recommendation would be to convert half of the L1P to SRAM and modify your linker command file to place these particular floating point RTS functions in that L1P SRAM.  Unfortunately, this isn't a simple rebuild since the only way to get that code into L1P is by using the C674x core's internal DMA (IDMA).  This means that your floating point code will have one load address in L2 or some external memory and a different run address in the L1P, which again would have to specified in the linker command file.  Then, in your code and before starting your benchmark app, you need to relocate that code from it's load address to the run address using the IDMA.

    Obviously this cuts your L1P cache in half, but if your individual benchmark chunks still fit in that space, that shouldn't be an issue.

    Regards, Daniel

     

  • Daniel Allred said:

    Clayton,

    I think my immediate recommendation would be to convert half of the L1P to SRAM and modify your linker command file to place these particular floating point RTS functions in that L1P SRAM.  Unfortunately, this isn't a simple rebuild since the only way to get that code into L1P is by using the C674x core's internal DMA (IDMA).  This means that your floating point code will have one load address in L2 or some external memory and a different run address in the L1P, which again would have to specified in the linker command file.  Then, in your code and before starting your benchmark app, you need to relocate that code from it's load address to the run address using the IDMA.

    Obviously this cuts your L1P cache in half, but if your individual benchmark chunks still fit in that space, that shouldn't be an issue.

    Regards, Daniel

     

    Thanks for the suggestion.  That could work for the benchmark, itself.  However, we'd like to avoid that method if at all possible, because we definitely won't want to cut our L1P in half for the target application.

    What about the cache layout tool?  Would it even be applicable to this situation?

  • I'm not familiar with that tool as it appears to be a new tool in the 7.x compiler releases, but looking at the link you sent, it certainly seems like it is intended to help with the situation you have before you.  So I would definitely recommend looking into that. So maybe if things were linked so that the RTS floating point routines were always mapped to one part of the cache (let's say the start of it) and the rest of your code was linked so that it could never be mapped there, then my previous solution wouldn't be required.  But that sounds like it might be a pain to actually do.

  • Clayton, if the benchmark code includes floating point divides and if you don't allow the RTS divide function to use the L1P cache, won't the performance of your benchmark be negatively impacted?
    That is, isn't the divide operation also contributing to your benchmark time? In which case, wouldn't you prefer the divide code also to get to L1P cache?

    Please let me also provide some other information that may be of interest to you. As you know TI provides optimized implementation for some of these math functions. See here: http://focus.ti.com/docs/toolsw/folders/print/sprc060.html
    The fastRTS library includes optimized implementation for many of these math functions that are otherwise invoked from the RTS (and are therefore slow)

    You may chose to include the fastRTS library in your project to realize the performance. It is simple:
    * Add the library to your PJT
    * link the library at higher link order compared to the RTS library.

    Let me give some more information that may be of help. We will soon ( ~2 weeks) be releasing an update to the earlier mentioned fastRTS library. The update will include vector version of the math functions. Also, it will include inlinable implementation of the lib. See attached the performance improvement that can be realized by inlining the various math operations. If this is of interest, we can provide you an early drop of the library to use with your evaluation. If it is not urgent, please wait for some time and you should see the update happen on ti.com in Oct

    Thanks,
    Gagan

    c67xfastRTS_Benchmarking.pdf
  • Thanks for the fastRTS info!  That chart is especially interesting.

    Gagan Maur said:

    Clayton, if the benchmark code includes floating point divides and if you don't allow the RTS divide function to use the L1P cache, won't the performance of your benchmark be negatively impacted?
    That is, isn't the divide operation also contributing to your benchmark time? In which case, wouldn't you prefer the divide code also to get to L1P cache?

    Yes, you are correct.  I don't want to inhibit the RTS functions from using L1P cache.  I just want them to share the L1P nicely with the benchmark functions.  There's enough room in L1P for both of them.  In fact, most of the benchmarks share L1P with the RTS functions without a problem.

    But, there's always one or two benchmarks that thrash L1P.  These are the ones that happened to be allocated a multiple of 32KB away from the RTS functions.  They're competing for the same cache lines as the RTS functions, becuase L1P is direct-mapped as opposed to set-associative.  So, the ideal solution would be to allocate the RTS functions to locations in memory that don't conflict with the benchmark functions for the same L1P lines.

    The problem, however, is how to do that with >100KB of code.

    Thanks,

    Clayton Gilmore

    Software Engineer

    Rockwell Collins, Inc.

  • Clayton, if you still need to tune your memory placements, let me suggest a simple solution that may work.
    First note you can control the placement of section from any library like this:

    Your Linker Command file:

    MEMORY {
          YOUR_SPECIAL_MEMORY
    }

    SECTIONS
    {
     .yourSection:
     {
      "../../libs/fastrts67x.lib"(.text)
      "../../libs/YourLib.lib"(.text)
     } :> YOUR_SPECIAL_MEMORY

    }

    As you can see, the above gives you lot of control on how you place your code.
    * Define contiguous 32KB chunks for each benchmark.
    * Keep the largest size benchmark and the fastRTS .text section in a single 32KB section

    By doing this, you can hopefully avoid the conflicts

    Regards,
    Gagan

  • Gagan Maur said:

    Clayton, if you still need to tune your memory placements, let me suggest a simple solution that may work.
    First note you can control the placement of section from any library like this:

    Your Linker Command file:

    MEMORY {
          YOUR_SPECIAL_MEMORY
    }

    SECTIONS
    {
     .yourSection:
     {
      "../../libs/fastrts67x.lib"(.text)
      "../../libs/YourLib.lib"(.text)
     } :> YOUR_SPECIAL_MEMORY

    }

    As you can see, the above gives you lot of control on how you place your code.
    * Define contiguous 32KB chunks for each benchmark.
    * Keep the largest size benchmark and the fastRTS .text section in a single 32KB section

    By doing this, you can hopefully avoid the conflicts

    Regards,
    Gagan

    Thank you.  I will implement this solution for now and will look into the cache layout tool as a possibility for the future.

    Also, I would like to take you up on getting an early drop of the new fastRTS library.  What do I need to do?

     

    Thanks again,

    Clayton Gilmore

    Software Engineer

    Rockwell Collins, Inc.

     

  • Clayton, please contact your local TI support engineer. They will be able to release the SW to you asap

    Regards,
    Gagan