This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM6446 vs C674x performance comparison

Dear experts,

We have to use the DM6446 processor for one of our new projects and I would like to get an idea of how the processor compares in performance to the C6748, since we are very familiar with this DSP. Our main concern is the fact that the L2 cache on the DM6446 is significantly lower than on the C6748.

C6748 key features:

- c64x+ and C67x architecture

- Max clock speed: 456 MHz

- Up to 3648 MIPS

- Cache: 32KB L1P, 32KB L1D, 256KB L2

- RAM: mDDR/DDR2 (via dedicated mDDR/DDR2 controller)

DM6446 key features:

- c64x+ architecture

- Max clock speed: 594 MHz (for this particular platform)

- Up to 4752 MIPS

- Cache: 32KB L1P, 80KB L1D, 64KB L2

- RAM: mDDR/DDR2 (via EMIF)

In our current C6748-based product, we found that having up to 256KB of L2 internal memory made a very significant difference in performance (128 KB internal RAM and 128 KB cache). How similar would performance be on the DM6446?

Thanks in advance! 

  • Hello Reinier,

    What is the purpose of using L2 cache in your application ? How much memory you want to access from L2 cache in DM6446 ?

    Definitely there would be a predominant difference in the performance due to the difference in L2 memory. However it depends on the L2 memory requirement of your application.

    Regards,
    Senthil
  • Hi Senthil,

    Many thanks for your response.

    We have various signal processing kernels (FFTs, matrix multiplication, matrix inversion, etc) executing, which makes up most of the CPU load. The cache is mainly used to access the data to be processed as fast as possible. Furthermore, we also have a number of higher level protocols executing and data is frequently moved around in external memory, so cache also speeds up the general execution time of the system.

    On the C6748-based system we have 64MB of mDDR external memory, but on the DM6446-based system we will have 256MB of DDR2 (although we will not nearly be using all of the external memory).

    - More or less how drastic will the difference in performance be?
    - Will the higher clock rate on the DM6446 somewhat compensate for the lack of L2 cache?
    - What is the effect of the larger L1D cache (32KB vs 80KB)?
  • Hi Senthil,

    Any feedback available yet?

  • Hello Reinier,

    We do not have performance details of the DM6446 device. The overall performance would not be solely dependent on the cache memory. It also relates to the operating frequency and peripherals used. With effect of lower L2 cache memory, there would be some difference, but we do not have concrete figure for this to share with you.

    I would say the higher clock rate may not completely compensate for the lack of L2 cache, but it might increase the performance compared to lower frequency.

    Regards,
    Senthil
  • Hi Senthil,

    Yes, I am aware that performance is not solely dependent on the cache alone, but in this particular case all the other parameters which affect performance are very similar and comparable on both platforms, the only significant difference is the difference in cache sizes.

    Furthermore, I am also assuming that the cache architecture for the DM6446 vs C6748 is very similar since both have a C64x+ core. Maybe this discussion then collapses into how cache performance of the architecture is affected by changing the L1/L2 cache sizes, assuming that all the other parameters of the processor remains constant.

    Can you then please refer me to someone who specializes in the C64x+ cache architecture?
  • Hi Senthil,

    Any news yet?

  • Hello Reinier,

    I forwarded this query to the right expert. He will get back to you soon.

    Thanks for your patience until then.

    Regards,
    Senthil
  • Thank you Senthil, much appreciated!
  • ReinierC said:
    - More or less how drastic will the difference in performance be?

    This is not possible to quantify in general terms.  It would need to be benchmarked for a very specific application, or at the very least we would need to consider some very specific details of a given application.

    ReinierC said:
    We have various signal processing kernels (FFTs, matrix multiplication, matrix inversion, etc) executing, which makes up most of the CPU load. The cache is mainly used to access the data to be processed as fast as possible.

    Can you provide a few more details:

    • Size of the data sets (element size and number of elements)
    • Number of threads, i.e. does all the processing occur in a single thread/task, or are there multiple concurrent processing threads?
    • General flow of data.  For example, is this conceptually an input data stream where the output from one processing block becomes the input to the next?
    • Is there any particular algorithm that currently takes up an inordinate share of the overall CPU load?

    ReinierC said:
    - Will the higher clock rate on the DM6446 somewhat compensate for the lack of L2 cache?

    It might help a bit, though as a general guideline I tend to view CPU performance and memory system performance independently, and there should be a "balance".

    ReinierC said:
    - What is the effect of the larger L1D cache (32KB vs 80KB)?

    The DM6446 has 80KB of "L1 memory".  Up to 32KB of this can be used as cache.  The remainder is used as SRAM (single cycle access).  The original use case that drove this architecture was H.264 encoding/decoding.  That particular algorithm required 64KB of dedicated single cycle access SRAM in order to meet real-time constraints.  So it was typical to use 16KB of L1D cache, plus 64KB of L1D SRAM in order to get enough performance to complete the video processing in real time.

    For most applications I would generally recommend a split of 32KB cache + 48KB SRAM.  Now this being the case, this is something that's actually a clear advantage of the DM6446 compared to the c6748.  You have an extra 48KB of L1D SRAM that can be used for anything you want.  So here is where the "application specific" part of things comes into play.  Is there some block of data that will fit into 48KB that you use REALLY FREQUENTLY?  I think this whether you're able to effectively use the extra L1D SRAM will likely be the "tipping point" in terms of whether your performance gets better or worse.  If there's some data set that will be a perfect thing to put here, you might actually find your performance gets much better!  Again it's all very application specific.

  • Hi Brad,

    Many thanks for your detailed response.

    I am gradually realizing now just how difficult it is to quantify this and how specific it is to the application.

    Size of the data sets (element size and number of elements)

    Just to give you some context, our system is a really complex modem, with well over 1 million lines of code and which algorithms execute can vary widely depending on the modems configuration. I also have to correct myself here: the data from external RAM to be processed is actually not so much of an issue compared to the program data that has to be moved from external memory. The data blocks to be processed are generally not more than 4KB, and typically less than 1KB.  

    Number of threads, i.e. does all the processing occur in a single thread/task, or are there multiple concurrent processing threads?

    We are running DSP/BIOS or SYS/BIOS and have over 30 tasks in the system. However, on a periodic basis only around 6 or 7 tasks execute and make up the bulk of the CPU load.

    General flow of data.  For example, is this conceptually an input data stream where the output from one processing block becomes the input to the next?

    Yes, audio samples come in and are processed in various stages, with the output of one processing block becoming the input to the next, i.e. I assume that a significant amount of the data stays local.

    Is there any particular algorithm that currently takes up an inordinate share of the overall CPU load?

    No, I would not say a particular algorithm takes up an inordinate share of CPU load. It is usually more a case of number of smaller kernels, which are executed repeatedly.

    The DM6446 has 80KB of "L1 memory".  Up to 32KB of this can be used as cache.  The remainder is used as SRAM (single cycle access). 

    Now this is interesting, this was not very clear in the datasheet. Only after you mentioned it, I found it tucked away in Table 2-1. This is quite a nice feature, it means that we can use the 48KB L1D as internal RAM and the entire L2 as cache, definitely an advantage. However, in our case my initial feeling is that this will not have a drastic effect on the performance, because our blocks of data are never nearly THAT big and nothing specific executes THAT frequently.  

    When considering memory performance, would this be very comparable:

    - 32KB L1P cache, 32KB L1D cache, 128KB L2 cache, 128 KB L2 internal memory, 16-bit DDR2 @ 150 MHz [C6748]

    vs

    - 32KB L1P cache, 32KB L1D cache, 48KB L1D internal memory, 64KB L2 cache, 32-bit DDR2 @ 164 MHz [DM6446]

    I hope this gives you a better overview of our system?

  • Based on the fact that you have over a million lines of code and multiple concurrent threads, it sounds like program cache is a key concern.  Here's a crazy thought that comes to mind...  Given your relatively small data requirements, do you think you could fit ALL data into 80KB of L1 SRAM?  If you could literally fit ALL data (structures, stacks, lookup tables, etc.) into 80KB of SRAM then you could configure the L1D cache to be 0KB and use all 80KB as dedicated L1D SRAM.  This is definitely "pie in the sky", but I thought I would throw it out there as a possibility.  The benefits of this would be two-fold:

    1. All data would be accessible as single-cycle access.  Pretty awesome.
    2. Since L2 cache is unified (code and data), if you can fit ALL data into L1D SRAM then you would never run into scenarios of data evicting program code.  This could potentially help "stretch" the 64KB of L2 cache to work a bit better...

    ReinierC said:

    - 32KB L1P cache, 32KB L1D cache, 128KB L2 cache, 128 KB L2 internal memory, 16-bit DDR2 @ 150 MHz [C6748]

    vs

    - 32KB L1P cache, 32KB L1D cache, 48KB L1D internal memory, 64KB L2 cache, 32-bit DDR2 @ 164 MHz [DM6446]

    I might suggest you try a test on the c6748.  In particular, you should be able to implement the following configuration:

    - 32KB L1P cache, 32KB L1D cache, 64KB L2 cache, 48 KB L2 internal memory, 16-bit DDR2 @ 150 MHz [C6748]

    The intent of this test is to "bound" your performance.  The L1P cache, L1D cache, and L2 cache would be exactly the same as DM6446.  The 48KB of L1 SRAM from the DM6446 would be replaced by 48KB of L2 SRAM (worse performance) on your c6748.  If this test is successful (i.e. if you can still meet real-time in this configuration) then you can know without a doubt that the DM6446 is going to be just fine.  If it doesn't work, well, there are quite a few things that will be improved compared to this:

    1. That 48 KB of L2 SRAM will become 48KB of L1 SRAM.
    2. The 16-bit DDR2 @150 MHz will become 32-bit DDR2 at 164 MHz.  (This will definitely help with respect to L2 cache misses.)
    3. The 674x core at 456 MHz will become a 64x+ core at 594 MHz.

    By the way, we've been talking about cache all this time and not the CPU core.  Are ALL of your algorithms fixed point?  If so, I don't think you'll see any difference at all moving "back" to the 64x+ DSP.  However, if you have any floating point code as part of your software, you'll see quite a performance hit on those since you'll lose the instructions for doing floating point in hardware.

  • Thanks for your feedback and ideas, much appreciated!

    Brad Griffis said:
    Here's a crazy thought that comes to mind...  Given your relatively small data requirements, do you think you could fit ALL data into 80KB of L1 SRAM? 

    Although all the data of a particular configuration/scenario might fit completely into 80KB, the problem is that we have many different types of configurations, so the compound effect of each configuration is that the total data is way more than 80KB. To compensate for this, with have a memory manager and a heap in internal RAM to rather dynamically allocate memory for critical data.

    Out of curiosity, if L1D is completely used as internal RAM (i.e. 0KB L1D cache) and data has to be moved from external RAM to the CPU, it would then basically pass directly from L2 cache to the CPU just at a significantly slower rate, right? 

    Brad Griffis said:

    I might suggest you try a test on the c6748.  In particular, you should be able to implement the following configuration:

    - 32KB L1P cache, 32KB L1D cache, 64KB L2 cache, 48 KB L2 internal memory, 16-bit DDR2 @ 150 MHz [C6748]

    Yes I actually had exactly this test in mind. Given our conversation, I am confident that we should meet die performance requirements.

    Brad Griffis said:
    The 16-bit DDR2 @150 MHz will become 32-bit DDR2 at 164 MHz.  (This will definitely help with respect to L2 cache misses.)

    How much faster will 32-bit DDR2 be compared to 16-bit DDR2. I am assuming this will definitely not be a factor 2?

    Brad Griffis said:
    Are ALL of your algorithms fixed point?

    Yes, well almost every algorithm, since we initially came from the C6418 (fixed point only). I know floating point emulation is slow.

  • ReinierC said:
    Out of curiosity, if L1D is completely used as internal RAM (i.e. 0KB L1D cache) and data has to be moved from external RAM to the CPU, it would then basically pass directly from L2 cache to the CPU just at a significantly slower rate, right? 

    Yes, the best case in this scenario would be if the data is already contained in the L2 cache, in which case you would be looking at 12-13 cycles for each access.  In the case of an L2 cache miss it would be much worse.

    ReinierC said:
    How much faster will 32-bit DDR2 be compared to 16-bit DDR2. I am assuming this will definitely not be a factor 2?

    Throughput will double though latency will NOT be cut in half.  This is because timings related to things like activating a bank/row will remain the same regardless of the width.  So that initial activation time won't improve, however, the speed at which you're reading the data will increase.  The L2 cache operates on line sizes of 128 bytes, though those lines are broken into 4 32-byte segments.  When there is a cache miss the segment containing the data of interest is fetched first.  So ignoring the bank/row activation times, the actual throughput on the bus will double.  Transferring 32 bytes on a 16-bit interface corresponds to 8 full clock cycles (i.e. 53ns at 150 MHz).  For the DM6446 this will become 4 clock cycles at 164 MHz, i.e. 24ns.  CAS latency and tRCD are each in the neighborhood of 13-15ns, so I think this will be noticeable.