Very slow data exchange with mDDR on C6748 SOM on Experimenter Kit

DanB57495

Hello,

I am running a C6748 SOM on an L138-EVM / Experimenter kit from Logic PD. I have been porting over some code that has some C++ segments. At one point for debugging puposes some data is loaded into two arrays:

for(t_int i = 0; i < m_idx_end; ++i)
    {
      m_p_begin_real[i] = sample;
      m_p_begin_imag[i] = 0.0;
}

Both of these arrays of of flavor "float" and are assigned to mDDR. The index is 4,194,304 (X2 make 8,388,608 floats). The system clock is set for 300Mhz, the mDDR is set for 132MHz, the are no optimizations and we are running a full sybolic debug through an XDS510PP-plus. There are no power saving settings, this function is running on a "thread" that was dynamically allocated from main.It takes about 8 seconds for the load to occur in this debug mode, which represents about 1M float/sec load-rate.

I have gone the other extreme and run -o3 optimizations with no debug and had it running booted from the NOR flash (i.e. with no emulator involved) and numerous other optimization settings and maybe only a couple of seconds were shaved-off.

Is this normal?

I am hoping there may be some setup parameter I am missing that will make it "all better"!

Any help is appreciated.

Dan.

over 14 years ago

0 RandyP over 14 years ago

TI__Guru* 84110 points

Dan,

This sure sounds like awful performance, but there could be a few things to examine. But I will say that often, single CPU reads and writes to a burst-oriented bus like mDDR will be less efficient than when driven by a burst-oriented master like the cache controllers or EDMA3. This example would be a perfect place to use a DMA channel to write these values. The ideal case would be to copy repeatedly from 128-float buffers to the mDDR destination.

Do you have cache enabled in both L1D and L2? And L1P, too?

The following code will try to use more cache capabilities by allocating a cache line in L1D for each. This will result in a burst read, but will also result in burst writes. The net should be some improvement.

cache-line allocation said:
for(t_int i = 0; i < m_idx_end; i+=8)
{
volatile float t1 = m_p_begin_real[i];
volatile float t2 = m_p_begin_imag[i];

for(t_int j = 0; j < 8; ++j)
{
m_p_begin_real[i+j] = sample;
m_p_begin_imag[i+j] = 0.0;
}
}

Since your two huge arrays are going to be allocated at the same mDDR page boundaries, they may be thrashing the bank select lines, which is not very efficient. You might get much better performance just splitting the for loop into separate loops for each so they can use the mDDR device features more efficiently.

separate loops for bank efficiency said:
for(t_int i = 0; i < m_idx_end; i++)
{
m_p_begin_real[i] = sample;
}
for(t_int i = 0; i < m_idx_end; i++)
{
m_p_begin_imag[i] = 0.0;
}

A combination might do better still. You can also try intrinsics like _mem8 to write two values to the same array.

separate loops with intrinsics said:
#include <c6x.h>
for(t_int i = 0; i < m_idx_end; i+=2)
{
_memd8(&m_p_begin_real[i]) = _itoll(sample,sample);
}
for(t_int i = 0; i < m_idx_end; i+=2)
{
_memd8(&m_p_begin_imag[i]) = _itoll(0.0,0.0);
}

But in the end, your most efficient memory activity will be to use DMA or QDMA to get the data in and out of internal memory while doing other things with the DSP.

Please let me know if any of my syntax is wrong (post the right way). And also please let me know how these improve your performance, or not.

Regards,
RandyP

0 DanB57495 over 14 years ago in reply to RandyP

Expert 1340 points

Hi Randy,

Thanks very much for your advice. I have not spent much time looking into the caching part of it. I have started reading SPRUFK5. I expect this is where I need to start! On the memory map it indicates no usage of the L1 segments for anything, although since this is done at run-time, I don't suppose I would see anything show up there necessarily. I have been looking for ways to "enable" cache, I did not see anything in the DSP/BIOS configuration and so I assume we just use the register settings as specified in SPRUFK5. I tried to look for include files or libraries that would have some of the register definitions like:

L1PCFG

LIPCC

etc.., but have not found them. Could you suggest what includes or kits I should attach to make these registers syntactically-available in C?

Thanks in advance.

Dan.

0 DanB57495 over 14 years ago in reply to DanB57495

Expert 1340 points

Hi Randy,

OK...I found under System => Global-Settings => properties in the TCF file for DSP/BIOS configuration a "64plus" TAB that allows settings for:

L1PCFG, L1DCFG, and L2CFG. Both L1s were already set to 32k and the L2 was set to 0k. When I tried to save the setting with L2>0k (in 32k segments) the configuration complained about a conflict (Overlap) in IRAM (a.k.a. L2 RAM), however according to the memory map there was plenty of space available so I am at a loss.

Since so much of this code is running in mDDR I'm guessing I'd better get that L2 cache running!

Any thoughts would be great.

Dan.

0 RandyP over 14 years ago in reply to DanB57495

TI__Guru* 84110 points

When you use the DSP/BIOS GUI to change the L2 Cache size from 0k to 32k, for example, what changes do you see in either the System->Memory list or in the tcf file contents? Do you see any memory sections that overlap after making these changes?

0 DanB57495 over 14 years ago in reply to RandyP

Expert 1340 points

Hi Randy,

OK...I have the L2 cache sorted-out and I have gone through you suggested optimizations.

Regarding looking at the memory sections on the GUI, I failed to notice last night that another memory section was setup, that did indeed conflict with "IRAM". I made space for a 64k byte cache in L2.

Regarding optimizations:

I used the compile flags: -o3, -ms0, -mt, -pm, -mf5

With this combination running with the XDS510PP-plus we got ~2.5 seconds for the original loop.

With the Cache-line allocation you suggested we got ~2seconds.

With the Seperate Loops for bank efficiency we got ~3 seconds (yes, more).

With Seperate Loops with Intrinsics we got ~1.5 seconds.

Since the seperate loops made it worse with the original configuration, I tried using the intrinsics-approach in the same loop so:

for(t_int i = 0; i < m_idx_end; i+=2)
{
_memd8(&m_p_begin_real[i]) = _itoll(sample,sample);
_memd8(&m_p_begin_imag[i]) = _itoll(0.0, 0.0);
}

And that was about 1.2 seconds (the best so far). While it is good to see some improvements if I look at the memory clock cycles vs throughput I see:

8,388,608(floats)/1.2(sec) = 6,990,506 floats/sec. If we use the 132MHz mDDR clock as a reference 132,000,000/6,990,506 = ~19 clocks per float. If we use the 300MHz DSP as a reference it is ~43 clocks per float.

Beyond this simple debug test, in the main program there are a lot of large-ish (not as large as the debug test) arrays in mDDR that will have a lot of individual elements to be multiplied, divded, added (with other array elements) in a round robin fashion inside a loop so even if I move to the DMA approach, I don't know if that will help in the long-run.

What additional gains might be expected (with this debug case) using DMA?

Thanks for all of your help.

Dan.

0 Brad Griffis over 14 years ago in reply to DanB57495

TI__Guru*** 125430 points

Your performance still seems lower than I would expect. How did you configure the MAR bits on the BIOS cache configuration page? You might want to see this wiki page for reference:

http://processors.wiki.ti.com/index.php?title=Enabling_64x%2B_Cache

0 DanB57495 over 14 years ago in reply to Brad Griffis

Expert 1340 points

Hi Brad,

You have just made me a very happy man! And on a Friday no less.

Yes it definitely helps when the compiler know what addresses to cache (I had not set them). The same mDDR storage cycle is now ~0.2 seconds. We now have a reasonable hope of converting the main program to something useful, I was beginning to worry.

Thank you Brad and thank you Randy!

Have a great weekend.

Dan.

Processors

Processors forum

Very slow data exchange with mDDR on C6748 SOM on Experimenter Kit