Running ECU (VoLIB 2.0) on multiple channels in real time

Leonardo Trierveiler

Other Parts Discussed in Thread: TMS320C6657

Hi,

I'm interested to run the ECU from VoLIB 2.0 in real time on multiple channels (70, to be exact). Right now I'm using the TMS320C6657, but we'll use the C6746 for our future custom boards.

For data reception, I'm using the McBSP + DMA. Running the data received in real time through ECU, I successfully canceled echo from one end of the telephone line (1 channel). Now I need to expand this.

I have some questions regarding the ECU:

Which modifications are necessary to run the ECU on multiple channels? In the "ecusimfunc.c", line 806, I see "ecuSim->num_channel = 1; ". I changed this value but I didn't see any alterations. When debbuging, inspecting the function "siu.c", line 129, I see that "SIU_MAX_CHANNELS" is set to 1. Does this value need to be modified too?

When I ran the code in real time, I noticed that while processing the data in the ECU, I received 30 RX events from the McBSP (the frame sync is 8kHz, 125us). This means that the code is taking around 3,75ms to run. This is way too much time and it'll be impossible to run more than 2 channels at once (given that the ECU works with 10ms frames). I'm using only the ".c" codes from the ECU project.

I assume that the CCS test project that comes with the ECU is alredy optmized. So, what do I need to do to reach a better perfomance? Every buffer is allocated in the L2SRAM.

Thank you in advance,

Regards,

Leonardo Batista

over 12 years ago

bogdank over 12 years ago

TI__Intellectual 800 points

Hello Leonardo,

You have some great questions there. Before we get to the details we need to step back and look at a "few" important things:

The ECU component has the component code and the unit-test/simulation code.
You can use unit-test code to "simulate" how ECU would behave in real world. You can also use it to profile ECU component execution in order to get information on how many CPU cycles a single channel would use under various conditions. You can use it even further to start modifying it to better reflect your application or to test it under conditions that may not be supported in the unit-test code.
The unit-test code is designed to operate with a single channel of ECU component. It cannot be made to run multiple channels. To get to multiple channels, you have to design your own DSP framework for the application you are trying to implement. For example, you may need to design Voice Gateway framework where you would fit the ECU component. Not the unit-test code, just the component code. In other words, you will need to integrate the ECU component into the Framework you would create.
The Framework you create will determine how many channels you can squeeze out of the ECU component. There are some limitations and I will get to them next.
In order to understand what you can or cannot do with the ECU component or any other for that matter you need to profile it. You need to profile all relevant functions and obtain the CPU cycle counts for 10ms frames (if your voice processing is done on 10ms frames). Based on your clock speed you will be able to estimate "time" it takes to execute those functions or %-MIPS or MHz you would spend on running those functions.
Some functions would have more or less fixed execution time, some will depend on input and/or current state of the ECU algorithm.
You need to educate yourself regarding Echo Canceling algorithms in general in order to learn how to make the best out of them when squeezing out performance out of them.
You will find out that you have Initialization/control and Real-time functions. Init/control are often not much of a problem due to stochastic nature of when they are used and only contribute to the peak load. Real-time functions are those that you need be aware of.
In your case that would be ecuSendIn() or similar. This function performs several operations:
1. Echo Removal
2. Adaptive Filter Calculations (including search filter update in case of long tail configuration)
3. Signal Energy measurements and Instrumentation
4. DT/NLP logic and NLP/CNG application
Soon you will find that 9.c) and 9.d) are more or less fixed time. 9.a) has to be done all the time and depends on configured filter length. 9.b) does not have to be done all the time and also depends on filter length.
If you profile execution time depending on the filter length you will be able to get a linear function of how cycles would depend on filter length for some of these functions.
Those cycles would be your raw information regarding how much you need to spend on most of the work inside ECU. The echo removal and filter updates are where the most of cycles are being spent.
Let's assume for the sake of argument (not real numbers, but just example illustrating the concept), that echo removal takes 4 MIPS and filter update takes 10 MIPS (for short-tail 32ms echo tail - no need for search filter). Your total MIPS would be about 14/ch. If you have 100 MIPS available for ECU channels that would result in 7ch density.
At this point you have to notice that you will need to determine the budget for the ECU component processing within your Framework. More you provide to ECU more channels you will be able to process.
Now you may ask yourself how can I get more than 7ch for the above example? Our ECU component has all the hooks and configuration API to get you to accomplish more.

Here are the methods that you will need to use and API that you will have to research (documentation and/or this forum):

First thing first: you have to properly profile cycles on your target platform for the real-time processing functions to determine what would really be your best channel density when running ECU functions (you are not supposed to measure unit-test code functions or your framework but only relevant ECU functions).
You should compare those numbers with those published by us for specific platform (if we failed to provide such numbers, please let us know and we will provide them).
If the density obtained using "brute force approach" is OK. You are done. If not keep reading...
ECU component has ability to automatically "slow down" the filter updates after certain depth of convergence has been achieved. This will buy you some cycles on average as you increase your density and start benefiting from stochastic nature of how multiple voice channels work and behave.
You have no control over when would this slow down occur, but you can disable it or enable it if I am not mistaken.
ECU provides very reliable instrumentation regarding performance and environment in which it operates. For example, ERL, ERLE, observed signal levels, etc.
ECU provides API's to freeze/unfreeze the updates.
Using those API's you can design an Intelligent Algorithm that can "schedule" certain parts of ECU processing on certain channels to be done or not to be done based on observed need captured by the reported performance and environmental conditions. That algorithm is operating at the higher level within the Framework you create and may take into account other DSP algorithm and their MIPS consumption as well, e.g. Voice Codecs, etc.
That's where you need to understand what is good or bad for the EC's in general to perform one way or another. For example, low ERL means trouble and those channels receiving high levels of echo may need to get "higher priority" in doing filter updates. Large ERL+ERLE means probably good performance and you may reduce priority for those channels. (please do not confuse filter update priority with OS task priorities, these are different things...)
TI has developed MIPS Agent that was used in our Voice Gateways and that actually uses instrumentation from ECU channels to decide which channel at which time would get to do the updates and which channels may not do the updates. Given the stochastic nature of how calls arrive to the GW and when speech activity is present you can benefit from it and it wouldn't be too hard in example we used above to get 14ch to perform perfectly well instead of 7ch on the same system and the same MIPS budget without resulting in any observable performance degradation.
You can design your own "MIPS Agent" that can do the similar task, or you may wait for us to productize it and release it in a future Release of VoLIB. Currently, we do not have committed date for this release, but we are considering putting it on our roadmap for 2014.

In summary:

First, focus attention on just ECU functions, instead of simulation code that provides support for showing how ECU functions work. Profile ECU functions and make sure you get numbers that are close to our advertized MIPS performance for the target you are using.

Second, design the Framework that will be with low overhead and will be able to efficiently run multi-channel ECU's (the way you get data from TDM interface will be important, as well as staggering between the channel signal streams in order to minimize end-to-end delay will be important) You will have to have some simple OS kernel or use our DSP BIOS and make sure you assign "task" priorities carefully, etc. (that would be the real-time design/programming aspect of it)

Third, consider learning more about how EC algorithms work in general, what is ERL, ERLE, etc., how it impacts the operation of EC's and how MIPS depends on various parameters like filter length, how well you think adaptive algorithm may converge for low vs. high ERL or low vs. high background noise or low vs. high speech levels, etc. That will provide you with insight into how to implement MIPS agent and intelligent scheduling for multi-channel framework.

Keep in mind that more channels you try to squeeze, easier will be to benefit from stochastic nature of speech and call arrivals. Long tail EC (>32ms tail) also adds search filter to the consideration which can also be controlled. In the end you may be scheduling "background updates" (of up to 32ms filter setment(s) which are used in echo removal) along with the "seach filter updates" (used in finding out where to place filter segments for adaptive filter). Those two operations take the most of MIPS and do not have to operate all the time on all channels, because you will be wasting huge amount of cycles if you just let them run all the time. Once EC channel achieves "steady state" the MIPS can drop significantly if you know how to control it and recognize it. You just have to make sure you react quickly if situation changes and reconvergence is needed or echo path changes (does not happen too often).

Finally, our EC can also detect "digital calls" or "4-wire calls" where adaptive filter does not need to be used and would not update during those calls. You just need to enable this. If you write your MIPS agent you could actually take this into account when calculating how many channels you want to do updates on during each "MIPS Agent Tick". There is absolutely no need to design your system for the "brute force approach" if using some of the configurability that ECU component provides for intelligent load control. Eventually, your system design will become memory bound and not MIPS bound. That is, you would run out of "internal memory" for critical buffers before you run out of MIPS. If you rely more on cache you just have to be careful to analyze your cache performance and buffer placements to make sure you are not hit too hard with cache penalties.

I hope this was helpful. Please let us know how you want to proceed and how we can be of further help.

Best Regards,
Bogdan

Leonardo Trierveiler over 12 years ago in reply to bogdank

Intellectual 385 points

Hi Bogdan,

Wow, I didn't expect such a complete answer, thank you very much for your attention!

I'm a rookie user of Texas' DSPs. I started using it 3 months ago and I'm learning slowly about it. It'll take me some time to process all of this information and study/implement them correctly.

Rookie question: the ECU component code is the one referenced in "ti/mas/ecu/ecu.h" and everything I need to make ECU work can be found in the External API, right?

I'll start there and build an alpha version of the ECU without worrying about the multichannel problem. Meanwhile, I'll study more about Echo Cancelers and its funcionality before adventuring in the optimization, since it seems that it'll be very complex and time consuming for me.

Again, thank you very much for your attention and your help.

Best Regards,

Leonardo Batista.

bogdank over 12 years ago in reply to Leonardo Trierveiler

TI__Intellectual 800 points

Yes. The external API is in ecu.h. The simulation code provides an example of how those functions may be used.

Regards,

Bogdan

Leonardo Trierveiler over 12 years ago in reply to bogdank

Intellectual 385 points

Hi Bogdank, I don't know if I should create another topic, but I'll use this one first.

So, I designed my own framework, a rather simple one, where I just configured the McBSP port, the EDMA and the ECU using the API.

First, I tested my code using only files from the system and the echo is cancelled perfectly. Then, I tried using real time for 4 channels and it worked. I figured that I wouldn't need much effort from that point to reach the desirable 70 channels (the documentation states that, for a 128ms filter length, I get in the steady state 4.4 MIPS, if I use a 450MHz processor, I can get up to 102 channels and for 64ms, 145 channels).

Using a brute force approach, I managed to get around 30 channels working without noticing problems in the echo cancelation on each channel. But I'm way too far from what I can get, even if it's not 100% optimized.

Here's my main loop I'm using (NC = number of channels, NS = number of samples (80)):

sinCmpBuf_ptr = (linSample *) sinCmpBuf;

for (i = 0; i < NC; i++)
{
rinLinear_ptr = (tword *) &rb[80*(16+i)]; //rinBuffer pointer

for (ii = 0; ii < NS; ii++){

teste = (tword) rb[80*i + ii];
sinCmpBuf[ii] = muaTblAlaw[teste]; }

ecuSendIn (ecuInsts[i].ecuInst, (void *)sinCmpBuf_ptr, (void *)rinLinear_ptr, (void *)soutCmpBuf_ptr);

muaTblAlawCmpr (NS, soutCmpBuf, (tint*) soutBuffer);

for (ii = 0, j = 0; ii < NS; ii++, j+=2)
{
tb[80*i + ii] = soutBuffer[j];
tb[80*(16+i) + ii] = rb[80*(16+i) + ii];
}

}

I know it's not optimized, but I expected a way bigger channel density than just 30 channels. I'm still studying everything you suggested before.

The next step was to profile the cycles and see how many I was getting in the ecuSendIn for each configuration. Then I found out something strange. I used the profile method used in the unit test code (is it ok to use it?) This is the data I gathered:

SEGMENT SIZE = 32 ms, SEGMENT COUNT = 3, NUMBER OF SAMPLES = 80

FILTER LENGTH 128ms -> mean cycle count = 7851

FILTER LENGTH 64ms-> mean cycle count = 14983

FILTER LENGTH 32ms -> mean cycle count = 22743

Lowering the filter size gave me more cycles per call! Any idea why does this occur?

P.S.: All my buffers are alocated in the L2SRAM.

Regarding the unit/test code, there were some steps that I didn't fully undersand, but I'll leave any questions for later, as they are not important right now.

Thank you.

Best Regards,

Leonardo Batista

Leonardo Trierveiler over 12 years ago in reply to Leonardo Trierveiler

Intellectual 385 points

Hi,

After more tests, I allocated my buffers to MSMCRAM and I got a way better channel density: 67 channels.

I can't understand why putting the buffers in the shared memory resulted in a better perfomance than the L2SRAM. Any ideas?

Best Regards,

Leonardo Batista

John Dowdal over 12 years ago in reply to Leonardo Trierveiler

TI__Intellectual 2180 points

What are your cache configurations, which are in 0x1840000, 0x1840020, and 0x1840040? Make sure L1D and L1P are on (at 0x20(L1P) and 0x40(L1D). I expect to find "4" in 0x20 and 0x40. If the L1D and/or L1P are off performance will be bad.

The MSMC (or really the XMC) supports a prefetch for program and data. This feature doesn't work on the L2SRAM because the L2 isn't hooked via the XMC. Prefetch could cause a speedup relative to L2SRAM (up to equivalent of L1SRAM) if the prefetch works on the EC data "perfectly". See http://www.ti.com/lit/sprugw0b section 7.5 for this prefetcher. You could disable it, in order to prove this is root cause of performance improvement. Also the cache registers are covered in this document as well, look for L1PCFG and L1DCFG.

Processors

Processors forum

Running ECU (VoLIB 2.0) on multiple channels in real time