TMS320C6748: Low Speed Issue with LCDK circa 2013

Paavo Jumppanen

Part Number: TMS320C6748

To put this in context, back in 2013 I bought a TI C6748 LCDK and a XDS100v2 JTAG emulator with the view to making a hi-res (8k) FIR filter for audio purposes. The algorithm I was using is based off an existing one I implemented and successfully used in my audio mastering software Har-Bal that runs efficiently on a PC (even one circa 2005) so it is numerically efficient. From a point of naivety I would have thought this LCDK should have more than enough grunt to do what I needed in implementing a stereo 8k FIR filter in hardware but when I started my development journey I quickly found that it was going a lot slower than I was expecting. After a lot of stuffing around and not much success I put it aside as I had more important things to do but now I am returning to it and am hitting the same issue and have spent days trying to understand what is wrong with my system.

To the problem, I have verified the clock speed of the LCDK to be 300MHz based on the advice given here,

https://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/115/t/54812

so it appears the clock speed is not an issue. Internally my code uses the DSP lib functions,

DSPF_sp_cfftr2_dit()

DSPF_sp_icfftr2_dif()

using a base 2 FFT size of 256 and my algorithm produces the correct results but painfully slowly, so much so that the processing is not possible in anything near real time so I decided to use the TSCL, TSCH approach above to measure the cycle count for calls to DSPF_sp_cfftr2_dit() and compare it to what your manual states it should approximately be, ie.

    TSCL = 0;
    TSCH = 0;

    t_start = TSCL;
    t_start += (unsigned long long)TSCH << 32;

    /* Transform to frequency domain */
    DSPF_sp_cfftr2_dit(Hm[cn], w, BLOCK_LENGTH);

    t_stop = TSCL;
    t_stop += (unsigned long long)TSCH << 32;
    t_overhead = t_stop - t_start;

BLOCK_LENGTH is 256 in this case and executing up to a breakpoint on t_overhead=... and looking at the variables it tells me that it has taken an astonishing 222688 cycles to calculate a 256 point FFT whereas the manual suggests it should be in the order of 4138. That is nearly 54 times slower than it should be and I haven't the faintest idea why. Can you possibly explain why this might be and what I could be possibly doing wrong? I am currently using the Code Composer Studio 5.5.0.00077 that I installed back in 2013 when I first got the LCDK. Also note that there are no calls to any system functions controlling the LCDK processor or peripherals prior to this initialisation code if that helps at all.

thanks in advance,

Paavo Jumppanen.

over 4 years ago

0 Victor Kazmirenko over 4 years ago

Guru 13202 points

Hello!

We know too little about rest of your application, so here are just a general though here. DSPLIB functions operate best way, when data are close enough, the best in L2 memory. From your explanation it is unclear, how are your data organized, placing them in L2 might be a first thing to check.

There are some opinions, that at sizes of 256 it is worth to proceed with radix-4 algorithm rather than radix-2, though that is not root issue in your case either way.

Also, do you link provided DSP library in binary, or you compiling your own binary?

0 Paavo Jumppanen over 4 years ago in reply to Victor Kazmirenko

Prodigy 10 points

Yes, the radix 4 would be more efficient but not by 50 times so the issue is more severe than that. I don't compile the DSP lib myself but link to it but if I take the C equivalent code of the algorithm and compile that it is about 3.5 times slower than the DSP lib version.

The rest of the application is not particularly complex. It is just based off the loopback audio example provided in the LCDK with some bug fixes. Having read a bit further I think I realize that the inbuilt memory caching is off by default and needs to be initialised for use. I'm not using the BIOS and I'm pretty sure the base code loop back example doesn't initialise any caching so that is what I will try next. As my application requires more memory than is available in L1 or L2 cache then caching is a necessity but if this is the cause it somewhat surprises me that the slowdown is so severe. Would that be because of pipeline stalls?

Probably my other problem is unfamiliarity with the whole CCS development environment and development flow as my main line of work is desktop computer software development in C/C++ where the lower level details are handle by the operating system.

thanks,

Paavo.

0 Paavo Jumppanen over 4 years ago

Prodigy 10 points

Further researching this problem I have discovered that indeed the cache is not enabled. After finding out of date documentation on how to enable the cache and code examples that I couldn't compile for whatever reason I have figured out how to do it by studying the header files in the C6748 starter ware SDK, notably the dspcache.h header file. I was able to enable the cache with this code:

#include <c674x/dspcache.h>

// Called this function in main()
static void config_cache(void)
{
CacheEnable(L1PCFG_L1PMODE_32K | L1DCFG_L1DMODE_32K | L2CFG_L2MODE_256K);
CacheEnableMAR(0xC0000000, 0x08000000);
}

After doing so and re-running my test the DSPF_sp_cfftr2_dit() function call cycle count went from 222688 to 7286 cycles and this is with full debug code without pipe-lining so this should improve considerably with optimisation and pipe-lining. That's welcome and very substantial improvement so I think I can finally possibly implement the filter I've been wanting to!!!

I'm posting this here for anyone else like me unfamiliar with TI development tools and CCS development environment. It is somewhat disappointing that the fact that caching is defaulted to off with CCS generate code is not written in big letters somewhere obvious so the inexperienced don't get caught up in days of dead ends and wonderment about why this thing is so damned slow! It really has sets a negative impression about an otherwise amazing piece of technology so I'd encourage the people at TI documenting and putting together the SDK's and development tools to make this immediately obvious, and better still, make the caching enabled by default instead. Intriguingly, the hardware documentation says it is enabled by default with the maximum size configurations but clearly the code generated by CCS must switch it off.

regards,

Paavo.

0 Victor Kazmirenko over 4 years ago in reply to Paavo Jumppanen

Guru 13202 points

Hi Paavo,

Good to know you have identified root cause. Still I have some thoughts about your situation.

First, if you already link against precompiled DSPLIB, then there is no hope for any further substantial improvement. Indeed, DSPLIB code was already optimized close to its maximum performance. There might be you own routines, they may improve, but that is unlikely for library.

Next, it is tempting to allocate whole L2 memory for cache in hope it will handle things without human efforts. Alternative scenario could be using L2 as memory, and allocate ping-pong buffers there. You did not mention where actual data are stored, assume that is DDR memory. If so, consider the following scheme. Your FFT is a block process, so allocate two input buffers enough to hold whole FFT input frame, and also two buffers for FFT output. Setup EDMA to pull input frame from DDR to one of A or B buffers in alternating fashion. Once EDMA transfer complete, fire FFT routine giving it A input and A output buffers as data locations. Simultaneously setup EDMA transfer to pull input data to B input buffer. Once FFT finished switch input pulling again and setup output push from L2 buffer to output location in DDR. This way you'll always deal with L2 data, and that should bring you close to documentation benchmark numbers. Note, here you'll have no L2 misses at all. Keep in mind, we don't know, which process would complete faster EDMA or FFT, and their duration may also change in runtime, so one should establish some semaphoring scheme. Once again, this scenario assumes no part of L2 was used as cache.

Another consideration is your program allocation in memory. Missing code in L1P cache would impose penalty as well. Your device has rather small amount of L2 memory. Still I would split it, and use one part for fast code placement, another part for fast data, like ping-pong buffers. Of course, one may allocate third part for cache as well. My point is that carefully planning data path might give a way better performance comparing to blindly allocate all L2 to cache.

Finally, I am so beaten with the fact, that people disregard BIOS. It often starts as simple loop application, but very soon it comes to interrupts, then preemption and scheduling issues. In my experience, BIOS is worth each byte of its overhead.

0 Paavo Jumppanen over 4 years ago in reply to Victor Kazmirenko

Prodigy 10 points

Thanks for your detailed response Victor! It's very helpful.

I was aware of the use of EDMA to do memory transfers and also found that although my code worked nothing would get through on the audio channels because, presumably the cache memory was stale and it had no idea that the EDMA from the audio CODEC had changed the data in DDR. Once I allocated the audio EDMA buffers to L2 directly and reduced the L2 cache the audio would get through and it works fine up to 1k FIR but hasn't currently got the performance to get to 8k.

This is just a starting point and I am aware that changing how memory is laid out can affect the performance considerably but it requires experimentation as it isn't obvious what of a number of different options will perform better. That is what I'll be doing now and your advice is a good starting point.

Also too the endorsement of the DSP BIOS. I guess the main reason I chose to avoid it was one of learning curve because it is yet another layer of stuff I would have to read up about and get on top of. As you suggest it might well be the case that I will be employing it out of necessity to provide other services. Once I've got the DSP side sorted I'll need some means of communication with it, either via USB or TCP/IP socket, to "program" the impulse response of the filter. The idea for me is as a hardware base audio equaliser to compensate for room and monitor anomalies and to have the response analysis and filter design implemented on a PC and then transferred to the hardware. The reason for that approach is because the PC gives you much more options with UI given the availability of extensive GUI that the DSP does not and it keeps the program size and memory use-age of the DSP application to a minimum.

0 Paavo Jumppanen over 4 years ago

Prodigy 10 points

Just thought I would post a follow up to my project.

I spent a while trying to implement my filter using memory mapped to L2 in the linker command file, which would work fine for 2k FIR but ran out of memory if attempting anything larger (comes down to need of a bunch of extra buffers needed to implement a 2k FIR with 256 sample processing buffer and latency). I then tried using half L2 for direct access and the rest cached which almost worked but I say almost because whatever I tried I seemed to end up with a memory aliasing problem where critical data was getting written over by a completely different array write. I'm guessing I just don't know how to configure it properly but I couldn't find any clear examples of using both L2 cache as cache plus L2RAM so I instead went full DDR2 through L1 and L2 caching. That solved the issue of memory shortages and got rid of the aliasing issue and left the only other issue of ensuring synchronisation between DDR2 and cache for the EDMA transfer out to the DAC. Without any treatment playback was noisy and distorted because of what is in cache and what is is L2RAM being out of sync. Adding a call to CacheWB() after processing the rxBuffer with my filter code solved the problem. Eg.

      /* run filter */
      process_FIR(txBufPtr[ProcessIdx], rxBufPtr[ProcessIdx]);

      /* need to force write back of cache to DDR2 before EDMA transfer */
      CacheWB((unsigned int)txBufPtr[ProcessIdx], AUDIO_BUF_SIZE);

I got the speed I needed by changing processing order to minimise number of FFT's required, pre-calculating the filter frequency domain components needed for the convolution and re-structuring my frequency domain data to be single sided and multiplex decoded (ie. I'm using a complex FFT to calculate two real sequence results at once) and to make the data access all linear ascending and simplifying the inner loop calculation to two complex number multiply and accumulate operations. Doing so dramatically reduced the cycles used so much so that I could easily drop my processing buffer to 64 samples and a corresponding latency of 1.33 milliseconds at 48kHz sampling (near enough real time) with my full 8k filter taking about 152k cycles to execute or approx. 507 microseconds, leaving me plenty of time to spare. Although the full L1/L2 cache configuration might not be optimal it is easily fast enough for my purposes which was my expectation going on the specs of the device.

Thanks for the suggestions along this journey. Now I have to figure out how to write the code to manage and program my filter from an external device (via USB or Ethernet not sure which will be easier). Also, I 'm not clear on how, once I've finished the development, that I can flash the code into device ROM and have it run on boot up. Any suggestions or links would be appreciated as I haven't found something that clearly demonstrates that to me yet.

regards,

Paavo.

0 Victor Kazmirenko over 4 years ago in reply to Paavo Jumppanen

Guru 13202 points

Hello!

It's a great thing to post updates for the benefit of community, as often people disappear once their problem solved.

I have to start with disclaimer that we never use L2 for cache, but keep critical data and code there.

Next is cache coherency. It may happen that peripheral device has updated memory location, but cache controller knows nothing about it. This happens with PCI, PCIe when remote device is bus master making writes to our memory. Here cache invalidation helps, telling cache controller to fetch fresh copy of data. Also may happen that data written did not reached ultimate destination but rather sit in cache. That is when write back helps to push data out of cache to their ultimate destination. Nevertheless, to my knowledge, EDMA maintains cache coherency, that is to say, once EDMA performed data move, cache records will have valid information about them.

There should be some planning on data moves. If DMA takes care about ping-pong buffers, then there is no need to cache those data: DMA would have no benefit of caching, CPU is working on L2 buffer either way. So in this scenario it should be planned carefully, what data and/or code we could not allocate to L2, so they better be seen through L2 cache.

Processors

Processors forum

TMS320C6748: Low Speed Issue with LCDK circa 2013