This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TMS320C6748: Inputting data into array of pointers efficiently, mcasp starterware development

Part Number: TMS320C6748
Other Parts Discussed in Thread: OMAPL138

Tool/software: Code Composer Studio

Hi, 

I am developing some DSP audio effects with the help of the mcasp starterware. One issue I am presented with is the presence of noise within my system, I think this is mainly due to inefficient code execution (please correct me if there is another significant factor that could cause this).

/*
** Transmit buffers. If any new buffer is to be added, define it here and
** update the NUM_BUF.
*/
signed char txBuf0[AUDIO_BUF_SIZE];
signed char txBuf1[AUDIO_BUF_SIZE];

//

/* Array of transmit buffer pointers */ signed int txBufPtr[NUM_BUF] = { (signed int) txBuf0, (signed int) txBuf1 };

In streamlining my code I want to figure out how to point to and assign values to my transmission buffers. Currently I am checking the lastSentTxBuf through an 'if' statement and assigning values to the arrays txBuf0 and txBuf1 within.

for (all values in buffer)
{
if (lastSentTxBuf == 0) { txBuf0[i] = Processed data; } else { txBuf1[i] = Processed data; }

}

Is there a way to avoid this, potentially by using  the txBufPtr array directly? I have tried multiple times to solve this but nothing seems to work. Please advise me on any solutions to this problem, along with ways to increase code efficiency and to get the most processing power out of the C6748.

Thanks,

Calum

  • Hi,

    I've notified the RTOS team. Can you please share which RTOS sdk version are you using?

    Best Regards,
    Yordan
  • Hello!

    Its hard to tell, whether the noise is related with data processing efficiency, but if you ask, is there a room to improve your loop the answer might be more certain. One big thing on C6000 is loop pipelining. Long story short, pipelining does not like branching (read ifs and other conditionals) in the loop. Just with a quick look you may rewrite your code as

    for (all values in buffer)
    {
      if (lastSentTxBuf == 0)
      {
         for (one buf len )  txBuf0[i] = Processed data;
      }
      else 
      {
         for (one buf len )  txBuf1[i] = Processed data;
      }
    }

    This way inner loop has no conditional and could be pipelined more efficiently.

    However, there is a lot of speculation. I suggest you find a document like "Optimizing loops on C6000 DSP", and learn to read output assembly.

    Be sure, that is as useful, as addictive. Once started you'll never quit, and its worth efforts :-)

  • Hi Yordan,

    Apologies for my delayed reply, I have 'processor_sdk_rtos_omapl138_4_01_00_06' downloaded, I think this is my RTOS sdk package but I'm unsure if I am actually using it in my project. I should have stated in my question that I am using the LCDK and starterware 'C6748_StarterWare_1_20_04_01', developing on top of the 'mcaspPlayBk' program.

    Calum
  • Hi rr,

    Apologies as well for my delayed response, I will certainly look into documentation of that nature and try to improve my system. I have implemented your suggestion to avoid branching and have noticed some improvement, but I am still not quiet at my desired performance.

    I'm unsure if I am correct in saying, but I think one issue of my system is that I may not be using my processor and DMA as efficiently as possible. Due to the starterware example I am developing on, my incoming data buffers are 32 bit per sample whilst the actual data is 2 bytes long (16 bit) . I am having to convert this into single valued data and place it into a 'processing' array, process data in said array, then output to my previously stated txBuf0/txBuf1 array. The conversion and transfer to and from the 'processing' array is done by the processor and not the DMA, I think this may be one of the reasons for my reduced performance, but again I'm not certain of this. I have seen other people taking advantage of FIFO present on the mcasp to put both the left and right 16 bit samples into one 32 bit buffer slot, but I am unsure on how to actually do this and hence have not implemented it.

    Should I be making use of the DMA for converting my incoming data to something that can be practically manipulated, leaving the DSP to simply carry out the arithmetic signal processing algorithms?
    Thanks in advance of anyone who can help me with this!

    Calum
  • Hello!
    Would you mind to shed more light, where are your data coming from, what is format of incoming data, what is format for processing, and what for output. That might give some more idea.
  • Hi,

    The data is coming through the mcasp and being transferred to the DDR2 memory by the DMA (I have tried moving it to L2 memory but have not seen a noticeable improvement).
    Each sample is two little-endian 8-bit singed char, with two 8-bit zero padded values after ( e.g. 1101 0010, 0000 0100, 0000 0000, 0000 0000 = 1234 in decimal).
    I am transferring this to a short type circular array, disregarding the two zero padded char. (Through the use of the DSP)
    I apply my processing to a single short 'processing' variable using the values in the circular array. (Through the use of the DSP)
    Then convert the 'processing' variable to it's two constituent char values and place them in my output array, in the same format as the input array (16-bit), then add two zero padding char values. (Through the use of the DSP)
    Then output my data from DDR2 (have tried doing so from L2 as well but with minimal improvement) to the mcbsp, through the DMA.


    Thanks

  • Hello!

    First of all, please don't take my considerations as immediate recipes, but rather as directions to review.

    Point one, whether you can avoid transferring dummy data. If that's a case, that would change the game a lot. Assuming you can do nothing about that and have to skip fillers and unpack data manually, still there is a room for improvement.

    I think EDMA can do data stream extraction and skip dummies. However, you still need to expand 8-bit values to 16-bit ones and as they are signed, EDMA can't do that, you have to use processor.

    Its not certain, that having to do that job with processor would definitely degrade performance. If your loop is well pipelined, data access and unpacking might be just a step of the pipeline. One may try to use SIMD intrinsics to speed up packed data processing. However, here we step on the land on speculation.

    Do you use optimization? Is that o3 level? Would you mind to show excerpt of processing loop?

  • Calum,

    I am not from the TI-RTOS team, but am jumping into the discussion, if I can help.

    There is a lot of good information in the C6000 Embedded Design Workshop, particularly about using the EDMA3 to service the peripherals and chain to a ping/pong buffer scheme. There are examples of this also in the TRM, but the workshop includes labs and solutions with the pictures. And there are videos for some of the modules that might be useful, especially EDMA3. It is tricky to find on TI.com; "training c6748" does not find it but "workshop c6748" does (no quotes).

    You should always use the EDMA3 to move data buffers. It is much better at it than the DSP. And you should always avoid moving data whenever possible. The workshop will help you visualize that.

    Starterware is simple code to help with simple tasks, and it is not actively supported by TI - only offered as-is. You may find some value in switching to the Processor SDK, which is fully supported and portable across all recent TI processors.

    What board are you using, and what ADC/DAC?

    Regards,
    RandyP