This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Very slow IPC.

Other Parts Discussed in Thread: TMS320C6472

We are using the TMS320C6472 in ultrasonic test equipment. Until now we have been using only
a single core (CORE0) and the system has 12 DSP cards communicating using TCP/IP.

I have reorganized the project so that it now uses both CORE0 and CORE1 on each of the 12 DSP cards.
CORE0 runs the TCP/IP stack and communicates with the other DSPs. CORE1 runs ultrasonic data processing.
CORE0 and CORE1 communicate using IPC/MessageQ.

When I started the debugging process I discovered the following:

IPC/MessageQ messages can be transmitted and received by both CORE0 and CORE1 but is extremely slow.
The round trip time for a short message (less than 256 bytes) is 30 milliseconds or more. The same message
is sent from DSP0/CORE0 to DSP1/CORE0 using TCP/IP and the round trip time is less than one millisecond!

DSP0/CORE0 =IPC/MessageQ=> DSP0/CORE1 =IPC/MessageQ=> DSP0/CORE0            30 milliseconds
DSP0/CORE0 =TCP/IP=> DSP1/CORE0 =TCP/IP=> DSP0/CORE0                         1 millisecond

Being a little perplexed I then built the IPC/MessageQ example from TI. I changed it so that CORE0 sends a
message to CORE1 which sends the message right back to CORE0. The cores are doing nothing else.
I get the same very long round trip time.

Can someone shed light on what I may be doing wrongly? I cannot believe that what I am seeing is correct
but I am unable to find clues in the documentation.

I have added a zipped folder of the example project which contains a .cfg file with my SYS/BIOS settings.

For IPC the seetings are the same in my own project.

Versions of tools/libraries:
CCS 5.1
bios_6_32_05_54
edma3_lld_02_10_03_04
ipc_1_23_05_40
mcsdk_1_00_00_08
ndk_2_20_06_35
pdk_c64x_1_00_00_06
uia_1_00_03_25
xdais_7_21_01_07
xdctools_3_22_04_46

IPCExample.zip
  • Stein,

    The one-way message latency should be on the order of 3-10usec, depending upon which Notify driver is being used. 

    A first guess is that the timings for the shared memory region used for IPC are not properly configured.  Are you using DDR or L2 for IPC?  If DDR, are you using a GEL file or otherwise initializing timings properly?  

    If this is not it, can you please provide more details of your memory and application configuration? (I don’t see any zipped attachment to your post.)

    Scott

  • Here is the .zip file.

    As you can see from the example application .cfg file, the only change was to change the cpu frequency to 250MHz.

    The memory used is SL2.

  • Stein,

    Can you please try a couple of things.  Do #1 first

    1.  Remove the System_printf() calls from the loop where MessageQ_put/get is being done.  System_printf() calls actually halt the processor.

    To run more optimized code do the following in your *.cfg file:

    2.  BIOS.libType = BIOS.LibType_Custom;

    Judah

  • Good catch. The System_printf statements were the culprit. I am of course aware of the latency of these functions but was expecting to see delays in milliseconds. It may be that when two (or more) cores compete for standard out using the JTAG, the latency becomes tens of milliseconds. Thanks a lot!

  • Stein,

    I'll give you a bit of an explanation as to why printf consumes so many cycles.

    Here's how all of the CIO functions work.  (printf, scanf, fprintf, etc.).  When the application encounters one of these functions, it formats the data and stores it in memory.  This part probably takes tens of milliseconds, as you say.  The problem is what happens next.  When you run an application from CCS, it runs freely, and CCS periodically non-intrusively polls each of the devices on the scan chain to see if the processor has been halted (typically by a breakpoint).  If it has been, then CCS goes and reads enough memory to update all of it's windows and remains at the breakpoint.  The period of these polls is not a fixed one, but I would suspect that they happen at least 2-3 times per second (maybe more).  For these CIO functions, once the data is formatted on chip, they branch to a special breakpoint that has been set at the CIO symbol.  When CCS next polls and sees that the processor is halted at the CIO breakpoint, it then reads the data up through the emulator, displays it in the CCS window, and then automatically runs the processor again.  This is where all of these cycles come from.  Assume the formatting takes 20 milliseconds, but right as the formatting has been completed, CCS is finishing one of it's polls.  So, there's the potential to be sitting at that breakpoint until the next time CCS polls, which if you assume 2 polls per second, might be something like 500 ms.  So, the delay is not really how long it takes for the printf code to execute, it's mostly the time that the processor is halted waiting for the asynchronous poll from CCS.

    Writing from more than one core does compound the problem.  The fastest emulators only run TCK at 35-50 Mhz.  Each time CCS polls and there is more than one core that has CIO data to push up, the emulator has to read that data in which is a serial process clocked at the TCK rate.  So, depending on how much data there is, there could be large delays here also.  And there could be worse cases.  If you assume that you have 8 cores, and they're all writing printf functions.  CCS polls the cores in order each time.  Assume that you have printf data on cores 1-7.  When CCS polls, it checks core 0, and finding nothing, moves on to pull the data from core 1, then core 2, etc.  What if, right after core 0 is polled, it gets some printf data.  Now it not only has to wait for the asynchronous poll, the poll won't even occur while CCS is busy getting the data from cores 1-7.  So, there might be cases where it takes seconds between time data is written to via printf, and the time that data is actually pulled up to the host.

    Regards,

    Dan