This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DSPLink performance on Linux PC using PCI

Hi,

We're currently evaluating DSPLink as a possible IPC solution for a new low latency audio streaming product. The test setup is as follows:

  • Pentium 4 (2.4GHz) running 2.6.31 kernel with RT and bigphysarea patches
  • DSP/BIOS Link version 1.64 (with minor changes to support 2.6.31 kernel)
  • Following options used to configure dsplink: --platform=LINUXPC --nodsp=1 --dspcfg_0=DM648PCI --dspos_0=DSPBIOS5XX --gppos=RHEL4 --comps=ponslrmc --dspdma=1
  • DM648 EVM from Lyrtech plugged into Pentium 4 PCI bus

I've modified the MSGQ samples supplied with dsplink 1.64 slightly by making the gpp the message initiator and adding a usleep() call in the for loop. This is to vary the message send rate to match audio frame rates that we expect to handle in our final system. The following are some observations:

With no usleep() call I see a rount-trip delay of about 480us. CPU loading is about 70%. See output of top below (modified msgq sample app is called atest):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2726 root     -96   0 11560 1096  924 S 38.4  0.1   0:08.87 atest
 2732 root      10 -10     0    0    0 S 29.8  0.0   0:06.82 DSPLINK_DPC_2
 2728 root      10 -10     0    0    0 R  4.6  0.0   0:01.09 DSPLINK_DPC_0
 2729 root     -51  -5     0    0    0 S  1.0  0.0   0:00.21 irq/9-DSPLINK

With usleep() call to set loop rate to 1ms (i.e. one MSGQ_put followed by one MSGQ_get every 1ms) CPU loading is about 32%.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2577 root     -96   0 11560 1104  928 S 16.9  0.1   0:05.15 atest
 2583 root      10 -10     0    0    0 S 12.9  0.0   0:03.89 DSPLINK_DPC_2
 2579 root      10 -10     0    0    0 R  2.0  0.0   0:00.65 DSPLINK_DPC_0
 2580 root     -51  -5     0    0    0 S  0.7  0.0   0:00.16 irq/9-DSPLINK

With usleep() call to set loop rate to 4ms (i.e. one MSGQ_put followed by one MSGQ_get every 4ms) CPU loading is about 9%.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2698 root     -96   0 11560 1100  928 S  5.0  0.1   0:02.65 atest
 2703 root      10 -10     0    0    0 S  3.3  0.0   0:01.91 DSPLINK_DPC_2
 2699 root      10 -10     0    0    0 S  0.7  0.0   0:00.32 DSPLINK_DPC_0
  186 root     -51  -5     0    0    0 S  0.3  0.0   0:00.84 irq/9-acpi
 2700 root     -51  -5     0    0    0 S  0.3  0.0   0:00.05 irq/9-DSPLINK

My question is, has anyone tested dsplink in a similar scenario and do the latency and cpu loading figures look comparable? I was expecting to see lower latency than this and much lower CPU loading. Unfortunately, if the cpu loading figures can't be improved, we won't be able to use dsplink as our final system will use an Atom Z510 with lower performance.

 

Regards,

Grant

 

 

 

 

 

 

  • Grant,

    Thanks for the detailed info.

    I don't have any numbers for the scenarios that you have discussed.

    I have a few questions which will hopefully give a clue to the loading and the latency numbers.

    1) What numbers do you get for the unmodified sample with --dspdma option?

    2) Can you turn off dspdma option and see the latency and loading numbers for both the modified and unmodified sample?

    We have run the message sample using the default option which involves a memcpy and it gives much lower latency numbers. ~275 usec for a round trip.

    3) For below latency numbers, the DSPLink DPC does run for every MSGQ_get call. This shows why DSPLINK_DPC_2 is taking some cpu load. Why would the application take 38% CPU load?

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
     2726 root     -96   0 11560 1096  924 S 38.4  0.1   0:08.87 atest
     2732 root      10 -10     0    0    0 S 29.8  0.0   0:06.82 DSPLINK_DPC_2
     2728 root      10 -10     0    0    0 R  4.6  0.0   0:01.09 DSPLINK_DPC_0

    4) Can you add a print statement to mark entry point and exit point in the function DPC_Callback in file $dsplink\gpp\src\osal\Linux\2.6.18\dpc.c

    I want to check if DPC runs even if no interrupt comes from DSP.

    Deepali

  •  

    Hello Deepali,

    Thanks for taking the time to help with this issue. I have run some more tests with settings you suggested and the results are as follows:

    1) With the unmodified sample with --dspdma=1 the round-trip latency is 453us and the cpu loading is about 77% (see top output below):

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    20841 root      10 -10     0    0    0 S 41.0  0.0   0:08.02 DSPLINK_DPC_2
    20835 root      20   0 10536  564  476 S 33.7  0.1   0:06.47 messagegpp
    20837 root      10 -10     0    0    0 R  2.0  0.0   0:00.23 DSPLINK_DPC_0
    20838 root     -51  -5     0    0    0 S  0.3  0.0   0:00.07 irq/9-DSPLINK

    2a) With the unmodified sample with --dspdma option turned off the rt latency is 428us and the cpu loading is about 75% (see top output below):

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    15184 root      10 -10     0    0    0 S 37.5  0.0   0:05.95 DSPLINK_DPC_2
    15179 root      20   0 10536  568  476 S 28.5  0.1   0:04.58 messagegpp
    15180 root      10 -10     0    0    0 R  6.0  0.0   0:00.96 DSPLINK_DPC_0
    15181 root     -51  -5     0    0    0 S  3.6  0.0   0:00.61 irq/9-DSPLINK

    2b) With the modified sample with the -dspdma option turned off the rt latency is 667us and the cpu loading is about 87% :

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    26761 root      10 -10     0    0    0 S 56.8  0.0   0:09.45 DSPLINK_DPC_2
    26753 root     -96   0 11560 1104  924 S 23.5  0.1   0:04.04 atest
    26757 root      10 -10     0    0    0 R  5.0  0.0   0:00.76 DSPLINK_DPC_0
    26758 root     -51  -5     0    0    0 S  1.7  0.0   0:00.30 irq/9-DSPLINK

    I think that the increased latency and loading in the modified sample with --dspdma turned off is due to the fact that in the modified sample the message size is increased by 128 bytes.

    3) The cpu loading due to atest and messagegpp applications is very similar so I think that there's something happening in either the MSGQ_put or MSGQ_get call that is taking up cpu cycles. I've tried to run oprofile on this system but haven't had much success unfortunately.

    4a) Below is the output of dmesg with printk statements at entry and exit points of DPC_Callback:

    [ 1278.510653] pci 0000:03:00.0: enabling device (0010 -> 0012)
    [ 1278.510670] pci 0000:03:00.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
    [ 1278.513713] enter DPC_Callback
    [ 1278.515035] enter DPC_Callback
    [ 1278.521746] enter DPC_Callback
    [ 1308.220819] exit DPC_Callback
    [ 1308.223077] exit DPC_Callback
    [ 1308.223591] exit DPC_Callback
    [ 1308.224308] pci 0000:03:00.0: PCI INT A disabled

    Note that this is with <number of transfers> parameter for messagegpp set to 65535. The number of callback entry/exit prints is the same even if I set <number of transfers> to 2 (just the timestamps change).

    4b) Below is the output of dmesg with --trace==1 and DPC trace turned on (not sure if you need this - if not, sorry for the clutter):

    [  189.988513] DSPLINK Module (1.64) created on Date: Mar  8 2010 Time: 13:12:04
    [  245.180041] Entered DPC_Disable ()
    [  245.180049] Leaving DPC_Disable ()
    [  245.180051] Entered DPC_Enable ()
    [  245.180053] Leaving DPC_Enable ()
    [  245.180398] Entered DPC_Disable ()
    [  245.180403] Leaving DPC_Disable ()
    [  245.180405] Entered DPC_Enable ()
    [  245.180407] Leaving DPC_Enable ()
    [  245.181475] Entered DPC_Disable ()
    [  245.181481] Leaving DPC_Disable ()
    [  245.181549] Entered DPC_Enable ()
    [  245.181551] Leaving DPC_Enable ()
    [  245.181778] Entered DPC_Disable ()
    [  245.181780] Leaving DPC_Disable ()
    [  245.181782] Entered DPC_Enable ()
    [  245.181784] Leaving DPC_Enable ()
    [  245.181851] pci 0000:03:00.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
    [  245.186955] Entered DPC_Create ()
    [  245.186957]  userDPCFn       [0xf8171df1]
    [  245.186958]  dpcArgs [0xf8190da4]
    [  245.186959]  dpcObj  [0xf842b010]
    [  245.187042] Leaving DPC_Create ()    status [0x8000]
    [  245.187051] Entered DPC_Callback ()
    [  245.187052]  arg     [0x0]
    [  245.188990] Entered DPC_Create ()
    [  245.188992]  userDPCFn       [0xf816eeaf]
    [  245.188993]  dpcArgs [0xf8190ad4]
    [  245.188994]  dpcObj  [0xf8190ae0]
    [  245.189680] Leaving DPC_Create ()    status [0x8000]
    [  245.190222] Entered DPC_Callback ()
    [  245.190224]  arg     [0x1]
    [  245.190760] Entered DPC_Disable ()
    [  245.190764] Leaving DPC_Disable ()
    [  245.190807] Entered DPC_Enable ()
    [  245.190809] Leaving DPC_Enable ()
    [  245.192566] Entered DPC_Disable ()
    [  245.192571] Leaving DPC_Disable ()
    [  245.192609] Entered DPC_Enable ()
    [  245.192611] Leaving DPC_Enable ()
    [  245.192963] Entered DPC_Disable ()
    [  245.192966] Leaving DPC_Disable ()
    [  245.193134] Entered DPC_Enable ()
    [  245.193137] Leaving DPC_Enable ()
    [  245.196806] Entered DPC_Disable ()
    [  245.196813] Leaving DPC_Disable ()
    [  245.196816] Entered DPC_Enable ()
    [  245.196819] Leaving DPC_Enable ()
    [  245.265976] Entered DPC_Disable ()
    [  245.265984] Leaving DPC_Disable ()
    [  245.265986] Entered DPC_Enable ()
    [  245.265988] Leaving DPC_Enable ()
    [  245.269207] Entered DPC_Disable ()
    [  245.269214] Leaving DPC_Disable ()
    [  245.269216] Entered DPC_Enable ()
    [  245.269218] Leaving DPC_Enable ()
    [  245.269269] Entered DPC_Create ()
    [  245.269271]  userDPCFn       [0xf817036b]
    [  245.269272]  dpcArgs [0xf8190d74]
    [  245.269273]  dpcObj  [0xf8190d84]
    [  245.269330] Leaving DPC_Create ()    status [0x8000]
    [  245.269580] Entered DPC_Schedule ()
    [  245.269582]  dpcObj  [0xf8190060]
    [  245.269586] Leaving DPC_Schedule ()
    [  245.269592] Entered DPC_Callback ()
    [  245.269594]  arg     [0x2]
    [  245.269926] Entered DPC_Schedule ()
    [  245.269927]  dpcObj  [0xf8190060]
    [  245.269930] Leaving DPC_Schedule ()
    [  245.269948] Entered DPC_Schedule ()
    [  245.269949]  dpcObj  [0xf8190090]
    [  245.269952] Leaving DPC_Schedule ()
    [  245.270571] Entered DPC_Disable ()
    [  245.270574] Leaving DPC_Disable ()
    [  245.270637] Entered DPC_Schedule ()
    [  245.270638]  dpcObj  [0xf8190060]
    [  245.270641] Leaving DPC_Schedule ()
    [  245.273479] Entered DPC_Enable ()
    [  245.273487] Leaving DPC_Enable ()
    [  245.273824] Entered DPC_Schedule ()
    [  245.273826]  dpcObj  [0xf8190090]
    [  245.273832] Leaving DPC_Schedule ()
    [  245.274616] Entered DPC_Schedule ()
    [  245.274617]  dpcObj  [0xf8190060]
    [  245.274622] Leaving DPC_Schedule ()
    [  245.274711] Entered DPC_Schedule ()
    [  245.274712]  dpcObj  [0xf8190090]
    [  245.274717] Leaving DPC_Schedule ()
    [  245.276069] Entered DPC_Schedule ()
    [  245.276071]  dpcObj  [0xf8190060]
    [  245.276076] Leaving DPC_Schedule ()
    [  245.276200] Entered DPC_Schedule ()
    [  245.276201]  dpcObj  [0xf8190090]
    [  245.276206] Leaving DPC_Schedule ()
    [  245.276997] Entered DPC_Disable ()
    [  245.277001] Leaving DPC_Disable ()
    [  245.277074] Entered DPC_Enable ()
    [  245.277077] Leaving DPC_Enable ()
    [  245.277113] Entered DPC_Cancel ()
    [  245.277114]  dpcObj  [0xf8190090]
    [  245.277116] Leaving DPC_Cancel ()    status [0x8000]
    [  245.277119] Entered DPC_Delete ()
    [  245.277120]  dpcObj  [0xf8190090]
    [  245.277127] Leaving DPC_Callback ()
    [  245.277157] Leaving DPC_Delete ()    status [0x8000]
    [  245.277160] Entered DPC_Disable ()
    [  245.277162] Leaving DPC_Disable ()
    [  245.277164] Entered DPC_Enable ()
    [  245.277166] Leaving DPC_Enable ()
    [  245.277503] Entered DPC_Disable ()
    [  245.277506] Leaving DPC_Disable ()
    [  245.277508] Entered DPC_Enable ()
    [  245.277510] Leaving DPC_Enable ()
    [  245.277896] Entered DPC_Disable ()
    [  245.277900] Leaving DPC_Disable ()
    [  245.277903] Entered DPC_Enable ()
    [  245.277905] Leaving DPC_Enable ()
    [  245.280841] Entered DPC_Disable ()
    [  245.280846] Leaving DPC_Disable ()
    [  245.280849] Entered DPC_Enable ()
    [  245.280851] Leaving DPC_Enable ()
    [  245.281441] Entered DPC_Disable ()
    [  245.281445] Leaving DPC_Disable ()
    [  245.281448] Entered DPC_Enable ()
    [  245.281450] Leaving DPC_Enable ()
    [  245.282834] Entered DPC_Disable ()
    [  245.282839] Leaving DPC_Disable ()
    [  245.282841] Entered DPC_Enable ()
    [  245.282843] Leaving DPC_Enable ()
    [  245.283593] Entered DPC_Disable ()
    [  245.283598] Leaving DPC_Disable ()
    [  245.283600] Entered DPC_Enable ()
    [  245.283602] Leaving DPC_Enable ()
    [  245.283637] Entered DPC_Cancel ()
    [  245.283638]  dpcObj  [0xf8190078]
    [  245.283640] Leaving DPC_Cancel ()    status [0x8000]
    [  245.283643] Entered DPC_Delete ()
    [  245.283644]  dpcObj  [0xf8190078]
    [  245.283653] Leaving DPC_Callback ()
    [  245.283670] Leaving DPC_Delete ()    status [0x8000]
    [  245.288078] Entered DPC_Cancel ()
    [  245.288080]  dpcObj  [0xf8190060]
    [  245.288085] Leaving DPC_Cancel ()    status [0x8000]
    [  245.288087] Entered DPC_Delete ()
    [  245.288089]  dpcObj  [0xf8190060]
    [  245.288097] Leaving DPC_Callback ()
    [  245.288122] Leaving DPC_Delete ()    status [0x8000]
    [  245.288806] pci 0000:03:00.0: PCI INT A disabled
    [  245.289235] Entered DPC_Disable ()
    [  245.289239] Leaving DPC_Disable ()
    [  245.289241] Entered DPC_Enable ()
    [  245.289243] Leaving DPC_Enable ()
    [  245.289660] Entered DPC_Disable ()
    [  245.289664] Leaving DPC_Disable ()
    [  245.289667] Entered DPC_Enable ()
    [  245.289669] Leaving DPC_Enable ()

    <number of tramsfers> is 2 in this case.

    Regards,

    Grant

     

     

     

     

     

     

     

     

     

     

     

     

  • Deepali,

    I have managed to do some profiling of dsplink using oprofile on the 2.6.21 kernel (with bigphysarea patches). I have profiled a debug build of dsplink with --dspdma option both on and off. In both cases, I've run the sample application messagegpp with <number of transfers> set to 65535.

    1) With --dspdma=1

    samples  %        image name               app name                 symbol name
    557      15.0378  dsplinkk.ko              dsplinkk                 DM648_halPciReadDMA
    519      14.0119  dsplinkk.ko              dsplinkk                 DM648_halPciWriteDMA
    464      12.5270  vmlinux                  vmlinux                  native_read_tsc
    451      12.1760  dsplinkk.ko              dsplinkk                 DMAPOOL_invalidate
    436      11.7711  dsplinkk.ko              dsplinkk                 DMAPOOL_writeback
    294       7.9374  dsplinkk.ko              dsplinkk                 LDRV_MPCS_enter
    196       5.2916  vmlinux                  vmlinux                  delay_tsc
    189       5.1026  dsplinkk.ko              dsplinkk                 IPS_ISR
    166       4.4816  dsplinkk.ko              dsplinkk                 LDRV_MPCS_leave
    80        2.1598  vmlinux                  vmlinux                  native_safe_halt
    52        1.4039  dsplinkk.ko              dsplinkk                 LDRV_MPLIST_isEmpty
    46        1.2419  libc-2.8.90.so           libc-2.8.90.so           (no symbols)

    2) With --dspdma=0

    samples  %        image name               app name                 symbol name
    1155     18.6381  dsplinkk.ko              dsplinkk                 MEM_Copy
    804      12.9740  dsplinkk.ko              dsplinkk                 DMAPOOL_writeback
    769      12.4092  dsplinkk.ko              dsplinkk                 DMAPOOL_invalidate
    718      11.5863  dsplinkk.ko              dsplinkk                 DM648_halPciWriteDMA
    637      10.2792  dsplinkk.ko              dsplinkk                 DM648_halPciReadDMA
    501       8.0846  dsplinkk.ko              dsplinkk                 LDRV_MPCS_enter
    363       5.8577  dsplinkk.ko              dsplinkk                 LDRV_MPCS_leave
    328       5.2929  dsplinkk.ko              dsplinkk                 IPS_ISR
    240       3.8728  vmlinux                  vmlinux                  native_safe_halt
    103       1.6621  dsplinkk.ko              dsplinkk                 DM648_halPciIntCtrl
    92        1.4846  libc-2.8.90.so           libc-2.8.90.so           (no symbols)
    69        1.1134  dsplinkk.ko              dsplinkk                 IPS_notify
    63        1.0166  dsplinkk.ko              dsplinkk                 LDRV_MPLIST_isEmpty

    It appears as though most of the CPU loading is due to the actual PCI transfer between system memory and DSP memory. It looks like even in --dspdma=1 case, the driver spins until the dma transfer is complete. In the --dspdma=0 case, it is probably the slow PCI acceses during the memcpy that accounts for the high loading (possibly made worse by the presence of the PCI bridge on the DM648EVM).

    Any ideas how the performance could be improved. I suppose one way would be to sleep during the DMA transfer but I'm not sure how easy it would be to do this. I guess the problem stems from the fact that dsplink was originally designed for a true shared memory system where transfers between system memory and DSP memory was not necessary? Perhaps a different solution would be to use the 4M shared memory for POOL buffers (assuming only small buffer requirement - which would be the case for our audio only application). Any idea what would be involved in making such a change?

    Regards,

    Grant

     

     

  • Grant,

    The CPU loading is high because we have to spin till DMA completes. There is no notification received on host when DMA operation completes

    Is this your final system configuration? Our general advise to customers is to use the DSP DMA code as reference to implement their own host based DMA transfer within DSPLink.

    Deepali

  • Deepali,

    Apologies for the long delay in responding to this.

    Our final system is ATOM Z510 GPP and DM648 DSP. I'm not quite sure what you mean by "host based DMA"? As far as I'm aware, DMA transfers on the PCI bus are performed by the target device using bus mastership, pretty much how it's done now. Or am I misunderstanding something?

    The problem as I see it is with the POOL_writeback and POOL_invalidate mechanism i.e. making it the responsibility of the GPP to keep the GPP and DSP buffers in sync. Since it's ultimately the DSP that will perform DMA, would it not be better to make it the responsibilty of the DSP to keep it's own buffers in sync with the GPP buffers? The GPP would then not have to wait for DMA completion and performance would be much better.

    Could you please give me some pointers as to what changes would be required in order to implement this? i.e. changes required to achieve the following:

    1. On the GPP side, either remove POOL_writeback and POOL_invalidate or just return as per true shared memory implementations.
    2. On the DSP side, upon receiving message notification, perform DMA to move buffers between GPP and DSP memory.

    Regards,

    Grant

  • Grant,

    By 'host based DMA', we mean modifying our reference implementation to use the DMA engine on your ATOM Z510 GPP to do the DMA work. All you would need to do is change the lowest level of the dma f unctionality to use the host DMA and (perhaps) use the interrupt based notification to do the DMA. This way, your host-side CPU load would go down, and you'd get the best performance.

    The reasons for the architecture as it currently is, i.e. host-side synchronization of the POOL and using DSP-side DMA, are:

    1. The PCI architecture of DSPLink supports one-to-many connections, i.e. one host to multiple slave DSPs in a star configuration. In this, if the DSP were made the master, i.e. the one responsible for keeping the POOL synchronized, the DSP might need to know about all the other slaves also, and potentially need to somehow initiate synchronization mechanism for the other slaves also. This was getting too complicated. The better option is to have the host, i.e. the main guy, be responsible for keeping the POOLs synchronized, since he's the only one who's aware of all the DSPs and can do the required activities easily. Another reason for not using the DSP DMA in the final system on DM6437 is that there is a silicon errata for PCI/DMA which significantly reduces the throughput. Hence it is much better to use the host DMA.

    2. The second question then comes up, is why are we using DSP DMA? We do not expect people to actually have the final configuration as what we have provided, i.e. PC connected to DM6437 over PCI. We expect that this is just a pre-silicon development platform, with the final one being to use a different embedded GPP. Hence, our PCI implementation is simply a reference implementation. The intention of this was to demonstrate how to use the DMA, and to make it easier for folks to port. We also provide a memory copy based reference solution, but then it is harder for people to port to use host DMA. We cannot use host DMA for our reference implementation since the PC DMA is not available for our use. However, you can port the DSPLink s/w to use the host DMA on your GPP instead of the DSP DMA.

    Hence, most of our customers who use DSPLink for PCI-based usage with DM6437 have actually gone the route we are suggesting, i.e. using GPP DMA instead of DSP DMA, while going with either the DSP DMA implementation or memory copy implementation for initial development (where speed is not necessarily an issue).

    Hope this helps ...

    Regards,
    Mugdha

  • Hello Mugdha and Deepali,

    First, I'd like to thank you both for the excellent quality of your replies - much appreciated!

    Unfortunately for us, the Intel Atom Z510 GPP we're using is basically an embedded PC, so the same DMA limitations apply as per you reference implementation (i.e. there's no host DMA controller available for generic PCI transfers). However, we still have a couple of options: i) either write some code on host and DSP side to do host initiated (but DSP implemented) DMA with completion notification,  or ii) implement a DMA engine on an FPGA we have attached to the PCI bus. Given that the DM647/8 have some PCI bus master performance issues, we'll probably go with ii).

    Again, thanks for your help.

    Regards,

    Grant