This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Lower Performance of DM368 compared to DM365

We were using DM365 for streaming/recording applications with DVSDK 2_10_01_18 for which CPU utilization was observed in the range of 60-70%(using top command).
Now we have migrated to DM368 and we are using the same DVSDK with clock configuration changes only in the kernel.
Following is the code snippet for the changes in the kernel:

static struct plat_serial8250_port serial_platform_data[] = {
{
.membase = (char *) IO_ADDRESS(DM365_UART1_BASE),
.mapbase = (unsigned long) DM365_UART1_BASE,
.irq = IRQ_UARTINT1,
.flags = UPF_BOOT_AUTOCONF | UPF_SKIP_TEST,
.iotype = UPIO_MEM,
.regshift = 2,
/* .uartclk = 121000000, */ /* for DM365@297MHz */
.uartclk = 170000000, /* for DM368@432MHz */
},
{
.membase = (char *) IO_ADDRESS(DAVINCI_UART0_BASE),
.mapbase = (unsigned long) DAVINCI_UART0_BASE,
.irq = IRQ_UARTINT0,
.flags = UPF_BOOT_AUTOCONF | UPF_SKIP_TEST,
.iotype = UPIO_MEM,
.regshift = 2,
.uartclk = 24000000,
},
{
.flags = 0
},
};


We observed that with the same application on DM368 the CPU utilization factor has increased by 10% (70-80%).

Are any other changes required to be done in the kernel/UBL/u-boot wrt DDR etc.

Regards

Haran

  • Haran,

    Did you flash the right UBL/U-boot for DM368? What is the clock that is configured for ARM and DDR?

  • Renjith,

    I think i changed only the clk wrt ARM and not DDR. Can you give some insights as to where i can change the clk wrt DDR in the UBL,u-boot and kernel for DM368.

     

  • Haran,

    Can you check the pre-built UBL and U-boot available for DM368 board? Or else, if you go through the UBL code, you'll be able to figure out easily where exactly it can be changed.

  • Renjith,

    I have configured ARM clock to be 432 MHz and DDR clock to 340 MHz but still i am observing the same results.Can you provide some insights as to where else i have to modify?

  • Haran,

    Can you do a memory write in u-boot using the command "mw" and compare between boards? Measure the time taken to finish the command in both the boards and see any difference is there or not. 

    Let's see first u-boot and if there is a difference there then we have to dig into clocks or some configuration. If DM368 is better, then we have to dig deep into kernel to figure out.

  • Renjith,

    I am actually facing difficulty in checking using the mw command.ie I am not able to figure out how to compare.Can you give some insights into this.

  • Haran,

    You can do a memory write for a big chunk of memory like 32/64MB and see how much time it takes using a simple stop watch and compare. If you  need more exact figures you've to modify u-boot code for memory write command to do a getticks() call before and after and print the difference. 

  • Renjith,

    I implemented a test app in the kernel wherein i just wrote 2MB of data to RAM and observed the performance on DM365 and DM368.The observation was that on DM365 the time taken was 4.27s and on DM368 it was 3.12s so this clearly proves that the ARM and DDR are configured properly but still why is it that there is an increase in the CPU utilization of DM368 compared to DM365.Or is there any bench mark application where i can specifically test the CPU load utilization by DM365 and DM368 so that i can conclude properly.

  • Haran,

    Can you try the bench mark application LMBENCH and IOZONE and compare the performance numbers?

  • Renjith,

    I have a basic question.Is the performance of DM368 better than DM365 because when i run some applications and checked the cpu utilization using the top command both DM365 and 368 showed the same utilization factor.Can i rely on top command for checking the CPU utilization factor?And does the CPU utilization factor give the actual measure of the performance of a processor?Can you give some insights on this.As far as i know i have configured the DM368 properly.

  • Haran,

    Theoretically top should give the CPU utilization properly. But let me tell you from my experience. I've seen not so reliable results from top, which made me think twice about the accuracy of top. Top is basically helpless as it reads and interprets info from /proc entries. If the way the CPU utilization or idle task % scheduled is considered, then it really depends on other factors. One such factor is the configuration of timer. If the timer accuracy is wrong, say it gives 1200 ticks per second instead of 1000, then top will naturally show high CPU utilization as each of its tasks take more ticks to complete. Similarly, if we run ARM at half the frequency, ARM load will naturally be double than the previous one.

    So, to understand it better, we need to spend more time analyzing the system and validate each and every parameter in details. There were cases where top reported more than 100% CPU utilization. 

  • Renjith,

    This is the result of LMbench.In these cases we found a  difference:

    1.context switching (Smaller Value better)

    dm368 > dm365 


    2.local communication udp,rpc/udp,tcp,rpc/tcp,tcp conn (Smaller Value better)
    dm368 > dm365 


    3.local communication bw for TCP (Larger Value better)
    tcp dm368 < dm365

    I was unable to attach the file so i will paste it below

    ******************************************************************

                           DM365

                     L M B E N C H  3 . 0   S U M M A R Y
                     ------------------------------------
             (Alpha software, do not distribute)

    Basic system parameters
    ------------------------------------------------------------------------------
    Host                 OS Description              Mhz  tlb  cache  mem   scal
                                                         pages line   par   load
                                                               bytes  
    --------- ------------- ----------------------- ---- ----- ----- ------ ----
    10.60.2.2 Linux 2.6.18_     armv5tejl-linux-gnu  271     8    32 1.0000    1

    Processor, Processes - times in microseconds - smaller is better
    ------------------------------------------------------------------------------
    Host                 OS  Mhz null null      open slct sig  sig  fork exec sh  
                                 call  I/O stat clos TCP  inst hndl proc proc proc
    --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
    10.60.2.2 Linux 2.6.18_  271 0.99 2.34 25.6 44.6 100. 4.17 12.4 3483 14.K 47.K

    Basic integer operations - times in nanoseconds - smaller is better
    -------------------------------------------------------------------
    Host                 OS  intgr intgr  intgr  intgr  intgr  
                              bit   add    mul    div    mod   
    --------- ------------- ------ ------ ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_ 3.8400 4.5700 1.3100  274.7  130.8

    Basic uint64 operations - times in nanoseconds - smaller is better
    ------------------------------------------------------------------
    Host                 OS int64  int64  int64  int64  int64  
                             bit    add    mul    div    mod   
    --------- ------------- ------ ------ ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_  7.730        3.4600 1612.3 1196.3

    Basic float operations - times in nanoseconds - smaller is better
    -----------------------------------------------------------------
    Host                 OS  float  float  float  float
                             add    mul    div    bogo
    --------- ------------- ------ ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_  122.6   97.3  451.8 1049.5

    Basic double operations - times in nanoseconds - smaller is better
    ------------------------------------------------------------------
    Host                 OS  double double double double
                             add    mul    div    bogo
    --------- ------------- ------  ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_  175.0  146.3 1961.7 2950.0

    Context switching - times in microseconds - smaller is better
    -------------------------------------------------------------------------
    Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                             ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
    --------- ------------- ------ ------ ------ ------ ------ ------- -------
    10.60.2.2 Linux 2.6.18_   96.9  114.7  103.3  123.7  119.3   129.1   122.2

    *Local* Communication latencies in microseconds - smaller is better
    ---------------------------------------------------------------------
    Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                            ctxsw       UNIX         UDP         TCP conn
    --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
    10.60.2.2 Linux 2.6.18_  96.9 221.0 398. 500.3 687.4 674.4 903.1 1828

    *Remote* Communication latencies in microseconds - smaller is better
    ---------------------------------------------------------------------
    Host                 OS   UDP  RPC/  TCP   RPC/ TCP
                                   UDP         TCP  conn
    --------- ------------- ----- ----- ----- ----- ----
    10.60.2.2 Linux 2.6.18_                             

    File & VM system latencies in microseconds - smaller is better
    -------------------------------------------------------------------------------
    Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                            Create Delete Create Delete Latency Fault  Fault  selct
    --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
    10.60.2.2 Linux 2.6.18_   79.6   47.3  341.9   94.6  7438.0 2.281    75.7  56.0

    *Local* Communication bandwidths in MB/s - bigger is better
    -----------------------------------------------------------------------------
    Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                                 UNIX      reread reread (libc) (hand) read write
    --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
    10.60.2.2 Linux 2.6.18_ 35.2 35.2 20.3   45.7  119.3   89.5   89.5 121. 234.7

    Memory latencies in nanoseconds - smaller is better
        (WARNING - may not be correct, check graphs)
    ------------------------------------------------------------------------------
    Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
    --------- -------------   ---   ----   ----    --------    --------    -------
    10.60.2.2 Linux 2.6.18_   271 7.8540  196.6       208.1       622.9    No L2 cache?

    ***************************************************************************************************************************

                                                       DM368

     L M B E N C H  3 . 0   S U M M A R Y
                     ------------------------------------
             (Alpha software, do not distribute)

    Basic system parameters
    ------------------------------------------------------------------------------
    Host                 OS Description              Mhz  tlb  cache  mem   scal
                                                         pages line   par   load
                                                               bytes  
    --------- ------------- ----------------------- ---- ----- ----- ------ ----
    10.60.2.2 Linux 2.6.18_     armv5tejl-linux-gnu  394     8    32 1.0000    1

    Processor, Processes - times in microseconds - smaller is better
    ------------------------------------------------------------------------------
    Host                 OS  Mhz null null      open slct sig  sig  fork exec sh  
                                 call  I/O stat clos TCP  inst hndl proc proc proc
    --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
    10.60.2.2 Linux 2.6.18_  394 0.68 1.64 21.0 34.7 86.7 2.89 9.91 3132 12.K 43.K

    Basic integer operations - times in nanoseconds - smaller is better
    -------------------------------------------------------------------
    Host                 OS  intgr intgr  intgr  intgr  intgr  
                              bit   add    mul    div    mod   
    --------- ------------- ------ ------ ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_ 2.6400 3.1400 0.9000  188.4   89.9

    Basic uint64 operations - times in nanoseconds - smaller is better
    ------------------------------------------------------------------
    Host                 OS int64  int64  int64  int64  int64  
                             bit    add    mul    div    mod   
    --------- ------------- ------ ------ ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_  5.320        2.3800 1114.2  827.6

    Basic float operations - times in nanoseconds - smaller is better
    -----------------------------------------------------------------
    Host                 OS  float  float  float  float
                             add    mul    div    bogo
    --------- ------------- ------ ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_   84.7   66.9  310.6  719.9

    Basic double operations - times in nanoseconds - smaller is better
    ------------------------------------------------------------------
    Host                 OS  double double double double
                             add    mul    div    bogo
    --------- ------------- ------  ------ ------ ------ 
    10.60.2.2 Linux 2.6.18_  120.7  100.3 1346.0 2031.7

    Context switching - times in microseconds - smaller is better
    -------------------------------------------------------------------------
    Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                             ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
    --------- ------------- ------ ------ ------ ------ ------ ------- -------
    10.60.2.2 Linux 2.6.18_  100.5  115.6  106.6  132.0  116.4   134.5   121.8

    *Local* Communication latencies in microseconds - smaller is better
    ---------------------------------------------------------------------
    Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                            ctxsw       UNIX         UDP         TCP conn
    --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
    10.60.2.2 Linux 2.6.18_ 100.5 220.8 396. 513.2 705.6 693.5 927.9 1890

    *Remote* Communication latencies in microseconds - smaller is better
    ---------------------------------------------------------------------
    Host                 OS   UDP  RPC/  TCP   RPC/ TCP
                                   UDP         TCP  conn
    --------- ------------- ----- ----- ----- ----- ----
    10.60.2.2 Linux 2.6.18_                             

    File & VM system latencies in microseconds - smaller is better
    -------------------------------------------------------------------------------
    Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                            Create Delete Create Delete Latency Fault  Fault  selct
    --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
    10.60.2.2 Linux 2.6.18_   69.3   41.7  338.4   84.6  5235.0 0.857    53.2  39.5

    *Local* Communication bandwidths in MB/s - bigger is better
    -----------------------------------------------------------------------------
    Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                                 UNIX      reread reread (libc) (hand) read write
    --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
    10.60.2.2 Linux 2.6.18_ 34.5 35.7 8.77   49.2  140.6   90.9   75.8 142. 321.5

    Memory latencies in nanoseconds - smaller is better
        (WARNING - may not be correct, check graphs)
    ------------------------------------------------------------------------------
    Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
    --------- -------------   ---   ----   ----    --------    --------    -------
    10.60.2.2 Linux 2.6.18_   394 5.3930  194.4       206.4       633.4    No L2 cache?

    ****************************************************************************************************************************'

    Could you give some insights on this.Should we change anything wrt configuration related to networking for DM368?

    Regards,

    Haran

  • Haran,

    Did you compare the kernel config in case of DM365 and DM368? 

  • Renjith,

    I compared the config files but there are no differences.

  • Haran,

    There is something seriously wrong.. Can you compare the clock of other peripherals also especially timer? Looks like the ARM performance is proper.

  • Renjith,

    Can you pls tel me how to check these clocks.

  • Haran,

    You've to dump the PLLC1 and PLLC2 registers. You can use devmem2 tool from Linux. The base address of PLLC1 and PLLC2 will be mentioned in ARM Subsytem document of DM36x. 

  • Renjith,

    As far as i have seen all the clock values are proper.Below are the clock values which is defined in UBL.

    ARM:432MHz

    DDR:340MHz

    MJCP:340MHz

    EDMA:170MHz

    VPSS:340MHz

    MMC/SD0:170MHz

    CLKOUT:680MHz

    VOICE RATE:144MHz

    VIDEO_HD:72MHz

    VIDEO_SD=27MHz

    Is there any other configuration that i am missing?

  • Haran,

    Did you compare all the clocks with DM365? Can you share both?

  • Renjith,

    The following is the comparison that you asked for

    DM368                                     DM365

    ARM:432MHz                       297MHz

    DDR:340MHz                       243MHz(but in UBL it is set to 486MHz ie div factor is 1.If i try to set the same div factor for DM368 then the board will not boot at all)

    MJCP:340MHz                     243MHz

    EDMA:170MHz                     121MHz

    VPSS:340MHz                      243MHz

    MMC/SD0:170MHz              121MHz

    CLKOUT:680MHz                 243MHz

    VOICE RATE:144MHz         99MHz

    VIDEO_HD:72MHz               74.25MHz

    VIDEO_SD:27MHz              27MHz

    These are the configurations set in the two UBL's.

    Regards

    Haran

  • Haran,

    All clocks looks to be fine. Since you have dumped the clock from UBL, there is a chance that some driver in kernel/u-boot can change the clock. So, can you please check whether all the clock values are same in kernel also. As suggested earlier, you can use devmem2 tool to dump all the register values from kernel command line. 

    Also one more thing to check. Where is your file system running from? 

    Don't worry about the DDR clock, even if you set the PLL to 243 there will be a multiplier which will double the frequency before giving to RAM. 

  • Renjith,

    I checked the corresponding values in kernel using devmem2 tool and the values remain unchanged.I am using NFS as the file system.

  • Haran,

    Can you try using an SD or NAND file system?

  • Renjith,

    I used CRAMFS but it results in the same result as before.The performance is still low.

  • Haran,

    This issue needs more analysis. If you are really serious about it, then I can assign one of my engineers. You'll have to spare two boards for the same.

  • Renjith,

    Sure i want to sort out this issue as early as possible and i am trying all the things i can to put it in place.

  • Haran,

    If you can share two boards, do email me at renjith.thomas@pathpartnertech.com. 

  • Renjith,

    It is not possible for me to share the board with you since its a client project.:(You can tel me suggestions so that i will do exactly the way that you tell.

  • Haran,

    No issues. One last thing from my side. Can you try to check the kernel config variable CONFIG_HZ and CONFIG_NO_HZ values? If CONFIG_NO_HZ is enabled disable it. Also you can try changing the default value of CONFIG_HZ to 1000?

  • Renjith,

    What should be the observation if we do the above changes.Will the performance of the processor increase?Can i know how will the above change influence the performance.?

  • Haran,

    This is not going to improve your CPU's performance. But since CPU load reported by kernel in percentage doesn't mean that your CPU is occupied 100% because of real processing. Some time it might be polling or waiting on something. When you change the timer interval, it will switch between tasks more often and there by giving CPU to other starving tasks as well. I've seen difference is performance of major kernel subsystems like filesystem, etc, when timer interval is changed.

  • Thanks for the reply Renjith...

  • Renjith

    One more query.On top we are actually observing that the CPU load for DM368 is more than DM365.Is our interpretation correct that more the CPU load for a particular process more better is the performance since for a particular amount of time the process is using the CPU more efficiently?
    Actually we ran an application on both DM365 and DM368 for about 5mins and we calculated the CPU load using the "time" command.The observations are as follows.

    DM365                                                                                                               DM368

    System time:47.16                                                                                            System time:53.47

    User time:163.94                                                                                              User time:223.11

    Real time:302.82                                                                                              Real time:301.77

    CPU load:(System time+User time)/Real time=69.71%                               CPU load:91.65%

    Frames Encoded:8865                                                                                    Frame Encoded:12625

    Our interpretation is that it is showing that DM368 has encoded more frames for the same time and configuration compared to DM365 still the user time and the system time are more in DM368.

    So is DM368 performing better than DM365?

  • Haran,

    The CPU load is nothing but 100-cpu idle. Basically cpu load is determined by the cpu idle time. The idle time is calculated based on timer interrupts and using scheduler intervals etc.. If this is not accurate, then you'll not be able to infer anything meaningful from the percentage values. 

    The best way to evaluate the performance is to run a single threaded application in both the SoCs. Best is to try from boot loader when there is no load to the system. This will give exact performance numbers. 

    When multiple cores and DMA is running, an ARM instruction might stall for longer duration than normal. Which is nothing but wastage of CPU cycles. But this wastage can never be captured using CPU idle. ARM doing a similar operation at two different times will take different amount of CPU cycles. CPU Idle will vary based on this. But in true sense ARM has executed same number of instructions, but because of external factors ( stalling) it took longer time in one of the cases. 

    To evaluate the raw performance of ARM you've to run from bootloader a simple integer math operation and compare. This will not have any dependency on DDR bandwidth if i-cache is enabled.

    I hope I'm clear as much as possible.

  • Hi Renjith,

    Just to update you.Now i am able to get the expected performance from our DM368 custom board.The problem due to which we got a decreased performance was that i was configuring the UBL to provide a 1x DDR clock instead of a 2x clock which had a major impact on the H264 hardware encoder performance on DM368 due to some syncronization issue but now i am able to get the expected performance on the  DM368 custom board.Thanks for ur support.

    Regards

    Haran

  • Haran,

    Glad that your issue is fixed. But in the beginning of this thread you've validated ARM and DDR clocks I believe? 

  • Renjith,

    Of course i validated it but got some shocking results actually when i validated with the TI EVM.

    With TI EVM with a DDR 2x clock ie with PLLDIV7 of PLL1 set to 8000 i executed the sample application which wrote a 2MB of data to the Memory.It took a time of 3.21s .Also i executed the stand alone  H264 Encoder application which utilized the Hardware encoder of the DM368 which took approx 11ms to encode the data.

    Now when i made the DDR clock to 1x ie  with PLLDIV7 of PLL1 set to 8001 and when i executed the sample app which wrote 2MB of data to the memory it still took 3.21s to write the data.But when i executed the  stand alone  H264 Encoder application this time it took 14ms( ie 3ms more than expected).

    So i assumed that even though as a standalone DDR is performing in par with the requirements with a 1x clock but still there was some sync issue between the DDR and the H/W encoder due to which the h/w encoder stand alone app took more time.

    Now when i give a 2X DDR clock to my custom board it started performing as expected.

  • Haran,

    This is really interesting behavior. I feel you should put a scope to the DDR lines to understand what exactly is the clock driven on the bus. 

    Anyways, its a good find.