This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM memcpy poor performance after memset zero

Part Number: TDA4VM
Other Parts Discussed in Thread: TDA4VH

Hi, expers:

On tda4vm platform, we found the performance of memcpy become lower  after memset zero, test code as follow:

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/prctl.h>
#include <sched.h>
#include <signal.h>
#include <string.h>
#include <malloc.h>

unsigned long long DEMO_TST_GetMs()
{
		struct timespec time = {0};
		clock_gettime(CLOCK_THREAD_CPUTIME_ID, &time);
	
    return (unsigned long long)(time.tv_sec * 1000000 + time.tv_nsec / 1000);
}

int main()
{
	int i = 0,j = 0;
    unsigned long long lStart = 0;
    unsigned long long lEnd = 0;    
    unsigned long long all = 0;

    void *pa = malloc(20 * 1024 * 1024);
    void *pb = malloc(20 * 1024 * 1024);

    /* do memset ? */
    //memset(pa, 0, 20 * 1024 * 1024);
    //memset(pb, 0, 20 * 1024 * 1024);
 
    for (j = 0; j < 100; j++)
    {
        lStart = DEMO_TST_GetMs();

        int sum = 0;

        for (i = 0; i < 100; i++)
        {
            sum += i;

            memcpy(pa, pb, 20 * 1024 * 1024);
        }

        lEnd = DEMO_TST_GetMs();

        all += lEnd - lStart;
        //printf("%lld, %lld\n", lEnd - lStart, all / (j + 1));
    }

    free(pb);
    free(pa);

    return 0;
}

Memcpy test without memset, total consume 27 second,(0.16% cache miss of all L1-dcache accesses)

Performance counter stats for 'memcpy_noset':

27379.08 msec task-clock # 0.997 CPUs utilized
3408 context-switches # 0.124 K/sec
1 cpu-migrations # 0.000 K/sec
669 page-faults # 0.024 K/sec
54712024432 cycles # 1.998 GHz
32828016573 instructions # 0.60 insn per cycle
<not supported> branches
418668 branch-misses

39342461969 L1-dcache-loads
61556500 L1-dcache-load-misses # 0.16% of all L1-dcache accesses
26222847268 L1-dcache-stores

27.270167320 seconds time elapsed

27.090683000 seconds user
0.011870000 seconds sys

Memcpy test with memset, total consume 83 second,(0.44% cache miss of all L1-dcache accesses)

Performance counter stats for 'memcpy':
85824.02 msec task-clock # 0.997 CPUs utilized
10882 context-switches # 0.127 K/sec
0 cpu-migrations # 0.000 K/sec
677 page-faults # 0.008 K/sec
171502032179 cycles # 1.998 GHz
32956132885 instructions # 0.19 insn per cycle
<not supported> branches
1252779 branch-misses

39390732794 L1-dcache-loads
171766881 L1-dcache-load-misses # 0.44% of all L1-dcache accesses
26245423599 L1-dcache-stores

84.339231775 seconds time elapsed

83.769493000 seconds user
0.015800000 seconds sys

Thanks

quanli

  • Hi Quanli,

    Could you please answer the below questions for me to reproduce the issue:

    • Which SDK version is being used?
    • Did you make any changes on Default SDK before testing?
    • What is the OS you are running this on?

    Best Regards,
    Keerthy

  • Hi Keerthy,

    Which SDK version is being used?

    SDK8.1 

     

    What is the OS you are running this on?

    Linux

    Thanks

    quanli

  • Hi Quanli,

    Apologies for the long delay. Can you comment out the dest memset to 0?


        /* do memset ? */
        memset(pa, 0, 20 * 1024 * 1024);
        //memset(pb, 0, 20 * 1024 * 1024);

    Since destination will anyways be re-written by memcpy does that make sense to do memset before copy?

    With the above i see same performance as without memcpy.

    - Keerthy

  • Hi Keerthy,

    1. only memset  destination  memory: (consume 80s)

      //memset(pa, 0, 20 * 1024 * 1024);
    memset(pb, 0, 20 * 1024 * 1024);

    root@j7-evm:/mnt# perf stat ./a.out
    Performance counter stats for './a.out':

    80636.50 msec task-clock # 1.000 CPUs utilized
    112 context-switches # 0.001 K/sec
    1 cpu-migrations # 0.000 K/sec
    666 page-faults # 0.008 K/sec
    161270853392 cycles # 2.000 GHz
    32881973600 instructions # 0.20 insn per cycle
    <not supported> branches
    463447 branch-misses

    80.643656310 seconds time elapsed

    80.222293000 seconds user
    0.074828000 seconds sys

    2. only memset source memory: (consume 26s)

        /* do memset ? */
        memset(pa, 0, 20 * 1024 * 1024);
        //memset(pb, 0, 20 * 1024 * 1024);

    root@j7-evm:/mnt# perf stat ./a.out

    Performance counter stats for './a.out':

    26372.76 msec task-clock # 0.999 CPUs utilized
    36 context-switches # 0.001 K/sec
    0 cpu-migrations # 0.000 K/sec
    666 page-faults # 0.025 K/sec
    52744680704 cycles # 2.000 GHz
    32808252931 instructions # 0.62 insn per cycle
    <not supported> branches
    181336 branch-misses

    26.392054970 seconds time elapsed

    26.229746000 seconds user
    0.031289000 seconds sys

    3. memset source and destination   memory: (consume 80s)

        /* do memset ? */
        memset(pa, 0, 20 * 1024 * 1024);
        memset(pb, 0, 20 * 1024 * 1024);


    root@j7-evm:/mnt# perf stat ./a.out

    Performance counter stats for './a.out':

    80740.41 msec task-clock # 1.000 CPUs utilized
    113 context-switches # 0.001 K/sec
    1 cpu-migrations # 0.000 K/sec
    666 page-faults # 0.008 K/sec
    161478707405 cycles # 2.000 GHz
    32884543675 instructions # 0.20 insn per cycle
    <not supported> branches
    497833 branch-misses

    80.761189095 seconds time elapsed

    80.334582000 seconds user
    0.059161000 seconds sys

    4. no memset:  (consume 26s)

        /* do memset ? */
        //memset(pa, 0, 20 * 1024 * 1024);
        //memset(pb, 0, 20 * 1024 * 1024);

    root@j7-evm:/mnt# perf stat ./a.out

    Performance counter stats for './a.out':

    26352.74 msec task-clock # 1.000 CPUs utilized
    5 context-switches # 0.000 K/sec
    0 cpu-migrations # 0.000 K/sec
    666 page-faults # 0.025 K/sec
    52704630211 cycles # 2.000 GHz
    32802721612 instructions # 0.62 insn per cycle
    <not supported> branches
    181438 branch-misses

    26.354800315 seconds time elapsed

    26.258809000 seconds user
    0.007984000 seconds sys


    Can you reproduce this issue on tda4vm evm board ?

    Thanks

    quanli

  • Can you reproduce this issue on tda4vm evm board ?

    Yes. That is why I commented:

    Since destination will anyways be re-written by memcpy does that make sense to do memset before copy?

    The destination is copied with new data using memcpy any case & hence the comment above. Is the request here to triage why destination memcpy takes
    time when we init using memset?

    - Keerthy

  • Hi Keerthy

    Since destination will anyways be re-written by memcpy does that make sense to do memset before copy?
    It is no sense to do memset before copy.  

    Is the request here to triage why destination memcpy takes

    Yes,  This problem exists on the tda4vm platform only。


    On tda4vh/tda4ve platform the test-cacse always consume about 80s, whether the memset operation is performed before memcpy or not, but on tda4vm platform the memcpy consume only 26s when without memset.

    So our algorithm runs less efficiently on tda4vh platform (compared with tda4vm).

    quanli
  • Hi,

    I have reproduced the behavior on TDA4ve. I will investigate on this and get back. Thanks for the test case.

    Best Regards,

    Keerthy

  • Hi Keerthy,

    I have reproduced the behavior on TDA4ve. I will investigate on this and get back. Thanks for the test case.

    Is there any latest news?

    Thanks

    Quanli

  • Hi Quanli,

    Use the below commands from Linux:

    echo 3 > /proc/sys/vm/drop_caches
    echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
    echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

    Now I see that the time has reduced:

    About 22 seconds to complete the test case on TDA4VH as well.

    Best Regards,
    Keerthy