TDA4VM memcpy poor performance after memset zero

li quan

Part Number: TDA4VM
Other Parts Discussed in Thread: TDA4VH

Hi, expers:

On tda4vm platform, we found the performance of memcpy become lower after memset zero, test code as follow:

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/prctl.h>
#include <sched.h>
#include <signal.h>
#include <string.h>
#include <malloc.h>

unsigned long long DEMO_TST_GetMs()
{
		struct timespec time = {0};
		clock_gettime(CLOCK_THREAD_CPUTIME_ID, &time);
	
    return (unsigned long long)(time.tv_sec * 1000000 + time.tv_nsec / 1000);
}

int main()
{
	int i = 0,j = 0;
    unsigned long long lStart = 0;
    unsigned long long lEnd = 0;    
    unsigned long long all = 0;

    void *pa = malloc(20 * 1024 * 1024);
    void *pb = malloc(20 * 1024 * 1024);

    /* do memset ? */
    //memset(pa, 0, 20 * 1024 * 1024);
    //memset(pb, 0, 20 * 1024 * 1024);
 
    for (j = 0; j < 100; j++)
    {
        lStart = DEMO_TST_GetMs();

        int sum = 0;

        for (i = 0; i < 100; i++)
        {
            sum += i;

            memcpy(pa, pb, 20 * 1024 * 1024);
        }

        lEnd = DEMO_TST_GetMs();

        all += lEnd - lStart;
        //printf("%lld, %lld\n", lEnd - lStart, all / (j + 1));
    }

    free(pb);
    free(pa);

    return 0;
}

Memcpy test without memset, total consume 27 second,(0.16% cache miss of all L1-dcache accesses)

Performance counter stats for 'memcpy_noset':

27379.08 msec task-clock # 0.997 CPUs utilized
3408 context-switches # 0.124 K/sec
1 cpu-migrations # 0.000 K/sec
669 page-faults # 0.024 K/sec
54712024432 cycles # 1.998 GHz
32828016573 instructions # 0.60 insn per cycle
<not supported> branches
418668 branch-misses

39342461969 L1-dcache-loads
61556500 L1-dcache-load-misses # 0.16% of all L1-dcache accesses
26222847268 L1-dcache-stores

27.270167320 seconds time elapsed

27.090683000 seconds user
0.011870000 seconds sys

Memcpy test with memset, total consume 83 second,(0.44% cache miss of all L1-dcache accesses)

Performance counter stats for 'memcpy':
85824.02 msec task-clock # 0.997 CPUs utilized
10882 context-switches # 0.127 K/sec
0 cpu-migrations # 0.000 K/sec
677 page-faults # 0.008 K/sec
171502032179 cycles # 1.998 GHz
32956132885 instructions # 0.19 insn per cycle
<not supported> branches
1252779 branch-misses

39390732794 L1-dcache-loads
171766881 L1-dcache-load-misses # 0.44% of all L1-dcache accesses
26245423599 L1-dcache-stores

84.339231775 seconds time elapsed

83.769493000 seconds user
0.015800000 seconds sys

Thanks

quanli

over 2 years ago

0 Keerthy J over 2 years ago

TI__Guru**** 164000 points

Hi Quanli,

Could you please answer the below questions for me to reproduce the issue:

Which SDK version is being used?
Did you make any changes on Default SDK before testing?
What is the OS you are running this on?

Best Regards,
Keerthy

0 li quan over 2 years ago in reply to Keerthy J

Intellectual 955 points

Hi Keerthy,

Keerthy J said:
Which SDK version is being used?

SDK8.1

Keerthy J said:
What is the OS you are running this on?

Linux

Thanks

quanli

0 Keerthy J over 2 years ago in reply to li quan

TI__Guru**** 164000 points

Hi Quanli,

Apologies for the long delay. Can you comment out the dest memset to 0?

    /* do memset ? */
    memset(pa, 0, 20 * 1024 * 1024);
    //memset(pb, 0, 20 * 1024 * 1024);

Since destination will anyways be re-written by memcpy does that make sense to do memset before copy?

With the above i see same performance as without memcpy.

- Keerthy

0 li quan over 2 years ago in reply to Keerthy J

Intellectual 955 points

Hi Keerthy,

1. only memset destination memory: (consume 80s)

//memset(pa, 0, 20 * 1024 * 1024);
memset(pb, 0, 20 * 1024 * 1024);

root@j7-evm:/mnt# perf stat ./a.out
Performance counter stats for './a.out':

80636.50 msec task-clock # 1.000 CPUs utilized
112 context-switches # 0.001 K/sec
1 cpu-migrations # 0.000 K/sec
666 page-faults # 0.008 K/sec
161270853392 cycles # 2.000 GHz
32881973600 instructions # 0.20 insn per cycle
<not supported> branches
463447 branch-misses

80.643656310 seconds time elapsed

80.222293000 seconds user
0.074828000 seconds sys

2. only memset source memory: (consume 26s)

/* do memset ? */
memset(pa, 0, 20 * 1024 * 1024);
//memset(pb, 0, 20 * 1024 * 1024);

root@j7-evm:/mnt# perf stat ./a.out

Performance counter stats for './a.out':

26372.76 msec task-clock # 0.999 CPUs utilized
36 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
666 page-faults # 0.025 K/sec
52744680704 cycles # 2.000 GHz
32808252931 instructions # 0.62 insn per cycle
<not supported> branches
181336 branch-misses

26.392054970 seconds time elapsed

26.229746000 seconds user
0.031289000 seconds sys

3. memset source and destination memory: (consume 80s)

/* do memset ? */
memset(pa, 0, 20 * 1024 * 1024);
memset(pb, 0, 20 * 1024 * 1024);

root@j7-evm:/mnt# perf stat ./a.out

Performance counter stats for './a.out':

80740.41 msec task-clock # 1.000 CPUs utilized
113 context-switches # 0.001 K/sec
1 cpu-migrations # 0.000 K/sec
666 page-faults # 0.008 K/sec
161478707405 cycles # 2.000 GHz
32884543675 instructions # 0.20 insn per cycle
<not supported> branches
497833 branch-misses

80.761189095 seconds time elapsed

80.334582000 seconds user
0.059161000 seconds sys

4. no memset: (consume 26s)

/* do memset ? */
//memset(pa, 0, 20 * 1024 * 1024);
//memset(pb, 0, 20 * 1024 * 1024);

root@j7-evm:/mnt# perf stat ./a.out

Performance counter stats for './a.out':

26352.74 msec task-clock # 1.000 CPUs utilized
5 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
666 page-faults # 0.025 K/sec
52704630211 cycles # 2.000 GHz
32802721612 instructions # 0.62 insn per cycle
<not supported> branches
181438 branch-misses

26.354800315 seconds time elapsed

26.258809000 seconds user
0.007984000 seconds sys

Can you reproduce this issue on tda4vm evm board ?

Thanks

quanli

0 Keerthy J over 2 years ago in reply to li quan

TI__Guru**** 164000 points

li quan said:
Can you reproduce this issue on tda4vm evm board ?

Yes. That is why I commented:

Keerthy J said:
Since destination will anyways be re-written by memcpy does that make sense to do memset before copy?

The destination is copied with new data using memcpy any case & hence the comment above. Is the request here to triage why destination memcpy takes
time when we init using memset?

- Keerthy

0 li quan over 2 years ago in reply to Keerthy J

Intellectual 955 points

Hi Keerthy

Keerthy J said:
Since destination will anyways be re-written by memcpy does that make sense to do memset before copy?

It is no sense to do memset before copy.

Keerthy J said:
Is the request here to triage why destination memcpy takes

Yes, This problem exists on the tda4vm platform only。

On tda4vh/tda4ve platform the test-cacse always consume about 80s, whether the memset operation is performed before memcpy or not, but on tda4vm platform the memcpy consume only 26s when without memset.

So our algorithm runs less efficiently on tda4vh platform (compared with tda4vm).

quanli

0 Keerthy J over 2 years ago in reply to Keerthy J

TI__Guru**** 164000 points

Hi,

I have reproduced the behavior on TDA4ve. I will investigate on this and get back. Thanks for the test case.

Best Regards,

Keerthy

0 li quan over 2 years ago in reply to Keerthy J

Intellectual 955 points

Hi Keerthy,

Keerthy J said:
I have reproduced the behavior on TDA4ve. I will investigate on this and get back. Thanks for the test case.

Is there any latest news?

Thanks

Quanli

+1 Keerthy J over 2 years ago in reply to Keerthy J

TI__Guru**** 164000 points

Hi Quanli,

Use the below commands from Linux:

echo 3 > /proc/sys/vm/drop_caches
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

Now I see that the time has reduced:

About 22 seconds to complete the test case on TDA4VH as well.

Best Regards,
Keerthy

Processors

Processors forum

TDA4VM memcpy poor performance after memset zero