This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-AM64X: Context switching underperformance on newer Linux SDK versions

Part Number: PROCESSOR-SDK-AM64X

Tool/software:

We have an application use-case that is sensitive to context switching inefficiency. We see meaningfully reduced benchmark results as we move to newer versions of the Linux SDK. Using the AM64x development kit we collected data using both stress-ng and perf bench:

  • stress-ng --context 1 --perf --metrics-brief --timeout 5s
  • perf bench sched

To provide rough numbers from the stress-ng results, we see 1100 ops/s (Linux SDK 08.06.00.42, 5.10 kernel) -> 755 ops/s (Linux SDK 09.02.01.10, 6.1 kernel) -> 710 ops/s (Linux SDK 10.01.10.04, 6.6 kernel). The magnitude of difference is similar in perf bench results, and we see the same kind of impact when tracing and profiling our own applications.

We're seeking some help driving down the difference. Thanks!

  • Hello Cory,

    I am reaching out to my team members who spend more time with optimizing cyclic test benchmarks. Feel free to ping the thread if I have not replied by the middle of next week.

    Regards,

    Nick

  • Hello Cory,

    Apologies for the delayed responses here. I have missed the key team member 5 or 6 times now since the last post. I am trying to get ahold of them tomorrow, feel free to ping the thread if I have not replied again before the end of the week.

    Regards,

    Nick

  • Hello Cory,

    Ok, thank you for your patience.

    perf is not a tool that me or my coworker have much experience with, so I will not be able to get as deep into the details as you might be hoping for. But let's cover some basic concepts about stuff that affects cyclictest and lmbench - my hope is the perf will be impacted by similar factors.

    First, let's talk benchmarks 

    In general, I would expect the benchmark numbers to look the best on SDK 9.1 & 9.2. Let me summarize the (non-optimized) benchmark results reported over the different SDK releases.

    The best reported value is in BLUE, while the worst reported value is in RED. I just checked the lmbench results on 11.0, and everything is about the same except for lat_ctx-4-256k (us), where I got ~8 usec. 

    8.6 9.0 9.1 9.2.1.10 10.0 10.1 11.0 11.1
    cyclictest (usec) avg 8,8
    max 152, 82
    avg 12, 13
    max 168, 88
    avg 7, 7
    max 72, 48
    avg 7, 8
    max 60, 50 
    avg 8, 9
    max 49, 66
    avg 8, 8
    max 77, 105
    copied from 10.1  copied from 10.1 
    lmbench N/A N/A

    lat_ctx-2-128k (us)

    6.01

    lat_ctx-2-256k (us)

    15.74

    lat_ctx-4-128k (us)

    7.02

    lat_ctx-4-256k (us)

    10.76

    lat_ctx-2-128k (us)

    5.12

    lat_ctx-2-256k (us)

    20.50

    lat_ctx-4-128k (us)

    7.30

    lat_ctx-4-256k (us)

    10.17

    lat_ctx-2-128k (us)

    6.35

    lat_ctx-2-256k (us)

    19.78

    lat_ctx-4-128k (us)

    6.93

    lat_ctx-4-256k (us)

    3.52 (typo?)

    lat_ctx-2-128k (us)

    5.52

    lat_ctx-2-256k (us)

    19.77

    lat_ctx-4-128k (us)

    6.16

    lat_ctx-4-256k (us)

    0.00 (typo?)

    lat_ctx-2-128k (us)

    14.36

    lat_ctx-2-256k (us)

    32.22

    lat_ctx-4-128k (us)

    11.99

    lat_ctx-4-256k (us)

    19.26

    lat_ctx-2-128k (us)

    10.52 (min 9.79, max 11.79)

    lat_ctx-2-256k (us)

    25.39 (min 15.41, max 30.59)

    lat_ctx-4-128k (us)

    11.98 (min 11.81, max 12.29)

    lat_ctx-4-256k (us)

    12.54 (min 6.99, max 15.42)

    Why is SDK 9.1 the "best"?

    In general, the more "junk" running on the Linux kernel and filesystem, the worse the benchmarks will be. In each version of Linux, the default kernel configs add more and more stuff by default to enable more and more features.

    In SDK 9.1, we updated the kernel configs to strip out as much of that unneeded stuff as possible. That's the "ti_arm64_prune.config" and "ti_rt.config" listed in the SDK docs: https://software-dl.ti.com/processor-sdk-linux/esd/AM64X/09_02_01_10/exports/docs/linux/Foundational_Components_Kernel_Users_Guide.html#using-default-configurations

    where you can find the files here:
    https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/kernel/configs/ti_rt.config?h=ti-linux-6.12.y-cicd 

    We have continued to disable some of the unneeded things in later SDKs, but it sounds like we have not dedicated the time to disable ALL the new stuff that has been added. So in general, you would expect to see performance getting slightly worse in the later SDKs.

    So how does that relate to your numbers?

    I am surprised that you actually saw a sharp drop in performance going from SDK 8.6 and SDK 9.2.1.10 if you are applying the above kernel configs. Could you tell me a bit more about your test environment?

    Another thing that might make a difference (I have not verified) is the filesystem size. I suspect we may see better performance on a tiny filesystem than on the default filesystem which has a lot more code running.

    Regards,

    Nick