This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Keystone II Linux: Random High CPU usage - userspace application using 1 full core

Expert 1800 points
Other Parts Discussed in Thread: 66AK2H06, 66AK2H12

Hi,

We have been facing an issue of blocking of a ARM core while starting a userspace application on our custom board based on K2HK (2 ARM Cores and 4 DSP cores).  Whenever we start the user space application, we randomly get into a state where one of the ARM core is fully used and blocked with the application.  

TOP reports 50% cpu idle time.

We tried to reproduce the issue on the K2HK eval kit.  The problem could be reproduced on the eval kit as well.  Here we use 66AKh12 (4 ARM core and 8 DSP core version).  Inorder to provoke the issue, we have to run the userspace application many times(Start stop) till you see the state where TOP shows 70% idle time.  In normal state while running the application, CPU idle time is 98%.

We are using kernel version from the TI keystone 2 git repository with tag  K2_LINUX_03.10.10_14.12  

The kernel version is 3.10.10-00067-ge366686

We have also tested latest kernel version from TI keystone 2 git repository master , and the problem is still there.  We tested the userspace application with the initial kernel version based on 3.8 and we could not reproduce the issue.

We suspect that this is a kernel bug.  We have not tried with earlier 3.10 version.  We will try to move back to older kernel versions to see if the issue is still there or not.

The user space application creates a 10 ms timer.  The code snipped is below.

Regards

Rams

#include <iostream>
#include <fstream>


#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>

int main()
{
// Create timer to expire every 10 ms.
int ret = 0;
int fd = timerfd_create(CLOCK_MONOTONIC, 0);
printf("FD: %d\n", fd);

struct itimerspec ts;
bzero(&ts,sizeof(ts));
ts.it_interval.tv_sec = 0;
ts.it_interval.tv_nsec = 10 * 1000 * 1000;
ts.it_value.tv_sec = 0;
ts.it_value.tv_nsec = 1 * 1000 * 1000;
if(timerfd_settime(fd, 0, &ts, NULL) < 0 )
{
printf("timerfd_settime Failed\n");
}

fd_set readSet;


while(true)
{
// Wait for timer to expire.
FD_ZERO(&readSet);
FD_SET(fd, &readSet);
ret = select(fd + 1, &readSet, NULL, NULL, NULL);
if (ret == -1)
perror("select()");
else if (ret)
{
if(FD_ISSET(fd, &readSet))
{
uint64_t timerExpCnt = 0;
if(read(fd, &timerExpCnt, sizeof(uint64_t)) > 0)
{
printf("timer expired\n");
}
}
else
{
printf("Select returned, but the timer did not expire!\n");
break;
}
}
else
{
printf("select timedout\n");

}

// Do work for a small amount of time.
const auto done = std::chrono::steady_clock::now() + std::chrono::milliseconds(2);

int i = 0;
while(std::chrono::steady_clock::now() < done)
{
++i;;
}

}


close(fd);

return 0;
}

  • Hi,
    Have you tried RT based linux kernel ?
  • I tried RT based kernel as well.  The behavior is the same.  

    Regards

    Rams 

  • Hi, Rams,

    I see the comment in the code saying it works for a smal amount of time, 2 ms. Is it the case, but not 10ms? I tried to compile your code and reproduce it , but  g++ compiler won't pass the line to set done to 2ms. I may want to modify it with other mechanism. Would you be able to share the binary so I can run it to compare?

    Thanks!

    Rex

  • Hi Rex

    Thanks for your response.  I am unable to attach binary to the message.  How can I share the binary image?

    Thanks

    Rams

  • Hi Rex,
    Are you compiling natively or cross compiling with linaro 4.7 version? Which compiler are you using?

    FYI: I tried 3.10.72 from TI repository, and the bug is still there on this version. I picked up important commits from 3.10.82 vannila kernel and patched to 3.10.72, and even then I was able to reproduce the issue.
    I wanted to try 3.12 kernel branch, I am not sure how difficult is that to get the Keystone2 changes to 3.12.
    Also, there has been some work with 3.13 version in v3.13/keystone-drivers. But this does not have Keystone 2 changes.

    Thanks
    Rams
  • Hi, Rams,
    Would you be able to rename the file extension to bbb and see if you can attach to e2e? If still can't, I'll find a way to do it.
    I am using Linaro 4.7 cross-compiler. I am also checking internally to see if we can spot anyting suspicious between 3.8 and 3.10.
    Rex
  • Hi Rex,

    Could you please check if you are able to access the binary.

    /cfs-file/__key/communityserver-discussions-components-files/354/7840.timer_5F00_arm.mp4

    Thanks

    Rams

  • Hi, Rams,
    I am able to run it and displaying "timer expired". I ran about 15-20 times by ctrl-C out of it and re-ran it with CPU utilization at 99.7% id. My kernel (from MCSDK 3.1.3.6) is one version newer than yours (3.1.2.5). That is v3.10.61_15.02. How long did you run each time before ctrl-c out of it and restart? I'll try to run more times and also on your kernel version to see if I see the same.
    Rex
  • Hi Rex,

    It is possible to reproduce with in 15 to 20 iterations with the Ctrl-C in few seconds between every start. It was fairly easy to see the behaviour on 2 ARM core 66AK2H06. The eval kit is 66AK2H12 and it took few more attempts to see on that. But I managed to reproduce on that and therefore able to isolate this as kernel issue.

    Thanks
    Rams
  • Hi, Rams,
    I see it happens, but it was the CPU3 bring down the % idle time. CPU0 still runs at 99.7% idle. CPU1 and CPU2 are at 100% idle. so the average idle time becomes 75%. It is interesting why CPU3 gets busy. I'll try 3.8 kernel.
    Rex
  • Hi Rex,

    It was good to note that you are able to see the issue. We have been bugged up with this issue for quite long time. We have to spawn atleast two instances of user application which has 10 ms timer in our product.
    Thanks for the update.

    Thanks
    Rams
  • Hi Rex,

    Do you have any update for us on this issue?

    Thanks
    Rams
  • Hi, Rams,

    Sorry, not update. We are trying to understand the issue and root cause it. The issue can not be reproduced consistently which adds to the complexity to the issue. I'll update you if we get any conclusion.

    Rex

  • Hi, Rams,

    Just want to give you an update before holiday. We still don't have any clue on the issue. We tried with kernel v4.1 and it still happens. We also noticed that when issue happens, it seems that the kernel is in the system call of getnsdayoftime(). We are still trying to understand why and the cause. I am trying to get the access to a Ubuntu 14.04 machine which runs Kernel 3.16 and see the behavior on it.

    Rex

  • Hi Rex,

    Thanks for the update.  We have tested this application on multicore PowerPC platform with 3.14, 3.16 and 3.10 versions.  We have not seen any issue with that.  The problem seems to be specific to ARM architecture.

    Regards

    Rams

  • Hi Rex,

    I tried provoking the same issue with TI Keystone 3.8 kernel and I was able to reproduce it with 3.8 version as well though it was hard to provoke on 3.8 version.

    Regards
    Rams
  • Hi Rex,

    Do you have any update on this issue?

    Thanks
    Rams
  • Hi, Rams,
    The issue is reproduced on the other non-keystone2 ARM A15 platform. I requested TI Linux Kernel team to investigate.
    Rex
  • Hi, Rams,

    I debugged further and found that it is not only across ARM A15 platforms, but also when issue hapeens, it is at the do{} while() loop in __getnstimeofday() waiting for mutex resource. I sent these info to TI linux development team and expect it to contact ARM to resolve the issue.

    Rex

  • Hello Rex,

    Any further news??

    Thanks
    Rams
  • Hi, Rams,
    No update. There are arch timer erratas but we aren't sure if those are related to the issue. However, I am away from office on a business trip and won't be able to work on this till I am back to office.
    Rex
  • Hello Rex,

    I just received email from ARM. Does the glibc from TI SDK VDSO enabled? I checked the glibc version from linaro git.linaro.org/.../glibc.git&nbsp;. I do not see the VDSO feature for ARM.

    ARM: 

    Have glibc VDSO enabled?
    
    sourceware.org/.../msg00680.html
    
    sourceware.org/.../msg00059.html

    Please let me know your inputs.

    Thanks

    Rams

  • Hi, Rams,

    I doubt this is the cause. I think high CPU utilization is misleading. One of the cores actually locks up spinning in the while loop. I talked to TI Kernel developer and he suggested to try the kernel at the branch tip of the other platform. Keystone-2 kernel lags behind that branch and may not have the arch timer errata fix. If issue still happens, we'll report it to ARM. The earliest I can access the other platform is tomorrow. I had the kernel built and ready for it.

    Rex

  • Hi, Rams,

    Just want to give you a quick update. I ran the latest kernel on the other A15 platform which was able to reproduce the issue, but I have not seen it happen. It does not totally rule out though. I built the latest kernel for K2H, but it fails to mount the nfs file system. Once I resolved it, I should be able to verify if issue still exists. If not, then we need to find out which patch to bring in or send the steps to you to rebuild the k2h kernel from latest TI linux kernel.

    Rex

  • Hello Rex,

    Thanks for the update.  So you dont think  ARCH timer erratas is relevant for cortex A15 which you were planning to test earlier with another branch. 

    I will await your test result. 

    Thanks

    Rams

  • Hi, Rams,

    We suspect arch timer errata is the cause, but not the one you pointed out which to me is a new feature. The new kernel I tested is from the tip of our kernel branch which has arch timer errata fix but has not been merged to Keystone-2. I haven't been able to make the issue happen on the other A15 platform which I was able with older kernel version. No being able to make it happen could mean it takes longer or different timing to make it happen. Unless I can verify on k2h which is easier to reproduce by demand. If it doesn't happen on k2h with new kernel, then I am more comfortable and believe the fix is in.

    Rex

  • Hi Rex,

    How did the test go? Do you have any further update?

    Thanks
    Rams
  • Hi, Rams,

    Not much to update. We verified on the same branch, it works for AM, but not Keystone-2. I also went through the logs to see which errata is related, but by looking at the code, it seems that they are different timer init sequence. If that is the case, I am not sure if errata matters. I'll need to consult with kernel developer.

    Rex

  • Hi Rex,

    How do we proceed on this issue? It is important for us to get the fix for this issue .
    If this is going to take long time, we have to find other alternate workarounds till the issue is reolved.
    Also, can you share the steps to upgrade the kernel from 3.10 to 4.1.

    Thanks
    Rams
  • Hello,

    Is this issue completely forgotten.? There has been no update on the issue for a month. We have mitigated the issue by not using the timers. But that is not always the case.
    Are there any plans to fix this issue?

    Regards
    Rams
  • Hi, Rams,

    No, it is not forgotten and we are still working on it. Several trials we had but made no difference. Still try to figure out which check-in makes the difference.

    Rex

  • Hi, Rams,
    If you don't need narosec granularity, could you try the kernel with HIGH_RES_TIMERS disabled? We are checking with upstream to see if anyone has seen an issue in that area.
    Rex
  • Hi Rex,
    Thanks a lot for the response. We dont need nanosec granularity.
    I tried provoking the issue with HIGH_RES_TIMERS disabled, but I could not reproduce the issue. Did you also try with HIGH_RES_TIMERS disabled?
    I will try to do some more testing with our applications to check if this fixes the problem.

    Thanks
    Rams
  • Hi, Ram,

    Druing the investigation, all my builds for the other platforms could not reproduce the issue. Even the same commit ID on the image I reproduce the issue which failed to reproduce after I rebuilt it. I recall the KS2 was looping in kernel code __getnstimeofday() and I recalled we had this high resolution timer turned on a while back. By disabling it, I have not been able to reproduce the issue on K2H with your timer application. Hence, I asked you to give it a try. We still don't know why the high resolution timer (ns) causes the issue and is probing upstream community to see if anyone has seen the issue.

    If you don't need ns granularity, this would allow you to proceed and at mean time we continue to investigate it by in lower priority.

    Rex

  • Hi Rex,

    Thanks for the support. I am setting up system testing with high resolution timer disabled. I hope that we will not require nsec granularity. I will update the status after our system testing. If things are under control, then the HIGH RES Timer could be investigated in lower priority.
    I will keep you posted on the status.

    Thanks
    Rams
  • Hi Rex,

    Disabling the high resolution timer does not seem to be good idea.  The load average is very high greater than 6 for 2 core ARM and it continues to remain high even when CPU idle time is 100%.  I think it is essential to fix high resolution timer issue.

    We dont want to introduce yet another issue here.

    Let me know your feedback.

    Thanks

    Rams

  • Hi, Rams,

    Could you elaborate a bit more on "load average is very high greater than 6 for 2 core ARM and it continues to remain high even when CPU idle time is 100%"? what load are you talking about and what is its relationship with cpu utilization, and what is the significance of 6?

    Rex

  • Hi Rex,
    Sorry. The number 6 was a typical number. The load average of the system as reported from top continues to remain very high. We performed system tests and they do not behave as expected and is broken. Applications are blocked for execution which in turn increases the load average.
    Please see below the load average numbers when the system is not loaded at all.

    Mem: 101676K used, 931444K free, 780K shrd, 0K buff, 40456K cached
    CPU: 0.0% usr 0.0% sys 0.0% nic 100% idle 0.0% io 0.0% irq 0.0% sirq
    Load average: 4.90 4.45 4.33 1/91 891

    The load average numbers are even higher when we perform system tests with full load conditions. This is observed only when we disable High resolution timers.

    Thanks
    Rams
  • I see. You meant the load averate shown in the "top". Thanks!Rex
  • Hello,

    Do you have any progress on finding solution for the timer issue?

    Regards
    Rams
  • Hi Rex,

    If this problem is related to arch ARM HIGH_RES_TIMERS, should this issue be made aware to ARM linux community? I got in touch with some of the ARM linux community members and they said that they were not aware of the issue. They asked to post it on the linux arm mailing list. They asked whether the test was done on latest kernel. Is that possible to put the necessary patches on the latest kernel and try once again?

    Thanks
    Rams
  • Hi, Rams,

    The issue has been tried on 4.1 kernel, and if I recall correctly, it still happens. I thought it also happens on TI AM57x platform, but after trying bisecting the checkins, it turned out that all my builds for AM57x kernel work and even with the same commit id from nightly build which I could reproduce the issue. The conclusion was that the issue only happens to KS2, not AM57x. Both of them are A15 platforms. The issue is arch timer related on KS2 and could be because different customization between KS2 and AM57x. We have a new group working on KS2 Linux kernel and will escalate it to its attention.

    Rex