This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

K2HK: One of the two ARM core dies after reboot

Expert 1800 points
Other Parts Discussed in Thread: 66AK2H06

Hi,

We have 66AK2H06 based custom boards.  We see occasionally after reboot that one of the CPU core dies and shown as offline. 

cat /proc/cpuinfo shows only one core instead of two cores.

top and cat /proc/interrupts shows only one CPU.


This issue was initially detected after the system exhibited high cpu load condition as reported by top.  And cat /proc/cpuinfo showed only one core.  It also happens when the system starts up after soft reset.

We do not have any steps to reproduce this issue on our custom board.  We have not tried on K2HK EVM.

Do you see any reasons behind this behaviour?  What problems can exhibit these kind of symptoms?

Thanks

Rams

  • Hi Rams,
    Did you check with "cat /sys/devices/system/cpu/online" command ?

    You are not getting this behavior when you not have much CPU load ?

    Which process is causing CPU load much ?

    Use the following command to generate CPU load more and check the CPU offline status.

    while : ; do : ; done
  • Hi Titus,

    When the problem happened, only one cpu was shown online i.e 0. 

    I am unable to reproduce it though I see it randomly. 

    I tried to load using while : ; do : ; done with several instances, but it does not provoke the issue.

    What other issue can cause core 1 to shutdown/crash/fail?  We still dont know what happens to the other core. There were no error prints in dmesg.  Do you have any clue or debugging info which I can look for?

    Thanks

    Rams

  • Hi, Rams,

    Does the console boot logs show any failure on CPU 1-3? In the normal case, it should show CPU 1-3 booted:

    [   59.784221] CPU: Testing write buffer coherency: ok
    [   59.784400] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
    [   59.784410] psci: probing function IDs from device-tree
    [   59.784422] Setting up static identity map for 0xc050e020 - 0xc050e078
    [   59.842393] CPU1: Booted secondary processor
    [   59.842423] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
    [   59.900160] CPU2: Booted secondary processor
    [   59.900189] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
    [   59.957913] CPU3: Booted secondary processor
    [   59.957942] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
    [   59.958022] Brought up 4 CPUs

    You can try enable the debug in boot-monitor code. The build instruction is in: processors.wiki.ti.com/.../MCSDK_UG_Chapter_Exploring

    Change the global variable, debug in sec/skern.c to be 1. I added the following code before it returns in skern_init() 

    if (debug) {
                    skern_puts(">>>> REX:: from cpu_id:: ");
                    skern_putui(cpu_id);
                    skern_puts("\n\r");
            }

    It should tells if CPUs are powered up and initialized ok:


    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0c5f08c0
    >>>> REX:: from cpu_id:: 00000001
    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0c5f0cc0
    >>>> REX:: from cpu_id:: 00000002
    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0c5f10c0
    >>>> REX:: from cpu_id:: 00000003 

    By the way, I noticed that if I disable the high resolution timer, and run timer_arm.mp4 on TI K2H EVM, it has lower average load then that when high resolution timer is enabled.

    High Res Timer

       Avg load after running for 15 min :   1.0   0.98   0.71

       Avg load after running for 30 min :   1.0   0.99   0.87

     

    No High Res Timer

       Avg load after running for 15 min :   0.60   0.78   0.67

       Avg load after running for 30 min :   0.09   0.14   0.36

     

    The behavior seems to be different from what was observed on your hardware.

     

    Rex

  • Hi Rex,

    Thanks a lot for your support.  I added the debug code as suggested by you into the boot monitor code.

    Please find below the boot message.  We are running 2 core ARM 66AK2h06 chipset.  We should see two skern CPU i.ds messages right.  But I see only one cpu_id, cpu_id 0

    Do you see any issue here?

    Thanks

    Rams

    Loading file 'skern-k2hk.bin' to addr 0x0c5f0000 with size 45056 (0x0000b000)...
    Done
    Message2 from Secure Mode
    Core freq - 0x0bebc200
    >>>> REX:: from cpu_id:: 00000000
    ## installed monitor, freq [200000000], status 0
    Loading file 'uImage_proencplus.bin' to addr 0x88000000 with size 3481712 (0x00352070)...
    Done
    ## Booting kernel from Legacy Image at 88000000 ...
       Image Name:   Linux-3.10.72-02954-g52b7e84
       Created:      2015-10-12  11:30:32 UTC
       Image Type:   ARM Linux Kernel Image (uncompressed)
       Data Size:    3481648 Bytes = 3.3 MiB
       Load Address: 80008000
       Entry Point:  80008000
       Verifying Checksum ... OK
    ## Flattened Device Tree blob at 87000000
       Booting using the fdt blob at 0x87000000
       Loading Kernel Image ... OK
    OK
       Using Device Tree in place at 87000000, end 8700cfc4

    Starting kernel ...

    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0bebc200

  • Hi, Rams,

    1) Does the issue also happen if there isn't any load?

    2) hat do you mean the load is high? the average load in top? the cpu utilization? what is running at the time?

    3) I ran timer_arm.mp4 to get one core in 0% idle, and reboot, the TI K2H EVM, I don't see it happen. Should I try more times?

    4) Could you modify the pr_debug to printk in keystone_smp_boot_secondary() of arch/arm/mach-keystone/platsmp.c, and see if the kernel ever tried to power up the secondary cores? You may want to debug the kernel code to see why it fails bringing up the secondary core. Any bad return code during the bring up? any memory conflict?

    Rex

  • Hi Rex,

    1)  This issue was seen randomly during our system testing.  I tried to provoke this issue by rebooting the system several times with system load.  I did not manage to reproduce.  I was checking the cat /sys/devices/system/cpu/online and it showed 0-1, two cores.

    2) No, we saw the issue when there was no load on the system.  It happens randomly

    3)  For the timer issue, we have some dependencies with HIGH_RES_TIMERS, so the system does not behave correctly if disabled.  I did not debug more with this option disabled.  I expect that the real issue with HIGH Resolution timer needs to be addressed here

    4) I modified the pr_debug in keystone_smp_boot_secondary() as suggested by you, but it seems that I do not see any kernel initialization messages until pcie init.  I tried to enable early printk messages, but it looks like the UART is very late in the sequence.  I have not tested it on EVM.  Is that the same behaviour or am I missing something here??

    Thanks

    Rams

  • Hi, Rams,

    Below is the debug printk I changed:

    static int __cpuinit
    keystone_smp_boot_secondary(unsigned int cpu, struct task_struct *idle)
    {
            unsigned long start = virt_to_idmap(&secondary_startup);
            int error;

    /*      pr_debug("keystone-smp: booting cpu %d, vector %08lx\n", */
            printk(">>>>>>>keystone-smp: booting cpu %d, vector %08lx\n",
                     cpu, start);

            asm volatile (
                    "mov    r0, #0\n"       /* power on cmd */
                    "mov    r1, %1\n"       /* cpu          */
                    "mov    r2, %2\n"       /* start        */
                    "smc    #0\n"           /* SMI          */
                    "mov    %0, r0\n"
                    : "=r" (error)
                    : "r"(cpu), "r"(start)
                    : "cc", "r0", "r1", "r2", "memory"
            );

    /*      pr_debug("keystone-smp: monitor returned %d\n", error); */
            printk(">>>>>>>keystone-smp: monitor returned %d\n", error);

            return error;
    }

     and console capture shown below. Do you not see the same debug print at the end of the logs?


    Starting kernel ...

    >>>> REX:: poweron cpu:: 00000001
    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0c5f08c0
    >>>> REX:: from cpu_id:: 00000001
    >>>> REX:: poweron cpu:: 00000002
    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0c5f0cc0
    >>>> REX:: from cpu_id:: 00000002
    >>>> REX:: poweron cpu:: 00000003
    >>>> skern_poweron_cpu >>>>
    Message2 from Secure Mode
    Core freq - 0x0c5f10c0
    >>>> REX:: from cpu_id:: 00000003
    [    0.000000] Booting Linux on physical CPU 0x0
    [    0.000000] Linux version 3.10.61-dirty (a0850461local@uda0850461) (gcc version 4.7.3 20130226 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.03-20130313 - Linaro GCC 2013.03) ) #7 SMP Tue Oct 20 14:32:36 EDT 2015
    [    0.000000] CPU: ARMv7 Processor [412fc0f4] revision 4 (ARMv7), cr=30c7387d
    [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
    [    0.000000] Machine: KeyStone2, model: Texas Instruments Keystone 2 SoC
    [    0.000000] switching to high address space at 0x800000000
    [    0.000000] cma: CMA: reserved 16 MiB at 2e800000
    [    0.000000] Memory policy: ECC disabled, Data cache writealloc
    [    0.000000] PERCPU: Embedded 8 pages/cpu @c2826000 s11840 r8192 d12736 u32768
    [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 915984
    [    0.000000] Kernel command line: console=ttyS0,115200n8 rootwait=1 rootfstype=nfs root=/dev/nfs rw nfsroot=158.218.109.254:/nfs/k2h_fs,v3,tcp,rsize=4096,wsize=4096 ip=dhcp
    [    0.000000] PID hash table entries: 4096 (order: 2, 16384 bytes)
    [    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
    [    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
    [    0.000000] Memory: 1536MB 2048MB = 3584MB total
    [    0.000000] Memory: 3615592k/3615592k available, 54424k reserved, 2891776K highmem
    [    0.000000] Virtual kernel memory layout:
    [    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
    [    0.000000]     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
    [    0.000000]     vmalloc : 0xf0000000 - 0xff000000   ( 240 MB)
    [    0.000000]     lowmem  : 0xc0000000 - 0xef800000   ( 760 MB)
    [    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
    [    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
    [    0.000000]       .text : 0xc0008000 - 0xc072db28   (7319 kB)
    [    0.000000]       .init : 0xc072e000 - 0xc0780e40   ( 332 kB)
    [    0.000000]       .data : 0xc0782000 - 0xc07c9bd8   ( 287 kB)
    [    0.000000]        .bss : 0xc07c9bd8 - 0xc07fb87c   ( 200 kB)
    [    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
    [    0.000000] Hierarchical RCU implementation.
    [    0.000000] NR_IRQS:16 nr_irqs:16 16
    [    0.000000] ipc irq: irqchip registered, range 512-539
    [    0.000000] Main PLL clk (1200000000 Hz), parent (122880000 Hz),postdiv = 2, mult = 624, prediv = 31
    [    0.000000] Generic PLL clk (1200000000 Hz), parent (125000000 Hz),postdiv = 1, mult = 47, prediv = 4
    [    0.000000] Generic PLL clk (983040000 Hz), parent (122880000 Hz),postdiv = 2, mult = 15, prediv = 0
    [    0.000000] Generic PLL clk (400000000 Hz), parent (100000000 Hz),postdiv = 4, mult = 15, prediv = 0
    [    0.000000] Generic PLL clk (333333333 Hz), parent (100000000 Hz),postdiv = 6, mult = 19, prediv = 0
    [    0.000000] Architected local timer running at 200.00MHz (phys).
    [    0.000000] Switching to timer-based delay loop
    [    0.000000] sched_clock: ARM arch timer >56 bits at 200000kHz, resolution 640/128ns
    [    0.000000] keystone timer clock @200000000 MHz
    [    0.000000] sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 4294967286ms
    [    0.000000] Console: colour dummy device 80x30
    [63200.382490] Calibrating delay loop (skipped), value calculated using timer frequency.. 400.00 BogoMIPS (lpj=2000000)
    [63200.382500] pid_max: default: 4096 minimum: 301
    [63200.382636] Mount-cache hash table entries: 512
    [63200.391282] CPU: Testing write buffer coherency: ok
    [63200.391462] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
    [63200.391472] psci: probing function IDs from device-tree
    [63200.391485] Setting up static identity map for 0xc050e220 - 0xc050e278
    [63200.391975] >>>>>>>keystone-smp: booting cpu 1, vector 804ffca0
    [63200.398526] >>>>>>>keystone-smp: monitor returned 0
    [63200.452516] CPU1: Booted secondary processor
    [63200.452545] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
    [63200.452806] >>>>>>>keystone-smp: booting cpu 2, vector 804ffca0
    [63200.458227] >>>>>>>keystone-smp: monitor returned 0
    [63200.513349] CPU2: Booted secondary processor
    [63200.513379] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
    [63200.513619] >>>>>>>keystone-smp: booting cpu 3, vector 804ffca0
    [63200.519055] >>>>>>>keystone-smp: monitor returned 0
    [63200.574165] CPU3: Booted secondary processor
    [63200.574193] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
    [63200.574274] Brought up 4 CPUs
    [63200.574285] SMP: Total of 4 processors activated (1600.00 BogoMIPS).
    [63200.574290] CPU: All CPU(s) started in SVC mode.
    [63200.574745] devtmpfs: initialized

     

     

  • Hi Rex,
    I did not have the earlyprintk bootargs set. I just added that and now I am able to see the earlyprintk boot messages. But I do not see the second CPU booting message. This is very strange.
    But there is a message "Brought up 2 CPUs".

    Do you see any issue here?

    Regards
    Rams

    Starting kernel ...

    Uncompressing Linux... done, booting the kernel.
    [ 0.000000] Booting Linux on physical CPU 0x0
    [ 0.000000] Linux version 3.10.72-dirty (rams@builder) (gcc version 4.7.3 20130226 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.03-20130313 - Linaro GCC 2013.03) ) #6 SMP Wed Oct 21 08 :57:02 UTC 2015
    [ 0.000000] Machine: KeyStone2, model: Texas Instruments Keystone 2 SoC
    [ 0.000000] bootconsole [earlycon0] enabled
    [ 0.000000] cma: CMA: reserved 16 MiB at 9e000000
    [ 0.000000] PERCPU: Embedded 8 pages/cpu @c0f37000 s11264 r8192 d13312 u32768
    [ 0.000000] Kernel command line: console=ttyS0,115200n8 rootwait=1 hwrev=3.1174 artnr=10002493 serial=144600010 position=7 bootversion=2013.01 earlyprintk=serial rootfstype=ubifs root=ubi0: rootfs1 rootflags=sync rw ubi.mtd=4 vmalloc=512M
    [ 0.000000] PID hash table entries: 2048 (order: 1, 8192 bytes)
    [ 0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
    [ 0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
    [ 0.000000] Memory: 1024MB = 1024MB total
    [ 0.000000] Memory: 1016056k/1016056k available, 32520k reserved, 540672K highmem
    [ 0.000000] Virtual kernel memory layout:
    [ 0.000000] vector : 0xffff0000 - 0xffff1000 ( 4 kB)
    [ 0.000000] fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB)
    [ 0.000000] vmalloc : 0xdf800000 - 0xff000000 ( 504 MB)
    [ 0.000000] lowmem : 0xc0000000 - 0xdf000000 ( 496 MB)
    [ 0.000000] pkmap : 0xbfe00000 - 0xc0000000 ( 2 MB)
    [ 0.000000] modules : 0xbf000000 - 0xbfe00000 ( 14 MB)
    [ 0.000000] .text : 0xc0008000 - 0xc064d738 (6422 kB)
    [ 0.000000] .init : 0xc064e000 - 0xc0692c00 ( 275 kB)
    [ 0.000000] .data : 0xc0694000 - 0xc06eeb68 ( 363 kB)
    [ 0.000000] .bss : 0xc06eeb68 - 0xc071cbfc ( 185 kB)
    [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
    [ 0.000000] Hierarchical RCU implementation.
    [ 0.000000] RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=2.
    [ 0.000000] NR_IRQS:16 nr_irqs:16 16
    [ 0.000000] ipc irq: irqchip registered, range 512-539
    [ 0.000000] Main PLL clk (1200000000 Hz), parent (122880000 Hz),postdiv = 2, mult = 624, prediv = 31
    [ 0.000000] Generic PLL clk (1200000000 Hz), parent (125000000 Hz),postdiv = 1, mult = 47, prediv = 4
    [ 0.000000] Generic PLL clk (983040000 Hz), parent (122880000 Hz),postdiv = 2, mult = 15, prediv = 0
    [ 0.000000] Generic PLL clk (400000000 Hz), parent (100000000 Hz),postdiv = 4, mult = 15, prediv = 0
    [ 0.000000] Generic PLL clk (333333333 Hz), parent (100000000 Hz),postdiv = 6, mult = 19, prediv = 0
    [ 0.000000] Architected local timer running at 200.00MHz (phys).
    [ 0.000000] Switching to timer-based delay loop
    [ 0.000000] sched_clock: ARM arch timer >56 bits at 200000kHz, resolution 640/128ns
    [ 0.000000] keystone timer clock @200000000 MHz
    [ 0.000000] sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 4294967286ms
    [ 0.000000] Console: colour dummy device 80x30
    [20458.782017] Calibrating delay loop (skipped), value calculated using timer frequency.. 400.00 BogoMIPS (lpj=2000000)
    [20458.793105] pid_max: default: 4096 minimum: 301
    [20458.797892] Mount-cache hash table entries: 512
    [20458.809479] CPU: Testing write buffer coherency: ok
    [20458.814651] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
    [20458.820452] psci: probing function IDs from device-tree
    [20458.825824] Setting up static identity map for 0xc0469f60 - 0xc0469f94
    [20458.833735] keystone-smp: booting cpu 1, vector 8045bfc0
    >>>> skern_poweron_cpu >>>>
    [20458.841551] keystone-smp: monitor returned 0
    Message2 from Secure Mode
    Core freq - 0x0bebc200
    [20458.852136] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
    [20458.852215] Brought up 2 CPUs
    [20458.861094] SMP: Total of 2 processors activated (800.00 BogoMIPS).
    [20458.867504] CPU: All CPU(s) started in SVC mode.
    [20458.872682] devtmpfs: initialized
  • I don't see anything wrong In this case. It powers up 2 cores.  Do you see one core or 2 cores after it came up? You should see 2 cores.

  • Hello Rex,

    I see two cores after it came up.  As I mentioned in my previous comment, I had seen the single core issue randomly.  I am still trying to reproduce it.  I am rebooting the fairly loaded system using automatic test script and checking number of cores online.

    I will also add the printks and run the reboot tests.

    Thanks

    Rams