Soft lockup when running bonnie++ only at 1600 mt/s

Bug #1239800 reported by Pradeep
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

SRU Justification:

Impact: running a test like bonnie++ makes the system instable and prone to hangs.

Fix: apply the attached patches and recompile a kernel.

Test case: leave bonnie running in a loop for 24hrs.

--

When bonnie++ was run in a loop, the system exhibits a hang behavior with
"rcu_sched: self-detected stall on CPU"
The time to error can be inconsistent. One time it took 7 hours and the next time more than 2 days.

Commands to reproduce the failure:
$ sudo apt-get install bonnie++
$ mkdir bonnie
$ while true; do bonnie++ -d bonnie; done &>>bonnie0.log &

Stack trace:
[237019.072290] INFO: rcu_sched self-detected stall on CPU { 1} (t=19305216 jiffies g=580389 c=580388 q=84)
[237019.080901] CPU: 1 PID: 44 Comm: kswapd0 Tainted: GF 3.11.0-6-generic-lpae #12-Ubuntu
[237019.088879] [<c002bc00>] (unwind_backtrace+0x0/0x138) from [<c0026f1c>] (show_stack+0x10/0x14)
[237019.096700] [<c0026f1c>] (show_stack+0x10/0x14) from [<c05cbe50>] (dump_stack+0x74/0x90)
[237019.104051] [<c05cbe50>] (dump_stack+0x74/0x90) from [<c00bf37c>] (rcu_check_callbacks+0x31c/0x798)
[237019.112262] [<c00bf37c>] (rcu_check_callbacks+0x31c/0x798) from [<c00492a0>] (update_process_times+0x38/0x64)
[237019.121254] [<c00492a0>] (update_process_times+0x38/0x64) from [<c008cdbc>] (tick_sched_handle+0x54/0x60)
[237019.129933] [<c008cdbc>] (tick_sched_handle+0x54/0x60) from [<c008d00c>] (tick_sched_timer+0x44/0x74)
[237019.138300] [<c008d00c>] (tick_sched_timer+0x44/0x74) from [<c005db50>] (__run_hrtimer+0x74/0x1d4)
[237019.146433] [<c005db50>] (__run_hrtimer+0x74/0x1d4) from [<c005e6f8>] (hrtimer_interrupt+0x10c/0x2c0)
[237019.154800] [<c005e6f8>] (hrtimer_interrupt+0x10c/0x2c0) from [<c0492e44>] (arch_timer_handler_phys+0x28/0x30)
[237019.163871] [<c0492e44>] (arch_timer_handler_phys+0x28/0x30) from [<c00b8c2c>] (handle_percpu_devid_irq+0x6c/0x104)
[237019.173332] [<c00b8c2c>] (handle_percpu_devid_irq+0x6c/0x104) from [<c00b54ec>] (generic_handle_irq+0x20/0x30)
[237019.182402] [<c00b54ec>] (generic_handle_irq+0x20/0x30) from [<c0023ff4>] (handle_IRQ+0x38/0x94)
[237019.190378] [<c0023ff4>] (handle_IRQ+0x38/0x94) from [<c0008508>] (gic_handle_irq+0x28/0x5c)
[237019.198041] [<c0008508>] (gic_handle_irq+0x28/0x5c) from [<c05d1c00>] (__irq_svc+0x40/0x50)
[237019.205624] Exception stack(0xee2c1c18 to 0xee2c1c60)
[237019.210238] 1c00: 00000004 00000004
[237019.217666] 1c20: 00000008 00000001 ee2c1c8c ca208700 ca208700 0996b000 ca208708 00000001
[237019.225093] 1c40: 00000002 edb31300 00000003 ee2c1c60 c02f54fc c00923c8 200f0013 ffffffff
[237019.232523] [<c05d1c00>] (__irq_svc+0x40/0x50) from [<c00923c8>] (generic_exec_single+0x6c/0x94)
[237019.240500] [<c00923c8>] (generic_exec_single+0x6c/0x94) from [<c00924f4>] (smp_call_function_single+0x104/0x198)
[237019.249805] [<c00924f4>] (smp_call_function_single+0x104/0x198) from [<c0029920>] (broadcast_tlb_mm_a15_erratum+0x7c/0x84)
[237019.259812] [<c0029920>] (broadcast_tlb_mm_a15_erratum+0x7c/0x84) from [<c0029adc>] (flush_tlb_page+0x74/0xa8)
[237019.268882] [<c0029adc>] (flush_tlb_page+0x74/0xa8) from [<c011fc8c>] (ptep_clear_flush_young+0x6c/0xd0)
[237019.277484] [<c011fc8c>] (ptep_clear_flush_young+0x6c/0xd0) from [<c011a60c>] (page_referenced_one+0x64/0x1fc)
[237019.286554] [<c011a60c>] (page_referenced_one+0x64/0x1fc) from [<c011c034>] (page_referenced+0xf4/0x2e4)
[237019.295155] [<c011c034>] (page_referenced+0xf4/0x2e4) from [<c00fc410>] (shrink_active_list+0x1f0/0x35c)
[237019.303756] [<c00fc410>] (shrink_active_list+0x1f0/0x35c) from [<c00fdadc>] (shrink_lruvec+0x32c/0x598)
[237019.312279] [<c00fdadc>] (shrink_lruvec+0x32c/0x598) from [<c00fddb0>] (shrink_zone+0x68/0x180)
[237019.320176] [<c00fddb0>] (shrink_zone+0x68/0x180) from [<c00fe430>] (kswapd+0x568/0x9d4)
[237019.327527] [<c00fe430>] (kswapd+0x568/0x9d4) from [<c005aae0>] (kthread+0xa4/0xb0)
[237019.334487] [<c005aae0>] (kthread+0xa4/0xb0) from [<c0023198>] (ret_from_fork+0x14/0x3c)

Setup details:
Quad-core A15 server nodes on Calxeda Midway hardware.
The failure has been seen two times with DDR setting of DDR3@1600mt/s

cat /proc/version_signature
Ubuntu 3.11.0-12.18-generic-lpae 3.11.3
The issue was first seen on Ubuntu 3.11.0-6.12-generic-lpae

cat /etc/issue
Ubuntu 13.04 \n \l

Additional debug information attached
---
Architecture: armhf
DistroRelease: Ubuntu 13.04
MarkForUpload: True
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=en_US:
 TERM=vt102
 PATH=(custom, no user)
 LANG=en_US
 SHELL=/bin/bash
Uname: Linux 3.11.0-12-generic-lpae armv7l
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo

CVE References

Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1239800

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote : HookError_cloud_archive.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote : HookError_generic.txt

apport information

Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote : HookError_source_linux.txt

apport information

Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote : HookError_ubuntu.txt

apport information

Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote :

As this bug repots a hang behavior apport log collection tool cannot be run at the time of the hang. However, I tried to run this after a reboot, but the python script crashed. Please let me know if you need additional info.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Are you unable to reprduce this bug if you boot back into: 3.11.0-5.11 ? If that is the case, we should be able to perform a bisect to identify the commit that introduced this.

It might also be worthwhile to test the latest mainline kernel, which can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-rc5-saucy/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: kernel-da-key saucy
Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote : Re: [Bug 1239800] Re: Soft lockup when running bonnie++ only at 1600 mt/s

Kernel 3.11.0-5.11 has been deleted from ports.ubuntu.com. Is there another way to get hold of this?

Revision history for this message
Andy Whitcroft (apw) wrote :

When this occured, was there any symptoms other than the self detected stall. Did the machine continue and recover.

@Pardeep -- all versions are in the launchpad librarian under the source listing page for the linux package.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Pardeep, a direct link to the 3.11.0-5.11 kernel is:
https://launchpad.net/ubuntu/+source/linux/3.11.0-5.11

The .deb files can be found by clicking the arch under "Builds"

Revision history for this message
Pradeep (pradeep-krishnamurthy) wrote :

The node was unresponsive and could not be recovered.
It needed a hard reset to continue.

Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
Paolo Pisati (p-pisati) wrote :

Patch 0001 in #13 is a prerequisite for patch 0002 in #14, that actually solves the problem: both of them are already in Linus tree.

Paolo Pisati (p-pisati)
description: updated
tags: added: patch
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-saucy' to 'verification-done-saucy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-saucy
Revision history for this message
Manoj Iyer (manjo) wrote :

I can verify that I was unable to reproduce this bug with the latest proposed kernel on a highbank.

ubuntu@hb07-15:~$ cat /etc/issue
Ubuntu 13.10 \n \l

ubuntu@hb07-15:~$ uname -a
Linux hb07-15 3.11.0-15-generic #22-Ubuntu SMP Mon Dec 2 23:36:39 UTC 2013 armv7l armv7l armv7l GNU/Linux
ubuntu@hb07-15:~$

Revision history for this message
Raghuram Kota (rkota) wrote :

Added Tag ; "Verification-done-saucy" based on comment #17

tags: added: verification-done-saucy
removed: verification-needed-saucy
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.3 KiB)

This bug was fixed in the package linux - 3.11.0-15.23

---------------
linux (3.11.0-15.23) saucy; urgency=low

  [Brad Figg]

  * Release Tracking Bug
    - LP: #1259259

  [ Tim Gardner ]

  * [Config] Build-in ohci-pci
    - LP: #1244176

linux (3.11.0-15.22) saucy; urgency=low

  [Brad Figg]

  * Release Tracking Bug
    - LP: #1257092

  [ Andy Whitcroft ]

  * [Config] CONFIG_DEBUG_BUGVERBOSE=y
    - LP: #1252353

  [ Benjamin Tissoires ]

  * SAUCE: (no-up) HID: appleir: force input to be set
    - LP: #1244505

  [ John Johansen ]

  * SAUCE: (no-up) apparmor: Fix tasks not subject to, reloaded policy
    - LP: #1236455

  [ Kamal Mostafa ]

  * SAUCE: (no-up) drm/i915: i915.disable_pch_pwm overrides PCH_PWM_ENABLE
    quirk
    - LP: #1163720

  [ Manoj Iyer ]

  * SAUCE: Enable earlyprintk via the PL011.
    - LP: #1248233

  [ Paolo Pisati ]

  * [Config] armhf: RTC_DRV_PL031=y
    - LP: #1252242
  * [Config] armhf: CPU_FREQ=y && ARM_HIGHBANK_CPUFREQ=y
    - LP: #1249397

  [ Rob Herring ]

  * [Config] armhf: PSTORE_RAM=y and PSTORE_CONSOLE=y
    - LP: #1248492
  * SAUCE: net: calxedaxgmac: add mac address learning
    - LP: #1248233

  [ Tim Gardner ]

  * [Debian] Re-sign modules after debug objcopy
    - LP: #1253155

  [ Upstream Kernel Changes ]

  * Revert "rt2x00pci: Use PCI MSIs whenever possible"
    - LP: #1257037
  * Revert "epoll: use freezable blocking call"
    - LP: #1257037
  * Revert "select: use freezable blocking call"
    - LP: #1257037
  * Revert "ima: policy for RAMFS"
    - LP: #1257037
  * ARM: tlb: don't perform inner-shareable invalidation for local TLB ops
    - LP: #1239800
  * ARM: 7855/1: Add check for Cortex-A15 errata 798181 ECO
    - LP: #1239800
  * mfd: rtsx: Modify rts5249_optimize_phy
    - LP: #1255297
  * usb: musb: start musb on the udc side, too
    - LP: #1257037
  * usb-storage: add quirk for mandatory READ_CAPACITY_16
    - LP: #1257037
  * USB: support new huawei devices in option.c
    - LP: #1257037
  * USB: quirks.c: add one device that cannot deal with suspension
    - LP: #1257037
  * USB: quirks: add touchscreen that is dazzeled by remote wakeup
    - LP: #1257037
  * USB: serial: ftdi_sio: add id for Z3X Box device
    - LP: #1257037
  * xhci: Don't enable/disable RWE on bus suspend/resume.
    - LP: #1257037
  * cifs: Fix inability to write files >2GB to SMB2/3 shares
    - LP: #1257037
  * x86: Update UV3 hub revision ID
    - LP: #1257037
  * cpufreq: s3c64xx: Rename index to driver_data
    - LP: #1257037
  * cpufreq / intel_pstate: Fix max_perf_pct on resume
    - LP: #1257037
  * bcache: Fixed incorrect order of arguments to bio_alloc_bioset()
    - LP: #1257037
  * HID: wiimote: add LEGO-wiimote VID
    - LP: #1257037
  * cgroup: fix to break the while loop in cgroup_attach_task() correctly
    - LP: #1257037
  * mac80211: correctly close cancelled scans
    - LP: #1257037
  * mac80211: drop spoofed packets in ad-hoc mode
    - LP: #1257037
  * mac80211: use sta_info_get_bss() for nl80211 tx and client probing
    - LP: #1257037
  * mac80211: update sta->last_rx on acked tx frames
    - LP: #1257037
  * mac80211: fix crash if bitrate calculation goes wrong
    - LP: #1257...

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.