Comment 10 for bug 1469214

Revision history for this message
Ming Lei (tom-leiming) wrote : Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

Hi Colin,

On Sat, Jul 4, 2015 at 12:43 AM, Colin Ian King
<email address hidden> wrote:
> I was able to hit the following translation fault running sudo ./stress-
> ng --seq 0 -t 60 --syslog --metrics --times -v

I suggest to not run stress-ng as root, otherwise it can be less
serious because:

  - root user can do bad things easily, and it is quite easy to kill any
of process
  - in reality most of loads are run as non-root

If some system processes(irqbalance, systemd-*) are only killed
becasue stress-ng is running as root, it can be a low priority issue.
Otherwise we need pay close attention to the issue.

And I always run 'stress-ng' as ubuntu user without sudo, that may
be the reason why it is difficult for me to reproduce that.

Even with the two new approaches, it is still not easy for me to
reproduce that. I only see one time of translation fault by your
first approach(./stress-ng --seq 0 ...) in 6 hours, and can't trigger
that with your 2nd approach(by bash script).

Folllows the log[1] I triggered, and I think it is very likely a userspace
issue. From irqbalanc-dbgsym package, we can easily find 'PC is at
0x406078' is one address in text section, and it should be inside
function of 'place_irq_in_node' because the exec file isn't built as
relocation. One thing I still can't understand is that why the fault
address is '0x00000040' in the context.

[1]
[ 3616.333392] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 3616.333393] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 5316.367265] irqbalance[1457]: unhandled level 2 translation fault
(11) at 0x00000040, esr 0x92000006
[ 5316.476937] pgd = ffffffcfb5478000
[ 5316.520692] [00000040] *pgd=0000004fb4a3c003,
*pud=0000004fb4a3c003, *pmd=0000000000000000
[ 5316.620270]
[ 5316.638140] CPU: 7 PID: 1457 Comm: irqbalance Not tain-21-generic #21-Ubuntu
[ 5316.733212] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5316.806382] task: ffffffcfb55e6e40 ti: ffffffcfa72b0000 task.ti:
ffffffcfa72b0000
[ 5316.896258] PC is at 0x406078
[ 5316.931865] LR is at 0x404100
[ 5316.967457] pc : [<0000000000406078>] lr : [<0000000000404100>]
pstate: 20000000
[ 5317.056268] sp : 0000007fc07ff2d0
[ 5317.096038] x29: 0000007fc07ff2d0 x28: 00000000004095a0
[ 5317.160023] x27: 0000000000409548 x26: 000000000041a000
[ 5317.223897] x25: 0000000000405000 x24: 000000000041acf8
[ 5317.287868] x23: 000000000041a000 x22: 000000000041a000
[ 5317.351841] x21: 000000002e0d6050 x20: 000000000041a000
[ 5317.415744] x19: 000000002e0e9020 x18: 0000000000000000
[ 5317.479620] x17: 0000007fb5ac287c x16: 000000000041a188
[ 5317.543490] x15: 003bdd2370f74a1c x14: 2030203020302030
[ 5317.607373] x13: 2030203020302030 x12: 2030203020302030
[ 5317.671263] x11: 2030203020302030 x10: 2030203020302030
[ 5317.735137] x9 : 00000000000000a0 x8 : 0000000000000001
[ 5317.799113] x7 : 0000000000000033 x6 : 000000002e0d6e08
[ 5317.862983] x5 : 0000000000000040 x4 : 0000000000000000
[ 5317.926867] x3 : 000000002e0d7008 x2 : 0000000000000000
[ 5317.990840] x1 : 000000000000002c x0 : 0000000000000003
[ 5318.054713]