Bug #1469214 “HP ProLiant m400 Server crashes with unhandled lev...” : Bugs : irqbalance package : Ubuntu

Colin Ian King (colin-king) on 2015-06-26

Changed in linux (Ubuntu):
assignee:	nobody → Colin Ian King (colin-king)
assignee:	Colin Ian King (colin-king) → dann frazier (dannf)

Colin Ian King (colin-king) on 2015-06-26

summary:

- HP ProLiant m400 Server
+ HP ProLiant m400 Server crashes with unhandled level 3 translation fault

Revision history for this message

Brad Figg (brad-figg) wrote on 2015-06-26: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1469214

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Joseph Salisbury (jsalisbury) on 2015-06-26

Changed in linux (Ubuntu):
importance:	Undecided → Medium
status:	Incomplete → Triaged

Revision history for this message

dann frazier (dannf) wrote on 2015-06-29:

#2

fyi, I ran this in a loop over the weekend and the issue has not reproduced.

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-06-29:

#3

Hrm, OK, I'll see if I can find a better reproducer.

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-06-30:

#4

I can't reproduce it after running half a day on ms10-36, and OOM is often triggered .

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-06-30:

#5

Oops, the test result in #4 is for LP1469218 instead of this one.

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-03: Re: [Bug 1469214] [NEW] HP ProLiant m400 Server crashes with unhandled level 3 translation fault

#6

Download full text (3.8 KiB)

This one looks a problem of systemd-timesyncd, from pmap log[1],
both the PC and faulted address aren't valid, which drop in heap area,
but the faulted address(0x7fa8ea6008) shouldn't have been allocated
and is far away from the start address(0x7f9eb27000) of hear area.

[1] pmap log

ubuntu@ms10-37-mcdivittB0:~$ ps -ax | grep systemd-timesyncd
412 ? Ssl 0:00 /lib/systemd/systemd-timesyncd
18058 pts/2 S+ 0:00 grep --color=auto systemd-timesyncd
ubuntu@ms10-37-mcdivittB0:~$ sudo pmap 412 | tail -n 10
0000007f82730000 108K r-x-- systemd-timesyncd
0000007f8274e000 16K rw--- [ anon ]
0000007f82757000 8K rw--- [ anon ]
0000007f82759000 4K r---- [ anon ]
0000007f8275a000 4K r-x-- [ anon ]
0000007f8275b000 4K r---- systemd-timesyncd
0000007f8275c000 4K rw--- systemd-timesyncd
0000007f9eb27000 132K rw--- [ anon ]
0000007fd5e13000 132K rw--- [ stack ]
total 77176K
ubuntu@ms10-37-mcdivittB0:~$ sudo pmap 412
412: /lib/systemd/systemd-timesyncd
0000007f7c000000 132K rw--- [ anon ]
0000007f7c021000 65404K ----- [ anon ]
0000007f81c29000 16K r-x-- libnss_dns-2.21.so
0000007f81c2d000 64K ----- libnss_dns-2.21.so
0000007f81c3d000 4K r---- libnss_dns-2.21.so
0000007f81c3e000 4K rw--- libnss_dns-2.21.so
0000007f81c3f000 4K ----- [ anon ]
0000007f81c40000 8188K rw--- [ anon ]
0000007f8243f000 40K r-x-- libnss_files-2.21.so
0000007f82449000 60K ----- libnss_files-2.21.so
0000007f82458000 4K r---- libnss_files-2.21.so
0000007f82459000 4K rw--- libnss_files-2.21.so
0000007f8245a000 36K r-x-- libnss_nis-2.21.so
0000007f82463000 60K ----- libnss_nis-2.21.so
0000007f82472000 4K r---- libnss_nis-2.21.so
0000007f82473000 4K rw--- libnss_nis-2.21.so
0000007f82474000 72K r-x-- libnsl-2.21.so
0000007f82486000 60K ----- libnsl-2.21.so
0000007f82495000 4K r---- libnsl-2.21.so
0000007f82496000 4K rw--- libnsl-2.21.so
0000007f82497000 8K rw--- [ anon ]
0000007f82499000 24K r-x-- libnss_compat-2.21.so
0000007f8249f000 64K ----- libnss_compat-2.21.so
0000007f824af000 4K r---- libnss_compat-2.21.so
0000007f824b0000 4K rw--- libnss_compat-2.21.so
0000007f824b1000 580K r-x-- libm-2.21.so
0000007f82542000 60K ----- libm-2.21.so
0000007f82551000 4K r---- libm-2.21.so
0000007f82552000 4K rw--- libm-2.21.so
0000007f82553000 16K r-x-- libcap.so.2.24
0000007f82557000 60K ----- libcap.so.2.24
0000007f82566000 4K r---- libcap.so.2.24
0000007f82567000 4K rw--- libcap.so.2.24
0000007f82568000 68K r-x-- libresolv-2.21.so
0000007f82579000 64K ----- libresolv-2.21.so
0000007f82589000 4K r---- libresolv-2.21.so
0000007f8258a000 4K rw--- libresolv-2.21.so
0000007f8258b000 8K rw--- [ anon ]
0000007f8258d000 1216K r-x-- libc-2.21.so
0000007f826bd000 60K ----- libc-2.21.so
0000007f826cc000 16K r---- libc-2.21.so
0000007f826d0000 8K rw--- libc-2.21.so
0000007f826d2000 16K rw--- [ anon ]
0000007f826d6000 88K r-x-- libpthread-2.21.so
0000007f826ec000 60K ----- libpthread-2.21.so
0000007f826fb000 4K r---- libpthr...

This one looks a problem of systemd-timesyncd, from pmap log[1],
both the PC and faulted address aren't valid,  which drop in heap area,
but the faulted address(0x7fa8ea6008) shouldn't have been allocated
and is far away from the start address(0x7f9eb27000) of hear area.

[1] pmap log

ubuntu@ms10-37-mcdivittB0:~$ ps -ax | grep systemd-timesyncd
  412 ?        Ssl    0:00 /lib/systemd/systemd-timesyncd
18058 pts/2    S+     0:00 grep --color=auto systemd-timesyncd
ubuntu@ms10-37-mcdivittB0:~$ sudo pmap 412 | tail -n 10
0000007f82730000    108K r-x-- systemd-timesyncd
0000007f8274e000     16K rw---   [ anon ]
0000007f82757000      8K rw---   [ anon ]
0000007f82759000      4K r----   [ anon ]
0000007f8275a000      4K r-x--   [ anon ]
0000007f8275b000      4K r---- systemd-timesyncd
0000007f8275c000      4K rw--- systemd-timesyncd
0000007f9eb27000    132K rw---   [ anon ]
0000007fd5e13000    132K rw---   [ stack ]
 total            77176K
ubuntu@ms10-37-mcdivittB0:~$ sudo pmap 412
412:   /lib/systemd/systemd-timesyncd
0000007f7c000000    132K rw---   [ anon ]
0000007f7c021000  65404K -----   [ anon ]
0000007f81c29000     16K r-x-- libnss_dns-2.21.so
0000007f81c2d000     64K ----- libnss_dns-2.21.so
0000007f81c3d000      4K r---- libnss_dns-2.21.so
0000007f81c3e000      4K rw--- libnss_dns-2.21.so
0000007f81c3f000      4K -----   [ anon ]
0000007f81c40000   8188K rw---   [ anon ]
0000007f8243f000     40K r-x-- libnss_files-2.21.so
0000007f82449000     60K ----- libnss_files-2.21.so
0000007f82458000      4K r---- libnss_files-2.21.so
0000007f82459000      4K rw--- libnss_files-2.21.so
0000007f8245a000     36K r-x-- libnss_nis-2.21.so
0000007f82463000     60K ----- libnss_nis-2.21.so
0000007f82472000      4K r---- libnss_nis-2.21.so
0000007f82473000      4K rw--- libnss_nis-2.21.so
0000007f82474000     72K r-x-- libnsl-2.21.so
0000007f82486000     60K ----- libnsl-2.21.so
0000007f82495000      4K r---- libnsl-2.21.so
0000007f82496000      4K rw--- libnsl-2.21.so
0000007f82497000      8K rw---   [ anon ]
0000007f82499000     24K r-x-- libnss_compat-2.21.so
0000007f8249f000     64K ----- libnss_compat-2.21.so
0000007f824af000      4K r---- libnss_compat-2.21.so
0000007f824b0000      4K rw--- libnss_compat-2.21.so
0000007f824b1000    580K r-x-- libm-2.21.so
0000007f82542000     60K ----- libm-2.21.so
0000007f82551000      4K r---- libm-2.21.so
0000007f82552000      4K rw--- libm-2.21.so
0000007f82553000     16K r-x-- libcap.so.2.24
0000007f82557000     60K ----- libcap.so.2.24
0000007f82566000      4K r---- libcap.so.2.24
0000007f82567000      4K rw--- libcap.so.2.24
0000007f82568000     68K r-x-- libresolv-2.21.so
0000007f82579000     64K ----- libresolv-2.21.so
0000007f82589000      4K r---- libresolv-2.21.so
0000007f8258a000      4K rw--- libresolv-2.21.so
0000007f8258b000      8K rw---   [ anon ]
0000007f8258d000   1216K r-x-- libc-2.21.so
0000007f826bd000     60K ----- libc-2.21.so
0000007f826cc000     16K r---- libc-2.21.so
0000007f826d0000      8K rw--- libc-2.21.so
0000007f826d2000     16K rw---   [ anon ]
0000007f826d6000     88K r-x-- libpthread-2.21.so
0000007f826ec000     60K ----- libpthread-2.21.so
0000007f826fb000      4K r---- libpthread-2.21.so
0000007f826fc000      4K rw--- libpthread-2.21.so
0000007f826fd000     16K rw---   [ anon ]
0000007f82701000    112K r-x-- ld-2.21.so
0000007f8272d000      4K r---- ld-2.21.so
0000007f8272e000      8K rw--- ld-2.21.so
0000007f82730000    108K r-x-- systemd-timesyncd
0000007f8274e000     16K rw---   [ anon ]
0000007f82757000      8K rw---   [ anon ]
0000007f82759000      4K r----   [ anon ]
0000007f8275a000      4K r-x--   [ anon ]
0000007f8275b000      4K r---- systemd-timesyncd
0000007f8275c000      4K rw--- systemd-timesyncd
0000007f9eb27000    132K rw---   [ anon ]
0000007fd5e13000    132K rw---   [ stack ]
 total            77176K

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-07-03:

#7

I was able to hit the following translation fault running sudo ./stress-ng --seq 0 -t 60 --syslog --metrics --times -v

[90103.913447] irqbalance[807]: unhandled level 2 translation fault (11) at 0x001754a4, esr 0x92000006
[90103.913454] pgd = ffffffcfb5926000
[90103.954271] [001754a4] *pgd=0000004fb5a8b003, *pud=0000004fb5a8b003, *pmd=0000000000000000

[90104.053696] CPU: 1 PID: 807 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
[90104.053698] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[90104.053701] task: ffffffcfb59c4980 ti: ffffffcfb5814000 task.ti: ffffffcfb5814000
[90104.053717] PC is at 0x7f95548834
[90104.053719] LR is at 0x7f955487f4
[90104.053721] pc : [<0000007f95548834>] lr : [<0000007f955487f4>] pstate: 80000000
[90104.053723] sp : 0000007fcf72a410
[90104.053725] x29: 0000007fcf72a410 x28: 00000000004095a0
[90104.053728] x27: 0000000000409548 x26: 000000000041a000
[90104.053731] x25: 0000000000000001 x24: 0000000000000010
[90104.053733] x23: 00000000175398a0 x22: 0000000017539880
[90104.053736] x21: 0000000000000018 x20: 0000007f955e4000
[90104.053738] x19: 0000000000000002 x18: 0000000000000000
[90104.053741] x17: 0000007f9524e8ec x16: 0000007f955e32e0
[90104.053743] x15: 0000000000000020 x14: 0000000000000001
[90104.053745] x13: 0000000000000000 x12: 0000000000000000
[90104.053748] x11: 0000007fcf727f80 x10: 0000000000000010
[90104.053750] x9 : 00000000000000a0 x8 : 0000000000000007
[90104.053753] x7 : 0000000000000033 x6 : 0000000017539c80
[90104.053755] x5 : 0000000000000001 x4 : 0000007f952672a0
[90104.053758] x3 : 0000000017539880 x2 : 0000000000000001
[90104.053760] x1 : 00000000000003fa x0 : 000000000017549c

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-07-03:

#8

Running the following:

#!/bin/bash
tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"

for t in $tests
do
        echo $t
        echo $t > /dev/kmsg
        ./stress-ng --$t 0 -v -t 60
done

eventually tripped the translation fault in irqbalance. I ran this after a clean reboot.

[ 4901.799846] timerfd
[ 4961.807050] tsearch
[ 5021.884456] udp
[ 5081.895058] udp-flood
[ 5141.674365] irqbalance[827]: unhandled level 2 translation fault (11) at 0x002d6da4, esr 0x92000006
[ 5141.674376] pgd = ffffffcfb51a0000
[ 5141.715215] [002d6da4] *pgd=0000004fb677e003, *pud=0000004fb677e003, *pmd=0000000000000000

[ 5141.816183] CPU: 0 PID: 827 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
[ 5141.816185] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5141.816188] task: ffffffcfac088000 ti: ffffffcfab710000 task.ti: ffffffcfab710000
[ 5141.816206] PC is at 0x7f88287834
[ 5141.816208] LR is at 0x7f882877f4
[ 5141.816210] pc : [<0000007f88287834>] lr : [<0000007f882877f4>] pstate: 80000000
[ 5141.816212] sp : 0000007ff2e46b30
[ 5141.816214] x29: 0000007ff2e46b30 x28: 00000000004095a0
[ 5141.816217] x27: 0000000000409548 x26: 000000000041a000
[ 5141.816220] x25: 0000000000000001 x24: 0000000000000010
[ 5141.816222] x23: 000000002d6c98a0 x22: 000000002d6c9880
[ 5141.816225] x21: 0000000000000018 x20: 0000007f88323000
[ 5141.816228] x19: 0000000000000002 x18: 0000000000000000
[ 5141.816230] x17: 0000007f87f8d8ec x16: 0000007f883222e0
[ 5141.816233] x15: 0000000000000020 x14: 0000000000000001
[ 5141.816235] x13: 0000000000000000 x12: 0000000000000000
[ 5141.816237] x11: 0000007ff2e446a0 x10: 0000000000000010
[ 5141.816240] x9 : 00000000000000a0 x8 : 0000000000000007
[ 5141.816242] x7 : 0000000000000033 x6 : 000000002d6c9c80
[ 5141.816245] x5 : 0000000000000001 x4 : 0000007f87fa62a0
[ 5141.816247] x3 : 000000002d6c9880 x2 : 0000000000000001
[ 5141.816250] x1 : 00000000000003fa x0 : 00000000002d6d9c

[ 5141.907792] urandom
[ 5201.928712] utime
[ 5261.934534] vecmath
[ 5321.940302] vfork
[ 5381.947904] vm
[ 5441.991784] vm-rw
[ 5502.017614] vm-splice
[ 5562.023334] wcs
[ 5622.037054] wait
[ 5682.043302] yield
[ 5742.056595] xattr
[ 5802.075772] zero
[ 5862.087396] zombie

Running the following:

#!/bin/bash
tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"

for t in $tests
do
        echo $t
        echo $t > /dev/kmsg
        ./stress-ng --$t 0 -v -t 60
done

eventually tripped the translation fault in irqbalance.  I ran this after a clean reboot.

[ 4901.799846] timerfd
[ 4961.807050] tsearch
[ 5021.884456] udp
[ 5081.895058] udp-flood
[ 5141.674365] irqbalance[827]: unhandled level 2 translation fault (11) at 0x002d6da4, esr 0x92000006
[ 5141.674376] pgd = ffffffcfb51a0000
[ 5141.715215] [002d6da4] *pgd=0000004fb677e003, *pud=0000004fb677e003, *pmd=0000000000000000

[ 5141.816183] CPU: 0 PID: 827 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
[ 5141.816185] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5141.816188] task: ffffffcfac088000 ti: ffffffcfab710000 task.ti: ffffffcfab710000
[ 5141.816206] PC is at 0x7f88287834
[ 5141.816208] LR is at 0x7f882877f4
[ 5141.816210] pc : [<0000007f88287834>] lr : [<0000007f882877f4>] pstate: 80000000
[ 5141.816212] sp : 0000007ff2e46b30
[ 5141.816214] x29: 0000007ff2e46b30 x28: 00000000004095a0 
[ 5141.816217] x27: 0000000000409548 x26: 000000000041a000 
[ 5141.816220] x25: 0000000000000001 x24: 0000000000000010 
[ 5141.816222] x23: 000000002d6c98a0 x22: 000000002d6c9880 
[ 5141.816225] x21: 0000000000000018 x20: 0000007f88323000 
[ 5141.816228] x19: 0000000000000002 x18: 0000000000000000 
[ 5141.816230] x17: 0000007f87f8d8ec x16: 0000007f883222e0 
[ 5141.816233] x15: 0000000000000020 x14: 0000000000000001 
[ 5141.816235] x13: 0000000000000000 x12: 0000000000000000 
[ 5141.816237] x11: 0000007ff2e446a0 x10: 0000000000000010 
[ 5141.816240] x9 : 00000000000000a0 x8 : 0000000000000007 
[ 5141.816242] x7 : 0000000000000033 x6 : 000000002d6c9c80 
[ 5141.816245] x5 : 0000000000000001 x4 : 0000007f87fa62a0 
[ 5141.816247] x3 : 000000002d6c9880 x2 : 0000000000000001 
[ 5141.816250] x1 : 00000000000003fa x0 : 00000000002d6d9c

[ 5141.907792] urandom
[ 5201.928712] utime
[ 5261.934534] vecmath
[ 5321.940302] vfork
[ 5381.947904] vm
[ 5441.991784] vm-rw
[ 5502.017614] vm-splice
[ 5562.023334] wcs
[ 5622.037054] wait
[ 5682.043302] yield
[ 5742.056595] xattr
[ 5802.075772] zero
[ 5862.087396] zombie

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-03: Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

#9

Download full text (6.7 KiB)

Hi Colin,

That looks one progress, but still takes time to reproduce that,
and I will use your new approach to reproduce that.

When you are doing that, could you dump the file of /proc/$(pidof
irqbalance)/maps so that we can see where the faulted address are
in the process's vm space?

thanks,

On Sat, Jul 4, 2015 at 4:10 AM, Colin Ian King
<email address hidden> wrote:
> Running the following:
>
> #!/bin/bash
> tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"
>
> for t in $tests
> do
> echo $t
> echo $t > /dev/kmsg
> ./stress-ng --$t 0 -v -t 60
> done
>
> eventually tripped the translation fault in irqbalance. I ran this
> after a clean reboot.
>
> [ 4901.799846] timerfd
> [ 4961.807050] tsearch
> [ 5021.884456] udp
> [ 5081.895058] udp-flood
> [ 5141.674365] irqbalance[827]: unhandled level 2 translation fault (11) at 0x002d6da4, esr 0x92000006
> [ 5141.674376] pgd = ffffffcfb51a0000
> [ 5141.715215] [002d6da4] *pgd=0000004fb677e003, *pud=0000004fb677e003, *pmd=0000000000000000
>
> [ 5141.816183] CPU: 0 PID: 827 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
> [ 5141.816185] Hardware name: HP ProLiant m400 Server Cartridge (DT)
> [ 5141.816188] task: ffffffcfac088000 ti: ffffffcfab710000 task.ti: ffffffcfab710000
> [ 5141.816206] PC is at 0x7f88287834
> [ 5141.816208] LR is at 0x7f882877f4
> [ 5141.816210] pc : [<0000007f88287834>] lr : [<0000007f882877f4>] pstate: 80000000
> [ 5141.816212] sp : 0000007ff2e46b30
> [ 5141.816214] x29: 0000007ff2e46b30 x28: 00000000004095a0
> [ 5141.816217] x27: 0000000000409548 x26: 000000000041a000
> [ 5141.816220] x25: 0000000000000001 x24: 0000000000000010
> [ 5141.816222] x23: 000000002d6c98a0 x22: 000000002d6c9880
> [ 5141.816225] x21: 0000000000000018 x20: 0000007f88323000
> [ 5141.816228] x19: 0000000000000002 x18: 0000000000000000
> [ 5141.816230] x17: 0000007f87f8d8ec x16: 0000007f883222e0
> [ 5141.816233] x15: 0000000000000020 x14: 0000000000000001
> [ 5141.816235] x13: 0000000000000000 x12: 0000000000000000
> [ 5141.816237] x11: 0000007ff2e446a0 x10: 0000000000000010
> [ 5141.816240] x9 : 00000000000000a0 x8 : 0000000000000007
> [ 5141.816242] x7 : 0000000000000033 x6 : 000000002d6c9c80
> [ 5141.816245] x5 : 0000000000000001 x4 : 0000007f87fa62a0
> [ 5141.816247] x3 : 000000002d6c9880 x2 : 0000000000000001
> [ 5141.816250] x1 : 00000000000003fa x0 : 00000000002d6d9c
>
> [ 5141.907792] urandom
> [ 5201.928712] utime
> [ 5261.934534] vecmath
> [ 5321.940302] vfork
> [ 5381.947904] vm
> [ 5441.991784] vm-rw
> [ 5502.017614] vm-splice
> [ 5562.023334] wcs
> [ 5622.037054] wait
> [ 5682.043302] yield
> ...

Hi Colin,

That looks one progress, but still takes time to reproduce that,
and I will use your new approach to reproduce that.

When you are doing that, could you dump the file of /proc/$(pidof
irqbalance)/maps so that we can see where the faulted address are
in the process's vm space?

thanks,

On Sat, Jul 4, 2015 at 4:10 AM, Colin Ian King
<1469214@bugs.launchpad.net> wrote:
> Running the following:
>
> #!/bin/bash
> tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"
>
> for t in $tests
> do
>         echo $t
>         echo $t > /dev/kmsg
>         ./stress-ng --$t 0 -v -t 60
> done
>
> eventually tripped the translation fault in irqbalance.  I ran this
> after a clean reboot.
>
> [ 4901.799846] timerfd
> [ 4961.807050] tsearch
> [ 5021.884456] udp
> [ 5081.895058] udp-flood
> [ 5141.674365] irqbalance[827]: unhandled level 2 translation fault (11) at 0x002d6da4, esr 0x92000006
> [ 5141.674376] pgd = ffffffcfb51a0000
> [ 5141.715215] [002d6da4] *pgd=0000004fb677e003, *pud=0000004fb677e003, *pmd=0000000000000000
>
> [ 5141.816183] CPU: 0 PID: 827 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
> [ 5141.816185] Hardware name: HP ProLiant m400 Server Cartridge (DT)
> [ 5141.816188] task: ffffffcfac088000 ti: ffffffcfab710000 task.ti: ffffffcfab710000
> [ 5141.816206] PC is at 0x7f88287834
> [ 5141.816208] LR is at 0x7f882877f4
> [ 5141.816210] pc : [<0000007f88287834>] lr : [<0000007f882877f4>] pstate: 80000000
> [ 5141.816212] sp : 0000007ff2e46b30
> [ 5141.816214] x29: 0000007ff2e46b30 x28: 00000000004095a0
> [ 5141.816217] x27: 0000000000409548 x26: 000000000041a000
> [ 5141.816220] x25: 0000000000000001 x24: 0000000000000010
> [ 5141.816222] x23: 000000002d6c98a0 x22: 000000002d6c9880
> [ 5141.816225] x21: 0000000000000018 x20: 0000007f88323000
> [ 5141.816228] x19: 0000000000000002 x18: 0000000000000000
> [ 5141.816230] x17: 0000007f87f8d8ec x16: 0000007f883222e0
> [ 5141.816233] x15: 0000000000000020 x14: 0000000000000001
> [ 5141.816235] x13: 0000000000000000 x12: 0000000000000000
> [ 5141.816237] x11: 0000007ff2e446a0 x10: 0000000000000010
> [ 5141.816240] x9 : 00000000000000a0 x8 : 0000000000000007
> [ 5141.816242] x7 : 0000000000000033 x6 : 000000002d6c9c80
> [ 5141.816245] x5 : 0000000000000001 x4 : 0000007f87fa62a0
> [ 5141.816247] x3 : 000000002d6c9880 x2 : 0000000000000001
> [ 5141.816250] x1 : 00000000000003fa x0 : 00000000002d6d9c
>
> [ 5141.907792] urandom
> [ 5201.928712] utime
> [ 5261.934534] vecmath
> [ 5321.940302] vfork
> [ 5381.947904] vm
> [ 5441.991784] vm-rw
> [ 5502.017614] vm-splice
> [ 5562.023334] wcs
> [ 5622.037054] wait
> [ 5682.043302] yield
> [ 5742.056595] xattr
> [ 5802.075772] zero
> [ 5862.087396] zombie
>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1469214
>
> Title:
>   HP ProLiant m400 Server crashes with unhandled level 3 translation
>   fault
>
> Status in linux package in Ubuntu:
>   Triaged
>
> Bug description:
>   Running stress-ng on a HP ProLiant m400 server can cause unhandled
>   level 3 translations faults:
>
>   use stress-ng from git://kernel.ubuntu.com/cking/stress-ng
>
>   ./stress-ng --seq 0 -t 60 -v
>
>   and after some time this trips the following:
>
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] systemd-timesyn[481]: unhandled level 3 translation fault (7) at 0x7fa8ea6008, esr 0x92000007
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = ffffffcfb563f000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] *pgd=0000004fb4f28003, *pud=0000004fb4f28003, *pmd=0000004fb4f38003, *pte=000000001d151c00
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566]
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP ProLiant m400 Server Cartridge (DT)
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: ffffffcfb4e3b100 ti: ffffffcfb4d2c000 task.ti: ffffffcfb4d2c000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 0x7fa8d81824
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 0x7fa8e3b3e4
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : [<0000007fa8d81824>] lr : [<0000007fa8e3b3e4>] pstate: 80000000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 0000007ff120d660
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922592] x29: 0000007ff120d660 x28: 0000007fa8f1c000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922594] x27: 0000007fa8f32084 x26: 0000007fa8f32000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922595] x25: 0000007fa8f1d788 x24: 0000007fa8f1d888
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922597] x23: 0000000000000001 x22: 0000007fa8f1faa0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922599] x21: 0000007ff120d7f0 x20: 0000007ff120d7d0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922600] x19: 0000007fa8f31000 x18: 0000007fa8f1e000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922602] x17: 0000007fa8e3b3b8 x16: 0000007fa8ea6000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922603] x15: 003b9aca00000000 x14: 00219bbdd0000000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922605] x13: ffffffffaa751223 x12: 0000000000000000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922607] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922609] x9 : 37333c43484f5e46 x8 : 0000007ff120d818
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922610] x7 : 0000007ff120d8f0 x6 : 0000007ff120d828
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922612] x5 : ffffff80ffffffd0 x4 : 0000007ff120d8c0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922613] x3 : 0000007ff120d7d0 x2 : 0000007fa8f1faa0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922615] x1 : 0000000000000001 x0 : 0000000000000064
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922616]
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-04:

#10

Download full text (3.2 KiB)

Hi Colin,

On Sat, Jul 4, 2015 at 12:43 AM, Colin Ian King
<email address hidden> wrote:
> I was able to hit the following translation fault running sudo ./stress-
> ng --seq 0 -t 60 --syslog --metrics --times -v

I suggest to not run stress-ng as root, otherwise it can be less
serious because:

- root user can do bad things easily, and it is quite easy to kill any
of process
- in reality most of loads are run as non-root

If some system processes(irqbalance, systemd-*) are only killed
becasue stress-ng is running as root, it can be a low priority issue.
Otherwise we need pay close attention to the issue.

And I always run 'stress-ng' as ubuntu user without sudo, that may
be the reason why it is difficult for me to reproduce that.

Even with the two new approaches, it is still not easy for me to
reproduce that. I only see one time of translation fault by your
first approach(./stress-ng --seq 0 ...) in 6 hours, and can't trigger
that with your 2nd approach(by bash script).

Folllows the log[1] I triggered, and I think it is very likely a userspace
issue. From irqbalanc-dbgsym package, we can easily find 'PC is at
0x406078' is one address in text section, and it should be inside
function of 'place_irq_in_node' because the exec file isn't built as
relocation. One thing I still can't understand is that why the fault
address is '0x00000040' in the context.

[1]
[ 3616.333392] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 3616.333393] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 5316.367265] irqbalance[1457]: unhandled level 2 translation fault
(11) at 0x00000040, esr 0x92000006
[ 5316.476937] pgd = ffffffcfb5478000
[ 5316.520692] [00000040] *pgd=0000004fb4a3c003,
*pud=0000004fb4a3c003, *pmd=0000000000000000
[ 5316.620270]
[ 5316.638140] CPU: 7 PID: 1457 Comm: irqbalance Not tain-21-generic #21-Ubuntu
[ 5316.733212] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5316.806382] task: ffffffcfb55e6e40 ti: ffffffcfa72b0000 task.ti:
ffffffcfa72b0000
[ 5316.896258] PC is at 0x406078
[ 5316.931865] LR is at 0x404100
[ 5316.967457] pc : [<0000000000406078>] lr : [<0000000000404100>]
pstate: 20000000
[ 5317.056268] sp : 0000007fc07ff2d0
[ 5317.096038] x29: 0000007fc07ff2d0 x28: 00000000004095a0
[ 5317.160023] x27: 0000000000409548 x26: 000000000041a000
[ 5317.223897] x25: 0000000000405000 x24: 000000000041acf8
[ 5317.287868] x23: 000000000041a000 x22: 000000000041a000
[ 5317.351841] x21: 000000002e0d6050 x20: 000000000041a000
[ 5317.415744] x19: 000000002e0e9020 x18: 0000000000000000
[ 5317.479620] x17: 0000007fb5ac287c x16: 000000000041a188
[ 5317.543490] x15: 003bdd2370f74a1c x14: 2030203020302030
[ 5317.607373] x13: 2030203020302030 x12: 2030203020302030
[ 5317.671263] x11: 2030203020302030 x10: 2030203020302030
[ 5317.735137] x9 : 00000000000000a0 x8 : 0000000000000001
[ 5317.799113] x7 : 0000000000000033 x6 : 000000002e0d6e08
[ 5317.862983] x5 : 0000000000000040 x4 : 0000000000000000
[ 5317.926867] x3 : 000000002e0d7008 x2 : 0000...

Hi Colin,

On Sat, Jul 4, 2015 at 12:43 AM, Colin Ian King
<1469214@bugs.launchpad.net> wrote:
> I was able to hit the following translation fault running sudo ./stress-
> ng --seq 0 -t 60 --syslog --metrics --times -v

I suggest to not run stress-ng as root, otherwise it can be less
serious because:

- root user can do bad things easily, and it is quite easy to kill any
of process
  - in reality most of loads are run as non-root

If some system processes(irqbalance, systemd-*) are only killed
becasue stress-ng is running as root, it can be a low priority issue.
Otherwise we need pay close attention to the issue.

And I always run 'stress-ng' as ubuntu user without sudo, that may
be the reason why it is difficult for me to reproduce that.

Even with the two new approaches, it is still not easy for me to
reproduce that. I only see one time of translation fault by your
first approach(./stress-ng --seq 0 ...)  in 6 hours, and can't trigger
that with your 2nd approach(by bash script).

Folllows the log[1] I triggered, and I think it is very likely a userspace
issue. From irqbalanc-dbgsym package, we can easily find 'PC is at
0x406078' is one address in text section, and it should be inside
function of 'place_irq_in_node' because the exec file isn't built as
relocation. One thing I still can't understand is that why the fault
address is '0x00000040' in the context.

[1]
[ 3616.333392] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 3616.333393] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 5316.367265] irqbalance[1457]: unhandled level 2 translation fault
(11) at 0x00000040, esr 0x92000006
[ 5316.476937] pgd = ffffffcfb5478000
[ 5316.520692] [00000040] *pgd=0000004fb4a3c003,
*pud=0000004fb4a3c003, *pmd=0000000000000000
[ 5316.620270]
[ 5316.638140] CPU: 7 PID: 1457 Comm: irqbalance Not tain-21-generic #21-Ubuntu
[ 5316.733212] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5316.806382] task: ffffffcfb55e6e40 ti: ffffffcfa72b0000 task.ti:
ffffffcfa72b0000
[ 5316.896258] PC is at 0x406078
[ 5316.931865] LR is at 0x404100
[ 5316.967457] pc : [<0000000000406078>] lr : [<0000000000404100>]
pstate: 20000000
[ 5317.056268] sp : 0000007fc07ff2d0
[ 5317.096038] x29: 0000007fc07ff2d0 x28: 00000000004095a0
[ 5317.160023] x27: 0000000000409548 x26: 000000000041a000
[ 5317.223897] x25: 0000000000405000 x24: 000000000041acf8
[ 5317.287868] x23: 000000000041a000 x22: 000000000041a000
[ 5317.351841] x21: 000000002e0d6050 x20: 000000000041a000
[ 5317.415744] x19: 000000002e0e9020 x18: 0000000000000000
[ 5317.479620] x17: 0000007fb5ac287c x16: 000000000041a188
[ 5317.543490] x15: 003bdd2370f74a1c x14: 2030203020302030
[ 5317.607373] x13: 2030203020302030 x12: 2030203020302030
[ 5317.671263] x11: 2030203020302030 x10: 2030203020302030
[ 5317.735137] x9 : 00000000000000a0 x8 : 0000000000000001
[ 5317.799113] x7 : 0000000000000033 x6 : 000000002e0d6e08
[ 5317.862983] x5 : 0000000000000040 x4 : 0000000000000000
[ 5317.926867] x3 : 000000002e0d7008 x2 : 0000000000000000
[ 5317.990840] x1 : 000000000000002c x0 : 0000000000000003
[ 5318.054713]

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-07-06:

#11

I re-ran this today with the following script as a non-root user:

#!/bin/bash
tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"

for t in $tests
do
echo $t
echo $t | sudo tee /dev/kmsg
./stress-ng --$t 0 -v -t 60
done

and hit this issue:

[14098.848615] urandom
[14111.696335] irqbalance[828]: unhandled level 2 translation fault (11) at 0x00004f64, esr 0x92000006
[14111.696341] pgd = ffffffcfef71b000
[14111.737149] [00004f64] *pgd=0000004fef1f3003, *pud=0000004fef1f3003, *pmd=0000000000000000

[14111.836705] CPU: 0 PID: 828 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
[14111.836707] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[14111.836710] task: ffffffcfefb0bd40 ti: ffffffcfb452c000 task.ti: ffffffcfb452c000
[14111.836723] PC is at 0x7fb1061834
[14111.836725] LR is at 0x7fb10617f4
[14111.836728] pc : [<0000007fb1061834>] lr : [<0000007fb10617f4>] pstate: 80000000
[14111.836729] sp : 0000007fc7cef6e0
[14111.836731] x29: 0000007fc7cef6e0 x28: 00000000004095a0
[14111.836735] x27: 0000000000409548 x26: 000000000041a000
[14111.836737] x25: 0000000000000001 x24: 0000000000000010
[14111.836740] x23: 00000000004e58a0 x22: 00000000004e5880
[14111.836750] x21: 0000000000000018 x20: 0000007fb10fd000
[14111.836762] x19: 0000000000000002 x18: 0000000000000000
[14111.836765] x17: 0000007fb0d678ec x16: 0000007fb10fc2e0
[14111.836768] x15: 0000000000000020 x14: 0000000000000001
[14111.836770] x13: 0000000000000000 x12: 0000000000000000
[14111.836773] x11: 0000007fc7ced250 x10: 0000000000000010
[14111.836775] x9 : 00000000000000a0 x8 : 0000000000000007
[14111.836778] x7 : 0000000000000033 x6 : 00000000004e5c80
[14111.836780] x5 : 0000000000000001 x4 : 0000007fb0d802a0
[14111.836783] x3 : 00000000004e5880 x2 : 0000000000000001
[14111.836785] x1 : 00000000000003fa x0 : 0000000000004f5c

I re-ran this today with the following script as a non-root user:

#!/bin/bash
tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"

for t in $tests
do
        echo $t
	echo $t | sudo tee /dev/kmsg
        ./stress-ng --$t 0 -v -t 60
done

and hit this issue:

[14098.848615] urandom
[14111.696335] irqbalance[828]: unhandled level 2 translation fault (11) at 0x00004f64, esr 0x92000006
[14111.696341] pgd = ffffffcfef71b000
[14111.737149] [00004f64] *pgd=0000004fef1f3003, *pud=0000004fef1f3003, *pmd=0000000000000000

[14111.836705] CPU: 0 PID: 828 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu
[14111.836707] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[14111.836710] task: ffffffcfefb0bd40 ti: ffffffcfb452c000 task.ti: ffffffcfb452c000
[14111.836723] PC is at 0x7fb1061834
[14111.836725] LR is at 0x7fb10617f4
[14111.836728] pc : [<0000007fb1061834>] lr : [<0000007fb10617f4>] pstate: 80000000
[14111.836729] sp : 0000007fc7cef6e0
[14111.836731] x29: 0000007fc7cef6e0 x28: 00000000004095a0 
[14111.836735] x27: 0000000000409548 x26: 000000000041a000 
[14111.836737] x25: 0000000000000001 x24: 0000000000000010 
[14111.836740] x23: 00000000004e58a0 x22: 00000000004e5880 
[14111.836750] x21: 0000000000000018 x20: 0000007fb10fd000 
[14111.836762] x19: 0000000000000002 x18: 0000000000000000 
[14111.836765] x17: 0000007fb0d678ec x16: 0000007fb10fc2e0 
[14111.836768] x15: 0000000000000020 x14: 0000000000000001 
[14111.836770] x13: 0000000000000000 x12: 0000000000000000 
[14111.836773] x11: 0000007fc7ced250 x10: 0000000000000010 
[14111.836775] x9 : 00000000000000a0 x8 : 0000000000000007 
[14111.836778] x7 : 0000000000000033 x6 : 00000000004e5c80 
[14111.836780] x5 : 0000000000000001 x4 : 0000007fb0d802a0 
[14111.836783] x3 : 00000000004e5880 x2 : 0000000000000001 
[14111.836785] x1 : 00000000000003fa x0 : 0000000000004f5c

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-06:

#12

0001-stress-ng-support-sequential-range.patch Edit (2.5 KiB, text/x-patch; charset=US-ASCII; name="0001-stress-ng-support-sequential-range.patch")

Download full text (7.4 KiB)

On Mon, Jul 6, 2015 at 9:28 PM, Colin Ian King
<email address hidden> wrote:
> I re-ran this today with the following script as a non-root user:
>
> #!/bin/bash
> tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"
>
> for t in $tests
> do
> echo $t
> echo $t | sudo tee /dev/kmsg
> ./stress-ng --$t 0 -v -t 60
> done
>
> and hit this issue:
>
> [14098.848615] urandom
> [14111.696335] irqbalance[828]: unhandled level 2 translation fault (11) at 0x00004f64, esr 0x92000006
> [14111.696341] pgd = ffffffcfef71b000
> [14111.737149] [00004f64] *pgd=0000004fef1f3003, *pud=0000004fef1f3003, *pmd=0000000000000000
>

As I suggested, it should be helpful to provide /proc/$(pidof
irqbalance)/maps, otherwise we can't know where both the faulted
and PC address are.

Finally I have figured out one simple way to reproduce the issue:

1) apply the attached debug patch to stress-ng

2) run the following script:

sudo cat /proc/$(pidof irqbalance)/maps
/home/ubuntu/git/stress-ng/stress-ng --sequential 0 --seq-start 80
--seq-end 84 -t 60 --syslog --metrics --times -v

And the above command just runs the following 4 stresses in 4 minutes:

stress-ng: info: [1067] dispatching hogs: 8 tsearch, 8 udp, 8 udp-flood,
8 urandom

3) the above may trigger the following faults from irqbalance with
~3/4 probability, and the faulted address is in heap, and PC points to
code of libglib-2.0.so, so looks like a use-after-free in irqbalance or
libglib? And no information shows it is related with kernel, also
the four stresses are quite simple and shouldn't cause trouble to
kernel.

# irqbalance memory maps
00400000-0040a000 r-xp 00000000 08:02 10496929
  /usr/sbin/irqbalance
00419000-0041a000 r-xp 00009000 08:02 10496929
  /usr/sbin/irqbalance
0041a000-0041b000 rwxp 0000a000 08:02 10496929
  /usr/sbin/irqbalance
16294000-162b5000 rwxp 00000000 00:00 0 [heap]
162b5000-162ce000 rwxp 00000000 00:00 0 [heap]
7f8fbf9000-7f8fbfb000 rwxp 00000000 00:00 0
7f8fbfb000-7f8fc11000 r-xp 00000000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc11000-7f8fc20000 ---p 00016000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc20000-7f8fc21000 r-xp 00015000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc21000-7f8fc22000 rwxp 00016000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc22000-7f8fc26000 rwxp 00000000 00:00 0
7f8fc26000-7f8fc7f000 r-xp 00000000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc7f000-7f8fc8f000 ---p 00059000 08:02 4718668
  /lib/aarch64-linux-gnu...

On Mon, Jul 6, 2015 at 9:28 PM, Colin Ian King
<1469214@bugs.launchpad.net> wrote:
> I re-ran this today with the following script as a non-root user:
>
> #!/bin/bash
> tests="affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie"
>
> for t in $tests
> do
>         echo $t
>         echo $t | sudo tee /dev/kmsg
>         ./stress-ng --$t 0 -v -t 60
> done
>
> and hit this issue:
>
> [14098.848615] urandom
> [14111.696335] irqbalance[828]: unhandled level 2 translation fault (11) at 0x00004f64, esr 0x92000006
> [14111.696341] pgd = ffffffcfef71b000
> [14111.737149] [00004f64] *pgd=0000004fef1f3003, *pud=0000004fef1f3003, *pmd=0000000000000000
>

As I suggested, it should be helpful to provide /proc/$(pidof
irqbalance)/maps, otherwise we can't know where both the faulted
and PC address are.

Finally I have figured out one simple way to reproduce the issue:

1) apply the attached debug patch to stress-ng

2) run the following script:

sudo cat /proc/$(pidof irqbalance)/maps
/home/ubuntu/git/stress-ng/stress-ng --sequential 0 --seq-start 80
--seq-end 84 -t 60 --syslog --metrics --times -v

And the above command just runs the following 4 stresses in 4 minutes:

stress-ng: info:  [1067] dispatching hogs: 8 tsearch, 8 udp, 8 udp-flood,
    8  urandom

3) the above may trigger the following faults from irqbalance with
~3/4 probability, and the faulted address is in heap, and PC points to
code of libglib-2.0.so, so looks like a use-after-free in irqbalance or
libglib? And no information shows it is related with kernel, also
the four stresses are quite simple and shouldn't cause trouble to
kernel.

# irqbalance memory maps
00400000-0040a000 r-xp 00000000 08:02 10496929
  /usr/sbin/irqbalance
00419000-0041a000 r-xp 00009000 08:02 10496929
  /usr/sbin/irqbalance
0041a000-0041b000 rwxp 0000a000 08:02 10496929
  /usr/sbin/irqbalance
16294000-162b5000 rwxp 00000000 00:00 0                                  [heap]
162b5000-162ce000 rwxp 00000000 00:00 0                                  [heap]
7f8fbf9000-7f8fbfb000 rwxp 00000000 00:00 0
7f8fbfb000-7f8fc11000 r-xp 00000000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc11000-7f8fc20000 ---p 00016000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc20000-7f8fc21000 r-xp 00015000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc21000-7f8fc22000 rwxp 00016000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc22000-7f8fc26000 rwxp 00000000 00:00 0
7f8fc26000-7f8fc7f000 r-xp 00000000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc7f000-7f8fc8f000 ---p 00059000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc8f000-7f8fc90000 r-xp 00059000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc90000-7f8fc91000 rwxp 0005a000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc91000-7f8fdc1000 r-xp 00000000 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdc1000-7f8fdd0000 ---p 00130000 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdd0000-7f8fdd4000 r-xp 0012f000 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdd4000-7f8fdd6000 rwxp 00133000 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdd6000-7f8fdda000 rwxp 00000000 00:00 0
7f8fdda000-7f8fde3000 r-xp 00000000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fde3000-7f8fdf2000 ---p 00009000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fdf2000-7f8fdf3000 r-xp 00008000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fdf3000-7f8fdf4000 rwxp 00009000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fdf4000-7f8fdf8000 rwxp 00000000 00:00 0
7f8fdf8000-7f8fe89000 r-xp 00000000 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe89000-7f8fe98000 ---p 00091000 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe98000-7f8fe99000 r-xp 00090000 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe99000-7f8fe9a000 rwxp 00091000 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe9a000-7f8ff8c000 r-xp 00000000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff8c000-7f8ff9c000 ---p 000f2000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff9c000-7f8ff9d000 r-xp 000f2000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff9d000-7f8ff9e000 rwxp 000f3000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff9e000-7f8ff9f000 rwxp 00000000 00:00 0
7f8ff9f000-7f8ffa3000 r-xp 00000000 08:02 10879730
  /usr/lib/aarch64-linux-gnu/libcap-ng.so.0.0.0
7f8ffa3000-7f8ffb2000 ---p 00004000 08:02 10879730
  /usr/lib/aarch64-linux-gnu/libcap-ng.so.0.0.0
7f8ffb2000-7f8ffb3000 r-xp 00003000 08:02 10879730
  /usr/lib/aarch64-linux-gnu/libcap-ng.so.0.0.0
7f8ffb3000-7f8ffb4000 rwxp 00004000 08:02 10879730
  /usr/lib/aarch64-linux-gnu/libcap-ng.so.0.0.0
7f8ffb4000-7f8ffd0000 r-xp 00000000 08:02 4722030
  /lib/aarch64-linux-gnu/ld-2.21.so
7f8ffd0000-7f8ffd3000 rwxp 00000000 00:00 0
7f8ffdc000-7f8ffde000 rwxp 00000000 00:00 0
7f8ffde000-7f8ffdf000 r--p 00000000 00:00 0                              [vvar]
7f8ffdf000-7f8ffe0000 r-xp 00000000 00:00 0                              [vdso]
7f8ffe0000-7f8ffe1000 r-xp 0001c000 08:02 4722030
  /lib/aarch64-linux-gnu/ld-2.21.so
7f8ffe1000-7f8ffe3000 rwxp 0001d000 08:02 4722030
  /lib/aarch64-linux-gnu/ld-2.21.so
7fecdb1000-7fecdd2000 rw-p 00000000 00:00 0                              [stack]

[  250.276095] irqbalance[779]: unhandled level 2 translation fault
(11) at 0x00162a54, esr 0x92000006
[  250.276103] pgd = ffffffc0ff812000
[  250.316917] [00162a54] *pgd=00000040ffa6b003,
*pud=00000040ffa6b003, *pmd=0000000000000000

[  250.416447] CPU: 5 PID: 779 Comm: irqbalance Not tainted
3.19.0-21-generic #21-Ubuntu
[  250.416450] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[  250.416452] task: ffffffcfb46cc980 ti: ffffffc0feba0000 task.ti:
ffffffc0feba0000
[  250.416464] PC is at 0x7f8ff02834
[  250.416467] LR is at 0x7f8ff027f4
[  250.416469] pc : [<0000007f8ff02834>] lr : [<0000007f8ff027f4>]
pstate: 80000000
[  250.416471] sp : 0000007fecdd1480
[  250.416472] x29: 0000007fecdd1480 x28: 000000000041a000
[  250.416476] x27: 000000000041a000 x26: 00000000004094e0
[  250.416478] x25: 0000000000000001 x24: 0000000000000010
[  250.416481] x23: 00000000162948a0 x22: 0000000016294880
[  250.416484] x21: 0000000000000018 x20: 0000007f8ff9e000
[  250.416486] x19: 0000000000000002 x18: 0000000000000000
[  250.416489] x17: 0000007f8fc088ec x16: 0000007f8ff9d2e0
[  250.416491] x15: 0000000000000020 x14: 0000000000000000
[  250.416494] x13: 0000000000000000 x12: 0000000000000000
[  250.416496] x11: 0000007fecdceff0 x10: 0000000000000010
[  250.416499] x9 : 00000000000000a0 x8 : 0000000000000007
[  250.416501] x7 : 0000000000000033 x6 : 0000000016294c80
[  250.416504] x5 : 0000000000000001 x4 : 0000007f8fc212a0
[  250.416506] x3 : 0000000016294880 x2 : 0000000000000001
[  250.416509] x1 : 00000000000003fa x0 : 0000000000162a4c

Ubuntu Foundations Team Bug Bot (crichton) on 2015-07-06

tags:

added: patch

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-07-06:

#13

captured irqbalance segfaulting:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145
145 if (irq_numa_node(info)->number != -1) {
(gdb) where
#0 0x0000000000408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145
#1 0x0000000000405154 in for_each_irq (list=0x2c3df660, cb=0x408f4c <place_irq_in_node>, data=0x0)
at classify.c:508
#2 0x000000000040923c in calculate_placement () at placement.c:196
#3 0x0000000000407800 in main (argc=2, argv=0x7fcd014928) at irqbalance.c:372

(gdb) print info
$1 = (struct irq_info *) 0x2c3d0050

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-07:

#14

Download full text (4.3 KiB)

On Tue, Jul 7, 2015 at 2:37 AM, Colin Ian King
<email address hidden> wrote:
> captured irqbalance segfaulting:
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000000408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145
> 145 if (irq_numa_node(info)->number != -1) {
> (gdb) where
> #0 0x0000000000408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145
> #1 0x0000000000405154 in for_each_irq (list=0x2c3df660, cb=0x408f4c <place_irq_in_node>, data=0x0)
> at classify.c:508
> #2 0x000000000040923c in calculate_placement () at placement.c:196
> #3 0x0000000000407800 in main (argc=2, argv=0x7fcd014928) at irqbalance.c:372
>
> (gdb) print info
> $1 = (struct irq_info *) 0x2c3d0050

Suppose info is one address in heap, then it is valid, and the segfault
should be caused by invalid info->numa_node.

Thanks

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1469214
>
> Title:
> HP ProLiant m400 Server crashes with unhandled level 3 translation
> fault
>
> Status in linux package in Ubuntu:
> Triaged
>
> Bug description:
> Running stress-ng on a HP ProLiant m400 server can cause unhandled
> level 3 translations faults:
>
> use stress-ng from git://kernel.ubuntu.com/cking/stress-ng
>
> ./stress-ng --seq 0 -t 60 -v
>
> and after some time this trips the following:
>
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] systemd-timesyn[481]: unhandled level 3 translation fault (7) at 0x7fa8ea6008, esr 0x92000007
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = ffffffcfb563f000
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] *pgd=0000004fb4f28003, *pud=0000004fb4f28003, *pmd=0000004fb4f38003, *pte=000000001d151c00
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566]
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP ProLiant m400 Server Cartridge (DT)
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: ffffffcfb4e3b100 ti: ffffffcfb4d2c000 task.ti: ffffffcfb4d2c000
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 0x7fa8d81824
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 0x7fa8e3b3e4
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : [<0000007fa8d81824>] lr : [<0000007fa8e3b3e4>] pstate: 80000000
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 0000007ff120d660
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922592] x29: 0000007ff120d660 x28: 0000007fa8f1c000
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922594] x27: 0000007fa8f32084 x26: 0000007fa8f32000
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922595] x25: 0000007fa8f1d788 x24: 0000007fa8f1d888
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922597] x23: 0000000000000001 x22: 0000007fa8f1faa0
> Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922599] x21: 0000007ff120d7f0 x20: 0000007ff120...

On Tue, Jul 7, 2015 at 2:37 AM, Colin Ian King
<1469214@bugs.launchpad.net> wrote:
> captured irqbalance segfaulting:
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000000408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145
> 145             if (irq_numa_node(info)->number != -1) {
> (gdb) where
> #0  0x0000000000408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145
> #1  0x0000000000405154 in for_each_irq (list=0x2c3df660, cb=0x408f4c <place_irq_in_node>, data=0x0)
>     at classify.c:508
> #2  0x000000000040923c in calculate_placement () at placement.c:196
> #3  0x0000000000407800 in main (argc=2, argv=0x7fcd014928) at irqbalance.c:372
>
> (gdb) print info
> $1 = (struct irq_info *) 0x2c3d0050

Suppose info is one address in heap, then it is valid, and the segfault
should be caused by invalid info->numa_node.

Thanks

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1469214
>
> Title:
>   HP ProLiant m400 Server crashes with unhandled level 3 translation
>   fault
>
> Status in linux package in Ubuntu:
>   Triaged
>
> Bug description:
>   Running stress-ng on a HP ProLiant m400 server can cause unhandled
>   level 3 translations faults:
>
>   use stress-ng from git://kernel.ubuntu.com/cking/stress-ng
>
>   ./stress-ng --seq 0 -t 60 -v
>
>   and after some time this trips the following:
>
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] systemd-timesyn[481]: unhandled level 3 translation fault (7) at 0x7fa8ea6008, esr 0x92000007
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = ffffffcfb563f000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] *pgd=0000004fb4f28003, *pud=0000004fb4f28003, *pmd=0000004fb4f38003, *pte=000000001d151c00
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566]
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP ProLiant m400 Server Cartridge (DT)
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: ffffffcfb4e3b100 ti: ffffffcfb4d2c000 task.ti: ffffffcfb4d2c000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 0x7fa8d81824
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 0x7fa8e3b3e4
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : [<0000007fa8d81824>] lr : [<0000007fa8e3b3e4>] pstate: 80000000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 0000007ff120d660
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922592] x29: 0000007ff120d660 x28: 0000007fa8f1c000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922594] x27: 0000007fa8f32084 x26: 0000007fa8f32000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922595] x25: 0000007fa8f1d788 x24: 0000007fa8f1d888
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922597] x23: 0000000000000001 x22: 0000007fa8f1faa0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922599] x21: 0000007ff120d7f0 x20: 0000007ff120d7d0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922600] x19: 0000007fa8f31000 x18: 0000007fa8f1e000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922602] x17: 0000007fa8e3b3b8 x16: 0000007fa8ea6000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922603] x15: 003b9aca00000000 x14: 00219bbdd0000000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922605] x13: ffffffffaa751223 x12: 0000000000000000
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922607] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922609] x9 : 37333c43484f5e46 x8 : 0000007ff120d818
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922610] x7 : 0000007ff120d8f0 x6 : 0000007ff120d828
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922612] x5 : ffffff80ffffffd0 x4 : 0000007ff120d8c0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922613] x3 : 0000007ff120d7d0 x2 : 0000007fa8f1faa0
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922615] x1 : 0000000000000001 x0 : 0000000000000064
>   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922616]
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-07:

#15

Looks there are two kinds of translation fault from irqbalance:

1) happend in place_irq_in_node() which can reproduce in vivid package

2) the 2nd one happened in glib2, which is built by myself, because
irqbalance can choose to use its own local glib if there isn't glib2 available,
and the glib2 does exist in my server in which I build irqbalance.

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-07:

#16

On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei <email address hidden> wrote:
> Looks there are two kinds of translation fault from irqbalance:
>
> 1) happend in place_irq_in_node() which can reproduce in vivid package
>
> 2) the 2nd one happened in glib2, which is built by myself, because
> irqbalance can choose to use its own local glib if there isn't glib2 available,
> and the glib2 does exist in my server in which I build irqbalance.

Both of two above reports can be fixed by the following irqbalance commit:

NUMA is not available fix

https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c

Looks stress-ng can't only find kernel bug, but also userspace
issue, :-)

Thanks,
Ming

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-07-07:

#17

Thanks Ming for finding the fix. I was going to do a bisect on the upstream code but ran out of time last night. Nice find!

Colin

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2015-07-07:

#18

Following Ming's identification of an irqbalance patch that fixes this issue, I'm marking the "Affected" status on "linux (Ubuntu)" as being "invalid".

Changed in linux (Ubuntu Trusty):
status:	New → Invalid
Changed in linux (Ubuntu Utopic):
status:	New → Invalid
Changed in linux (Ubuntu Vivid):
status:	New → Invalid
Changed in linux (Ubuntu Wily):
status:	Triaged → Invalid
Changed in irqbalance (Ubuntu Vivid):
status:	New → In Progress

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2015-07-07:

#19

The attachment "0001-stress-ng-support-sequential-range.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

Ming Lei (tom-leiming) on 2015-07-08

Changed in irqbalance (Ubuntu Vivid):
status:	In Progress → Confirmed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-07-08:

#20

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in irqbalance (Ubuntu Trusty):
status:	New → Confirmed
Changed in irqbalance (Ubuntu Utopic):
status:	New → Confirmed
Changed in irqbalance (Ubuntu):
status:	New → Confirmed

Alberto Salvia Novella (es20490446e) on 2015-07-08

Changed in irqbalance (Ubuntu Trusty):
assignee:	nobody → dann frazier (dannf)
Changed in irqbalance (Ubuntu Utopic):
assignee:	nobody → dann frazier (dannf)
Changed in irqbalance (Ubuntu Vivid):
assignee:	nobody → dann frazier (dannf)
Changed in irqbalance (Ubuntu Wily):
assignee:	nobody → dann frazier (dannf)
Changed in irqbalance (Ubuntu Trusty):
importance:	Undecided → Medium
Changed in irqbalance (Ubuntu Utopic):
importance:	Undecided → Medium
Changed in irqbalance (Ubuntu Vivid):
importance:	Undecided → Medium
Changed in irqbalance (Ubuntu Wily):
importance:	Undecided → Medium
no longer affects:	linux (Ubuntu)
no longer affects:	linux (Ubuntu Trusty)
no longer affects:	linux (Ubuntu Utopic)
no longer affects:	linux (Ubuntu Vivid)
no longer affects:	linux (Ubuntu Wily)
tags:	added: trusty utopic vivid wily

Alberto Salvia Novella (es20490446e) on 2015-07-08

Changed in irqbalance (Ubuntu Trusty):
status:	Confirmed → Triaged
Changed in irqbalance (Ubuntu Utopic):
status:	Confirmed → Triaged
Changed in irqbalance (Ubuntu Vivid):
status:	Confirmed → Triaged
Changed in irqbalance (Ubuntu Wily):
status:	Confirmed → Triaged

Ming Lei (tom-leiming) on 2015-07-08

description:

updated

Revision history for this message

dann frazier (dannf) wrote on 2015-07-09:

#23

On Tue, Jul 7, 2015 at 2:25 AM, Ming Lei <email address hidden> wrote:
> On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei <email address hidden> wrote:
>> Looks there are two kinds of translation fault from irqbalance:
>>
>> 1) happend in place_irq_in_node() which can reproduce in vivid package
>>
>> 2) the 2nd one happened in glib2, which is built by myself, because
>> irqbalance can choose to use its own local glib if there isn't glib2 available,
>> and the glib2 does exist in my server in which I build irqbalance.
>
>
> Both of two above reports can be fixed by the following irqbalance commit:
>
> NUMA is not available fix
>
> https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c
>
> Looks stress-ng can't only find kernel bug, but also userspace
> issue, :-)

I was looking to upload a fix for wily, but I haven't been able to
reproduce it to in order to verify the fix. I ran 'stress-ng --seq 0
-t 60 --syslog --metrics --times -v' overnight in a loop, but
irqbalance never crashed. How long should I expect this to take on
average? Does it usually crash in a single run?

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-13:

#24

Dann,

Please follow the steps in #12, in which you should trigger the crash in 4 minutes.

BTW, looks wily kernel can't boot to shell prompt on mcdivitt.

Thanks,

Revision history for this message

dann frazier (dannf) wrote on 2015-07-13:

#25

On Mon, Jul 13, 2015 at 9:27 AM, Ming Lei <email address hidden> wrote:
> Dann,
>
> Please follow the steps in #12, in which you should trigger the crash in
> 4 minutes.

I've been running that in a loop and I'm currently on iteration #76
w/o a crash :(

Maybe it's
Linux ms10-33-mcdivittB0 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16
17:18:17 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

> BTW, looks wily kernel can't boot to shell prompt on mcdivitt.

OK - mind filing a separate bug for that?

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-13:

#26

> BTW, looks wily kernel can't boot to shell prompt on mcdivitt.

That kernel(v4.0) isn't the final kernel for wily, so do we need to pay attention to that?

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-14:

#27

On Mon, Jul 13, 2015 at 9:27 AM, Ming Lei <email address hidden> wrote:
> Dann,
>
> Please follow the steps in #12, in which you should trigger the crash in
> 4 minutes.

> I've been running that in a loop and I'm currently on iteration #76
> w/o a crash :(

The issue is nothing to do with kernel, and it should be made sure that irqbalance
is running first.

I can reproduce the issue on trusty, utopic and vivid easily with the approach in #12.

Revision history for this message

dann frazier (dannf) wrote on 2015-07-14:

#28

Ming was able to help me reliable reproduce this with the command:
stress-ng --sequential 0 --seq-start 86 --seq-end 90 -t 60 --syslog --metrics --times -v

I prepared a wily package w/ the proposed upstream backport for testing:
lp:~dannf/ubuntu/wily/irqbalance/lp1469214

Unfortunately, I'm still seeing irqbalance crash even with this backport:

[ 2461.635168] irqbalance[558]: unhandled input address range fault (11) at 0x20202020202034, esr 0x92000004
[ 2461.635175] pgd = ffffffcfab3f3000
[ 2461.675979] [20202020202034] *pgd=0000000000000000

[ 2461.733566] CPU: 4 PID: 558 Comm: irqbalance Not tainted 3.13.0-57-generic #95-Ubuntu
[ 2461.733570] task: ffffffcfa9cdcd00 ti: ffffffcfa9df8000 task.ti: ffffffcfa9df8000
[ 2461.733577] PC is at 0x40605c
[ 2461.733580] LR is at 0x4040e4
[ 2461.733582] pc : [<000000000040605c>] lr : [<00000000004040e4>] pstate: 80000000
[ 2461.733584] sp : 0000007fd95cf7a0
[ 2461.733585] x29: 0000007fd95cf7a0 x28: 000000000041a000
[ 2461.733588] x27: 000000000041a000 x26: 0000000000409510
[ 2461.733591] x25: 000000000041a000 x24: 0000000000405000
[ 2461.733593] x23: 000000000041acf8 x22: 000000000041a000
[ 2461.733596] x21: 0000000014ab0130 x20: 000000000041a000
[ 2461.733598] x19: 0000000014a9f0e0 x18: 0000000000000000
[ 2461.733601] x17: 0000007fa72118ec x16: 0000007fa75a72e0
[ 2461.733603] x15: 003bcfb11b54656b x14: 2030203020302030
[ 2461.733606] x13: 2030203020302030 x12: 2030203020302030
[ 2461.733608] x11: 2030203020302030 x10: 2030203020302030
[ 2461.733611] x9 : 2030203020302030 x8 : 0000000014a9bc80
[ 2461.733613] x7 : 0000000000000020 x6 : 0000000014a9bc90
[ 2461.733616] x5 : 0000000000000001 x4 : 0000007fa722a2a0
[ 2461.733618] x3 : 0000000014a9b880 x2 : 0000000000000001
[ 2461.733620] x1 : 4320202020202020 x0 : 000000003355000a

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-15:

#29

wily.log Edit (14.8 KiB, text/plain)

> I prepared a wily package w/ the proposed upstream backport for testing:
> lp:~dannf/ubuntu/wily/irqbalance/lp1469214

> Unfortunately, I'm still seeing irqbalance crash even with this backport:

I guess you still test irqbalance on c33, looks that upgrade from trusty isn't good, and
I can see lots of this kind of falut in different processes(sshd, stress-ng, systemd...)
just after a fresh boot with irqbalance disabled(see attachment), and sounds like a bad upgrade.

If you verify the patch on trusty/utopic/vivid, it does fix the issue according to my tests.

Revision history for this message

Ming Lei (tom-leiming) wrote on 2015-07-16:

#30

Dann,

I have figured out patches for fixing wily kernel, see following link:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1474171/comments/4

so you can reproduce the issue on a totally clean wily distribution, :-)

dann frazier (dannf) on 2015-09-09

Changed in irqbalance (Ubuntu Wily):
status:	Triaged → In Progress

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-09-10:

#31

This bug was fixed in the package irqbalance - 1.0.6-3ubuntu3

---------------
irqbalance (1.0.6-3ubuntu3) wily; urgency=medium

* d/p/NUMA-is-not-available-fix.patch: Avoid crashes when NUMA
is not available. (LP: #1469214)

-- dann frazier <email address hidden> Wed, 09 Sep 2015 17:35:26 -0600

Changed in irqbalance (Ubuntu Wily):
status:	In Progress → Fix Released

dann frazier (dannf) on 2015-09-10

Changed in irqbalance (Ubuntu Vivid):
status:	Triaged → In Progress
Changed in irqbalance (Ubuntu Trusty):
status:	Triaged → In Progress
Changed in irqbalance (Ubuntu Utopic):
status:	Triaged → Won't Fix

Revision history for this message

Timo Aaltonen (tjaalton) wrote on 2015-09-11: Please test proposed package

#32

Hello Colin, or anyone else affected,

Accepted irqbalance into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/irqbalance/1.0.6-3ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in irqbalance (Ubuntu Vivid):
status:	In Progress → Fix Committed
tags:	added: verification-needed
Changed in irqbalance (Ubuntu Trusty):
status:	In Progress → Fix Committed

Revision history for this message

Timo Aaltonen (tjaalton) wrote on 2015-09-11:

#33

Hello Colin, or anyone else affected,

Accepted irqbalance into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/irqbalance/1.0.6-2ubuntu0.14.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-09-11:

#34

Hi there,

I need access to the machine to test this, any hints on the machine name and how to access it would be useful. Thanks.

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2015-09-21:

#35

Hi Colin, I believe you now have access to the necessary hardware, but please let me know if this is still an issue. Thanks.

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-09-21:

#36

I've tested 1.0.6-2ubuntu0.14.04.4 for several hours and the problem is fixed, I can't reproduce this at all.

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2015-09-21:

#37

Great! Many thanks...

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-09-21:

#38

Andrew, I'm still testing it for vivid, will be done in a few hours.

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-09-21:

#39

Tested with vivid 1.0.6-3ubuntu1.1, bug is fixed

Revision history for this message

Colin Ian King (colin-king) wrote on 2015-09-21:

#40

Tested with wily 1.0.6-3ubuntu3, bug fixed.

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-09-23:

#41

This bug was fixed in the package irqbalance - 1.0.6-2ubuntu0.14.04.4

---------------
irqbalance (1.0.6-2ubuntu0.14.04.4) trusty; urgency=medium

* d/p/NUMA-is-not-available-fix.patch: Avoid crashes when NUMA
is not available. (LP: #1469214)

-- dann frazier <email address hidden> Thu, 10 Sep 2015 13:11:21 -0600

Changed in irqbalance (Ubuntu Trusty):
status:	Fix Committed → Fix Released

Revision history for this message

Chris J Arges (arges) wrote on 2015-09-23: Update Released

#42

The verification of the Stable Release Update for irqbalance has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-09-23:

#43

This bug was fixed in the package irqbalance - 1.0.6-3ubuntu1.1

---------------
irqbalance (1.0.6-3ubuntu1.1) vivid; urgency=medium

* d/p/NUMA-is-not-available-fix.patch: Avoid crashes when NUMA
is not available. (LP: #1469214)

-- dann frazier <email address hidden> Thu, 10 Sep 2015 13:01:56 -0600

Changed in irqbalance (Ubuntu Vivid):
status:	Fix Committed → Fix Released

Ubuntu
irqbalance package

HP ProLiant m400 Server crashes with unhandled level 3 translation fault

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
irqbalance (Ubuntu)	Fix Released	Medium	dann frazier
Trusty	Fix Released	Medium	dann frazier
Utopic	Won't Fix	Medium	dann frazier
Vivid	Fix Released	Medium	dann frazier
Wily	Fix Released	Medium	dann frazier

Ubuntuirqbalance package

HP ProLiant m400 Server crashes with unhandled level 3 translation fault

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
irqbalance package