stress-ng mmap failing on zVM and LPAR

Bug #1569468 reported by Jeff Lane 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
stress-ng (Ubuntu)
Fix Released
High
Colin Ian King
Xenial
Fix Released
High
Colin Ian King

Bug Description

[SRU, Xenial]

When running stress-ng on zVM and LPAR we are hitting SIGBUS errors because we have a sparsely allocated mmap'd backing file which due to over commit and a full file system causes pages not to be mapped in and causes memory accesses on unbacked pages to trigger a SIGBUS.

[REPRODUCER + FIX]
Run stress-ng --mmap 64 --maximize on a filesystem that is very nearly full and the SIGBUS triggers and the stressor exits early with SIGBUS. With the fix, the SIGBUS is caught and the stressor can continue without premature early exit.

[REGRESSION POTENTIAL]
I am requesting syncing with 0.05.24 micro release as this contains the fix plus a few SIGSEGV stack trapping fixes. stress-ng is a universe leaf project and the fixes touch just a few of the stress tests. These have been regression checked on various architectures and the code passes static analysis on cppcheck, CoverityScan and clang's scan-build, so regression potential is minimal.

--------------------------------------------------------------------------

Running stress-ng on s390 in all three modes. This seems to work ok in z/KVM, however, on zVM and LPAR as of a few days ago on Xenial the mmap stressor has started failing.

I tried to get detailed logs but either don't know the correct switches or they simply aren't there.

Here is the output when I used --log-file and --verbose on an LPAR:
root@s1lp10-jefflane:~# less stress-ng-mmap-fail.log
stress-ng: debug: [179421] 4 processors online, 4 processors configured
stress-ng: info: [179421] dispatching hogs: 4 mmap
stress-ng: debug: [179421] cache allocate: reducing cache level from L3 (too high) to L2
stress-ng: info: [179421] cache allocate: default cache size: 2048K
stress-ng: debug: [179421] starting stressors
stress-ng: debug: [179422] stress-ng-mmap: started [179422] (instance 0)
stress-ng: debug: [179423] stress-ng-mmap: started [179423] (instance 1)
stress-ng: debug: [179424] stress-ng-mmap: started [179424] (instance 2)
stress-ng: debug: [179421] 4 stressors spawned
stress-ng: debug: [179425] stress-ng-mmap: started [179425] (instance 3)
stress-ng: debug: [179424] stress-ng-mmap: exited [179424] (instance 2)
stress-ng: debug: [179422] stress-ng-mmap: exited [179422] (instance 0)
stress-ng: debug: [179421] process [179422] terminated
stress-ng: debug: [179421] process 179423 (stress-ng-mmap) terminated on signal: 7 (Bus error)
stress-ng: debug: [179421] process [179423] terminated
stress-ng: debug: [179421] process [179424] terminated
stress-ng: debug: [179421] process 179425 (stress-ng-mmap) terminated on signal: 7 (Bus error)
stress-ng: debug: [179421] process [179425] terminated
stress-ng: info: [179421] unsuccessful run completed in 300.68s (5 mins, 0.68 secs)

That is the only info I have for the failure, unfortunately.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: stress-ng 0.05.23-1
ProcVersionSignature: Ubuntu 4.4.0-18.34-generic 4.4.6
Uname: Linux 4.4.0-18-generic s390x
ApportVersion: 2.20.1-0ubuntu1
Architecture: s390x
Date: Tue Apr 12 12:47:49 2016
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: stress-ng
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Jeff Lane  (bladernr) wrote :
description: updated
Revision history for this message
Jeff Lane  (bladernr) wrote :

This is the command line used:
stress-ng --aggressive --verbose -t 300 --log-file stress-ng-mmap-fail.log --mmap 0

Revision history for this message
Jeff Lane  (bladernr) wrote :

This is the output from a successful run on zKVM using the same kernel as zVM and the LPAR.

stress-ng: debug: [17990] 2 processors online, 2 processors configured
stress-ng: info: [17990] dispatching hogs: 2 mmap
stress-ng: debug: [17990] cache allocate: reducing cache level from L3 (too high) to L2
stress-ng: info: [17990] cache allocate: default cache size: 2048K
stress-ng: debug: [17990] starting stressors
stress-ng: debug: [17991] stress-ng-mmap: started [17991] (instance 0)
stress-ng: debug: [17992] stress-ng-mmap: started [17992] (instance 1)
stress-ng: debug: [17990] 2 stressors spawned
stress-ng: debug: [17991] stress-ng-mmap: exited [17991] (instance 0)
stress-ng: debug: [17990] process [17991] terminated
stress-ng: debug: [17992] stress-ng-mmap: exited [17992] (instance 1)
stress-ng: debug: [17990] process [17992] terminated
stress-ng: info: [17990] successful run completed in 300.25s (5 mins, 0.25 secs)

Jeff Lane  (bladernr)
tags: added: blocks-hwcert-server
Revision history for this message
Colin Ian King (colin-king) wrote :

On the zVM and/or LPAR can you run:

strace -f stress-ng --aggressive --verbose -t 300 --log-file stress-ng-mmap-fail.log --mmap 0 >& strace.log

and attach the strace.log to the bug.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Here's the strace and other log from zVM, I'll have the LPAR shortly, it's a much larger log.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Here's the lpar logs... interestingly these are smaller than zVM. same amount of RAM on each.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Colin, can you take a look at this and just confirm a theory for me...

I re-ran after adding an LVM volume to the zVM lpar to expand the filesystem and now the test passes.

I am thinking it may have been as simple as running out of disk space for writing temp data.

I'm trying to get the LPAR expanded to verify this on the LPAR too.

Just discovered this possibility a few moments ago after getting hte LVM volumes created and attached.

Revision history for this message
Jeff Lane  (bladernr) wrote :

If that's the case, it may be useful to have stress-ng actually say it ran out of disk space. As it is, assuming the above is true, what I think is happening is that it's writing a bunch of data to some temp dirs, it runs out of disk space, errors out and then deletes the contents of the temp dirs behind itself.

So when I go in to investigate, I don't see that the filesystem was full, because the temp data has been cleaned up and the filesystem is no longer full.

Thanks
Jeff

Revision history for this message
Colin Ian King (colin-king) wrote :

The SIGBUS occurs on the following actions:

[pid 196657] mmap(NULL, 268435456, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x3ff78480000
[pid 196657] --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x3ff7f6ea000} ---
[pid 196657] +++ killed by SIGBUS +++

The mmap'ing was 0x3ff78480000 to 0x3ff78480000 + 268435456, e.g 0x3ff78480000 to 0x3ff88480000
The page that tiggered the SIGBUS was 0x3ff7f6ea000 which is between these addresses, which shows it is a valid mapping.

The file based mapped file is basically a sparse file and gets populated as the pages get touched, so a SIGBUS most probably occurs when we run out of free blocks on disk, the kernel can't supply the page mapping and we get a SIGBUS. I overlooked this corner case, so I'll work out a fix for stress-ng.

Changed in stress-ng (Ubuntu):
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :
Changed in stress-ng (Ubuntu):
status: New → Fix Committed
Revision history for this message
Jeff Lane  (bladernr) wrote : Re: [Bug 1569468] Re: stress-ng mmap failing on zVM and LPAR
Download full text (4.2 KiB)

Ahhh, thanks. In the test scenario, I have two different items, an
LPAR and a z/VM instance and both were configured with relatively
small disks. (6GB total). The zVM one at least had a couple extra
DASDs added that I could add via LVM and that helped a lot. The other
did not so I'm waiting for IS to expand it or give me more storage
somewhere.

Thanks a lot Colin

On Thu, Apr 14, 2016 at 4:58 AM, Colin Ian King
<email address hidden> wrote:
> Fix committed: http://kernel.ubuntu.com/git/cking/stress-
> ng.git/commit/?id=4621e3afd7af4ee950618e58200cded325a1401d
>
> ** Changed in: stress-ng (Ubuntu)
> Status: New => Fix Committed
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1569468
>
> Title:
> stress-ng mmap failing on zVM and LPAR
>
> Status in stress-ng package in Ubuntu:
> Fix Committed
>
> Bug description:
> Running stress-ng on s390 in all three modes. This seems to work ok
> in z/KVM, however, on zVM and LPAR as of a few days ago on Xenial the
> mmap stressor has started failing.
>
> I tried to get detailed logs but either don't know the correct
> switches or they simply aren't there.
>
> Here is the output when I used --log-file and --verbose on an LPAR:
> root@s1lp10-jefflane:~# less stress-ng-mmap-fail.log
> stress-ng: debug: [179421] 4 processors online, 4 processors configured
> stress-ng: info: [179421] dispatching hogs: 4 mmap
> stress-ng: debug: [179421] cache allocate: reducing cache level from L3 (too high) to L2
> stress-ng: info: [179421] cache allocate: default cache size: 2048K
> stress-ng: debug: [179421] starting stressors
> stress-ng: debug: [179422] stress-ng-mmap: started [179422] (instance 0)
> stress-ng: debug: [179423] stress-ng-mmap: started [179423] (instance 1)
> stress-ng: debug: [179424] stress-ng-mmap: started [179424] (instance 2)
> stress-ng: debug: [179421] 4 stressors spawned
> stress-ng: debug: [179425] stress-ng-mmap: started [179425] (instance 3)
> stress-ng: debug: [179424] stress-ng-mmap: exited [179424] (instance 2)
> stress-ng: debug: [179422] stress-ng-mmap: exited [179422] (instance 0)
> stress-ng: debug: [179421] process [179422] terminated
> stress-ng: debug: [179421] process 179423 (stress-ng-mmap) terminated on signal: 7 (Bus error)
> stress-ng: debug: [179421] process [179423] terminated
> stress-ng: debug: [179421] process [179424] terminated
> stress-ng: debug: [179421] process 179425 (stress-ng-mmap) terminated on signal: 7 (Bus error)
> stress-ng: debug: [179421] process [179425] terminated
> stress-ng: info: [179421] unsuccessful run completed in 300.68s (5 mins, 0.68 secs)
>
> That is the only info I have for the failure, unfortunately.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 16.04
> Package: stress-ng 0.05.23-1
> ProcVersionSignature: Ubuntu 4.4.0-18.34-generic 4.4.6
> Uname: Linux 4.4.0-18-generic s390x
> ApportVersion: 2.20.1-0ubuntu1
> Architecture: s390x
> Date: Tue Apr 12 12:47:49 2016
> ProcEnviron:
> TERM=screen
> PATH=(custom, no user)
> XDG_RUNTIME_DIR=<set>
> LANG=en_US.UTF-8
> SHEL...

Read more...

description: updated
description: updated
description: updated
description: updated
Revision history for this message
Colin Ian King (colin-king) wrote :
Changed in stress-ng (Ubuntu Xenial):
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
status: New → In Progress
Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello Jeff, or anyone else affected,

Accepted stress-ng into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/stress-ng/0.05.23-1ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in stress-ng (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi Martin,

Tested on z/KVM, z/VM and LPAR (two of which were initially bitten by this bug) and the version in proposed works great!

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks for testing, however, I found a regression in one of the other bugs being fixed by this update, so we're waiting for the -1ubuntu2 release to land in -proposed for final testing.

Revision history for this message
Jeff Lane  (bladernr) wrote : Re: [Bug 1569468] Re: stress-ng mmap failing on zVM and LPAR
Download full text (5.2 KiB)

ok, no worries. Let me know and I'll retest that too, if needed.

On Tue, Apr 26, 2016 at 2:59 PM, Colin Ian King
<email address hidden> wrote:
> Thanks for testing, however, I found a regression in one of the other
> bugs being fixed by this update, so we're waiting for the -1ubuntu2
> release to land in -proposed for final testing.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1569468
>
> Title:
> stress-ng mmap failing on zVM and LPAR
>
> Status in stress-ng package in Ubuntu:
> Fix Committed
> Status in stress-ng source package in Xenial:
> Fix Committed
>
> Bug description:
> [SRU, Xenial]
>
> When running stress-ng on zVM and LPAR we are hitting SIGBUS errors
> because we have a sparsely allocated mmap'd backing file which due to
> over commit and a full file system causes pages not to be mapped in
> and causes memory accesses on unbacked pages to trigger a SIGBUS.
>
> [REPRODUCER + FIX]
> Run stress-ng --mmap 64 --maximize on a filesystem that is very nearly full and the SIGBUS triggers and the stressor exits early with SIGBUS. With the fix, the SIGBUS is caught and the stressor can continue without premature early exit.
>
> [REGRESSION POTENTIAL]
> I am requesting syncing with 0.05.24 micro release as this contains the fix plus a few SIGSEGV stack trapping fixes. stress-ng is a universe leaf project and the fixes touch just a few of the stress tests. These have been regression checked on various architectures and the code passes static analysis on cppcheck, CoverityScan and clang's scan-build, so regression potential is minimal.
>
>
> --------------------------------------------------------------------------
>
> Running stress-ng on s390 in all three modes. This seems to work ok
> in z/KVM, however, on zVM and LPAR as of a few days ago on Xenial the
> mmap stressor has started failing.
>
> I tried to get detailed logs but either don't know the correct
> switches or they simply aren't there.
>
> Here is the output when I used --log-file and --verbose on an LPAR:
> root@s1lp10-jefflane:~# less stress-ng-mmap-fail.log
> stress-ng: debug: [179421] 4 processors online, 4 processors configured
> stress-ng: info: [179421] dispatching hogs: 4 mmap
> stress-ng: debug: [179421] cache allocate: reducing cache level from L3 (too high) to L2
> stress-ng: info: [179421] cache allocate: default cache size: 2048K
> stress-ng: debug: [179421] starting stressors
> stress-ng: debug: [179422] stress-ng-mmap: started [179422] (instance 0)
> stress-ng: debug: [179423] stress-ng-mmap: started [179423] (instance 1)
> stress-ng: debug: [179424] stress-ng-mmap: started [179424] (instance 2)
> stress-ng: debug: [179421] 4 stressors spawned
> stress-ng: debug: [179425] stress-ng-mmap: started [179425] (instance 3)
> stress-ng: debug: [179424] stress-ng-mmap: exited [179424] (instance 2)
> stress-ng: debug: [179422] stress-ng-mmap: exited [179422] (instance 0)
> stress-ng: debug: [179421] process [179422] terminated
> stress-ng: debug: [179421] process 179423 (stress-ng-mmap) terminated on signal: 7 (Bus e...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package stress-ng - 0.05.24-1

---------------
stress-ng (0.05.24-1) unstable; urgency=medium

  * Makefile: bump version
  * stress-mmap: handle SIGBUS signals (LP: #1569468)
  * stress-mmapmany: sanity check sysconf return
  * stress-mmapmany: detect SEGV deaths
  * stress-mlock: detect SEGV deaths
  * stress-brk: detect SEGV deaths
  * stress-bigheap: detect SEGV deaths
  * stress-memfd: detect SEGV deaths
  * stress-mmapmany: allocate mappings on heap rather than stack
  * stress-mlock: allocate mappings on heap rather than stack
  * stress-cpu: move sieve buffer to static to reduce stack size
  * stress-sem*: differentiate between which semaphore init that failed
  * stress-remap-file-pages: abort if remap fails
  * stress-fiemap: remove \n from pr_fail_err messages

 -- Colin King <email address hidden> Thu, 14 Apr 2016 11:00:11 +0100

Changed in stress-ng (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello Jeff, or anyone else affected,

Accepted stress-ng into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/stress-ng/0.05.23-1ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Colin Ian King (colin-king) wrote :

I've given this a test and it no longer fails, so the fix in -proposed looks good to me.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package stress-ng - 0.05.23-1ubuntu2

---------------
stress-ng (0.05.23-1ubuntu2) xenial; urgency=medium

  * Fix alignment mask to ensure stacks are 16 byte aligned (LP: #1573117)
    - incorrect mask used in previous fix, now using correct mask

stress-ng (0.05.23-1ubuntu1) xenial; urgency=medium

  * Ensure all clone() calls are 16 byte aligned for aarch64 (LP: #1573117)
  * stress-mmap: handle SIGBUS signals (LP: #1569468)

 -- Colin King <email address hidden> Tue, 26 Apr 2016 12:16:47 +0100

Changed in stress-ng (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for stress-ng has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.