systemd autopkgtest broken on ppc64el with qemu 6.0

Bug #1935617 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
bugproxy
qemu (Ubuntu)
Fix Released
Undecided
Unassigned
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I'm not sure yet if this is flaky or a real issue, but I'm filing it to avoid multiple people analyzing the same.

The Qemu 6.0 upload https://launchpad.net/ubuntu/+source/qemu/1:6.0+dfsg-1~ubuntu2 triggers a test failure like
https://autopkgtest.ubuntu.com/results/autopkgtest-impish/impish/ppc64el/s/systemd/20210708_223311_e3bbb@/log.gz

I have tested the new qemu on ppc64 and it worked fine for device emulation and migration cases.
But this is suspicious.
Of the last tests exactly and only those with the new qemu failed.

impish
  ppc64el
    tests-in-lxd (F 5% f 0% S 0% B 0% => P 95%/) F.........F.............................
    systemd-fsckd (F 0% f 0% S 100% B 0% => P 0%/) SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
    upstream-1 (F 15% f 0% S 0% B 0% => P 85%/) F..FFFF....F............................
    upstream-2 (F 12% f 0% S 0% B 0% => P 87%/) F..FFFF.................................

For an insight in flakyness/reproducibility I've retriggered the missing qemu and the non-qemu cases a few times. If those reproduce all-bad vs all-good again this would further indicate a real issue.

Unfortunately the ppc maas seems down right now and canonistack also isn't too nice this week - overall that inhibits the testing a bit :-/

tags: added: update-excuse
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

My Repro-tests indeed seem to indicate that really qemu 6.0 is the trigger for this,
but as I said HW is rather unavailable at the moment. I was told that the MAAS is serviced but should be back soon - so I'll have a look later when this one is available again.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Unfortunately the ppc maas seems down right now

product engineering has access to PPC systems in maas?!? sigh...that sure would be nice for sustaining engineering to have access to also :(

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

> product engineering has access to PPC systems in maas?!? sigh...that sure would be nice
> for sustaining engineering to have access to also :(

That might sound more appealing than it is, the HW is often broken the availability is worse than gold-nuggets and I think I never got a system without pinging someone to free it.
But yeah, all that aside it is useful for some cases which just are not workable at all in a virtual environment.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

While canonistack was breaking hard on me last week I managed to get an instance running today.
Trying to reproduce the issue in there ...

Atfer a long while of upgrades, dependencies, ....

I was running with impish (as-is) vs impish+qemu6.0.
It was very reproucible working with 5.2 and always failing for the same subset of

TEST-08-ISSUE-2730: FAIL (120 s)
TEST-09-ISSUE-2691: FAIL ( 71 s)
TEST-11-ISSUE-3166: FAIL ( 71 s)
TEST-13-NSPAWN-SMOKE: FAIL ( 72 s)
TEST-14-MACHINE-ID: FAIL ( 72 s)
TEST-17-UDEV: FAIL ( 73 s)
TEST-19-DELEGATE: FAIL ( 71 s)
TEST-24-CRYPTSETUP: FAIL (126 s)

Now we need to find what these have in common that the others do not have ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Isolated case
 $ sudo env SELECTED_TESTS="TEST-08-*" debian/tests/upstream

It runs a VM Guest to do init stuff and that seems to break:

There are suspicious logs like systemd crashing, but those also exist (to some extent) in the good case. I compared that:

$ grep -e "of user 0 dumped core" -e "^TEST-[0-9][0-9]-.* SETUP:" systemd-ppc-good.txt
Just one hit in TEST-17-UDEV

While the bad case has hits in exactly the list of not failing tests.
So these crashes are indeed related (root cause or symptom is not yet clear yet) to the problem.

The boot&stop itself is fine if you just start the same commands.
But the tests redirect the journal and process it, in there is the crash - I need to find how it does that.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.6 KiB)

As root:

$ mkdir /tmp/debugtest
$ export ARTIFACT_DIRECTORY=/tmp/debugtest
$ export TEST_SHOW_JOURNAL=warning
$ export TEST_REQUIRE_INSTALL_TESTS=0
$ export TEST_PREFER_NSPAWN=1
$ export NO_BUILD=1
$ export SELECTED_TESTS="TEST-08-*"
./test/run-integration-tests.sh

Reproduces the issue, it will keep the journal in /tmp/debugtest and run in emulation.

The crash looks like:
[ 15.988730] kernel: ext4 filesystem being mounted at /root supports timestamps until 2038 (0x7fffffff)
[ 18.235447] systemd[1]: Failed to find module 'autofs4'
[ 21.938779] kernel: ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
[ 25.404984] systemd-coredump[371]: [🡕] Process 369 (rm) of user 0 dumped core.

                                      Stack trace of thread 369:
                                      #0 0x00000b1dcc8c61d0 n/a (rm + 0x61d0)
                                      #1 0x00000b1dcc8c8f64 n/a (rm + 0x8f64)
                                      #2 0x00000b1dcc8c3538 n/a (rm + 0x3538)
                                      #3 0x00000b1dcc8c4424 n/a (rm + 0x4424)
                                      #4 0x00000b1dcc8c2b70 n/a (rm + 0x2b70)
                                      #5 0x00007ee0760aa0b4 n/a (libc.so.6 + 0x2a0b4)
                                      #6 0x00007ee0760aa2a0 __libc_start_main (libc.so.6 + 0x2a2a0)
[ 25.480260] systemd[1]: testsuite-08.service: Control process exited, code=dumped, status=11/SEGV

It seems always "rm" that breaks, here from a fill log:

$ grep -e "of user 0 dumped core" systemd-ppc-qemu6.0.txt
[ 47.691335] systemd-coredump[286]: Process 279 (rm) of user 0 dumped core.
[ 44.911447] systemd-coredump[284]: Process 279 (rm) of user 0 dumped core.
[ 45.089287] systemd-coredump[284]: Process 282 (rm) of user 0 dumped core.
[ 44.799519] systemd-coredump[284]: Process 278 (rm) of user 0 dumped core.
[ 44.666997] systemd-coredump[285]: Process 283 (rm) of user 0 dumped core.
[ 46.306449] systemd-coredump[282]: Process 279 (rm) of user 0 dumped core.
[ 43.898582] systemd-coredump[283]: Process 280 (rm) of user 0 dumped core.
[ 53.298322] systemd-coredump[338]: Process 335 (rm) of user 0 dumped core.
[ 45.731700] systemd-coredump[284]: Process 282 (rm) of user 0 dumped core.
[ 45.354492] systemd-coredump[283]: Process 281 (rm) of user 0 dumped core.
[ 43.579587] systemd-coredump[281]: Process 278 (rm) of user 0 dumped core.
[ 45.047475] systemd-coredump[284]: Process 278 (rm) of user 0 dumped core.
[ 50.662839] systemd-coredump[281]: Process 278 (rm) of user 0 dumped core.
[ 45.634806] systemd-coredump[283]: Process 278 (rm) of user 0 dumped core.

It is interesting to see such a crash as all other services report a valid green "ok".

If I start the very same guest, but without the pre-defined systemd.wants=end.service (as it is meant to shut it down again) I can log in as root without a PW.

$ /bin/qemu-system-ppc64 -smp 4 -net none -m 512M -nographic -vga none -kernel /boot/vmlinux-5.11.0-22-generic -drive format=raw,cache=unsafe,file=/var/tmp/systemd-test.zx7kOu/test08.img -initrd /boot/initrd.img-5.11.0-22-ge...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The service has a StartPre that is like
  ExecStartPre=rm -f /failed /testok

And that works to reproduce:

bash-5.1# rm -f /failed /testok
Segmentation fault (core dumped)

In fact this RM is very very sad:
-bash-5.1# rm /doesnotexist
Segmentation fault (core dumped)
-bash-5.1# touch /doesexist
-bash-5.1# rm /doesexist
Segmentation fault (core dumped)
-bash-5.1# rm -f /doesexist
Segmentation fault (core dumped)

Other commands (as far as I can see) work as usual albeit with the restrictions of an emulated HVC console.

There is no ssh installed and no apt, so getting this into a better debuggable state isn't easy.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I tried the same on a bare metal machine (now one is available for a short time).

Got the new systemd source and test, ran it with qemu 5.2 and it is failing.
But not randomly, the same subset of tests fails on bare metal with 5.2.
But now they all fail on an OOM.

  kvm: OS terminated: OS panic: System is deadlocked on memory

bare-metal fails
TEST-01-BASIC: SUCCESS ( 56 s)
TEST-03-JOBS: SUCCESS ( 13 s)
TEST-04-JOURNAL: SUCCESS ( 17 s)
TEST-05-RLIMITS: SUCCESS ( 3 s)
TEST-06-SELINUX: SUCCESS ( 0 s)
TEST-07-ISSUE-1981: SUCCESS ( 3 s)
TEST-08-ISSUE-2730: FAIL ( 61 s)
TEST-09-ISSUE-2691: FAIL ( 19 s)
TEST-11-ISSUE-3166: FAIL ( 19 s)
TEST-12-ISSUE-3171: SUCCESS ( 5 s)
TEST-13-NSPAWN-SMOKE: FAIL ( 26 s)
TEST-14-MACHINE-ID: FAIL ( 25 s)
TEST-15-DROPIN: SUCCESS ( 14 s)
TEST-16-EXTEND-TIMEOUT: SUCCESS ( 67 s)
TEST-17-UDEV: FAIL ( 27 s)
TEST-18-FAILUREACTION: SUCCESS ( 3 s)
TEST-19-DELEGATE: FAIL ( 19 s)
TEST-20-MAINPIDGAMES: SUCCESS ( 4 s)
TEST-22-TMPFILES: SUCCESS ( 7 s)
TEST-23-TYPE-EXEC: SUCCESS ( 2 s)
TEST-24-CRYPTSETUP: FAIL ( 66 s)
TEST-26-SETENV: SUCCESS ( 3 s)
TEST-27-STDOUTFILE: SUCCESS ( 2 s)
TEST-28-PERCENTJ-WANTEDBY: SUCCESS ( 5 s)

Failing on qemu 6.0 on infra
TEST-01-BASIC: SUCCESS ( 54 s)
TEST-03-JOBS: SUCCESS ( 12 s)
TEST-04-JOURNAL: SUCCESS ( 15 s)
TEST-05-RLIMITS: SUCCESS ( 2 s)
TEST-06-SELINUX: SUCCESS ( 0 s)
TEST-07-ISSUE-1981: SUCCESS ( 3 s)
TEST-08-ISSUE-2730: FAIL (131 s)
TEST-09-ISSUE-2691: FAIL ( 77 s)
TEST-11-ISSUE-3166: FAIL ( 80 s)
TEST-12-ISSUE-3171: SUCCESS ( 5 s)
TEST-13-NSPAWN-SMOKE: FAIL ( 77 s)
TEST-14-MACHINE-ID: FAIL ( 80 s)
TEST-15-DROPIN: SUCCESS ( 23 s)
TEST-16-EXTEND-TIMEOUT: SUCCESS ( 66 s)
TEST-17-UDEV: FAIL ( 77 s)
TEST-18-FAILUREACTION: SUCCESS ( 3 s)
TEST-19-DELEGATE: FAIL ( 75 s)
TEST-20-MAINPIDGAMES: SUCCESS ( 3 s)
TEST-22-TMPFILES: SUCCESS ( 3 s)
TEST-23-TYPE-EXEC: SUCCESS ( 2 s)
TEST-24-CRYPTSETUP: FAIL (130 s)
TEST-26-SETENV: SUCCESS ( 2 s)
TEST-27-STDOUTFILE: SUCCESS ( 2 s)
TEST-28-PERCENTJ-WANTEDBY: SUCCESS ( 3 s)

Both lists are suspiciously 8/9/11/13/14/17/19/24.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since the bare metal case reports a memory issue on the crashes, let us give it a bit more memory - with 2G it works finally.

Instead of the other tests in launchpad/canonistack this uses KVM mode via "-machine accel=kvm -enable-kvm -cpu host". Switching that instead to "-machine accel=tcg" it fails the same.

But I found what consumed all the memory, the guest is (by default) started with the same CPU set as the host which means -smp 156. Disabling SMT in general on the host and starting the guest with 4 cpus allows to reduce the memory to 1G (but 512M still fails). Some other environment detail seems to bump the memory demand up.

As a cross check I can it on canonistack with 2G but it still fails with the segfaults.

With the adapted config it worked in qemu 5.2 in KVM and TCG mode.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

From here I was switching to qemu 6.0 on the bare metal machine.
6.0-KVM - works
6.0-TCG - works

But running the testsuite itself fails (due to using only 512M ram).
I found that switching from testsuite-08 to testsuite-09 allowed me to get into bad-mode in TCG (with enough memory, but into the segfaulting rm).

Now that (segfaulting 6.0 with TCG) I switched to KVM and it worked.

TL;DR for now:
- memory problems hit the same that are segfaulting
  - even though memory problems hit the same cases, more memory does not fix it for TCG+qemu6.0
- segfault not reproducible in KVM acceleration mode
- issue is TCG only and qemu 6.0 only

P.S. I do all the creation of the images and such with the working qemu 5.2, only then do I switch to 6.0 and re-run the failing test.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

I installed a ppc64 impish cloud image (all fine) but then ran it in TCG mode (to check if I can trigger the same).

That starts it like:
/usr/bin/qemu-system-ppc64le -name guest=i-systemdtest,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-i-systemdtest/master-key.aes"} -machine pseries-impish,accel=tcg,usb=off,dump-guest-core=off,memory-backend=ppc_spapr.ram -cpu POWER9 -m 2048 ...

So it uses accel=tcg as intended, but other than the systemd testbed it has ssh and apt for debugging.
Gladly it seems reproducible there as well.

So we can add to the above TL;DR:
 - also reproducible in impish cloud images (not just systemd)

In fact after a boot I already have plenty of crashes.

-rw-r----- 1 root root 253783 Jul 12 13:43 _usr_bin_chgrp.0.crash
-rw-r----- 1 root root 208344 Jul 12 13:47 _usr_bin_chmod.0.crash
-rw-r----- 1 root root 338077 Jul 12 13:52 _usr_bin_find.0.crash
-rw-r----- 1 root root 218387 Jul 12 13:52 _usr_bin_rm.0.crash
-rw-r----- 1 ubuntu ubuntu 226114 Jul 12 13:52 _usr_bin_rm.1000.crash
-rw-r----- 1 pollinate daemon 224894 Jul 12 13:43 _usr_bin_rm.111.crash
... growing ...

With that it seems odd that it boots and works just fine.
A praise to the stability of a modern Linux ...

Also apt is rather unlucky - like
Unpacking libdebuginfod-common (0.185-1) ...
dpkg: error while cleaning up:
 rm command for cleanup subprocess was killed by signal (Segmentation fault), core dumped
dpkg: error processing archive /tmp/apt-dpkg-install-1jDo19/1-libbabeltrace1_1.5.8-1build3_ppc64el.deb (--unpack):
 rm command for cleanup subprocess was killed by signal (Segmentation fault), core dumped
...

I let it crash a bunch of things and switched back to KVM mode to then do the debugging with GDB on the crash files.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (7.2 KiB)

Backtraces

Two times (on rm, only slight variations)

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000798b9f361d0 in quotearg_buffer_restyled (buffer=buffer@entry=0x798b9f60130 <slot0> "'", buffersize=buffersize@entry=256,
    arg=arg@entry=0x800369e198070000 <error: Cannot access memory at address 0x800369e198070000>, argsize=argsize@entry=18446744073709551615, quoting_style=shell_always_quoting_style,
    flags=flags@entry=1, quote_these_too=quote_these_too@entry=0x7fffcf48d7c0, left_quote=<optimized out>, right_quote=<optimized out>) at lib/quotearg.c:400
400 lib/quotearg.c: No such file or directory.
(gdb) bt
#0 0x00000798b9f361d0 in quotearg_buffer_restyled (buffer=buffer@entry=0x798b9f60130 <slot0> "'", buffersize=buffersize@entry=256,
    arg=arg@entry=0x800369e198070000 <error: Cannot access memory at address 0x800369e198070000>, argsize=argsize@entry=18446744073709551615, quoting_style=shell_always_quoting_style,
    flags=flags@entry=1, quote_these_too=quote_these_too@entry=0x7fffcf48d7c0, left_quote=<optimized out>, right_quote=<optimized out>) at lib/quotearg.c:400
#1 0x00000798b9f38f64 in quotearg_n_options (options=0x7fffcf48d7b8, argsize=18446744073709551615, arg=0x800369e198070000 <error: Cannot access memory at address 0x800369e198070000>, n=0)
    at lib/quotearg.c:907
#2 quotearg_n_style (arg=0x800369e198070000 <error: Cannot access memory at address 0x800369e198070000>, s=shell_escape_always_quoting_style, n=0) at lib/quotearg.c:958
#3 quotearg_style (s=shell_escape_always_quoting_style, arg=0x800369e198070000 <error: Cannot access memory at address 0x800369e198070000>) at lib/quotearg.c:972
#4 0x00000798b9f33538 in excise (fts=fts@entry=0x798e16902f0, ent=ent@entry=0x798e16915b0, x=x@entry=0x7fffcf48dbd8, is_dir=is_dir@entry=false) at src/remove.c:406
#5 0x00000798b9f34424 in rm_fts (x=<optimized out>, ent=0x798e16915b0, fts=<optimized out>) at src/remove.c:548
#6 rm (file=<optimized out>, x=0x7fffcf48dbd8) at src/remove.c:607
#7 0x00000798b9f32b70 in main (argc=<optimized out>, argv=<optimized out>) at src/rm.c:370

Once

Program terminated with signal SIGSEGV, Segmentation fault.
#0 last_component (name=0x804170e3450d0000 <error: Cannot access memory at address 0x804170e3450d0000>) at lib/basename-lgpl.c:36
36 lib/basename-lgpl.c: No such file or directory.
(gdb) bt
#0 last_component (name=0x804170e3450d0000 <error: Cannot access memory at address 0x804170e3450d0000>) at lib/basename-lgpl.c:36
#1 0x00000d45ade24298 in rm_fts (x=<optimized out>, ent=0xd45e37053b0, fts=<optimized out>) at src/remove.c:444
#2 rm (file=<optimized out>, x=0x7fffcbf8f508) at src/remove.c:607
#3 0x00000d45ade22b70 in main (argc=<optimized out>, argv=<optimized out>) at src/rm.c:370

Once (chgrp)

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000e0cb28f5190 in quotearg_buffer_restyled (buffer=buffer@entry=0xe0cb2920130 <slot0> "'", buffersize=buffersize@entry=256,
    arg=arg@entry=0x606513d00c0e0000 <error: Cannot access memory at address 0x606513d00c0e0000>, argsize=argsize@entry=18446744073709551615, quoting_style=shell_always_quoting_style,
    flags=flags@entry=1, quote_thes...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

quotearg seems to be from gettext
 https://sources.debian.org/src/gettext/0.21-4/gettext-tools/gnulib-lib/quotearg.c/?hl=1#L1

And is at least related in a few of the crashes (but not exclusively).
I wonder if the new TCG implements an instruction differently that now breaks.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (4.7 KiB)

We've had similar cases int he past where new releases glibc or compiler had new instructions breaking the TCG mode.
So I spawned a focal and a bionic image - and then remodeled them to run in emulation mode.

And indeed Bionic works without any crashes, so it must be something in the guest that isn't in Bionic yet. Focal OTOH crashes very similarly - so the userspace feature/instructions was added in 20.04. (BTW 20.04 feels more broken, e.g. it does not at all shut down anymore - but the crashes are just the same).

The focal crashes affect the same programs (rm, find) and look similar:

rm

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000727ed5063b0 in quotearg_buffer_restyled (buffer=buffer@entry=0x727ed530130 <slot0> "'", buffersize=buffersize@entry=256,
    arg=arg@entry=0x800314f727070000 <error: Cannot access memory at address 0x800314f727070000>, argsize=argsize@entry=18446744073709551615, quoting_style=shell_always_quoting_style,
    flags=flags@entry=1, quote_these_too=quote_these_too@entry=0x7fffdbb42770, left_quote=0x0, right_quote=0x0) at lib/quotearg.c:400
400 lib/quotearg.c: No such file or directory.
(gdb) bt
#0 0x00000727ed5063b0 in quotearg_buffer_restyled (buffer=buffer@entry=0x727ed530130 <slot0> "'", buffersize=buffersize@entry=256,
    arg=arg@entry=0x800314f727070000 <error: Cannot access memory at address 0x800314f727070000>, argsize=argsize@entry=18446744073709551615, quoting_style=shell_always_quoting_style,
    flags=flags@entry=1, quote_these_too=quote_these_too@entry=0x7fffdbb42770, left_quote=0x0, right_quote=0x0) at lib/quotearg.c:400
#1 0x00000727ed509144 in quotearg_n_options (options=0x7fffdbb42768, argsize=18446744073709551615, arg=0x800314f727070000 <error: Cannot access memory at address 0x800314f727070000>, n=0)
    at lib/quotearg.c:907
#2 quotearg_n_style (arg=0x800314f727070000 <error: Cannot access memory at address 0x800314f727070000>, s=shell_escape_always_quoting_style, n=0) at lib/quotearg.c:958
#3 quotearg_style (s=shell_escape_always_quoting_style, arg=0x800314f727070000 <error: Cannot access memory at address 0x800314f727070000>) at lib/quotearg.c:972
#4 0x00000727ed50377c in excise (ent=ent@entry=0x727f71415b0, x=x@entry=0x7fffdbb42bb8, is_dir=is_dir@entry=false, fts=<optimized out>) at src/remove.c:406
#5 0x00000727ed504644 in rm_fts (x=<optimized out>, ent=0x727f71415b0, fts=<optimized out>) at src/remove.c:548
#6 rm (file=<optimized out>, x=0x7fffdbb42bb8) at src/remove.c:607
#7 0x00000727ed502d98 in main (argc=<optimized out>, argv=<optimized out>) at src/rm.c:370

find
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __strlen_power8 () at ../sysdeps/powerpc/powerpc64/power8/strlen.S:36
36 ../sysdeps/powerpc/powerpc64/power8/strlen.S: No such file or directory.
(gdb) bt
#0 __strlen_power8 () at ../sysdeps/powerpc/powerpc64/power8/strlen.S:36
#1 0x00007e0f566971f8 in __vfprintf_internal (s=0x7e0f568417e8 <_IO_2_1_stdout_>, format=0x6b007417f10 "%s\n", ap=0x7ffff96f4690 "", mode_flags=<optimized out>) at vfprintf-internal.c:1688
#2 0x00007e0f56772750 in ___fprintf_chk (fp=<optimized out>, flag=<optimized out>, format=<optimized out>) a...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, I can blindly read through the qemu commit log and try rebuilds or even a bisect at some point. But maybe this is already known. Since I'm soon on EOD I wonder if we could mirror this to IBM (now that we know a bit more) if this is a known issue to point us at an existing discussion or fix.
@Frank could you have a look at doing so?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (11.5 KiB)

I've collected the instructions that are around the failures and it is suspicious for
 lbzx / lvx / ld
All failures are around those:

Impish

ubuntu@i-systemdtest:/var/crash$ for i in */CoreDump; do p=$(dirname $i | tr -d '.0123456789' | tr '_' '/' ); gdb $p $i -q -ex "x/-10i \$pc" -ex "x/3i \$pc" -ex "q"; done
Reading symbols from /usr/bin/chgrp...
Reading symbols from /usr/lib/debug/.build-id/27/193823a3e38fa68e68ecfcb8bd98b2cb41d617.debug...
[New LWP 855]
Core was generated by `/bin/chgrp adm /var/log/dmesg '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000e0cb28f5190 in quotearg_buffer_restyled (buffer=buffer@entry=0xe0cb2920130 <slot0> "'", buffersize=buffersize@entry=256,
    arg=arg@entry=0x606513d00c0e0000 <error: Cannot access memory at address 0x606513d00c0e0000>, argsize=argsize@entry=18446744073709551615, quoting_style=shell_always_quoting_style,
    flags=flags@entry=1, quote_these_too=quote_these_too@entry=0x7fffd194aca0, left_quote=<optimized out>, right_quote=<optimized out>) at lib/quotearg.c:400
400 lib/quotearg.c: No such file or directory.
   0xe0cb28f5168 <quotearg_buffer_restyled+1256>: cmpdi r28,0
   0xe0cb28f516c <quotearg_buffer_restyled+1260>: cmpdi cr3,r22,-1
   0xe0cb28f5170 <quotearg_buffer_restyled+1264>: addi r9,r31,1
   0xe0cb28f5174 <quotearg_buffer_restyled+1268>: addi r29,r29,1
   0xe0cb28f5178 <quotearg_buffer_restyled+1272>: cmpld cr7,r9,r16
   0xe0cb28f517c <quotearg_buffer_restyled+1276>: addi r31,r9,1
   0xe0cb28f5180 <quotearg_buffer_restyled+1280>: blt cr7,0xe0cb28f51b0 <quotearg_buffer_restyled+1328>
   0xe0cb28f5184 <quotearg_buffer_restyled+1284>: bne 0xe0cb28f4eb0 <quotearg_buffer_restyled+560>
   0xe0cb28f5188 <quotearg_buffer_restyled+1288>: li r24,0
   0xe0cb28f518c <quotearg_buffer_restyled+1292>: bne cr3,0xe0cb28f4eb4 <quotearg_buffer_restyled+564>
=> 0xe0cb28f5190 <quotearg_buffer_restyled+1296>: lbzx r30,r23,r29
   0xe0cb28f5194 <quotearg_buffer_restyled+1300>: cntlzw r30,r30
   0xe0cb28f5198 <quotearg_buffer_restyled+1304>: rlwinm r30,r30,27,5,31
Reading symbols from /usr/bin/chmod...
Reading symbols from /usr/lib/debug/.build-id/4d/106a49cb4c9f6f8f4ab6db8ce5930efdec021e.debug...

warning: Can't open file /usr/lib/powerpc64le-linux-gnu/libc-2.31.so during file-backed mapping note processing

warning: Can't open file /usr/lib/powerpc64le-linux-gnu/ld-2.31.so during file-backed mapping note processing
[New LWP 2300]
Core was generated by `chmod 0711 /var/snap/lxd/common/lxd '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000147684c5050 in ?? ()
No line number information available for address 0x147684c504f
=> 0x147684c5050: lbzx r30,r24,r29
   0x147684c5054: cntlzw r30,r30
   0x147684c5058: rlwinm r30,r30,27,5,31
Reading symbols from /usr/bin/find...
Reading symbols from /usr/lib/debug/.build-id/3c/c37af50368782eb631f8c451e0830803c5e669.debug...
[New LWP 3876]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `find /var/lib/update-notifier/updates-available -newermt now-7 days '.
Program terminated with s...

Frank Heimes (fheimes)
tags: added: ppc64el reverse-proxy-bugzilla
Changed in ubuntu-power-systems:
assignee: nobody → bugproxy (bugproxy)
importance: Undecided → High
bugproxy (bugproxy)
tags: added: architecture-all bugnameltc-193617 severity-medium targetmilestone-inin---
bugproxy (bugproxy)
tags: added: architecture-ppc64le targetmilestone-inin2110
removed: architecture-all targetmilestone-inin---
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After some setup and preparation I was bisecting through v5.2.0..v6.0.0 to see which change exactly added this new behavior.

Summary:
 - build reduced qemu
 - build, configure and log-check errors will be skips
 - run the extracted systemd test
 - redirect serial to file
 - check log if the service failed
 - unlikely to be related, but to be sure I as using the same slof.bin on all runs
 - verified with self-built v5.2.0 (food) and v6.0.0 (bad)

The full log is here:
$ git bisect log
git bisect start
# new: [609d7596524ab204ccd71ef42c9eee4c7c338ea4] Update version for v6.0.0 release
git bisect new 609d7596524ab204ccd71ef42c9eee4c7c338ea4
# old: [553032db17440f8de011390e5a1cfddd13751b0b] Update version for v5.2.0 release
git bisect old 553032db17440f8de011390e5a1cfddd13751b0b
# new: [57206696058e3a8b1ee6ec8a129722baa6e7ec9f] hw/block/nvme: merge implicitly/explicitly opened processing masks
git bisect new 57206696058e3a8b1ee6ec8a129722baa6e7ec9f
# new: [4663b72a48fd540cbe16053b01d6839a95656440] intc/arm_gic: Fix gic_irq_signaling_enabled() for vCPUs
git bisect new 4663b72a48fd540cbe16053b01d6839a95656440
# new: [55810e90cc9a5ca18289c25aa5c1e0e2dc77eadb] ppc/spapr: cleanup -machine pseries,nvdimm=X handling
git bisect new 55810e90cc9a5ca18289c25aa5c1e0e2dc77eadb
# old: [cd08b80952657ab9f9ca5b8f374982d9c3159f4a] Replace HTTP links with HTTPS ones: documentation
git bisect old cd08b80952657ab9f9ca5b8f374982d9c3159f4a
# old: [5f3013654e879bb4b22876617fdb235aa22568d3] target/mips: Remove unused headers from op_helper.c
git bisect old 5f3013654e879bb4b22876617fdb235aa22568d3
# new: [ab32b78cd1b3b31950c4332f0fa8b192295d77fd] audio: Simplify audio_bug() removing old code
git bisect new ab32b78cd1b3b31950c4332f0fa8b192295d77fd
# new: [711dfb2423a85fba91884b5a90a8c0712c152ce2] hw/ppc: Do not re-read the clock on pre_save if doing savevm
git bisect new 711dfb2423a85fba91884b5a90a8c0712c152ce2
# old: [cf77e27f1e16d8eca761e5c966ea80291254a32c] target/ppc: replaced the TODO with LOG_UNIMP and add break for silence warnings
git bisect old cf77e27f1e16d8eca761e5c966ea80291254a32c
# new: [728aa6f6ffd8cf21d9485eff7ac1926b00155ed2] hw/ppc/spapr_tpm_proxy: Fix hexadecimal format string specifier
git bisect new 728aa6f6ffd8cf21d9485eff7ac1926b00155ed2
# old: [132954a8307635c08c8e3c4d63d4af9fa865a459] ppc/translate: Turn the helper macros into functions
git bisect old 132954a8307635c08c8e3c4d63d4af9fa865a459
# old: [91699dbf30a94dea2575ae193412c364c7f3a5fd] ppc/translate: Raise exceptions after setting the cc
git bisect old 91699dbf30a94dea2575ae193412c364c7f3a5fd
# new: [bcb0b7b1a1c05707304f80ca6f523d557816f85c] ppc/translate: Rewrite gen_lxvdsx to use gvec primitives
git bisect new bcb0b7b1a1c05707304f80ca6f523d557816f85c
# first new commit: [bcb0b7b1a1c05707304f80ca6f523d557816f85c] ppc/translate: Rewrite gen_lxvdsx to use gvec primitives

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

And that matches:

Offending commit:

bcb0b7b1a1c05707304f80ca6f523d557816f85c is the first new commit
commit bcb0b7b1a1c05707304f80ca6f523d557816f85c
Author: Giuseppe Musacchio <email address hidden>
Date: Mon Nov 9 10:17:11 2020 +0100

    ppc/translate: Rewrite gen_lxvdsx to use gvec primitives

    Make the implementation match the lxvwsx one.
    The code is now shorter smaller and potentially faster as the
    translation will use the host SIMD capabilities if available.

    No functional change.

    Signed-off-by: Giuseppe Musacchio <email address hidden>
    Message-Id: <a463dea379da4cb3a22de49c678932f74fb15dd7<email address hidden>>
    Reviewed-by: Richard Henderson <email address hidden>
    Signed-off-by: David Gibson <email address hidden>

 target/ppc/translate/vsx-impl.c.inc | 46 ++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

And if you remember the backtraces above, it was lbzx / lvx / ld that failed which could be implemented that way or conflict the the actual lxvdsx.

And - as always - once you know what you look for it is easier to identify the fix.
I think this one is worth a try:

commit 861f10fd528263a507476b8c4dda93a9588dfa5c
Author: Giuseppe Musacchio <email address hidden>
Date: Tue May 18 15:30:20 2021 +0200

    target/ppc: Fix load endianness for lxvwsx/lxvdsx

    TARGET_WORDS_BIGENDIAN may not match the machine endianness if that's a
    runtime-configurable parameter.

    Fixes: bcb0b7b1a1c05707304f80ca6f523d557816f85c
    Fixes: afae37d98ae991c0792c867dbd9f32f988044318
    Resolves: https://gitlab.com/qemu-project/qemu/-/issues/212

    Signed-off-by: Giuseppe Musacchio <email address hidden>
    Message-Id: <email address hidden>
    Tested-by: Paul A. Clarke <email address hidden>
    Signed-off-by: David Gibson <email address hidden>

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, that fix works fine in a test.
I'm uploading this for a full test and hope that there are no further issues with it that were hidden behind it so far.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, now that we know the root cause we can mark systemd as invalid and the ppc project as fixed.
The qemu task will be fixed once the upload migrates - but currently the test queue is rather full.

Changed in systemd (Ubuntu):
status: New → Invalid
Changed in ubuntu-power-systems:
status: New → Fix Released
Changed in qemu (Ubuntu):
status: New → Fix Committed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

While some other flaky tests still hold it back the issue is fixed.
Here the PPC test of the new upload:
https://autopkgtest.ubuntu.com/results/autopkgtest-impish/impish/ppc64el/s/systemd/20210714_130923_091d6@/log.gz

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.0 KiB)

This bug was fixed in the package qemu - 1:6.0+dfsg-1~ubuntu3

---------------
qemu (1:6.0+dfsg-1~ubuntu3) impish; urgency=medium

  * d/p/u/lp-1935617-target-ppc-Fix-load-endianness-for-lxvwsx-lxvdsx.patch:
    fix TCG emulation for ppc64 (LP: #1935617)

qemu (1:6.0+dfsg-1~ubuntu2) impish; urgency=medium

  * d/control: remove fuse2 trial-build (LP 1934510)

qemu (1:6.0+dfsg-1~ubuntu1) impish; urgency=medium

  * Merge with Debian experimental, Among many other things this fixes LP Bugs:
    (LP: #1907952) broken arrow keys in -display gtk on aarch64
    - qemu-kvm to systemd unit
      - d/qemu-kvm-init: script for QEMU KVM preparation modules, ksm,
        hugepages and architecture specifics
      - d/qemu-system-common.qemu-kvm.service: systemd unit to call
        qemu-kvm-init
      - d/qemu-system-common.install: install helper script
      - d/qemu-system-common.qemu-kvm.default: defaults for
        /etc/default/qemu-kvm
      - d/rules: call dh_installinit and dh_installsystemd for qemu-kvm
    - Distribution specific machine type
      (LP: 1304107 1621042 1776189 1761372 1761372 1776189)
      - d/p/ubuntu/define-ubuntu-machine-types.patch: define distro machine
        types containing release versioned machine attributes
      - d/qemu-system-x86.NEWS Info on fixed machine type defintions
        for host-phys-bits=true
      - Add an info about -hpb machine type in debian/qemu-system-x86.NEWS
      - ubuntu-q35 alias added to auto-select the most recent q35 ubuntu type
    - Enable nesting by default
      - d/p/ubuntu/enable-svm-by-default.patch: Enable nested svm by default
        in qemu64 on amd
        [ No more strictly needed, but required for backward compatibility ]
    - improved dependencies
      - Make qemu-system-common depend on qemu-block-extra
      - Make qemu-utils depend on qemu-block-extra
      - Let qemu-utils recommend sharutils
    - tolerate ipxe size change on migrations to >=18.04 (LP: 1713490)
      - d/p/ubuntu/pre-bionic-256k-ipxe-efi-roms.patch: old machine types
        reference 256k path
      - d/control-in: depend on ipxe-qemu-256k-compat-efi-roms to be able to
        handle incoming migrations from former releases.
    - d/control-in: Disable capstone disassembler library support (universe)
    - d/qemu-system-x86.README.Debian: add info about updated nesting changes
    - d/control*, d/rules: disable xen by default, but provide universe
      package qemu-system-x86-xen as alternative
      [includes compat links changes of 5.0-5ubuntu4]
    - Fix upgrade module handling (LP 1905377)
      --enable-module-upgrades for qemu-xen which doesn't exist in Debian
  * Dropped Changes [in 6.0]:
    - d/p/ubuntu/lp-1907789-build-no-pie-is-no-functional-liker-flag.patch: fix
      ld usage of -no-pie (LP 1907789)
    - d/p/u/lp-1916230-hw-s390x-fix-build-for-virtio-9p-ccw.patch: fix
      virtio-9p-ccw being missing (LP 1916230)
    - d/p/u/lp-1916705-disas-Fix-build-with-glib2.0-2.67.3.patch: Fix FTFBS due
      to glib2.0 >=2.67.3 (LP 1916705)
    - d/p/u/lp-1921754*: add EPYC-Rome-v2 as v1 missed IBRS and thereby fails
      on some HW/Guest combinations e.g. Windows 10 on Threadripper chips
  ...

Read more...

Changed in qemu (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Still the systemd tests are very flaky.
I went through some re-runs and these last 7 tests all did test the very same packages.

impish
  amd64
    boot-and-services (F 14% f 0% S 0% B 0% => P 85%/) ...F...
    systemd-fsckd (F 0% f 0% S 100% B 0% => P 0%/) SSSSSSS
    upstream-1 (F 42% f 0% S 0% B 0% => P 57%/) ....FFF
    upstream-2 (F 14% f 0% S 0% B 0% => P 85%/) ....F..

We see that the last three all were happy.
But formerly without any reason other than being flaky one (upstream-1) or the other (upstream-2) or another (boot-and-services) failed.
If only we could re-trigger subtests only that would be much less wasted test-time :-/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.