upstream-simulation-test-suite test fails on s390x with LTO turned on in dpkg

Bug #1921377 reported by Łukasz Zemczak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
chrony (Ubuntu)
Fix Released
Undecided
Christian Ehrhardt 

Bug Description

The server team is already investigating, but filling in a bug for 'tracking'.

With the new dpkg (1.20.7.1ubuntu4), chrony s390x autopkgtest fail - more specifically the upstream-simulation-test-suite:

108-peer .................... PASS
109-makestep .................... PASS
110-chronyc xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
xclknetsim failed
clknetsim failed
xxclknetsim failed
clknetsim failed
xxclknetsim failed
xclknetsim failed
xclknetsim failed
clknetsim failed
xx FAIL
111-knownclient clknetsim failed
.................... PASS
112-port .................... PASS

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute/hirsute/s390x/c/chrony/20210325_093733_d4718@/log.gz

Passes fine against release.

Related branches

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

In the log - since this is triggered by the ned dpkg we see exactly at the
failing subtest of the 4 autopkgtests that chrony has:
 Can't exec "gcc": No such file or directory at /usr/share/perl5/Dpkg/Arch.pm line 126.

=> Theory 0: dpkg changed behavior

I've found that older good runs had that as well, so this is disqualified as red herring

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Reprocuced in a s390x LXD container:

$ apt update
$ apt build-dep chrony
$ apt source chrony
$ cd chrony-4.0
$ ./debian/rules build
$ mkdir /tmp/mytest/
$ export AUTOPKGTEST_TMP=/tmp/mytest/
$ export CLKNETSIM_PATH=/tmp/mytest/
$ cd test/simulation
$ ./110-chronyc

Chrony is not installed from the package for the test.

The binaries used are from the build that autopkgtest does due to the
"build-needed" flag.

clknetsim OTOH just uses make, not buildflags from dpkg

Before the new dpdk upload that sets LTO as default this was fine.

=> Theory I: LTO from new dpkg now takes place on that autopkgtest-triggered build and breaks things

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I broke down the tests and got to what it actually runs.
The one actually failing does:

LD_PRELOAD=/tmp/mytest//clknetsim.so CLKNETSIM_NODE=1 CLKNETSIM_SOCKET=tmp/sock chronyd -d -f tmp/conf.1 &
LD_PRELOAD=/tmp/mytest//clknetsim.so CLKNETSIM_NODE=2 CLKNETSIM_SOCKET=tmp/sock chronyc -h node1.net1.clk -m dns -n dns +n dns -4 dns -6 dns -46 timeout 200 retries 1 keygen keygen 10 MD5 128 keygen 11 MD5 40 help quit nosuchcommand &
 /tmp/mytest//clknetsim -o tmp/log.offset -f tmp/log.freq -p tmp/log.packets -R 1 -r 210 -l 1 -s tmp/sock tmp/conf 2

The clients try to connect and once the server starts they do so, what I
get then is:

Running simulation...
python3: client.c:2279: syscall: Assertion `0' failed.
python3: client.c:2279: syscall: Assertion `0' failed.
client 1 failed.
failed

So is it seccomp based syscall filtering - we had that in the past in
different cases?
=> Theory II: seccomp syscall filtering

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

dmesg also throws some disturbing messages

[2679323.928166] User process fault: interruption code 003b ilc:2 in libc-2.33.so[3ff8d500000+1c5000]
[2679323.928179] Failing address: 00000000d1230000 TEID: 00000000d1230800
[2679323.928181] Fault in primary space mode while using user ASCE.
[2679323.928183] AS:00000007cc2e01c7 R3:0000000000000024
[2679323.928188] CPU: 12 PID: 4097677 Comm: chronyc Tainted: P O 5.4.0-66-generic #74-Ubuntu
[2679323.928189] Hardware name: IBM 2964 N63 400 (LPAR)
[2679323.928191] User PSW : 0705100180000000 000003ff8d634620
[2679323.928192] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:1 PM:0 RI:0 EA:3
[2679323.928194] User GPRS: 0000000000000000 0000000000000015 00000000ffffffff 0000000000000200
[2679323.928195] 0000000000000001 0000000000000014 00000000d1230123 000003fff907cac0
[2679323.928196] 0000000000000014 0000000000000014 00000000d1230123 0000000000000001
[2679323.928197] 000003ff8daacf98 000002aa398971f0 000002aa39887efc 000003fff907c960
[2679323.928204] User Code: 000003ff8d63460e: ec1200762065 clgrj %r1,%r2,2,000003ff8d6346fa
                            000003ff8d634614: ec93007f2065 clgrj %r9,%r3,2,000003ff8d634712
                           #000003ff8d63461a: ec980050007c cgij %r9,0,8,000003ff8d6346ba
                           >000003ff8d634620: 9180a002 tm 2(%r10),128
                            000003ff8d634624: a7740025 brc 7,000003ff8d63466e
                            000003ff8d634628: b24f0060 ear %r6,%a0
                            000003ff8d63462c: e310a0880004 lg %r1,136(%r10)
                            000003ff8d634632: eb660020000d sllg %r6,%r6,32
[2679323.928217] Last Breaking-Event-Address:
[2679323.928221] [<000002aa39883214>] 0x2aa39883214

Triggered by the very same test, but NOT when I just run the three commands
I identified. It really seems my manual commands block on these syscall things
before I reach the actual issue.
Note This is a Hirsute container on a Focal host. There might be issues
between the new glibc using new stuff the kernel can't handle :-/
But the tests on the autopkgtest Infra happen on a current Hirsute kernel.

=> Theory III: Broken by new new glibc 2.33 (as we had other such cases)

But why is it then only happening when testing the new dpkg?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Differentiating I/II/III

The syscall assertion does only happen in manual runs. My environment and start
commands might not yet be perfect. But what surely hits exactly when the real
test fails is the User process fault.
On x86 where the test works fine this syscall messages also appear, so
"Theory II (syscall)" might be a red herring and will be ignored for now.

If it is related to LTO can be tested via d/rules by adding
  export DEB_BUILD_MAINT_OPTIONS=hardening=+all optimize=-lto
I've built chrony one with that applied and retried the test.
In that case the test works just fine \o/.
So indeed there is an issue with LTO, the user space crash no more occurs.

We will need:
- an upload to chrony to avoid it will be LTO buillt to avoid this
- an upstream report about the problem so that they can look into what
  might fail

Changed in chrony (Ubuntu):
status: New → Triaged
assignee: nobody → Christian Ehrhardt  (paelzer)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Builds and Tests are happy now.

And FYI: reported to upstream https://<email address hidden>/msg02366.html
There as well it was hit on Fedora and for the time being LTO was disabled.

PPA with the change: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4500
MP: https://code.launchpad.net/~paelzer/ubuntu/+source/chrony/+git/chrony/+merge/400182
Test ran against that PPA on s390x: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute-ci-train-ppa-service-4500/hirsute/s390x/c/chrony/20210325_142251_84acb@/log.gz

LGTM, it would be ready to upload ... but we are on it ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I was working with mlichvar (awesome upstream btw!) on getting a bit further on this.

via valgrind logs of the failing case
https://paste.ubuntu.com/p/nbB85C4Dp4/
https://paste.ubuntu.com/p/8Xf85cYBv8/

He spotted an issue in regard to fortified fread which we need to intercept.
The fix is then
https://paste.ubuntu.com/p/rgGGfsdpcG/

Therefore let me test and upload that instead :-)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

All tests and reviews good, uploaded to Hirsute.
Once available in proposed we can trigger the dpkg test against that and it will work.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package chrony - 4.0-5ubuntu3

---------------
chrony (4.0-5ubuntu3) hirsute; urgency=medium

  * d/t/upstream-simulation-test-suite: Update clknetsim version to fix
    a test failure on s390x when LTO is enabled at build time (LP: #1921377)

 -- Christian Ehrhardt <email address hidden> Thu, 25 Mar 2021 15:45:47 +0100

Changed in chrony (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.