Merge ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix into ubuntu/+source/qemu:ubuntu/bionic-devel

Proposed by Rafael David Tinoco
Status: Merged
Approved by: Rafael David Tinoco
Approved revision: 5b3022ea10386b4c68679384dce8a8666cec3660
Merged at revision: 5b3022ea10386b4c68679384dce8a8666cec3660
Proposed branch: ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic-refix
Merge into: ubuntu/+source/qemu:ubuntu/bionic-devel
Diff against target: 212 lines (+190/-0)
3 files modified
debian/changelog (+8/-0)
debian/patches/series (+1/-0)
debian/patches/ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch (+181/-0)
Reviewer Review Type Date Requested Status
Rafael David Tinoco (community) Approve
dann frazier (community) Approve
Christian Ehrhardt  (community) Needs Information
Canonical Server Pending
Review via email: mp+387269@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Christian,

This is a very clean backport of the commit:

commit 5710a3e09f
Author: Paolo Bonzini <email address hidden>
Date: Tue Apr 7 14:07:46 2020

    async: use explicit memory barriers

...

main responsible for fixing the Aarch64 race window we've seen.

After thinking a bit about the patches I had backported... I decided to "KISS" so I have kept current AIO logic intact - with existent, or not, BUGS - but did implement the stronger atomics, needed by Aarch64 and the extra barriers.

I've made some tests with and without i/o threads and it worked well.

Suggestion for next steps:

- Do a full battery of tests using your regression tests
  (including the io-threaded test this time)

- Try to reproduce the Aarch64 issue (from LP: #1805256) without this patch
  PROBLEM: I dont think we ever did reproduce this with Bionic's version.

- Decide (based on reproducer - or not) if you want to go on with this SRU.

Let me know your thoughts, please.

I'm uploading to the PPA now...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I think Ike was able to reproduce this with Bionic - you might ping/subscribe him for that.

In general the reduced backport might be more suited for the SRU anyway. So I like that approach in general.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

@ike,

would you mind reviewing this change ?

@paelzer thinks you were able to reproduce the issue in Bionic... I honestly don't remember if you did. I'm willing to provision an arm64 machine in our MAAS if you can't review or reproduce this (or does not remind).

Best,

-rafaeldtinoco

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Generally ok with me, the patch seems small and reasonable now.
And going forward this was the right change, but since the issue only ever occurred/reproduced/matched on armhf/arm64 I wonder a bit.

I know that Paolo has gone to length to improve the relax the writes.
But for the SRU I think this would be even safer if you'd throw in "if arch = arm* as preprocessor macros to all of these changes.

That way the regression risk would reduce a lot while giving the fix to the arch where it is needed.

I'll mark it as "need info" as this is more starting a discussion than anything else.

review: Needs Information
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Like verbally said already, I do agree with the ifdefs and had thought that myself also. The issue was never seen outside Aarch64 processor family based servers, so it makes sense to confine it to that architecture for the SRU, reducing risks.

I'll change the patch and update the BUG SRU template (regarding regressions risks and that measure).

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay, from all ideas I had.. this one is the less intrusive, most easy to manage and one without introducing anything that QEMU has not already have.

#ifndef TARGET_ARM

depends on me removing the gcc poisoning for "TARGET_ARM" definition. QEMU poisons TARGET_ARM just to make sure that there aren't any other pars in machine independent code taking decisions based on the target architecture. Our case was exactly that.. to take a decision based on target architecture inside the mach-independent code (AIO)..

I'm now sending this to the PPA to be tested for build and executions. Dan Frazier provided a good feedback on this change for the original case.

Thanks for reviewing/testing this o/

Revision history for this message
dann frazier (dannf) wrote :

Since I'm not familiar with TARGET_ARM, I did a quick check on arm64 to confirm it DTRT. I added the following code to util/async.c:

#ifdef TARGET_ARM
#error "TARGET_ARM DEFINED, YAY!"
#else
#error "ERROR: TARGET_ARM *NOT* DEFINED!"
#endif

And the build failed with:
/home/ubuntu/qemu-2.11+dfsg/util/async.c: In function ‘aio_ctx_prepare’:
/home/ubuntu/qemu-2.11+dfsg/util/async.c:227:2: error: #error "ERROR: TARGET_ARM *NOT* DEFINED!"
 #error "ERROR: TARGET_ARM *NOT* DEFINED!"
  ^~~~~

So is TARGET_ARM really what we want? Should we use the GCC defs instead - i.e. "#ifdef __aarch64__" - or "#if defined(__aarch64__) || defined(__arm__)", if we also want it to apply to armhf?

review: Needs Fixing
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Yes @dann! Thanks for checking that, it would be my next step before @paelzer was back... I was afraid this was never defined, but wanted to stick to what QEMU already had... let me get back to you today/tomorrow with this.. I'll make sure it is defined when it is supposed to before pushing it again...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Right,

so TARGET_ARM was only defined in some parts of the build and wasn't appropriate for async code (also because it would include armhf and that wasn't needed based in our observations).

Checking how QEMU used predefs for arch recognition, I saw the preferred way was #if defined(__arch__) so I'm using the #if !defined(__arch64__) version here.

Tested the predef in an aarch64 env and it works.

$ ./temp
aarch64

--
int
main(int argc, char *argv)
{
#if !defined(__aarch64__)
    printf("other\n");
#else
    printf("aarch64\n");
#endif
}
--

Submitted package to PPA again.

Thanks for the feedback @dannf

Revision history for this message
dann frazier (dannf) wrote :

On Tue, Jul 21, 2020 at 10:40 PM Rafael David Tinoco
<email address hidden> wrote:
>
> Right,
>
> so TARGET_ARM was only defined in some parts of the build and wasn't appropriate for async code (also because it would include armhf and that wasn't needed based in our observations).

I don't know that we actually did any direct testing on armhf.
Unfortunately we don't have any hw in the lab that both reproduces
this problem and supports 32-bit. But I spun up a Graviton2 metal
instance, and found that it does both. I'm running the test case in a
loop in an armhf chroot to see if it can be reproduced there.

  -dann

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Oh, in the beginning I did test with:

Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 12
NUMA node(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
BogoMIPS: 200.00
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 3 MiB
L3 cache: 4 MiB
NUMA node0 CPU(s): 0-23
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

and using armhf guests from it.. but iirc the issue only happened in machines with more than 1 socket (likely because of internal - to CPU - cache invalidation protocols and the barrier instructions enforcing). All our aarch64 machines had more than 1 socket but did not support aarch32 kvm mode.

Let me know your findings! Thanks!

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

@dann,

Could you +1 or -1 this after the last change ?

@paelzer,

This is "done" from my POV unless @dann points out something else we should fix or change.

Thanks to both of you

Revision history for this message
dann frazier (dannf) wrote :

@rafaeldtinoco: Testing all looks good on arm64/armhf. Can you change the name of the patch in debian/changelog to match the new name of the patch? It is missing the -arm-only suffix. With that, I'm +1. But, if you haven't already, I'd recommend testing w/ unifdef just to prove that the !aarch64 code is exactly what it was before.

btw, though I understand the reasoning, I'm personally not a fan of the #ifdef approach as that may cause maintenance headaches in upcoming years. If the submitter of bug 1885419 were to be kind enough to verify a non-ifdef'd version in his environment, would that give you enough confidence to ship it? Or is the concern about not-yet-uncovered regressions?

Thanks for continuing to stick with this nasty issue for so long Rafael!

 -dann

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

> @rafaeldtinoco: Testing all looks good on arm64/armhf. Can you change the name of the patch in debian/changelog to match the new name of the patch? It is missing the -arm-only suffix. With that, I'm +1. But, if you haven't already, I'd recommend testing w/ unifdef just to prove that the !aarch64 code is exactly what it was before.

I'll ask to @paelzer for a full set of regression tests from the package in the PPA so that will cover what you asked.

> btw, though I understand the reasoning, I'm personally not a fan of the #ifdef approach as that may cause maintenance headaches in upcoming years. If the submitter of bug 1885419 were to be kind enough to verify a non-ifdef'd version in his environment, would that give you enough confidence to ship it? Or is the concern about not-yet-uncovered regressions?

Based on the history: just seen in aarch64, difficult to reproduce (same cacheline, multiple cpus, aarch64), difficult to fix without performance impact (thus the long study from Paolo about using the minimum required mem ordering instructions), and my previous mistake in the SRU attempt.. and I do agree that #ifdef approach isn't great... Christian and I thought it was the safest AND something that would definitely raise the chances for the SRU to be accepted.

> Thanks for continuing to stick with this nasty issue for so long Rafael!

My pleasure! It became a question of honor #). Cheers o/

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Forgot to thank for the changelog and git log patch renaming warning, I'm pushing them fixed but that won't change binary code in the PPA in any way.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

@paelzer,

I guess this leaves it with you now. I'd like to request a full set of tests for the qemu binary packages in given PPA (https://launchpad.net/~rafaeldtinoco/+archive/ubuntu/lp1805256-bionic), specially amd64 and arm64 to check if both pass in the migration tests with no issues.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Alright, I had to re-push the changes rebased again in pkg/ubuntu/bionic-devel because qemu was likely re-imported as usd-importer migrates to new git-ubuntu. I think its all good now.

Revision history for this message
dann frazier (dannf) wrote :

On Mon, Jul 27, 2020 at 7:46 PM Rafael David Tinoco
<email address hidden> wrote:
>
> > @rafaeldtinoco: Testing all looks good on arm64/armhf. Can you change the name of the patch in debian/changelog to match the new name of the patch? It is missing the -arm-only suffix. With that, I'm +1. But, if you haven't already, I'd recommend testing w/ unifdef just to prove that the !aarch64 code is exactly what it was before.
>
> I'll ask to @paelzer for a full set of regression tests from the package in the PPA so that will cover what you asked.

I think it's still a good idea to the unifdef test in addition to
regression tests. I went ahead and did this:

$ for file in aio-posix.c aio-win32.c async.c; do \
  unifdef -U__aarch64__ qemu-2.11+dfsg-1ubuntu7.30/util/${file} > \
    qemu-2.11+dfsg-1ubuntu7.30/util/${file}.undef__aarch64__;
  diff -u qemu-2.11+dfsg-1ubuntu7.29/util/${file} \
    qemu-2.11+dfsg-1ubuntu7.30/util/${file}.undef__aarch64__;
done
--- qemu-2.11+dfsg-1ubuntu7.29/util/async.c 2017-12-13 10:27:20.000000000 -0700
+++ qemu-2.11+dfsg-1ubuntu7.30/util/async.c.undef__aarch64__
2020-07-28 18:24:12.870325172 -0600
@@ -338,6 +338,7 @@
      * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
      */
     smp_mb();
+
     if (ctx->notify_me) {
         event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);

Looks good!

 -dann

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

unifdef is new to me =), sorry for not replying back about that specific item. I do code in Eclipse-CDT, so I usually rely in the IDE's functionality and linter for such checks. Thanks for checking it!

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

BTW @dann, could you +1 if you're good ? Just so it does not appear as -1 in our queue.

@paelzer, ball is with you on this, I suppose.

Revision history for this message
dann frazier (dannf) wrote :

+1 :)

review: Approve
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

After speaking with @paelzer, we agreed I could push and upload this SRU. Thanks @dannf for reviewing and working 4 hands with me.

(c)rafaeldtinoco@bionic:~/.../sources/ubuntu/qemu$ git ubuntu tag --upload

(c)rafaeldtinoco@bionic:~/.../sources/ubuntu/qemu$ git describe --tags
upload/1%2.11+dfsg-1ubuntu7.30

$ git push pkg upload/1%2.11+dfsg-1ubuntu7.30
Counting objects: 12, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (12/12), 3.48 KiB | 162.00 KiB/s, done.
Total 12 (delta 8), reused 5 (delta 2)
To ssh://git.launchpad.net/ubuntu/+source/qemu
 * [new tag] upload/1%2.11+dfsg-1ubuntu7.30 -> upload/1%2.11+dfsg-1ubuntu7.30

$ debdiff qemu*.dsc 2>&1 | diffstat
 changelog | 8 +
 patches/series | 1
 patches/ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch | 181 +++++++++++++++++++++++
 3 files changed, 190 insertions(+)

$ dput ubuntu qemu_2.11+dfsg-1ubuntu7.30_source.changes
...
Uploading to ubuntu (via ftp to upload.ubuntu.com):
  Uploading qemu_2.11+dfsg-1ubuntu7.30.dsc: done.
  Uploading qemu_2.11+dfsg-1ubuntu7.30.debian.tar.xz: done.
  Uploading qemu_2.11+dfsg-1ubuntu7.30_source.buildinfo: done.
  Uploading qemu_2.11+dfsg-1ubuntu7.30_source.changes: done.
Successfully uploaded packages.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/debian/changelog b/debian/changelog
2index 4d61c58..a82b170 100644
3--- a/debian/changelog
4+++ b/debian/changelog
5@@ -1,3 +1,11 @@
6+qemu (1:2.11+dfsg-1ubuntu7.30) bionic; urgency=medium
7+
8+ * d/p/ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch:
9+ - More conservative and less intrusive approach of the Aarch64 AIO
10+ race window fix. Contained to Aarch64 builds only. (LP: #1805256)
11+
12+ -- Rafael David Tinoco <rafaeldtinoco@ubuntu.com> Mon, 20 Jul 2020 11:48:06 +0000
13+
14 qemu (1:2.11+dfsg-1ubuntu7.29) bionic; urgency=medium
15
16 * allow vhost-user driver to ignore some unneeded mem regions,
17diff --git a/debian/patches/series b/debian/patches/series
18index d025219..c237eb7 100644
19--- a/debian/patches/series
20+++ b/debian/patches/series
21@@ -147,3 +147,4 @@ CVE-2019-20382.patch
22 CVE-2020-1983.patch
23 lp1887525/0001-vhost-fix-memslot-limit-check.patch
24 lp1887525/0002-vhost-allow-backends-to-filter-memory-sections.patch
25+ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch
26diff --git a/debian/patches/ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch b/debian/patches/ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch
27new file mode 100644
28index 0000000..07d16dc
29--- /dev/null
30+++ b/debian/patches/ubuntu/lp-1805256-async-use-explicit-mem-barriers-arm-only.patch
31@@ -0,0 +1,181 @@
32+Description: async: use explicit memory barriers
33+
34+When using C11 atomics, non-seqcst reads and writes do not participate
35+in the total order of seqcst operations. In util/async.c and util/aio-posix.c,
36+in particular, the pattern that we use
37+
38+ write ctx->notify_me write bh->scheduled
39+ read bh->scheduled read ctx->notify_me
40+ if !bh->scheduled, sleep if ctx->notify_me, notify
41+
42+needs to use seqcst operations for both the write and the read. In
43+general this is something that we do not want, because there can be
44+many sources that are polled in addition to bottom halves. The
45+alternative is to place a seqcst memory barrier between the write
46+and the read. This also comes with a disadvantage, in that the
47+memory barrier is implicit on strongly-ordered architectures and
48+it wastes a few dozen clock cycles.
49+
50+Fortunately, ctx->notify_me is never written concurrently by two
51+threads, so we can assert that and relax the writes to ctx->notify_me.
52+The resulting solution works and performs well on both aarch64 and x86.
53+
54+Note that the atomic_set/atomic_read combination is not an atomic
55+read-modify-write, and therefore it is even weaker than C11 ATOMIC_RELAXED;
56+on x86, ATOMIC_RELAXED compiles to a locked operation.
57+
58+ [Backport Note]
59+
60+ Backported ONLY atomics and barriers from upstream commit.
61+
62+ Note: AIO is a sensitive part of QEMU and we decided not to apply
63+ this SRU to all architectures, reducing chances of regressions to
64+ anything other than Aarch64 architecture, affected by the original
65+ issue.
66+
67+Analyzed-by: Ying Fang <fangying1@huawei.com>
68+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
69+Tested-by: Ying Fang <fangying1@huawei.com>
70+Message-Id: <20200407140746.8041-6-pbonzini@redhat.com>
71+Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
72+[backported from upstream 5710a3e09f9b85801e5ce70797a4a511e5fc9e2c]
73+Signed-off-by: Rafael David Tinoco <rafaeldtinoco@ubuntu.com>
74+---
75+ util/aio-posix.c | 17 +++++++++++++++++
76+ util/aio-win32.c | 17 +++++++++++++++++
77+ util/async.c | 28 ++++++++++++++++++++++++++++
78+ 3 files changed, 62 insertions(+)
79+
80+diff --git a/util/aio-posix.c b/util/aio-posix.c
81+index 1427f49b4..721d69c8f 100644
82+--- a/util/aio-posix.c
83++++ b/util/aio-posix.c
84+@@ -590,7 +590,18 @@ bool aio_poll(AioContext *ctx, bool blocking)
85+ * so disable the optimization now.
86+ */
87+ if (blocking) {
88++#if !defined(__aarch64__)
89+ atomic_add(&ctx->notify_me, 2);
90++#else
91++ /* LP: #1805256 - ARM Bionic SRU */
92++ atomic_set(&ctx->notify_me, atomic_read(&ctx->notify_me) + 2);
93++ /*
94++ * Write ctx->notify_me before computing the timeout
95++ * (reading bottom half flags, etc.). Pairs with
96++ * smp_mb in aio_notify().
97++ */
98++ smp_mb();
99++#endif
100+ }
101+
102+ qemu_lockcnt_inc(&ctx->list_lock);
103+@@ -631,7 +642,13 @@ bool aio_poll(AioContext *ctx, bool blocking)
104+ }
105+
106+ if (blocking) {
107++#if !defined(__aarch64__)
108+ atomic_sub(&ctx->notify_me, 2);
109++#else
110++ /* LP: #1805256 - ARM Bionic SRU */
111++ /* Finish the poll before clearing the flag. */
112++ atomic_store_release(&ctx->notify_me, atomic_read(&ctx->notify_me) - 2);
113++#endif
114+ }
115+
116+ /* Adjust polling time */
117+diff --git a/util/aio-win32.c b/util/aio-win32.c
118+index d6d5e02f0..49ba72918 100644
119+--- a/util/aio-win32.c
120++++ b/util/aio-win32.c
121+@@ -340,7 +340,18 @@ bool aio_poll(AioContext *ctx, bool blocking)
122+ * so disable the optimization now.
123+ */
124+ if (blocking) {
125++#if !defined(__aarch64__)
126+ atomic_add(&ctx->notify_me, 2);
127++#else
128++ /* LP: #1805256 - ARM Bionic SRU */
129++ atomic_set(&ctx->notify_me, atomic_read(&ctx->notify_me) + 2);
130++ /*
131++ * Write ctx->notify_me before computing the timeout
132++ * (reading bottom half flags, etc.). Pairs with
133++ * smp_mb in aio_notify().
134++ */
135++ smp_mb();
136++#endif
137+ }
138+
139+ qemu_lockcnt_inc(&ctx->list_lock);
140+@@ -373,7 +384,13 @@ bool aio_poll(AioContext *ctx, bool blocking)
141+ ret = WaitForMultipleObjects(count, events, FALSE, timeout);
142+ if (blocking) {
143+ assert(first);
144++#if !defined(__aarch64__)
145+ atomic_sub(&ctx->notify_me, 2);
146++#else
147++ /* LP: #1805256 - ARM Bionic SRU */
148++ atomic_store_release(&ctx->notify_me,
149++ atomic_read(&ctx->notify_me) - 2);
150++#endif
151+ }
152+
153+ if (first) {
154+diff --git a/util/async.c b/util/async.c
155+index 4dd9d95a9..20f234e2b 100644
156+--- a/util/async.c
157++++ b/util/async.c
158+@@ -221,7 +221,19 @@ aio_ctx_prepare(GSource *source, gint *timeout)
159+ {
160+ AioContext *ctx = (AioContext *) source;
161+
162++#if !defined(__aarch64__)
163+ atomic_or(&ctx->notify_me, 1);
164++#else
165++ /* LP: #1805256 - ARM Bionic SRU */
166++ atomic_set(&ctx->notify_me, atomic_read(&ctx->notify_me) | 1);
167++
168++ /*
169++ * Write ctx->notify_me before computing the timeout
170++ * (reading bottom half flags, etc.). Pairs with
171++ * smp_mb in aio_notify().
172++ */
173++ smp_mb();
174++#endif
175+
176+ /* We assume there is no timeout already supplied */
177+ *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
178+@@ -239,7 +251,13 @@ aio_ctx_check(GSource *source)
179+ AioContext *ctx = (AioContext *) source;
180+ QEMUBH *bh;
181+
182++#if !defined(__aarch64__)
183+ atomic_and(&ctx->notify_me, ~1);
184++#else
185++ /* LP: #1805256 - ARM Bionic SRU */
186++ /* Finish computing the timeout before clearing the flag. */
187++ atomic_store_release(&ctx->notify_me, atomic_read(&ctx->notify_me) & ~1);
188++#endif
189+ aio_notify_accept(ctx);
190+
191+ for (bh = ctx->first_bh; bh; bh = bh->next) {
192+@@ -338,7 +356,17 @@ void aio_notify(AioContext *ctx)
193+ * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
194+ */
195+ smp_mb();
196++
197++#if !defined(__aarch64__)
198+ if (ctx->notify_me) {
199++#else
200++ /* LP: #1805256 - ARM Bionic SRU
201++ *
202++ * Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
203++ * with smp_mb in aio_ctx_prepare or aio_poll.
204++ */
205++ if (atomic_read(&ctx->notify_me)) {
206++#endif
207+ event_notifier_set(&ctx->notifier);
208+ atomic_mb_set(&ctx->notified, true);
209+ }
210+--
211+2.17.1
212+

Subscribers

People subscribed via source and target branches