Code review comment for ~rafaeldtinoco/ubuntu/+source/qemu:lp1805256-bionic

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

> Could it just be that there are 2 different issues here?
>
> A) One that impacts all releases, which is fixed by "async: use
> explicit memory barriers"
>
> B) One that only impacts bionic, which requires both "aio-posix: Don't
> count ctx->notifier as progress when polling" and "aio: Do
> aio_notify_accept only during blocking aio_poll"? These 2 patches were
> submitted together to fix an IOThread hang:
> https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg00897.html
>
> I'll see if I can collect a backtrace w/ the fix for A) and not B) to
> see how it compares to the one we've attributed to B).

A more plausible scenario, discussed with @cpaelzer as well, would be
that the patch we just placed in the patchset, has added entropy in
between 2 lock acquire/releases. This way, just like adding variables in
between the 2 lock variables, we are mitigating the issue.

Considering the initial root cause:

- 2 locking variables adjacent in address space
- thus sharing the same cache line
- when being manipulated in 2 x different CPUs
- each CPU with its own cache line
- CPUs should invalidate each others cacheline

We have the following mitigations:

- replacing locking primitives by full mutex implementation
- removing variables from the same cache line
- changing entropy reducing/postponing lock reads/writes
  - this one is here

So I think its more of the same... explaining to @cpaelzer he felt that
this was also plausible. Only way to confirm would be to isolate
everything else and stick with a small reproducer.

« Back to merge proposal