> Could it just be that there are 2 different issues here?
>
> A) One that impacts all releases, which is fixed by "async: use
> explicit memory barriers"
>
> B) One that only impacts bionic, which requires both "aio-posix: Don't
> count ctx->notifier as progress when polling" and "aio: Do
> aio_notify_accept only during blocking aio_poll"? These 2 patches were
> submitted together to fix an IOThread hang:
> https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg00897.html
>
> I'll see if I can collect a backtrace w/ the fix for A) and not B) to
> see how it compares to the one we've attributed to B).
A more plausible scenario, discussed with @cpaelzer as well, would be
that the patch we just placed in the patchset, has added entropy in
between 2 lock acquire/releases. This way, just like adding variables in
between the 2 lock variables, we are mitigating the issue.
Considering the initial root cause:
- 2 locking variables adjacent in address space
- thus sharing the same cache line
- when being manipulated in 2 x different CPUs
- each CPU with its own cache line
- CPUs should invalidate each others cacheline
We have the following mitigations:
- replacing locking primitives by full mutex implementation
- removing variables from the same cache line
- changing entropy reducing/postponing lock reads/writes
- this one is here
So I think its more of the same... explaining to @cpaelzer he felt that
this was also plausible. Only way to confirm would be to isolate
everything else and stick with a small reproducer.
> Could it just be that there are 2 different issues here? /lists. gnu.org/ archive/ html/qemu- devel/2018- 08/msg00897. html
>
> A) One that impacts all releases, which is fixed by "async: use
> explicit memory barriers"
>
> B) One that only impacts bionic, which requires both "aio-posix: Don't
> count ctx->notifier as progress when polling" and "aio: Do
> aio_notify_accept only during blocking aio_poll"? These 2 patches were
> submitted together to fix an IOThread hang:
> https:/
>
> I'll see if I can collect a backtrace w/ the fix for A) and not B) to
> see how it compares to the one we've attributed to B).
A more plausible scenario, discussed with @cpaelzer as well, would be
that the patch we just placed in the patchset, has added entropy in
between 2 lock acquire/releases. This way, just like adding variables in
between the 2 lock variables, we are mitigating the issue.
Considering the initial root cause:
- 2 locking variables adjacent in address space
- thus sharing the same cache line
- when being manipulated in 2 x different CPUs
- each CPU with its own cache line
- CPUs should invalidate each others cacheline
We have the following mitigations:
- replacing locking primitives by full mutex implementation
- removing variables from the same cache line
- changing entropy reducing/postponing lock reads/writes
- this one is here
So I think its more of the same... explaining to @cpaelzer he felt that
this was also plausible. Only way to confirm would be to isolate
everything else and stick with a small reproducer.