Mir

Intermittent mir_unit_tests.MesaDisplayTest.drm_device_change_event_triggers_handler test failure: libumockdev isn't thread safe

Bug #1336671 reported by Alexandros Frantzis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Invalid
Medium
Alexandros Frantzis
umockdev (Ubuntu)
Fix Released
Undecided
Martin Pitt

Bug Description

As seen in: http://s-jenkins.ubuntu-ci:8080/job/mir-clang-utopic-amd64-build/799/console

To reproduce locally:

bzr branch lp:mir/devel mir-devel && cd mir-devel
mkdir build && cd build && cmake .. && make -j4
umockdev-wrapper bin/mir_unit_tests --gtest_filter=MesaDisplayTest.drm_device_change_event_triggers_handler --gtest_repeat=-1 --gtest_break_on_failure

(Ignore the segfault when the test fails, it's a side effect of --gtest_break_on_failure)

Running with strace or on a system with high load increases the chances that we hit the problem. For example, running make -j4 in another mir branch while
running the tests does the trick for mir.

Tags: testsfail

Related branches

tags: added: testsfail
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

Trying to reproduce locally.

Changed in mir:
assignee: nobody → Alexandros Frantzis (afrantzis)
importance: Undecided → Medium
Revision history for this message
Alexandros Frantzis (afrantzis) wrote : Re: Intermittent mir_unit_tests.MesaDisplayTest.drm_device_change_event_triggers_handler test failure

Reproduced:

[ RUN ] MesaDisplayTest.drm_device_change_event_triggers_handler
/storage/work/mir/tests/unit-tests/graphics/mesa/test_display.cpp:773: Failure
Value of: call_count
  Actual: 10
Expected: expected_call_count
Which is: 11

summary: - Intermittent mir_unit_tests.MesaDisplayTest test failure
+ Intermittent
+ mir_unit_tests.MesaDisplayTest.drm_device_change_event_triggers_handler
+ test failure
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

Also see https://code.launchpad.net/~alan-griffiths/mir/frig-1283085/+merge/207682 for a workaround/frig related to this test .

description: updated
description: updated
Changed in mir:
status: New → In Progress
Changed in mir:
milestone: none → 0.5.0
Martin Pitt (pitti)
description: updated
Revision history for this message
Martin Pitt (pitti) wrote :

Some note keeping:

 * I applied http://paste.ubuntu.com/7746005/ to test_display.cpp to disable the sleeps between the change events (which makes the race condition much more likely to hit), and to also print when the register_configuration_change_handler() hits so that we can verify that this is called the expected number of times.

 * I added an udev_dbg() at the end of libudev-monitor.c udev_monitor_receive_device():

           udev_dbg(udev_monitor->udev, "udev_monitor_receive_device: success\n");

   With that we can easily count how many times this actually gets called.

 * I applied http://paste.ubuntu.com/7746012/ to umockdev to get debugging messages for only the netlink emulation bits, not for everything else (to make the output easier to read).

Then I run the test with using my locally modified libs and udev debug logging:

UDEV_LOG=debug PATH=~/upstream/umockdev/src:$PATH LD_LIBRARY_PATH=~/upstream/umockdev/.libs/:/tmp/systemd-204/build-deb/.libs/ umockdev-wrapper bin/mir_unit_tests --gtest_filter=MesaDisplayTest.drm_device_change_event_triggers_handler --gtest_break_on_failure

With that I get 11 instances of

libudev: udev_device_new_from_syspath: device 0x7f09b4001be0 has devpath '/devices/card2'
testbed wrapped recvmsg: netlink socket fd 7, got 91 bytes
libudev: udev_monitor_receive_device: udev_monitor_receive_device: success
display config change handler called

in the successful case; the first recvmsg reports 88 bytes, as the "add" event is slightly shorter than the "change" event.

But in the failure case with 9 out of 11, I see this in the middle:

libudev: udev_monitor_receive_device: unable to receive message
libudev: udev_device_new_from_syspath: device 0x7f36c4001be0 has devpath '/devices/card2'
testbed wrapped recvmsg: netlink socket fd 7, got 91 bytes
libudev: udev_device_new_from_syspath: device 0x7f36c4001be0 has devpath '/devices/card2'
testbed wrapped recvmsg: netlink socket fd 7, got 91 bytes
libudev: udev_device_new_from_syspath: device 0x7f36c4001be0 has devpath '/devices/card2'
testbed wrapped recvmsg: netlink socket fd 7, got 91 bytes
libudev: udev_monitor_receive_device: udev_monitor_receive_device: success
display config change handler called

The "unable to receive message" is quite harmless, it's printed by udev_monitor_receive_device() if recvmsg() fails; I added another %m to that message, and it always just says "Resource temporarily unavailable", which is perfectly normal on a nonblocking socket.

So what we see here is that the first two change events are correctly synthesized (by umockdev) and someone fetches them from the netlink socket (recvmsg), but they aren't returned by udev_monitor_receive_device().

I'm also quite sure after reading the code and the above output that the counting in test_display.cpp and the order of setting up the monitor and synthesizing/receiving events are correct. So this smells like either a bug/race condition in libudev, or using the nonblocking socket in an incorrect way; it's documentation says to either use event-based wakeup (poll/select) or blocking mode, not manual polling in nonblocking mode. But in theory the latter ought to work as well.

Revision history for this message
Martin Pitt (pitti) wrote :

More notekeeping: I added

  udev_dbg(udev_monitor->udev, "udev_monitor_receive_device: start\n");

at the very beginning of udev_monitor_receive_device(). In theory, any "start" of this should either be matched by a "success" at the end, or an "unable to receive message: Resource temporarily unavailable". In a successful test run this is true, but in a failed run I get 14 "start"s and 3 "Resource temporarily unavailable", which leaves exactly 11 "starts" which ought to succeed. But I only get 9 successes, reflecting the "actual: 9 expected: 11" failure.

Indeed I see two blocks

libudev: udev_monitor_receive_device: udev_monitor_receive_device: start
libudev: udev_device_new_from_syspath: device 0x7f2394001be0 has devpath '/devices/card2'
libudev: udev_device_new_from_syspath: device 0x7f2394001be0 has devpath '/devices/card2'

without either a success or a "unable to receive message", which means that there is somewhere an exit of udev_monitor_receive_device() which eats the event.

I now added udev_dbg()s to all exit paths which didn't yet have one. This revealed that the event was correctly read from the netlink socket, but then discarded here:

        /* skip device, if it does not pass the current filter */
        if (!passes_filter(udev_monitor, udev_device)) {
                struct pollfd pfd[1];
                int rc;

                udev_device_unref(udev_device);

src/platform/udev_wrapper.cpp has wrappers for udev_monitor_filter_add_match_subsystem_devtype(), and apparently for this test this is applied:

src/platform/graphics/mesa/display.cpp: monitor.filter_by_subsystem_and_type("drm", "drm_minor");

Mir does not use tag based filtering.

Further additions of udev_dbg() to passes_filter() reveals that the received uevent device delivers udev_device_get_devtype() == NULL, which is the reason for discarding it as the monitor filter only watches for devtype "drm_minor".

My current suspicion is that this is a race on the fake /sys "uevent" file for the device -- it gets read before it got completely written, or rather synced to disk. Thus the "DEVTYPE=" property would be missing. This would be the kind of thing which would get aggravated under high system load. Also, I ran the test case with just this patch:

                 // sleeping between calls to fake_devices hides race conditions
- std::this_thread::sleep_for(std::chrono::microseconds{500});
+ //std::this_thread::sleep_for(std::chrono::microseconds{500});
+ sync();

... and it is now through 500 iterations without failure. This doesn't prove that this is the bug as the sync() delays the iterations quite a bit (each test now takes ~ 200 ms), but it's currently the only plausible explanation that I have.

I'll create a similar test for that in umockdev's test suite, which will make this slightly easier to debug and ensure it stays fixed.

summary: Intermittent
mir_unit_tests.MesaDisplayTest.drm_device_change_event_triggers_handler
- test failure
+ test failure: device DEVTYPE is sometimes NULL
Revision history for this message
Martin Pitt (pitti) wrote : Re: Intermittent mir_unit_tests.MesaDisplayTest.drm_device_change_event_triggers_handler test failure: device DEVTYPE is sometimes NULL

FTR, when I completely take the thread::sleep_for() out, I also sometimes get failures like

mir_unit_tests: src/uevent_sender.c:213: uevent_sender_send: Assertion `subsystem != ((void *)0)' failed.

which is again a hint that the test process doesn't see a consistent view on the /tmp/umockdev.* tree.

Revision history for this message
Martin Pitt (pitti) wrote :

Another update: This is not an fsync() race after all. The /uevent files it tries to read already exist and can be successfully read in a previous event.

I adjusted the libudev debugging to http://paste.ubuntu.com/7747163/ and umockdev's to http://paste.ubuntu.com/7747174/ to get a non-overwhelming debug output, and still run the test like

With that, a "normal" event processing looks like that:

libudev: udev_monitor_receive_device: udev_monitor_receive_device: start
libudev: udev_device_read_uevent_file: reading /sys/devices/card2/uevent
testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.YI7DIX/sys/devices/card2/uevent
libudev: udev_device_read_uevent_file: line DEVTYPE=drm_minor
libudev: udev_monitor_receive_device: udev_monitor_receive_device: success

But I also see ones that look like that:

libudev: udev_monitor_receive_device: udev_monitor_receive_device: start
libudev: udev_device_read_uevent_file: reading /sys/devices/card2/uevent
testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.YI7DIX/sys/devices/card2/uevent
libudev: udev_device_new_from_syspath: device 0x7f2208001be0 has devpath '/devices/card2'
libudev: udev_device_read_uevent_file: reading /sys/devices/card2/uevent
testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.YI7DIX/sys/devices/card2/uevent
libudev: udev_device_read_uevent_file: line DEVTYPE=drm_minor
libudev: udev_device_read_uevent_file: line DEVTYPE=drm_minor
libudev: udev_monitor_receive_device: udev_monitor_receive_device: success

and for a failing case:

ibudev: udev_monitor_receive_device: udev_monitor_receive_device: start
libudev: udev_device_read_uevent_file: reading /sys/devices/card2/uevent
testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.YI7DIX/sys/devices/card2/uevent
libudev: udev_device_new_from_syspath: device 0x7f2208001be0 has devpath '/devices/card2'
libudev: udev_device_read_uevent_file: reading /sys/devices/card2/uevent
testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.YI7DIX/sys/devices/card2/uevent
libudev: passes_filter: passes_filter: monitor filters on DEVTYPE=drm_minor, but device does not have DEVTYPE
libudev: udev_monitor_receive_device: udev_monitor_receive_device: failed filter test, discarding
libudev: udev_device_read_uevent_file: line DEVTYPE=drm_minor
libudev: udev_device_read_uevent_file: reading /sys/devices/card2/uevent
testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.YI7DIX/sys/devices/card2/uevent
libudev: udev_device_read_uevent_file: line DEVTYPE=drm_minor
libudev: udev_monitor_receive_device: udev_monitor_receive_device: success

So the really strange thing is that there are often multiple parallel runs of these functions, like if udev_monitor_receive_device() was being called in parallel. I wonder if that's related to the test having three threads, so that the display's configuration handler monitor somehow also runs in the "t" and "watchdog" threads?

Revision history for this message
Martin Pitt (pitti) wrote :

An strace confirms that all three threads interact with libudev. That's bad as libudev is very far away from being thread safe. Only the main thread actually calls recvmsg(), i. e. udev_monitor_receive_device(). But the others do something as well. This isn't expected I figure, as at least the watchdog thread doesn't touch udev at all, and the "t" thread does not talk to libudev but only to umockdev. So *something* leaks into these threads which has no business there.

Revision history for this message
Martin Pitt (pitti) wrote :

Indeed, so as libumockdev isn't thread safe, it becomes utterly confused by several threads interacting with it at once:

[pid 27529] open("/tmp/umockdev.ZMUMIX/sys/devices/card2/subsystem", O_RDONLY <unfinished ...>
[pid 27574] write(2, "reading /sys/devices/card2/uevent\n", 34reading /sys/devices/card2/uevent
 <unfinished ...>
[pid 27529] <... open resumed> ) = 11
[pid 27574] <... write resumed> ) = 34
[pid 27529] fstat(11, <unfinished ...>
[pid 27574] write(2, "testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.ZMUMIX/sys/devices/card2/uevent\n", 98testbed wrapped fopen(/sys/devices/card2/uevent) -> /tmp/umockdev.ZMUMIX/sys/devices/card2/uevent
 <unfinished ...>
[pid 27529] <... fstat resumed> {st_dev=makedev(0, 21), st_ino=1776459, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=0, st_size=120, st_atime=2014/07/04-16:25:05, st_mtime=2014/07/04-16:25:05, st_ctime=2014/07/04-16:25:05}) = 0
[pid 27574] <... write resumed> ) = 98
[pid 27529] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 27574] open("/tmp/umockdev.ZMUMIX/sys/devices/card2/uevent", O_RDONLY <unfinished ...>
[pid 27529] <... mmap resumed> ) = 0x7f8518d4f000
[pid 27574] <... open resumed> ) = 12
[pid 27529] read(11, <unfinished ...>
[pid 27574] fstat(12, <unfinished ...>
[pid 27529] <... read resumed> 0x7f8518d4f000, 4096) = -1 EISDIR (Is a directory)

I. e. while one thread is handling the /sys/devices/card2/subsystem symlink, the other tries to read /sys/devices/card2/uevent, and they collide as the umockdev-preload's path resolution (and probably other things) are not thread safe.

summary: Intermittent
mir_unit_tests.MesaDisplayTest.drm_device_change_event_triggers_handler
- test failure: device DEVTYPE is sometimes NULL
+ test failure: libumockdev isn't thread safe
Revision history for this message
Martin Pitt (pitti) wrote :

Just in case I still need it next week, here's the complete strace with debugging.

Revision history for this message
Martin Pitt (pitti) wrote :

I now wrote two test cases for umockdev itself which reproduce the multi-thread errors and pretty much the same errors as the Mir test case. I now made the path redirection to testbed thread-safe for all wrapped functions, which does not yet address all MT issues, but is sufficient for this particular test scenario:

  https://github.com/martinpitt/umockdev/commit/3b852ea23

With that I ran that Mir test successfully through 20.000 iterations (with a version without thestd::this_thread::sleep_for(std::chrono::microseconds{500})) . It previously hardly survived two iterations without the sleep, and with the sleep it failed after ~ 1000 iterations on an idle system and ~ a hundred on a loaded one.

Changed in umockdev (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
status: New → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

I keep the Mir task open so that you can adjust the test case to take out the sleep (if you wish). If you want to leave it in, please just close as invalid.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Note fix committed to upstream and the LP mirror of it:
https://github.com/martinpitt/umockdev/commit/3b852ea2321b6b5c887b8b4943eee1a1cc24696b
http://bazaar.launchpad.net/~pitti/umockdev/trunk/revision/364

But technically not fix committed to Ubuntu yet. I'm sure it's coming soon:
https://launchpad.net/ubuntu/+source/umockdev

Revision history for this message
Martin Pitt (pitti) wrote :

Daniel: yeah, I (and many others) have traditionally used "fix committed" for "fixed in upstream task, but not released to Ubuntu yet". I know it's a slightly different semantics than for upstream tasks, but a very useful one.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package umockdev - 0.8.5-1

---------------
umockdev (0.8.5-1) unstable; urgency=medium

  * New upstream release, to fix build and test on 32 bit architectures.

 -- Martin Pitt <email address hidden> Fri, 11 Jul 2014 07:34:26 +0200

Changed in umockdev (Ubuntu):
status: Fix Committed → Fix Released
Changed in mir:
milestone: 0.5.0 → none
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.