Code review comment for ~ahasenack/ubuntu/+source/pyzmq:eoan-pyzmq-test-poll-timeout-bump

Revision history for this message
Bryce Harrington (bryce) wrote :

The LP generated debdiff appears to be against a test build, so I constructed one against what's in eoan: http://paste.ubuntu.com/p/GBS8nD68Zg/

- Verified upstream-maintainer is updated
- Verified patch is correctly listed in debian/patches/series
- Verified debdiff targets correct release codename

Bumping the timeout seems safe and I don't have any strong arguments *against* doing so, and the packaging changes LGTM, so I feel comfortable giving a provisional +1 approval.

However, my guess is this will only partially solve the actual issue, so further thinking below:

The test cases in test_auth.py appear to be doing a bunch of connection tests against localhost. One should think these connections would be super-speedy. If they don't connect in under a second, then while giving it a larger timeout should reduce the frequency of the failures, presumably there'll still be cases where it will fail. I wonder if maybe a) use an even larger timeout??, b) skip the test if upstream already runs it "enough", or c) investigate why this particular test fails.

Spot checking the failure logs, the failures due indeed all seem to be occurring with the test_curve_user_id test case, although sometimes are under python2 and sometimes python3 so its an intermittent failure. The proposed patch would increase the timeout for the can_connect() routine, but there are a number of test cases in test_auth.py which call this routine so why is it this one test where that call fails? If the root cause here was simply machine sluggishness I would expect the failures to occur semi-randomly across a variety of callers; the test case consistency makes me wonder if there is a legit bug.

The most notable thing test_curve_user_id() is doing differently from the other test cases is setting up CURVE-based authentication certs. I wouldn't think the mere tacking on of some certs would cause this much delay, and wonder if something is attempting a network lookup somewhere under the hood. The failure errors in the log file aren't detailed enough to explore fully, but digging through the code I got as far as zmq/backend/cffi/_poll.py:zmq_poll() which has a 'while True' loop that might be where things are getting stuck; I couldn't guess why, but maybe tracing into this (or adding debug print's) would verify it's hanging in this routine, and at what part.

If bumping the timeout doesn't eliminate the test failures entirely, it would probably be a good idea to ensure there's a bug report for this for further investigation. If it ends up being more than just a sluggish machine, then this could be worth raising with upstream - particularly if a reproducible test case can be found.

review: Approve

« Back to merge proposal