Merge lp:~rodsmith/checkbox/network-fixes into lp:checkbox

Proposed by Rod Smith
Status: Merged
Approved by: Daniel Manrique
Approved revision: 3155
Merged at revision: 3155
Proposed branch: lp:~rodsmith/checkbox/network-fixes
Merge into: lp:checkbox
Diff against target: 201 lines (+66/-25)
1 file modified
providers/plainbox-provider-checkbox/bin/network (+66/-25)
To merge this branch: bzr merge lp:~rodsmith/checkbox/network-fixes
Reviewer Review Type Date Requested Status
Daniel Manrique (community) Approve
Zygmunt Krynicki (community) Needs Fixing
Review via email: mp+228923@code.launchpad.net

Commit message

Fixes problems with failing network tests and trashed routing tables in network tests.

Description of the change

Fixes for bug 1329029:

First, the script now tries to ping the iperf or FTP server for up to four minutes before proceeding. It does not proceed with tests until the server begins responding to pings. If the server doesn't respond to pings after ~4 minutes, the tests fail. This resolves problems with some smart switches, which can be sluggish to restore connections when they're restored via "ip link set dev eth0 up" (or similar for other interfaces). The original script used a 10-second pause, which is inadequate in these cases. Note that if the config file lacks an IP address for the test server, this will add a ~4-minute delay to each NIC's test time. (The tests would fail either way, of course; only the extra delay is new.)

Second, the script now saves the routing table (via "ip route save table all") to a temporary file (/tmp/plainbox-network.route) before running each test, then restores the routing table (via "ip route restore") from the same file after each test. This prevents the default route from being lost when its interface is tested; such loss has caused problems with running the virtualization test and submitting results once tests are run.

To post a comment you must log in.
Revision history for this message
Daniel Manrique (roadmr) wrote :

This all looks good, I suggested a few improvements but they are just style fixes, which I think would be nice to have but not mandatory. Let me know, if you're pressed for time we can merge as-is.

lp:~rodsmith/checkbox/network-fixes updated
3153. By Rod Smith

Implemented (most of) the suggested changes

Revision history for this message
Rod Smith (rodsmith) wrote :

I've implemented most of the changes suggested by Daniel. The one exception was a change to make a loop a bit more "Pythonic." My level of Python-fu is still fairly low, and the suggested changes were a little over my comfort level. Since it was described as being of little real benefit and the effort to figure it out would take me a while, I'm passing on that change for now.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Some more reviews from me inline below

review: Needs Fixing
Revision history for this message
Daniel Manrique (roadmr) wrote :

Thanks! Time permitting, do have a look at Python's iterator handling, you seldom need to manage incrementing a counter yourself. If you ever have some spare time and want to refactor this to use what I suggested, it would be a fun exercise.

http://www.diveintopython.net/file_handling/for_loops.html

Also, bear in mind that opening any file can throw an exception. The assumption that /dev/null can be successfully opened is reasonable, as is the one about TemporaryFile() being mostly likely to succeed; if they do fail, a very clear trace will be thrown and that will help debug the problem. But usually it's good practice to try to catch those file opening exceptions. I won't block this merge on that though, so we're golden in that respect.

With that in mind, there's only one tiny change I'd ask you to make (it slipped by both of us!). See below. Once that's in, we can proceed to approve and merge this.

review: Needs Fixing
lp:~rodsmith/checkbox/network-fixes updated
3154. By Rod Smith

Fixes in response to Zygmunt's and Daniel's comments

Revision history for this message
Rod Smith (rodsmith) wrote :

I've implemented changes in response to Zygmunt's comments, as well as Daniel's final one.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Almost okay.

One last comment: run 'python3 -m flake8' on that file and try to fix anything reported (haven't tried but there's always some things mentioned). Ping me on IRC if you have any questions about that please.

review: Needs Fixing
Revision history for this message
Daniel Manrique (roadmr) wrote :

The part about not using TemporaryFile as context manager is my fault, I told Rod it was OK not to do that as he would use the file descriptor later (one write, one read). Maybe I was mistaken :(

lp:~rodsmith/checkbox/network-fixes updated
3155. By Rod Smith

Fixes for pre-existing flake8-identified issues

Revision history for this message
Rod Smith (rodsmith) wrote :

These changes were all to fix pre-existing problems identified by flake8. The temp file access is being left without context protection because protecting it would cause the whole test to fail if the temp file couldn't be created; but the point of storing the routing table in the temp file is to fix a minor annoyance (a missing default route after the test completes), and a failed test would be a bigger annoyance than that. Also, if the temp file can't be created, that implies that some other test -- probably a disk test -- would fail spectacularly, so the underlying problem can be left to that test for debugging.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Thanks! Let's approve this.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'providers/plainbox-provider-checkbox/bin/network'
2--- providers/plainbox-provider-checkbox/bin/network 2014-07-21 14:45:11 +0000
3+++ providers/plainbox-provider-checkbox/bin/network 2014-07-31 17:16:53 +0000
4@@ -23,7 +23,6 @@
5 ArgumentParser,
6 RawTextHelpFormatter
7 )
8-import configparser
9 import fcntl
10 import ftplib
11 from ftplib import FTP
12@@ -34,6 +33,7 @@
13 import socket
14 import struct
15 import subprocess
16+import tempfile
17 from subprocess import (
18 CalledProcessError,
19 check_call,
20@@ -71,12 +71,12 @@
21 if self.iface.max_speed == 0:
22 logging.warning("No max speed detected, assuming Wireless device "
23 "and continuing with test.")
24- # Otherwise, no sense in running if we're not running at full speed.
25+ # Otherwise, no sense in running if we're not running at full speed.
26 elif self.iface.max_speed > self.iface.link_speed:
27 logging.error("Detected link speed (%s) is lower than "
28 "detected max speed (%s)" %
29 (self.iface.link_speed, self.iface.max_speed))
30- logging.error("Please check your device configuration and try again")
31+ logging.error("Check your device configuration and try again")
32 return 1
33
34 cmd = "timeout 180 iperf -c {} -n {}".format(self.target, self.mbytes)
35@@ -116,7 +116,7 @@
36 scaled_throughput /= 1000
37 try:
38 percent = scaled_throughput / int(self.iface.max_speed) * 100
39- except (ZeroDivisionError, TypeError) as error:
40+ except (ZeroDivisionError, TypeError):
41 # Catches a condition where the interface functions fine but
42 # ethtool fails to properly report max speed. In this case
43 # it's up to the reviewer to pass or fail.
44@@ -134,7 +134,8 @@
45 return 0
46 # Below is guaranteed to not throw an exception because we'll
47 # have exited above if it did.
48- print("{:03.2f}% of theoretical max {} Mb/s\n".format(percent, int(self.iface.max_speed)))
49+ print("{:03.2f}% of theoretical max {} Mb/s\n".format(percent,
50+ int(self.iface.max_speed)))
51 if percent < self.fail_threshold:
52 logging.warn("Poor network performance detected")
53 return 30
54@@ -224,7 +225,8 @@
55 self.remote.set_debuglevel(2)
56 self.remote.set_pasv(True)
57 except socket.error as connect_exception:
58- logging.error("Failed to connect to: %s", self.target)
59+ logging.error("Failed to connect to: %s: %s", self.target,
60+ connect_exception)
61 return False
62
63 logging.debug("Logging in")
64@@ -233,7 +235,8 @@
65 try:
66 self.remote.login(self.username, self.password)
67 except ftplib.error_perm as login_exception:
68- logging.error("failed to log into target: %s", self.target)
69+ logging.error("failed to log into target: %s: %s", self.target,
70+ login_exception)
71 return False
72
73 default_out_dir = ""
74@@ -380,12 +383,12 @@
75 # Parse ethtool data for max speed since /sys/class/net/DEV/speed only
76 # reports link speed.
77
78- # Search for things that look like 100baseSX,
79- #40000baseNX, 10000baseT
80+ # Search for things that look like 100baseSX,
81+ # 40000baseNX, 10000baseT
82 try:
83 ethinfo = check_output(['ethtool', self.interface],
84- universal_newlines = True,
85- stderr=STDOUT).split(' ')
86+ universal_newlines=True,
87+ stderr=STDOUT).split(' ')
88 except FileNotFoundError:
89 logging.warning('ethtool not found! Unable to get max speed')
90 ethinfo = None
91@@ -450,8 +453,28 @@
92 return params
93
94
95+def can_ping(the_interface, test_target):
96+ working_interface = False
97+ num_loops = 0
98+ while (not working_interface) and (num_loops < 48):
99+ working_interface = True
100+
101+ try:
102+ check_call(["ping", "-I", the_interface, "-c", "1", test_target],
103+ stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
104+ except CalledProcessError as excp:
105+ working_interface = False
106+ logging.warning("Ping failure on %s (%s)", the_interface, excp)
107+
108+ if not working_interface:
109+ time.sleep(5)
110+ num_loops += 1
111+
112+ return working_interface
113+
114+
115 def interface_test(args):
116- if not "test_type" in vars(args):
117+ if not ("test_type" in vars(args)):
118 return
119
120 # Get the actual test data from one of two possible sources
121@@ -489,16 +512,20 @@
122
123 # Testing begins here!
124 #
125- # Check and make sure that interface is indeed connected
126+ # Make sure that the interface is indeed connected
127 try:
128- cmd = "ip link set dev %s up" % args.interface
129- check_call(shlex.split(cmd))
130+ check_call(["ip", "link", "set", "dev", args.interface, "up"])
131 except CalledProcessError as interface_failure:
132- logging.error("Failed to use %s:%s", cmd, interface_failure)
133+ logging.error("Failed to use %s:%s", args.interface, interface_failure)
134 return 1
135
136- # Give interface enough time to get DHCP address
137- time.sleep(10)
138+ # Back up routing table, since network down/up process
139+ # tends to trash it....
140+ temp = tempfile.TemporaryFile()
141+ try:
142+ check_call(["ip", "route", "save", "table", "all"], stdout=temp)
143+ except CalledProcessError as route_error:
144+ logging.warning("Unable to save routing table: %s", route_error)
145
146 result = 0
147 # Stop all other interfaces
148@@ -509,13 +536,18 @@
149 for iface in extra_interfaces:
150 logging.debug("Shutting down interface:%s", iface)
151 try:
152- cmd = "ip link set dev %s down" % iface
153- check_call(shlex.split(cmd))
154+ check_call(["ip", "link", "set", "dev", iface, "down"])
155 except CalledProcessError as interface_failure:
156- logging.error("Failed to use %s:%s", cmd, interface_failure)
157+ logging.error("Failed to shut down %s:%s", iface, interface_failure)
158 result = 3
159
160 if result == 0:
161+ # Ensure that interface is fully up by waiting until it can
162+ # ping the test server
163+ if not can_ping(args.interface, test_target):
164+ logging.error("Can't ping test server on %s", args.interface)
165+ return 1
166+
167 # Execute FTP transfer benchmarking test
168 if args.test_type.lower() == "ftp":
169 ftp_benchmark = FTPPerformanceTest(
170@@ -526,7 +558,7 @@
171 result = ftp_benchmark.run()
172
173 elif args.test_type.lower() == "iperf":
174- iperf_benchmark = IPerfPerformanceTest(args.interface, test_target,
175+ iperf_benchmark = IPerfPerformanceTest(args.interface, test_target,
176 args.fail_threshold)
177 result = iperf_benchmark.run()
178
179@@ -538,12 +570,21 @@
180 for iface in extra_interfaces:
181 logging.debug("Restoring interface:%s", iface)
182 try:
183- cmd = "ip link set dev %s up" % iface
184- check_call(shlex.split(cmd))
185+ check_call(["ip", "link", "set", "dev", iface, "up"])
186 except CalledProcessError as interface_failure:
187- logging.error("Failed to use %s:%s", cmd, interface_failure)
188+ logging.error("Failed to restore %s:%s", iface, interface_failure)
189 result = 3
190
191+ # Restore routing table to original state
192+ temp.seek(0)
193+ try:
194+ # Harmless "RTNETLINK answers: File exists" messages on stderr
195+ check_call(["ip", "route", "restore"], stdin=temp,
196+ stderr=subprocess.DEVNULL)
197+ except CalledProcessError as restore_failure:
198+ logging.warning("Unable to restore routing table: %s", restore_failure)
199+ temp.close()
200+
201 return result
202
203

Subscribers

People subscribed via source and target branches