Merge lp:~percona-toolkit-dev/percona-toolkit/pt-stalk-iter-1-bug-1070434 into lp:percona-toolkit/2.1

Proposed by Daniel Nichter
Status: Merged
Merged at revision: 428
Proposed branch: lp:~percona-toolkit-dev/percona-toolkit/pt-stalk-iter-1-bug-1070434
Merge into: lp:percona-toolkit/2.1
Diff against target: 280 lines (+110/-28)
2 files modified
bin/pt-stalk (+43/-0)
t/pt-stalk/pt-stalk.t (+67/-28)
To merge this branch: bzr merge lp:~percona-toolkit-dev/percona-toolkit/pt-stalk-iter-1-bug-1070434
Reviewer Review Type Date Requested Status
Daniel Nichter Approve
Review via email: mp+131093@code.launchpad.net
To post a comment you must log in.
429. By Daniel Nichter

Make "not stalking" tests more reliable.

430. By Daniel Nichter

Really wait for expected data in files.

431. By Daniel Nichter

Even more waiting for _really_ slow Ubuntu 12 box.

432. By Daniel Nichter

Include collector output in diag.

Revision history for this message
Daniel Nichter (daniel-nichter) :
review: Approve
433. By Daniel Nichter

Document extra --run-time-related waits.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bin/pt-stalk'
--- bin/pt-stalk 2012-10-13 20:15:37 +0000
+++ bin/pt-stalk 2012-10-24 00:23:21 +0000
@@ -1210,6 +1210,37 @@
1210 purge_samples "$OPT_DEST" "$OPT_RETENTION_TIME"1210 purge_samples "$OPT_DEST" "$OPT_RETENTION_TIME"
1211 fi1211 fi
1212 done1212 done
1213
1214 # Before exiting, the last collector may still be running.
1215 # Wait for it to finish in case the tool is part of a script,
1216 # or part of a test, so the caller has access to the collected
1217 # data when the tool exists. collect() waits an additional
1218 # --run-time seconds for itself to complete, which means we
1219 # have to wait for 2 * run-time like it plus some overhead else
1220 # we may get in sync with the collector and kill it a microsecond
1221 # before it kills itself, thus 3 * run-time.
1222 # https://bugs.launchpad.net/percona-toolkit/+bug/1070434
1223 if [ "$(jobs)" ]; then
1224 local sleep_time=$((OPT_RUN_TIME * 3))
1225 log "Waiting up to $sleep_time seconds for collectors to finish..."
1226 local slept=0
1227 while [ -n "$(jobs)" -a $slept -lt $sleep_time ]; do
1228 sleep 1
1229 slept=$((slept + 1))
1230 done
1231
1232 if [ "$(jobs)" ]; then
1233 for pid in $(jobs -p); do
1234 # This isn't an warning (we don't want exit status 1) because
1235 # the system may be running slowly so it's just "natural" that
1236 # a collector may get stuck or run really slowly.
1237 log "Killing collector $pid"
1238 kill $pid >/dev/null 2>&1
1239 done
1240 else
1241 log "All collectors have finished"
1242 fi
1243 fi
1213}1244}
12141245
1215# ###########################################################################1246# ###########################################################################
@@ -1754,6 +1785,18 @@
1754seconds hasn't gathered enough diagnostic data, running longer is not likely to1785seconds hasn't gathered enough diagnostic data, running longer is not likely to
1755do so. In fact, in many cases a shorter collection period is appropriate.1786do so. In fact, in many cases a shorter collection period is appropriate.
17561787
1788This value is used two other times. After collecting, the collect subprocess
1789will wait another L<"--run-time"> seconds for its commands to finish. Some
1790commands can take awhile if the system is running very slowly (which can
1791likely be the case given that a collection was triggered). Since empty files
1792are deleted, the extra wait gives commands time to finish and write their
1793data. The value is potentially used again just before the tool exits to wait
1794again for any collect subprocesses to finish. In most cases this won't
1795happen because of the aforementioned extra wait. If it happens, the tool
1796will log "Waiting up to N seconds for collectors to finish..." where N is
1797three times L<"--run-time">. In both cases, after waiting, the tool kills
1798all of its subprocesses.
1799
1757=item --sleep1800=item --sleep
17581801
1759type: int; default: 3001802type: int; default: 300
17601803
=== modified file 't/pt-stalk/pt-stalk.t'
--- t/pt-stalk/pt-stalk.t 2012-10-13 18:58:55 +0000
+++ t/pt-stalk/pt-stalk.t 2012-10-24 00:23:21 +0000
@@ -86,20 +86,21 @@
86 "Creates --dest (collect) dir"86 "Creates --dest (collect) dir"
87);87);
8888
89chomp($pid = `cat $pid_file`);89chomp($pid = `cat $pid_file 2>/dev/null`);
90$retval = system("kill -0 $pid");90$retval = system("kill -0 $pid");
91is(91is(
92 $retval >> 0,92 $retval >> 0,
93 0,93 0,
94 "pt-stalk is running ($pid)"94 "pt-stalk is running"
95);95);
9696
97$output = `cat $log_file`;97PerconaTest::wait_for_sh("grep -q 'Check results' $log_file >/dev/null");
98$output = `cat $log_file 2>/dev/null`;
98like(99like(
99 $output,100 $output,
100 qr/Check results: Threads_running=\d+, matched=no, cycles_true=0/,101 qr/Check results: Threads_running=\d+, matched=no, cycles_true=0/,
101 "Check results logged"102 "Check results logged"
102);103) or diag(`cat $log_file 2>/dev/null`, `cat $dest/*-output 2>/dev/null`);
103104
104$retval = system("kill $pid 2>/dev/null");105$retval = system("kill $pid 2>/dev/null");
105is(106is(
@@ -108,19 +109,19 @@
108 "Killed pt-stalk"109 "Killed pt-stalk"
109);110);
110111
111sleep 1;112PerconaTest::wait_until(sub { !-f $pid_file });
112113
113ok(114ok(
114 ! -f $pid_file,115 ! -f $pid_file,
115 "Removes PID file"116 "Removes PID file"
116);117);
117118
118$output = `cat $log_file`;119$output = `cat $log_file 2>/dev/null`;
119like(120like(
120 $output,121 $output,
121 qr/Caught signal, exiting/,122 qr/Caught signal, exiting/,
122 "Caught signal logged"123 "Caught signal logged"
123);124) or diag(`cat $log_file 2>/dev/null`, `cat $dest/*-output 2>/dev/null`);
124125
125# ###########################################################################126# ###########################################################################
126# Test collect.127# Test collect.
@@ -134,42 +135,61 @@
134my (undef, $uptime) = $dbh->selectrow_array("SHOW STATUS LIKE 'Uptime'");135my (undef, $uptime) = $dbh->selectrow_array("SHOW STATUS LIKE 'Uptime'");
135my $threshold = $uptime + 2;136my $threshold = $uptime + 2;
136137
137$retval = system("$trunk/bin/pt-stalk --iterations 1 --dest $dest --variable Uptime --threshold $threshold --cycles 2 --run-time 2 --pid $pid_file -- --defaults-file=$cnf >$log_file 2>&1");138$retval = system("$trunk/bin/pt-stalk --iterations 1 --dest $dest --variable Uptime --threshold $threshold --cycles 2 --run-time 2 --pid $pid_file -- --defaults-file=$cnf >$log_file 2>&1");
138139
139sleep 3;140PerconaTest::wait_until(sub { !-f $pid_file });
140141
141$output = `cat $dest/*-trigger`;142$output = `cat $dest/*-trigger 2>/dev/null`;
142like(143like(
143 $output,144 $output,
144 qr/Check results: Uptime=\d+, matched=yes, cycles_true=2/,145 qr/Check results: Uptime=\d+, matched=yes, cycles_true=2/,
145 "Collect triggered"146 "Collect triggered"
147)
148or diag(
149 'output', $output,
150 'log file', `cat $log_file 2>/dev/null`,
151 'collector', `cat $dest/*-output 2>/dev/null`,
152 'dest', `ls -l $dest/ 2>/dev/null`,
153 'df', `cat $dest/*-df 2>/dev/null`,
146);154);
147155
148# There is some nondeterminism here. Sometimes it'll run for 2 samples because156# There is some nondeterminism here. Sometimes it'll run for 2 samples because
149# the samples may not be precisely 1 second apart.157# the samples may not be precisely 1 second apart.
150chomp($output = `cat $dest/*-df | grep -c '^TS'`);158chomp($output = `cat $dest/*-df 2>/dev/null | grep -c '^TS'`);
151ok(159ok(
152 $output >= 1 && $output <= 3,160 $output >= 1 && $output <= 3,
153 "Collect ran for --run-time"161 "Collect ran for --run-time"
154) or diag("Run time: $output");162)
163or diag(
164 'output', $output,
165 'log file', `cat $log_file 2>/dev/null`,
166 'collector', `cat $dest/*-output 2>/dev/null`,
167 'dest', `ls -l $dest/ 2>/dev/null`,
168 'df', `cat $dest/*-df 2>/dev/null`,
169);
155170
156ok(171ok(
157 PerconaTest::not_running("pt-stalk --iterations 1"),172 PerconaTest::not_running("pt-stalk --iterations 1"),
158 "pt-stalk is not running"173 "pt-stalk is not running"
159);174);
160175
161$output = `cat $dest/*-trigger`;176$output = `cat $dest/*-trigger 2>/dev/null`;
162like(177like(
163 $output,178 $output,
164 qr/pt-stalk ran with --function=status --variable=Uptime --threshold=$threshold/,179 qr/pt-stalk ran with --function=status --variable=Uptime --threshold=$threshold/,
165 "Trigger file logs how pt-stalk was ran"180 "Trigger file logs how pt-stalk was ran"
166);181);
167182
168chomp($output = `cat $log_file | grep 'Collector PID'`);183chomp($output = `cat $log_file 2>/dev/null | grep 'Collector PID'`);
169like(184like(
170 $output,185 $output,
171 qr/Collector PID \d+/,186 qr/Collector PID \d+/,
172 "Collector PID logged"187 "Collector PID logged"
188)
189or diag(
190 'output', $output,
191 'log file', `cat $log_file 2>/dev/null`,
192 'collector', `cat $dest/*-output 2>/dev/null`,
173);193);
174194
175# ###########################################################################195# ###########################################################################
@@ -184,9 +204,9 @@
184204
185$retval = system("$trunk/bin/pt-stalk --no-collect --iterations 1 --dest $dest --variable Uptime --threshold $threshold --cycles 1 --run-time 1 --pid $pid_file -- --defaults-file=$cnf >$log_file 2>&1");205$retval = system("$trunk/bin/pt-stalk --no-collect --iterations 1 --dest $dest --variable Uptime --threshold $threshold --cycles 1 --run-time 1 --pid $pid_file -- --defaults-file=$cnf >$log_file 2>&1");
186206
187sleep 2;207PerconaTest::wait_until(sub { !-f $pid_file });
188208
189$output = `cat $log_file`;209$output = `cat $log_file 2>/dev/null`;
190like(210like(
191 $output,211 $output,
192 qr/Collect triggered/,212 qr/Collect triggered/,
@@ -196,7 +216,7 @@
196ok(216ok(
197 ! -f "$dest/*",217 ! -f "$dest/*",
198 "No files collected"218 "No files collected"
199);219) or diag(`ls -l $dest/ 2>/dev/null`);
200220
201ok(221ok(
202 PerconaTest::not_running("pt-stalk --no-collect"),222 PerconaTest::not_running("pt-stalk --no-collect"),
@@ -221,7 +241,7 @@
221 "Killed pt-stalk"241 "Killed pt-stalk"
222);242);
223243
224$output = `cat $log_file`;244$output = `cat $log_file 2>/dev/null`;
225like(245like(
226 $output,246 $output,
227 qr/Check results: Aborted_connects=|variable=Aborted_connects/,247 qr/Check results: Aborted_connects=|variable=Aborted_connects/,
@@ -238,23 +258,32 @@
238diag(`rm $log_file 2>/dev/null`);258diag(`rm $log_file 2>/dev/null`);
239diag(`rm $dest/* 2>/dev/null`);259diag(`rm $dest/* 2>/dev/null`);
240260
241$retval = system("$trunk/bin/pt-stalk --no-stalk --run-time 2 --dest $dest --prefix nostalk -- --defaults-file=$cnf >$log_file 2>&1");261$retval = system("$trunk/bin/pt-stalk --no-stalk --run-time 2 --dest $dest --prefix nostalk --pid $pid_file -- --defaults-file=$cnf >$log_file 2>&1");
242262
243PerconaTest::wait_for_files("$dest/nostalk-trigger");263PerconaTest::wait_until(sub { !-f $pid_file });
244$output = `cat $dest/nostalk-trigger`;264
265$output = `cat $dest/nostalk-trigger 2>/dev/null`;
245like(266like(
246 $output,267 $output,
247 qr/Not stalking/,268 qr/Not stalking/,
248 "Not stalking, collect triggered"269 "Not stalking, collect triggered"
270)
271or diag(
272 'dest', `ls -l $dest/ 2>/dev/null`,
273 'log_file', `cat $log_file 2>/dev/null`,
274 'collector', `cat $dest/*-output 2>/dev/null`,
249);275);
250276
251PerconaTest::wait_for_files("$dest/nostalk-hostname");277chomp($output = `grep -c '^TS' $dest/nostalk-df 2>/dev/null`);
252PerconaTest::wait_for_sh("test \$(grep -c '^TS' $dest/nostalk-df) -ge 2");
253chomp($output = `grep -c '^TS' $dest/nostalk-df`);
254is(278is(
255 $output,279 $output,
256 2,280 2,
257 "Not stalking, collect ran for --run-time"281 "Not stalking, collect ran for --run-time"
282)
283or diag(
284 'dest', `ls -l $dest/ 2>/dev/null`,
285 'log_file', `cat $log_file 2>/dev/null`,
286 'collector', `cat $dest/*-output 2>/dev/null`,
258);287);
259288
260my $vmstat = `which vmstat 2>/dev/null`;289my $vmstat = `which vmstat 2>/dev/null`;
@@ -269,14 +298,24 @@
269};298};
270299
271is(300is(
272 `cat $dest/nostalk-hostname`,301 `cat $dest/nostalk-hostname 2>/dev/null`,
273 `hostname`,302 `hostname`,
274 "Not stalking, collect gathered data"303 "Not stalking, collect gathered data"
304)
305or diag(
306 'dest', `ls -l $dest/ 2>/dev/null`,
307 'log_file', `cat $log_file 2>/dev/null`,
308 'collector', `cat $dest/*-output 2>/dev/null`,
275);309);
276310
277ok(311ok(
278 PerconaTest::not_running("pt-stalk --no-stalk"),312 PerconaTest::not_running("pt-stalk --no-stalk"),
279 "Not stalking, pt-stalk is not running"313 "Not stalking, pt-stalk is not running"
314)
315or diag(
316 'dest', `ls -l $dest/ 2>/dev/null`,
317 'log_file', `cat $log_file 2>/dev/null`,
318 'collector', `cat $dest/*-output 2>/dev/null`,
280);319);
281320
282# ############################################################################321# ############################################################################

Subscribers

People subscribed via source and target branches