Merge lp:~percona-toolkit-dev/percona-toolkit/pxc-pt-heartbeat into lp:percona-toolkit/2.1
- pxc-pt-heartbeat
- Merge into 2.1
Status: | Merged |
---|---|
Approved by: | Daniel Nichter |
Approved revision: | 507 |
Merged at revision: | 506 |
Proposed branch: | lp:~percona-toolkit-dev/percona-toolkit/pxc-pt-heartbeat |
Merge into: | lp:percona-toolkit/2.1 |
Diff against target: |
582 lines (+512/-5) 4 files modified
bin/pt-heartbeat (+117/-4) sandbox/start-sandbox (+5/-1) sandbox/test-env (+6/-0) t/pt-heartbeat/pxc.t (+384/-0) |
To merge this branch: | bzr merge lp:~percona-toolkit-dev/percona-toolkit/pxc-pt-heartbeat |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Daniel Nichter | Approve | ||
Brian Fraser (community) | Approve | ||
Review via email: mp+139339@code.launchpad.net |
Commit message
Description of the change
Daniel Nichter (daniel-nichter) wrote : | # |
Brian Fraser (fraserbn) wrote : | # |
> 169 + /tmp/12345/stop >/dev/null
> 170 + /tmp/12345/start >/dev/null
>
> Is that really needed in test-env and at the end of pxt.t?
>
"Yes", but also no. For test-env, that's done so that we can be actually sure that the cluster will have 3 members. If after stop/update the cluster has 1 or 2 members, something went awfully wrong. And not catching this here means that code later down the line stop sop/start on 12345 (to set up replication filters, for example) will break the sandbox.
For pxt.t, that's doen to really restore the original state of the node. It's a trivial thing, but it might end up breaking up tests: relay_master_
Unrealistic but easy way of seeing that it might break things: Remove those two from pxt.t, then run
$ prove t/pt-heartbeat/
First one will pass; Second will fail a test. So we could take it out, but risk the admittedly small possibility of some unrelated test breaking later.
> 327 +is(
> 328 + $output,
> 329 + "0.00s [ 0.00s, 0.00s, 0.00s ]\n",
> 330 + "--monitor works"
> 331 +);
>
> A similar test for regular MySQL often fails because one of those 0.00 will be
> 0.01 or something.
>
That's preceded by $output =~ s/\d\.\d{2}/0.00/g, so it should be okay, although changing the is() to a like() might be a good idea.
> 333 +# Try to generate some lag between cluster nodes. Rather brittle at
> the moment.
>
> I think we can remove code related to that because it's going to be quite slow
> and, as you note, brittle.
>
Bit of a sunken cost argument here, but I spent quite a bit of time making that test work, so I'm going to argue against this. But actually, I ran pxc.t in a while true; loop for two hours, and that test never failed. So I think that it's okay -- the comment was written on a version of pxc.t that only reloaded sakila, but combining it with the alter active code seems to have hardened quite a bit. I think that adding a /m to the regex would make it even more resilient.
Also, it's actually pretty fast: Only the 5 seconds that --monitor is told to run, since the rest is running in the background.
If anything, can we try keeping it until it starts causing trouble?
> 518 +diag(`
> 519 +sleep 1;
>
> Looks like a timing-related failure waiting to happen.
>
Definitive yes here. I could increase the sleep time to something much bigger, like 15, since the only thing that interests us of that part is that --stop actually works on instances of pt-heartbeat running on PXC, not that it stops them quickly.
The other option is making away with those tests and just wait_until(
> Need a "Percona XtraDB Cluster" section in pt-heartbeat mention what we talked
> about on IRC: that a cluster is as fast as the slowest node, but pt-heartbeat
> doesn't really report the cluster's after lag. Although --mon...
Daniel Nichter (daniel-nichter) wrote : | # |
On Dec 11, 2012, at 9:47 PM, Brian Fraser wrote:
>> 169 + /tmp/12345/stop >/dev/null
>> 170 + /tmp/12345/start >/dev/null
>>
>> Is that really needed in test-env and at the end of pxt.t?
>>
>
> "Yes", but also no. For test-env, that's done so that we can be actually sure that the cluster will have 3 members. If after stop/update the cluster has 1 or 2 members, something went awfully wrong. And not catching this here means that code later down the line stop sop/start on 12345 (to set up replication filters, for example) will break the sandbox.
> For pxt.t, that's doen to really restore the original state of the node. It's a trivial thing, but it might end up breaking up tests: relay_master_
>
> Unrealistic but easy way of seeing that it might break things: Remove those two from pxt.t, then run
>
> $ prove t/pt-heartbeat/
>
> First one will pass; Second will fail a test. So we could take it out, but risk the admittedly small possibility of some unrelated test breaking later.
Ok, we'll keep them in for now. I just hate to add any more delays in the test suite.
>> 327 +is(
>> 328 + $output,
>> 329 + "0.00s [ 0.00s, 0.00s, 0.00s ]\n",
>> 330 + "--monitor works"
>> 331 +);
>>
>> A similar test for regular MySQL often fails because one of those 0.00 will be
>> 0.01 or something.
>>
>
> That's preceded by $output =~ s/\d\.\d{2}/0.00/g, so it should be okay, although changing the is() to a like() might be a good idea.
Ok, that should work better then.
>> 333 +# Try to generate some lag between cluster nodes. Rather brittle at
>> the moment.
>>
>> I think we can remove code related to that because it's going to be quite slow
>> and, as you note, brittle.
>>
>
> Bit of a sunken cost argument here, but I spent quite a bit of time making that test work, so I'm going to argue against this. But actually, I ran pxc.t in a while true; loop for two hours, and that test never failed. So I think that it's okay -- the comment was written on a version of pxc.t that only reloaded sakila, but combining it with the alter active code seems to have hardened quite a bit. I think that adding a /m to the regex would make it even more resilient.
> Also, it's actually pretty fast: Only the 5 seconds that --monitor is told to run, since the rest is running in the background.
> If anything, can we try keeping it until it starts causing trouble?
Alright, we'll keep it in until/if reasons arise to remove or change it.
>> 518 +diag(`
>> 519 +sleep 1;
>>
>> Looks like a timing-related failure waiting to happen.
>>
>
> Definitive yes here. I could increase the sleep time to something much bigger, like 15, since the only thing that interests us of that part is that --stop actually works on instances of pt-heartbeat running on PXC, not that it stops them quickly.
> The other option is making away with those tests and just wait_until(
Brian Fraser (fraserbn) wrote : | # |
Fixed & documented.
Brian Fraser (fraserbn) : | # |
- 507. By Daniel Nichter
-
Tweak Percona XtraDB Cluster docs a little.
Daniel Nichter (daniel-nichter) : | # |
Preview Diff
1 | === modified file 'bin/pt-heartbeat' |
2 | --- bin/pt-heartbeat 2012-12-03 03:48:11 +0000 |
3 | +++ bin/pt-heartbeat 2012-12-14 00:42:20 +0000 |
4 | @@ -20,6 +20,7 @@ |
5 | Daemon |
6 | Quoter |
7 | TableParser |
8 | + Retry |
9 | Transformers |
10 | VersionCheck |
11 | HTTPMicro |
12 | @@ -2921,6 +2922,84 @@ |
13 | # ########################################################################### |
14 | |
15 | # ########################################################################### |
16 | +# Retry package |
17 | +# This package is a copy without comments from the original. The original |
18 | +# with comments and its test file can be found in the Bazaar repository at, |
19 | +# lib/Retry.pm |
20 | +# t/lib/Retry.t |
21 | +# See https://launchpad.net/percona-toolkit for more information. |
22 | +# ########################################################################### |
23 | +{ |
24 | +package Retry; |
25 | + |
26 | +use strict; |
27 | +use warnings FATAL => 'all'; |
28 | +use English qw(-no_match_vars); |
29 | +use constant PTDEBUG => $ENV{PTDEBUG} || 0; |
30 | + |
31 | +sub new { |
32 | + my ( $class, %args ) = @_; |
33 | + my $self = { |
34 | + %args, |
35 | + }; |
36 | + return bless $self, $class; |
37 | +} |
38 | + |
39 | +sub retry { |
40 | + my ( $self, %args ) = @_; |
41 | + my @required_args = qw(try fail final_fail); |
42 | + foreach my $arg ( @required_args ) { |
43 | + die "I need a $arg argument" unless $args{$arg}; |
44 | + }; |
45 | + my ($try, $fail, $final_fail) = @args{@required_args}; |
46 | + my $wait = $args{wait} || sub { sleep 1; }; |
47 | + my $tries = $args{tries} || 3; |
48 | + |
49 | + my $last_error; |
50 | + my $tryno = 0; |
51 | + TRY: |
52 | + while ( ++$tryno <= $tries ) { |
53 | + PTDEBUG && _d("Try", $tryno, "of", $tries); |
54 | + my $result; |
55 | + eval { |
56 | + $result = $try->(tryno=>$tryno); |
57 | + }; |
58 | + if ( $EVAL_ERROR ) { |
59 | + PTDEBUG && _d("Try code failed:", $EVAL_ERROR); |
60 | + $last_error = $EVAL_ERROR; |
61 | + |
62 | + if ( $tryno < $tries ) { # more retries |
63 | + my $retry = $fail->(tryno=>$tryno, error=>$last_error); |
64 | + last TRY unless $retry; |
65 | + PTDEBUG && _d("Calling wait code"); |
66 | + $wait->(tryno=>$tryno); |
67 | + } |
68 | + } |
69 | + else { |
70 | + PTDEBUG && _d("Try code succeeded"); |
71 | + return $result; |
72 | + } |
73 | + } |
74 | + |
75 | + PTDEBUG && _d('Try code did not succeed'); |
76 | + return $final_fail->(error=>$last_error); |
77 | +} |
78 | + |
79 | +sub _d { |
80 | + my ($package, undef, $line) = caller 0; |
81 | + @_ = map { (my $temp = $_) =~ s/\n/\n# /g; $temp; } |
82 | + map { defined $_ ? $_ : 'undef' } |
83 | + @_; |
84 | + print STDERR "# $package:$line $PID ", join(' ', @_), "\n"; |
85 | +} |
86 | + |
87 | +1; |
88 | +} |
89 | +# ########################################################################### |
90 | +# End Retry package |
91 | +# ########################################################################### |
92 | + |
93 | +# ########################################################################### |
94 | # Transformers package |
95 | # This package is a copy without comments from the original. The original |
96 | # with comments and its test file can be found in the Bazaar repository at, |
97 | @@ -4920,10 +4999,31 @@ |
98 | } |
99 | } |
100 | |
101 | - $sth->execute(ts(time), @vals); |
102 | - PTDEBUG && _d($sth->{Statement}); |
103 | - $sth->finish(); |
104 | - |
105 | + my $retry = Retry->new(); |
106 | + $retry->retry( |
107 | + tries => 3, |
108 | + wait => sub { sleep 0.25; return; }, |
109 | + try => sub { |
110 | + $sth->execute(ts(time), @vals); |
111 | + PTDEBUG && _d($sth->{Statement}); |
112 | + $sth->finish(); |
113 | + }, |
114 | + fail => sub { |
115 | + my (%args) = @_; |
116 | + my $error = $args{error}; |
117 | + if ( $error =~ m/Deadlock found/ ) { |
118 | + return 1; # try again |
119 | + } |
120 | + else { |
121 | + return 0; |
122 | + } |
123 | + }, |
124 | + final_fail => sub { |
125 | + my (%args) = @_; |
126 | + die $args{error}; |
127 | + } |
128 | + ); |
129 | + |
130 | return; |
131 | }; |
132 | } |
133 | @@ -5387,6 +5487,19 @@ |
134 | columns are optional. If any are present, their corresponding information |
135 | will be saved. |
136 | |
137 | +=head1 Percona XtraDB Cluster |
138 | + |
139 | +Although pt-heartbeat should work with all supported versions of Percona XtraDB |
140 | +Cluster (PXC), we recommend using 5.5.28-23.7 and newer. |
141 | + |
142 | +If you are setting up heartbeat instances between cluster nodes, keep in mind |
143 | +that, since the speed of the cluster is determined by its slowest node, |
144 | +pt-heartbeat will not report how fast the cluster itself is, but only how |
145 | +fast events are replicating from one node to another. |
146 | + |
147 | +You must specify L<"--master-server-id"> for L<"--monitor"> and L<"--check"> |
148 | +instances. |
149 | + |
150 | =head1 OPTIONS |
151 | |
152 | Specify at least one of L<"--stop">, L<"--update">, L<"--monitor">, or L<"--check">. |
153 | |
154 | === modified file 'sandbox/start-sandbox' |
155 | --- sandbox/start-sandbox 2012-11-16 19:08:49 +0000 |
156 | +++ sandbox/start-sandbox 2012-12-14 00:42:20 +0000 |
157 | @@ -52,6 +52,10 @@ |
158 | if [ -n "${master_port}" ]; then |
159 | local master_listen_port=$(($master_port + 10)) |
160 | cluster_address="gcomm://$ip:$master_listen_port" |
161 | + |
162 | + local this_listen_port=$(($port + 10)) |
163 | + local this_cluster_address="gcomm://$ip:$this_listen_port" |
164 | + sed -e "s!gcomm://\$!$this_cluster_address!g" -i.bak "/tmp/$master_port/my.sandbox.cnf" |
165 | fi |
166 | |
167 | sed -e "s/ADDR/$ip/g" -i.bak "/tmp/$port/my.sandbox.cnf" |
168 | @@ -118,7 +122,7 @@ |
169 | debug_sandbox $port |
170 | exit 1 |
171 | fi |
172 | - |
173 | + |
174 | # If the sandbox is a slave, start the slave. |
175 | if [ "$type" = "slave" ]; then |
176 | /tmp/$port/use -e "change master to master_host='127.0.0.1', master_user='msandbox', master_password='msandbox', master_port=$master_port" |
177 | |
178 | === modified file 'sandbox/test-env' |
179 | --- sandbox/test-env 2012-12-03 20:06:47 +0000 |
180 | +++ sandbox/test-env 2012-12-14 00:42:20 +0000 |
181 | @@ -299,6 +299,12 @@ |
182 | exit_status=$((exit_status | $?)) |
183 | |
184 | if [ "${2:-""}" = "cluster" ]; then |
185 | + # Bit of magic here. 'start-sandbox cluster new_node old_node' |
186 | + # changes old_node's my.sandbox.cnf's wsrep_cluster_address to |
187 | + # point to new_node. This is especially useful because otherwise, |
188 | + # calling stop/start like below on 12345 would create a new cluster. |
189 | + /tmp/12345/stop >/dev/null |
190 | + /tmp/12345/start >/dev/null |
191 | echo -n "Checking that the cluster size is correct... " |
192 | size=$(/tmp/12345/use -ss -e "SHOW STATUS LIKE 'wsrep_cluster_size'" | awk '{print $2}') |
193 | if [ ${size:-0} -ne 3 ]; then |
194 | |
195 | === added file 't/pt-heartbeat/pxc.t' |
196 | --- t/pt-heartbeat/pxc.t 1970-01-01 00:00:00 +0000 |
197 | +++ t/pt-heartbeat/pxc.t 2012-12-14 00:42:20 +0000 |
198 | @@ -0,0 +1,384 @@ |
199 | +#!/usr/bin/env perl |
200 | + |
201 | +BEGIN { |
202 | + die "The PERCONA_TOOLKIT_BRANCH environment variable is not set.\n" |
203 | + unless $ENV{PERCONA_TOOLKIT_BRANCH} && -d $ENV{PERCONA_TOOLKIT_BRANCH}; |
204 | + unshift @INC, "$ENV{PERCONA_TOOLKIT_BRANCH}/lib"; |
205 | +}; |
206 | + |
207 | +use strict; |
208 | +use warnings FATAL => 'all'; |
209 | +use English qw(-no_match_vars); |
210 | +use Test::More; |
211 | +use Data::Dumper; |
212 | + |
213 | +use File::Temp qw(tempfile); |
214 | + |
215 | +use PerconaTest; |
216 | +use Sandbox; |
217 | + |
218 | +require "$trunk/bin/pt-heartbeat"; |
219 | +# Do this after requiring pt-hb, since it uses Mo |
220 | +require VersionParser; |
221 | + |
222 | +my $dp = new DSNParser(opts=>$dsn_opts); |
223 | +my $sb = new Sandbox(basedir => '/tmp', DSNParser => $dp); |
224 | +my $node1 = $sb->get_dbh_for('node1'); |
225 | +my $node2 = $sb->get_dbh_for('node2'); |
226 | +my $node3 = $sb->get_dbh_for('node3'); |
227 | + |
228 | +if ( !$node1 ) { |
229 | + plan skip_all => 'Cannot connect to cluster node1'; |
230 | +} |
231 | +elsif ( !$node2 ) { |
232 | + plan skip_all => 'Cannot connect to cluster node2'; |
233 | +} |
234 | +elsif ( !$node3 ) { |
235 | + plan skip_all => 'Cannot connect to cluster node3'; |
236 | +} |
237 | + |
238 | +my $db_flavor = VersionParser->new($node1)->flavor(); |
239 | +if ( $db_flavor !~ /XtraDB Cluster/ ) { |
240 | + plan skip_all => "PXC tests"; |
241 | +} |
242 | + |
243 | +my $node1_dsn = $sb->dsn_for('node1'); |
244 | +my $node2_dsn = $sb->dsn_for('node2'); |
245 | +my $node3_dsn = $sb->dsn_for('node3'); |
246 | +my $node1_port = $sb->port_for('node1'); |
247 | +my $node2_port = $sb->port_for('node2'); |
248 | +my $node3_port = $sb->port_for('node3'); |
249 | + |
250 | +my $output; |
251 | +my $exit; |
252 | +my $base_pidfile = (tempfile("/tmp/pt-heartbeat-test.XXXXXXXX", OPEN => 0, UNLINK => 0))[1]; |
253 | +my $sample = "t/pt-heartbeat/samples/"; |
254 | + |
255 | +my $sentinel = '/tmp/pt-heartbeat-sentinel'; |
256 | + |
257 | +diag(`rm -rf $sentinel >/dev/null 2>&1`); |
258 | +$sb->create_dbs($node1, ['test']); |
259 | + |
260 | +my @exec_pids; |
261 | +my @pidfiles; |
262 | + |
263 | +sub start_update_instance { |
264 | + my ($port) = @_; |
265 | + my $pidfile = "$base_pidfile.$port.pid"; |
266 | + push @pidfiles, $pidfile; |
267 | + |
268 | + my $pid = fork(); |
269 | + die "Cannot fork: $OS_ERROR" unless defined $pid; |
270 | + if ( $pid == 0 ) { |
271 | + my $cmd = "$trunk/bin/pt-heartbeat"; |
272 | + exec { $cmd } $cmd, qw(-h 127.0.0.1 -u msandbox -p msandbox -P), $port, |
273 | + qw(--database test --table heartbeat --create-table), |
274 | + qw(--update --interval 0.5 --pid), $pidfile; |
275 | + exit 1; |
276 | + } |
277 | + push @exec_pids, $pid; |
278 | + |
279 | + PerconaTest::wait_for_files($pidfile); |
280 | + ok( |
281 | + -f $pidfile, |
282 | + "--update on $port started" |
283 | + ); |
284 | +} |
285 | + |
286 | +sub stop_all_instances { |
287 | + my @pids = @exec_pids, map { chomp; $_ } map { slurp_file($_) } @pidfiles; |
288 | + diag(`$trunk/bin/pt-heartbeat --stop >/dev/null`); |
289 | + |
290 | + waitpid($_, 0) for @pids; |
291 | + PerconaTest::wait_until(sub{ !-e $_ }) for @pidfiles; |
292 | + |
293 | + unlink $sentinel; |
294 | +} |
295 | + |
296 | +foreach my $port ( map { $sb->port_for($_) } qw(node1 node2 node3) ) { |
297 | + start_update_instance($port); |
298 | +} |
299 | + |
300 | +# ############################################################################# |
301 | +# Basic cluster tests |
302 | +# ############################################################################# |
303 | + |
304 | +my $rows = $node1->selectall_hashref("select * from test.heartbeat", 'server_id'); |
305 | + |
306 | +is( |
307 | + scalar keys %$rows, |
308 | + 3, |
309 | + "Sanity check: All nodes are in the heartbeat table" |
310 | +); |
311 | + |
312 | +my $only_slave_data = { |
313 | + map { |
314 | + $_ => { |
315 | + relay_master_log_file => $rows->{$_}->{relay_master_log_file}, |
316 | + exec_master_log_pos => $rows->{$_}->{exec_master_log_pos}, |
317 | + } } keys %$rows |
318 | +}; |
319 | + |
320 | +my $same_data = { relay_master_log_file => undef, exec_master_log_pos => undef }; |
321 | +is_deeply( |
322 | + $only_slave_data, |
323 | + { |
324 | + 12345 => $same_data, |
325 | + 12346 => $same_data, |
326 | + 12347 => $same_data, |
327 | + }, |
328 | + "Sanity check: No slave data (relay log or master pos) is stored" |
329 | +); |
330 | + |
331 | +$output = output(sub{ |
332 | + pt_heartbeat::main($node1_dsn, qw(-D test --check)), |
333 | + }, |
334 | + stderr => 1, |
335 | +); |
336 | + |
337 | +like( |
338 | + $output, |
339 | + qr/\QThe --master-server-id option must be specified because the heartbeat table `test`.`heartbeat`/, |
340 | + "pt-heartbeat --check + PXC doesn't autodetect a master if there isn't any" |
341 | +); |
342 | + |
343 | +$output = output(sub{ |
344 | + pt_heartbeat::main($node1_dsn, qw(-D test --check), |
345 | + '--master-server-id', $node3_port), |
346 | + }, |
347 | + stderr => 1, |
348 | +); |
349 | + |
350 | +$output =~ s/\d\.\d{2}/0.00/g; |
351 | +is( |
352 | + $output, |
353 | + "0.00\n", |
354 | + "pt-heartbeat --check + PXC works with --master-server-id" |
355 | +); |
356 | + |
357 | +# Test --monitor |
358 | + |
359 | +$output = output(sub { |
360 | + pt_heartbeat::main($node1_dsn, |
361 | + qw(-D test --monitor --run-time 1s), |
362 | + '--master-server-id', $node3_port) |
363 | + }, |
364 | + stderr => 1, |
365 | +); |
366 | + |
367 | +$output =~ s/\d\.\d{2}/0.00/g; |
368 | +is( |
369 | + $output, |
370 | + "0.00s [ 0.00s, 0.00s, 0.00s ]\n", |
371 | + "--monitor works" |
372 | +); |
373 | + |
374 | +# Try to generate some lag between cluster nodes. Rather brittle at the moment. |
375 | + |
376 | +# Lifted from alter active table |
377 | +my $pt_osc_sample = "t/pt-online-schema-change/samples"; |
378 | + |
379 | +my $query_table_stop = "/tmp/query_table.$PID.stop"; |
380 | +my $query_table_pid = "/tmp/query_table.$PID.pid"; |
381 | +my $query_table_output = "/tmp/query_table.$PID.output"; |
382 | + |
383 | +$sb->create_dbs($node1, ['pt_osc']); |
384 | +$sb->load_file('master', "$pt_osc_sample/basic_no_fks_innodb.sql"); |
385 | + |
386 | +$node1->do("USE pt_osc"); |
387 | +$node1->do("TRUNCATE TABLE t"); |
388 | +$node1->do("LOAD DATA INFILE '$trunk/$pt_osc_sample/basic_no_fks.data' INTO TABLE t"); |
389 | +$node1->do("ANALYZE TABLE t"); |
390 | +$sb->wait_for_slaves(); |
391 | + |
392 | +diag(`rm -rf $query_table_stop`); |
393 | +diag(`echo > $query_table_output`); |
394 | + |
395 | +my $cmd = "$trunk/$pt_osc_sample/query_table.pl"; |
396 | +system("$cmd 127.0.0.1 $node1_port pt_osc t id $query_table_stop $query_table_pid >$query_table_output 2>&1 &"); |
397 | +wait_until(sub{-e $query_table_pid}); |
398 | + |
399 | +# Reload sakila |
400 | +system "$trunk/sandbox/load-sakila-db $node1_port &"; |
401 | + |
402 | +$output = output(sub { |
403 | + pt_heartbeat::main($node3_dsn, |
404 | + qw(-D test --monitor --run-time 5s), |
405 | + '--master-server-id', $node1_port) |
406 | + }, |
407 | + stderr => 1, |
408 | +); |
409 | + |
410 | +like( |
411 | + $output, |
412 | + qr/^(?:0\.(?:\d[1-9]|[1-9]\d)|\d*[1-9]\d*\.\d{2})s\s+\[/m, |
413 | + "pt-heartbeat can detect replication lag between nodes" |
414 | +); |
415 | + |
416 | +diag(`touch $query_table_stop`); |
417 | +chomp(my $p = slurp_file($query_table_pid)); |
418 | +wait_until(sub{!kill 0, $p}); |
419 | + |
420 | +$node1->do(q{DROP DATABASE pt_osc}); |
421 | + |
422 | +$sb->wait_for_slaves(); |
423 | + |
424 | +# ############################################################################# |
425 | +# cluster, node1 -> slave, run on node1 |
426 | +# ############################################################################# |
427 | + |
428 | +my ($slave_dbh, $slave_dsn) = $sb->start_sandbox( |
429 | + server => 'cslave1', |
430 | + type => 'slave', |
431 | + master => 'node1', |
432 | + env => q/BINLOG_FORMAT="ROW"/, |
433 | +); |
434 | + |
435 | +$sb->create_dbs($slave_dbh, ['test']); |
436 | + |
437 | +start_update_instance($sb->port_for('cslave1')); |
438 | +PerconaTest::wait_for_table($slave_dbh, "test.heartbeat", "1=1"); |
439 | + |
440 | +$output = output(sub{ |
441 | + pt_heartbeat::main($slave_dsn, qw(-D test --check)), |
442 | + }, |
443 | + stderr => 1, |
444 | +); |
445 | + |
446 | +like( |
447 | + $output, |
448 | + qr/\d\.\d{2}\n/, |
449 | + "pt-heartbeat --check works on a slave of a cluster node" |
450 | +); |
451 | + |
452 | +$output = output(sub { |
453 | + pt_heartbeat::main($slave_dsn, |
454 | + qw(-D test --monitor --run-time 2s)) |
455 | + }, |
456 | + stderr => 1, |
457 | +); |
458 | + |
459 | +like( |
460 | + $output, |
461 | + qr/^\d.\d{2}s\s+\[/, |
462 | + "pt-heartbeat --monitor + slave of a node1, without --master-server-id" |
463 | +); |
464 | + |
465 | +$output = output(sub { |
466 | + pt_heartbeat::main($slave_dsn, |
467 | + qw(-D test --monitor --run-time 2s), |
468 | + '--master-server-id', $node3_port) |
469 | + }, |
470 | + stderr => 1, |
471 | +); |
472 | + |
473 | +like( |
474 | + $output, |
475 | + qr/^\d.\d{2}s\s+\[/, |
476 | + "pt-heartbeat --monitor + slave of node1, --master-server-id pointing to node3" |
477 | +); |
478 | + |
479 | +# ############################################################################# |
480 | +# master -> node1 in cluster |
481 | +# ############################################################################# |
482 | + |
483 | +# CAREFUL! See the comments in t/pt-table-checksum/pxc.t about cmaster. |
484 | +# Nearly everything applies here. |
485 | + |
486 | +my ($master_dbh, $master_dsn) = $sb->start_sandbox( |
487 | + server => 'cmaster', |
488 | + type => 'master', |
489 | + env => q/BINLOG_FORMAT="ROW"/, |
490 | +); |
491 | + |
492 | +my $cmaster_port = $sb->port_for('cmaster'); |
493 | + |
494 | +$sb->create_dbs($master_dbh, ['test']); |
495 | + |
496 | +$master_dbh->do("FLUSH LOGS"); |
497 | +$master_dbh->do("RESET MASTER"); |
498 | + |
499 | +$sb->set_as_slave('node1', 'cmaster'); |
500 | + |
501 | +start_update_instance($sb->port_for('cmaster')); |
502 | +PerconaTest::wait_for_table($node1, "test.heartbeat", "server_id=$cmaster_port"); |
503 | + |
504 | +$output = output(sub{ |
505 | + pt_heartbeat::main($node1_dsn, qw(-D test --check --print-master-server-id)), |
506 | + }, |
507 | + stderr => 1, |
508 | +); |
509 | + |
510 | +like( |
511 | + $output, |
512 | + qr/^\d.\d{2} $cmaster_port$/, |
513 | + "--print-master-id works for master -> $node1_port, when run from $node1_port" |
514 | +); |
515 | + |
516 | +# Wait until node2 & node3 get cmaster in their heartbeat tables |
517 | +$sb->wait_for_slaves(master => 'node1', slave => 'node2'); |
518 | +$sb->wait_for_slaves(master => 'node1', slave => 'node3'); |
519 | + |
520 | +foreach my $test ( |
521 | + [ $node2_port, $node2_dsn, $node2 ], |
522 | + [ $node3_port, $node3_dsn, $node3 ], |
523 | +) { |
524 | + my ($port, $dsn, $dbh) = @$test; |
525 | + |
526 | + $output = output(sub{ |
527 | + pt_heartbeat::main($dsn, qw(-D test --check --print-master-server-id)), |
528 | + }, |
529 | + stderr => 1, |
530 | + ); |
531 | + |
532 | + # This could be made to work, see the node autodiscovery branch |
533 | + TODO: { |
534 | + local $::TODO = "cmaster -> node1, other nodes can't autodetect the master"; |
535 | + like( |
536 | + $output, |
537 | + qr/$cmaster_port/, |
538 | + "--print-master-id works for master -> $node1_port, when run from $port" |
539 | + ); |
540 | + } |
541 | + |
542 | + $output = output(sub{ |
543 | + pt_heartbeat::main($dsn, qw(-D test --check --master-server-id), $cmaster_port), |
544 | + }, |
545 | + stderr => 1, |
546 | + ); |
547 | + |
548 | + $output =~ s/\d\.\d{2}/0.00/g; |
549 | + is( |
550 | + $output, |
551 | + "0.00\n", |
552 | + "--check + explicit --master-server-id work for master -> node1, run from $port" |
553 | + ); |
554 | +} |
555 | + |
556 | +# ############################################################################ |
557 | +# Stop the --update instances. |
558 | +# ############################################################################ |
559 | + |
560 | +stop_all_instances(); |
561 | + |
562 | +# ############################################################################ |
563 | +# Disconnect & stop the two servers we started |
564 | +# ############################################################################ |
565 | + |
566 | +# We have to do this after the --stop, otherwise the --update processes will |
567 | +# spew a bunch of warnings and clog |
568 | + |
569 | +$slave_dbh->disconnect; |
570 | +$master_dbh->disconnect; |
571 | +$sb->stop_sandbox('cslave1', 'cmaster'); |
572 | +$node1->do("STOP SLAVE"); |
573 | +$node1->do("RESET SLAVE"); |
574 | + |
575 | +# ############################################################################# |
576 | +# Done. |
577 | +# ############################################################################# |
578 | +$sb->wipe_clean($node1); |
579 | +diag(`/tmp/12345/stop`); |
580 | +diag(`/tmp/12345/start`); |
581 | +ok($sb->ok(), "Sandbox servers") or BAIL_OUT(__FILE__ . " broke the sandbox"); |
582 | +done_testing; |
169 + /tmp/12345/stop >/dev/null
170 + /tmp/12345/start >/dev/null
Is that really needed in test-env and at the end of pxt.t?
327 +is(
328 + $output,
329 + "0.00s [ 0.00s, 0.00s, 0.00s ]\n",
330 + "--monitor works"
331 +);
A similar test for regular MySQL often fails because one of those 0.00 will be 0.01 or something.
333 +# Try to generate some lag between cluster nodes. Rather brittle at the moment.
I think we can remove code related to that because it's going to be quite slow and, as you note, brittle.
518 +diag(` $trunk/ bin/pt- heartbeat --stop >/dev/null`);
519 +sleep 1;
Looks like a timing-related failure waiting to happen.
Need a "Percona XtraDB Cluster" section in pt-heartbeat mention what we talked about on IRC: that a cluster is as fast as the slowest node, but pt-heartbeat doesn't really report the cluster's after lag. Although --monitor processes on various nodes can reveal how fast events are replicating to that node from the --update process. etc. etc.