Percona Toolkit moved to https://jira.percona.com/projects/PT

Code review comment for lp:~percona-toolkit-dev/percona-toolkit/pxc-pt-heartbeat

pxc-pt-heartbeat
Merge into 2.1

Revision history for this message

Brian Fraser (fraserbn) wrote on 2012-12-12:

> 169 + /tmp/12345/stop >/dev/null
> 170 + /tmp/12345/start >/dev/null
>
> Is that really needed in test-env and at the end of pxt.t?
>

"Yes", but also no. For test-env, that's done so that we can be actually sure that the cluster will have 3 members. If after stop/update the cluster has 1 or 2 members, something went awfully wrong. And not catching this here means that code later down the line stop sop/start on 12345 (to set up replication filters, for example) will break the sandbox.
For pxt.t, that's doen to really restore the original state of the node. It's a trivial thing, but it might end up breaking up tests: relay_master_log_file & exec_master_log_pos are both NULL previous to running the file, but become 0 after turning cmaster on & off and calling RESET SLAVE. stop/start return them to NULL.

Unrealistic but easy way of seeing that it might break things: Remove those two from pxt.t, then run

$ prove t/pt-heartbeat/pxc.t t/pt-heartbeat/pxc.t

First one will pass; Second will fail a test. So we could take it out, but risk the admittedly small possibility of some unrelated test breaking later.

> 327 +is(
> 328 + $output,
> 329 + "0.00s [ 0.00s, 0.00s, 0.00s ]\n",
> 330 + "--monitor works"
> 331 +);
>
> A similar test for regular MySQL often fails because one of those 0.00 will be
> 0.01 or something.
>

That's preceded by $output =~ s/\d\.\d{2}/0.00/g, so it should be okay, although changing the is() to a like() might be a good idea.

> 333 +# Try to generate some lag between cluster nodes. Rather brittle at
> the moment.
>
> I think we can remove code related to that because it's going to be quite slow
> and, as you note, brittle.
>

Bit of a sunken cost argument here, but I spent quite a bit of time making that test work, so I'm going to argue against this. But actually, I ran pxc.t in a while true; loop for two hours, and that test never failed. So I think that it's okay -- the comment was written on a version of pxc.t that only reloaded sakila, but combining it with the alter active code seems to have hardened quite a bit. I think that adding a /m to the regex would make it even more resilient.
Also, it's actually pretty fast: Only the 5 seconds that --monitor is told to run, since the rest is running in the background.
If anything, can we try keeping it until it starts causing trouble?

> 518 +diag(`$trunk/bin/pt-heartbeat --stop >/dev/null`);
> 519 +sleep 1;
>
> Looks like a timing-related failure waiting to happen.
>

Definitive yes here. I could increase the sleep time to something much bigger, like 15, since the only thing that interests us of that part is that --stop actually works on instances of pt-heartbeat running on PXC, not that it stops them quickly.
The other option is making away with those tests and just wait_until(sub{!kill 0, $pid}) for @pids, but then we might hang / not be quite sure if --stop really works.

> Need a "Percona XtraDB Cluster" section in pt-heartbeat mention what we talked
> about on IRC: that a cluster is as fast as the slowest node, but pt-heartbeat
> doesn't really report the cluster's after lag. Although --monitor processes
> on various nodes can reveal how fast events are replicating to that node from
> the --update process. etc. etc.

Aye. Haven't done this because it actually ties in with the timezone bit that I mentioned on IRC, so let's discuss it later.

> 169     +            /tmp/12345/stop  >/dev/null
> 170     +            /tmp/12345/start >/dev/null
> 
> Is that really needed in test-env and at the end of pxt.t?
>

Unrealistic but easy way of seeing that it might break things: Remove those two from pxt.t, then run

$ prove t/pt-heartbeat/pxc.t t/pt-heartbeat/pxc.t

First one will pass; Second will fail a test. So we could take it out, but risk the admittedly small possibility of some unrelated test breaking later.

> 327     +is(
> 328     +   $output,
> 329     +   "0.00s [  0.00s,  0.00s,  0.00s ]\n",
> 330     +   "--monitor works"
> 331     +);
> 
> A similar test for regular MySQL often fails because one of those 0.00 will be
> 0.01 or something.
>

That's preceded by $output =~ s/\d\.\d{2}/0.00/g, so it should be okay, although changing the is() to a like() might be a good idea.

> 333     +# Try to generate some lag between cluster nodes. Rather brittle at
> the moment.
> 
> I think we can remove code related to that because it's going to be quite slow
> and, as you note, brittle.
>

> 518     +diag(`$trunk/bin/pt-heartbeat --stop >/dev/null`);
> 519     +sleep 1;
> 
> Looks like a timing-related failure waiting to happen.
>

> Need a "Percona XtraDB Cluster" section in pt-heartbeat mention what we talked
> about on IRC: that a cluster is as fast as the slowest node, but pt-heartbeat
> doesn't really report the cluster's after lag.  Although --monitor processes
> on various nodes can reveal how fast events are replicating to that node from
> the --update process. etc. etc.

Aye. Haven't done this because it actually ties in with the timezone bit that I mentioned on IRC, so let's discuss it later.

« Back to merge proposal