pt-table-checksum doesn't honor --run-time while checking replication lag

Bug #1043438 reported by Baron Schwartz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Daniel Nichter

Bug Description

I've run pt-table-checksum against a server with badly lagging replication, and with --run-time=6h so that it starts at 2am and ends at 8am. Much later, I shut down and restart the replica, and get:

08-29T11:11:16 Fatal error checksumming table <....>: Lost connection to replica <....> while attempting to get its lag

Related branches

Revision history for this message
Baron Schwartz (baron-xaprb) wrote :

This is becoming a problem because I'm ending up with dozens of pt-table-checksum instances running for many days. If the replica ever does catch up, they will all dive-bomb the server at the same time and probably interact in undesirable ways. I am making the following change in my local copy of the tool:

@@ -7048,6 +7048,7 @@
    my $master_cxn = $make_cxn->(set_vars => 1, dsn_string => shift @ARGV);
    my $master_dbh = $master_cxn->dbh(); # just for brevity
    my $master_dsn = $master_cxn->dsn(); # just for brevity
+ my $have_time;

    # ########################################################################
    # If this is not a dry run (--explain was not specified), then we're
@@ -7231,7 +7232,7 @@
       $replica_lag = new ReplicaLagWaiter(
          slaves => $slave_lag_cxns,
          max_lag => $o->get('max-lag'),
- oktorun => sub { return $oktorun },
+ oktorun => sub { return $oktorun && $have_time->() },
          get_lag => $get_lag,
          sleep => $sleep,
       );
@@ -7334,7 +7335,6 @@
    # ########################################################################
    # Set up the run time, if any.
    # ########################################################################
- my $have_time;
    if ( my $run_time = $o->get('run-time') ) {
       my $end = time() + $o->get('run-time');
       $have_time = sub { return time() < $end };

tags: added: wrong-behavior
Changed in percona-toolkit:
milestone: none → 2.1.5
status: New → Confirmed
Changed in percona-toolkit:
importance: Undecided → High
assignee: nobody → Daniel Nichter (daniel-nichter)
Changed in percona-toolkit:
status: Confirmed → In Progress
tags: added: run-time
removed: wrong-behavior
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Baron, the attached branch is pretty much the same as your change. I just also did the same fix for --max-load. Want to try it on your end since testing this kind of thing is tricky?

Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Baron has left the building, so I've just gone ahead and merged this.

Brian Fraser (fraserbn)
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-330

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.