pt-table-checksum has ambiguous exit status

Bug #944051 reported by Marco
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Daniel Nichter

Bug Description

From pt-table-checksum manual: "The tool’s exit status is nonzero if any differences are found, or if any warnings or errors occur."
It would be nice to distinguish, with different status codes, errors (e.g. table skipped) from diffs (different tables checksum). Indeed errors may occur temporarily and don't break replicas integrity, while diffs do.

Related branches

tags: added: ambiguity pt-table-checksum
Brian Fraser (fraserbn)
Changed in percona-toolkit:
importance: Undecided → Wishlist
Changed in percona-toolkit:
status: New → Triaged
Revision history for this message
Ryan Lowe (ryan-a-lowe) wrote :

pt-table-sync has the following:

STATUS MEANING
====== =======================================================
0 Success.
1 Internal error.
2 At least one table differed on the destination.
3 Combination of 1 and 2.

I'd love to see something similar on pt-table-checksum along the lines of:

STATUS MEANING
====== =======================================================
0 Success.
1 Could not start due to PID
2 Internal error.
3 At least one table differed on the destination.

Changed in percona-toolkit:
milestone: none → 2.2.5
assignee: nobody → Daniel Nichter (daniel-nichter)
importance: Wishlist → High
status: Triaged → In Progress
summary: - pt-table-checksum: exit status ambiguous
+ pt-table-checksum has ambiguous exit status
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

pt-table-checksum has three possible exit statuses: zero, 255, and any other
value is a bitmask with flags for different problems.

A zero exit status indicates no errors, warnings, or checksum differences,
or skipped chunks or tables.

A 255 exit status indicates a fatal error. In other words: the tool died
or crashed. The error is printed to C<STDERR>.

If the exit status is not zero or 255, then its value functions as a bitmask
with these flags:

   FLAG BIT VALUE MEANING
   ================ ========= ==========================================
   ALREADY_RUNNING 4 --pid file exists and the PID is running
   NO_SLAVES_FOUND 8 No replicas or cluster nodes were found
   CAUGHT_SIGNAL 16 Caught SIGHUP, SIGINT, SIGPIPE, or SIGTERM
   ERROR 32 A non-fatal error occurred
   TABLE_DIFF 512 At least one diff was found
   SKIP_CHUNK 1024 At least one chunk was skipped
   SKIP_TABLE 2048 At least one table was skipped

If any flag is set, the exit status will be non-zero. Use the bitwise C<AND>
operation to check for a particular flag. For example, if C<$exit_status & 4>
is true, then at least one diff was found.

Changed in percona-toolkit:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

I conflated Perl exit with standard Unix exit, and the latter is limited to a single byte. So the new list is:

   ERROR 1 A non-fatal error occurred
   ALREADY_RUNNING 2 --pid file exists and the PID is running
   CAUGHT_SIGNAL 4 Caught SIGHUP, SIGINT, SIGPIPE, or SIGTERM
   NO_SLAVES_FOUND 8 No replicas or cluster nodes were found
   TABLE_DIFF 16 At least one diff was found
   SKIP_CHUNK 32 At least one chunk was skipped
   SKIP_TABLE 64 At least one table was skipped

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

For the record, we had a debate about this: some people say skipped chunks or tables should not be a non-zero exit, and others say it should. More people, including myself, think the latter, so we'll stay with the previous comment. My thinking is: zero exit should be a true, total "AOK"--everything worked as expected. People who commonly have skipped chunks may find this change to be a pita, as it does break backwards-compat a little, but a skipped chunk really is indication that something didn't work right, e.g. MySQL didn't us the index for a chunk, or the chunk was too large on the slave, etc. And since it's easy enough to isolate this exit status, people can still filter it out: "zero" exit == 0 || 32 || 96 (32 & 64).

Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Correction to previous comment: "zero" exit == 0 || 32 because < 2.2.4 only skipped *chunks* did not cause non-zero exit.

I have mentioned this change in the docs: As of pt-table-checksum 2.2.5, skipped chunks cause a non-zero exit status.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Kenny Gryp (gryp) wrote :

Daniel, I totally agree with your decision. Thanks for 'fixing'/'adding that feature'!

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-300

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.