pt-table-checksum + PXC inconsistent results upon --resume

Bug #1311654 reported by Aurimas Mikalauskas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

If I interrupt and then resume a pt-table-checksum checking two PXC nodes, ~20-30% of times I get an incorrect result - checksum mismatch. This is easily reproducible with small tables. Here's the command I am running:

/usr/bin/pt-table-checksum \
                        --recursion-method cluster \
                        --user $USER \
                        --password $PASSWORD \
                        --max-load Threads_running=$MAXTHREADS \
                        --progress time,3600 \
                        --chunk-size-limit 4 \
                        --pid $PID \
                        --databases db1,db2"

PTDEBUG output, as it containts sensitive customer information, will be sent privately.

Daniel's hack that adds an extra 1.5s delay before checking for the last chunk, decreased this effect to zero, but we were testing with very small tables, so such waits added a lot of overhead and I am guessing in most cases I would interrupt pt-table-checksum while it was waiting.

Tested with pt-table-checksum 2.2.7 and Percona XtraDB Cluster, Release 31.1, wsrep_25.9.r3928 (5.5.34-31.1)

Related branches

tags: added: pt-table-checksum
Changed in percona-toolkit:
importance: Undecided → Medium
status: New → Incomplete
status: Incomplete → Fix Committed
milestone: none → 2.2.10
assignee: nobody → Frank Cizmich (frank-cizmich)
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Discrepant table checksums are now re-checked a number of times at short intervals before declaring them true.
This strategy does not add significant time to the overall run since differences are usually rare, and this is done at most once per table.

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

To preserve the default behavior a new command line parameter was added.

If you are having resume problems you can now set --replicate-check-retries N , where N is the number of times to retry a discrepant checksum (default = 1 , no retries)

Setting a value of 3 is enough to completely eliminate spurious differences.

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-646

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.