MDEV-21322: Report slave progress to the master
This patch extends the command `SHOW REPLICA HOSTS` with three
columns:
1) Gtid_Pos_Sent. This represents that latest GTIDs sent to the
replica in each domain. It will always be populated, regardless of
the semi-sync status (i.e. asynchronous connections will still
update this column with the latest GTID state sent to the replica).
2) Gtid_Pos_Ack. For semi-synchronous connections (only), this
column represents the last GTID in each domain that the replica has
acknowledged.
3) Sync_Status. This value represents the synchronization status of
the replica, and is used to help determine how to interpret the
Gtid_Pos_Ack column. There are four possible values:
3.1) Initializing. This means the binlog dump thread is still
initializing, and has not yet determined the synchronization status
of the replica.
3.2) Asynchronous: This means the replica is not configured for
semi-sync replication, and thereby, Gtid_Pos_Ack should always be
NULL.
3.3) Semi-sync Stale: This means the replica is configured for
semi-sync replication, however, connected using an old state, and is
not readily able to send ACKs for new transactions. Functionally,
this means that the primary will try to catch the replica up-to-date
by sending transactions which will not be ACKed. Additionally, the
value shown by Gtid_Pos_Ack will be NULL until the replica
catches up and ACKs its first transaction.
3.4) Semi-sync Active: This means the replica is configured for
semi-sync replication, and is readily sending ACKs for new
transactions it receives. It is possible for Gtid_Pos_Ack to be
NULL while Sync_Status is "Semi-sync Active" if no new transactions
have been executed on the primary since the replica has connected.
Additionally, this patch creates a new semantic for the
configuration rpl_semi_sync_master_timeout=0. That is, it creates a
mode to mimic the asynchronous connection behavior, while allowing
one to monitor the progress of the replica via the new columns
Gtid_Pos_Sent and Gtid_Pos_Ack. When configured to 0, 1) new
transactions will not attempt to wait for an ACK before completing,
and 2) the primary will still request ACKs from the replica for new
transactions. This means that Gtid_Pos_Ack will be updated for each
ACK from the replica and Sync_Status will read as "Semi-sync Active".
Also note that a new error message was added to account for the case
that Gtid_Pos_(Sent/Ack) represents a binary log file that was
purged/cannot be found.
The overall implementation is rather simple. It leverages the
existing semi-sync framework, where the replica uses binlog file:pos
to ACK transactions, in order to infer GTID state by performing a
binlog lookup at the time `SHOW REPLICA HOSTS` is executed. In
particular, the Slave_info struct is extended to store 1) the binlog
file:pos pair of the transaction which was last sent to the replica,
2) the binlog file:pos pair that was last ACKed by the replica, and
3) and enum to represent the Sync_Status.
Note that in kill_callback_collect(), ack_receiver.remove_slave(thd)
was removed from the mutual exclusion of LOCK_thd_data. This call
shouldn't need LOCK_thd_data in the first place, as the thread can go
on to be killed while we remove the slave from the ack receiver.
This was brought out though because locking LOCK_thd_data from within
the ack receiver thread (which already holds m_mutex) would conflict
with the mutex ordering of kill_callback_collect(), which grabs
LOCK_thd_data before m_mutex.
This patch was initially started by @JackSlateur in PR#1427, where
it was then transferred to @an3l who buffed it out in PR#2374, and
final touches were put on by @bnestere.
Reviewed By
============
Kristian Nielsen <email address hidden>
Andrei Elkin <email address hidden>