lp:~percona-dev/percona-xtradb-cluster/galera-3.x
- Get this branch:
- bzr branch lp:~percona-dev/percona-xtradb-cluster/galera-3.x
Branch merges
Related bugs
Bug #1180791: RBR error on IST not zeroing grastate | Undecided | Fix Released | |
Bug #1237444: Statically linked galera library wrt. openssl | Undecided | Fix Released | |
Bug #1243729: Galera build fails when ssl is disabled | Undecided | Fix Committed | |
Bug #1247344: Relocatable RPMs | Undecided | Fix Released | |
Bug #1253882: Make galera packages interchangeable | Undecided | Fix Released | |
Bug #1256788: Port the garbd init script to debian and package it. | Undecided | Fix Released | |
Bug #1256887: Concomitant installation of galera packages | Undecided | Fix Released | |
Bug #1262171: Garbd debian init script startup issue | Undecided | Fix Released | |
Bug #1269836: Debian debug packages for galera | Undecided | Fix Released |
Related blueprints
Branch information
- Owner:
- Percona developers
- Status:
- Development
Recent revisions
- 215. By Raghavendra D Prabhu
-
Merge galera-3.x upto revno 183, also reverting changes made in revno 209 for Bug#1285380
- 209. By Raghavendra D Prabhu
-
Bug#1285380: Donor in desynced state makes the joiner wait indefinitely
So, this is how it looks:
%%%%%%%
%%%%%%% %%%%%%% %%%%%%% %%%%%%% 5
do {
end = strchr(begin, ',');int len;
if (NULL == end) {
len = str_len - (begin - str);
}
else {
len = end - begin;
}assert (len >= 0);
int const idx = len > 0 ? /* consider empty name as "any" */
group_ find_node_ by_name (group, joiner_idx, begin, len, status) :
/* err == -EAGAIN here means that at least one of the nodes in the
* list will be available later, so don't try others. */
(err == -EAGAIN ?
err : group_find_node_by_ state(group, joiner_idx, status)); if (idx >= 0) return idx;
/* once we hit -EAGAIN, don't try to change error code: this means
* that at least one of the nodes in the list will become available. */
if (-EAGAIN != err) err = idx;begin = end + 1; /* skip comma */
} while (end != NULL);
%%%%%%%
%%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% Based on my tests, when wsrep_sst_
donor=' A1,A2,' and A1 is
unavailable (non-SYNCED), A3 does SST from A2 without any issues.*However*, if wsrep_sst_
donor=' A1,' and A1 is unavailable, then
it keeps looping without any bounds (there are no strict bounds on
number of retries):%%%%%%%
%%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %% do
{
tries++;gcs_seqno_t seqno_l;
ret = gcs_.request_
state_transfer( req->req( ), req->len(), sst_donor_,
&seqno_ l); if (ret < 0)
{
if (!retry_str(ret))
{
log_error << "Requesting state transfer failed: "
<< ret << "(" << strerror(-ret) << ")";
}
else if (1 == tries)
{
log_info << "Requesting state transfer failed: "
<< ret << "(" << strerror(-ret) << "). "
<< "Will keep retrying every " << sst_retry_sec_
<< " second(s)";
}
}if (seqno_l != GCS_SEQNO_ILL)
{
/* Check that we're not running out of space in monitor. */
if (local_monitor_ .would_ block(seqno_ l))
{
long const seconds = sst_retry_sec_ * tries;
log_error << "We ran out of resources, seemingly because "
<< "we've been unsuccessfully requesting state "
<< "transfer for over " << seconds << " seconds. "
<< "Please check that there is "
<< "at least one fully synced member in the group. "
<< "Application must be restarted.";
ret = -EDEADLK;
}
else
{
// we are already holding local monitor
LocalOrder lo(seqno_l);
local_ monitor_ .self_cancel( lo);
}
}
}%%%%%%%
%%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%%%% %%%%% From log:
140309 0:01:32 [Note] [Debug] WSREP: gcs/src/
gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff528 after receiving -11
140309 0:01:32 [Note] WSREP: galera/src/replicator_ str.cpp: send_state_ request( ):560: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:16001,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140309 00:01:32.391)
140309 0:01:33 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff592 after receiving -11
140309 0:01:34 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff5fc after receiving -11
140309 0:01:35 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff666 after receiving -11
140309 0:01:36 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff6d0 after receiving -11
140309 0:01:37 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff73a after receiving -11
140309 0:01:38 [Note] [Debug] WSREP: gcs/src/gcs.c:gcs_ replv() :1568: Freeing gcache buffer 0x7f528bfff7a4 after receiving -11 So I see that, according to current design, it looks good. The
current design being if a node in wsrep_sst_donor is unavailable
then don't fall back at all (to group_find_node_by_ state). However, there is a flaw in that which is that in
wsrep_sst_donor there is a choice given whether to leave a dangling
comma or not. The former implies that check all nodes and try the
fall provider logic whereas the latter implies a strict checking.Currently, irrespective of whether the comma exists or not, it
does only strict membership checking for donor without ever
falling back. In case when a dangling comma is provided by user
(for precisely that reason) - "A1,A2," it should check A1, A2
and if both are unavailable then check for others (may A3 or A5 are
available (in a cluster of A1,A2,A3,A4,A5) as well.The number of retries should also be bounded, but that is for
another bug.
Branch metadata
- Branch format:
- Branch format 7
- Repository format:
- Bazaar repository format 2a (needs bzr 1.16 or later)