Alexey - Thanks. Could you please expand on your comment on why you think this assumption is wrong - > The assumption this patch is based on looks wrong to me, i.e.: > > > + XA COMMIT. In contrast to that, the slave position is an > > + actual part of the changes made by this transaction and thus > > + must be updated in the XA PREPARE stage. */ because the following explanation matches my assumptions precisely: > A prepared transaction may either be committed or rolled back (see > xarecover_handlerton()) depending on whether the corresponding record made it > to the binary log (and if the binary log is used at all). Namely, why is the assumption "slave position update is a part of actual transaction changes" wrong in the light of this? If a prepared transaction is rolled back, the old slave position is restored from an undo seg. If it is 2pc-committed, the new slave position becomes permanent. Maybe you view this fix as incomplete or working by chance because of the following (which is the actual action sequence that happens in this bug): on crash recovery the relay status log overwrite will happen before the XA rollback, thus the slave position will point to as if the transaction was fully committed. Then the transaction will be rolled back, (which would require overwriting relay status log with the old position) and replayed from the binlog, (which would require overwriting the relay status log position again, but binlog does not have the required info for that). This results in correct positions, although with a shortcut taken. The assumption here that all replicated InnoDB XA prepared transactions will be eventually committed. It is not perfect but IMHO "slave position is a part of transaction itself" is a step to the right direction. > The correct solution would be to update slave coordinates when commiting the > corresponding XA transaction on recovery, rather than moving the code storing > slave coordinates to persistent storage from COMMIT to PREPARE stage. The > problem is that slave coordinates are not available at the point, i.e. when > committing prepared transactions on recovery. Fully agreed modulus that ATM I think it's "one of the possible correct solutions." > I wonder if we can fix this by introducing another set of slave coordinates in > the trx header. The we can only update those fields on PREPARE, and update the > regular TRX_SYS_MYSQL_RELAY_LOG_INFO fields on COMMIT. On recovery, we could > copy the "prepare" fields to "committed" ones and overwrite the relay log info > file, *if* the PREPAREd transaction is being committed, e.g. in > innobase_commit_by_xid(). What do you think? Upon the first thought it seems workable, however I would like to postpone further discussion until we agree on the previous point, because there are other bugs in crash-resistant replication bug too, fixing which might require implementing "transactional system table for slave relay log," and if we have to do that, then we should discuss such fixes as a whole. > Minor comments on the test case: > - do we really need the "rpl_" prefix in rpl_percona_crash_resistant_rpl.*? Looks funny to me too, I just didn't want the convention of all rpl suite tests having a "rpl_" prefix, even though that seems historical. > - please use > > SET GLOBAL debug="+d,keyword" Will do.