Merge lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure into lp:akiban-persistit
Status: | Merged | ||||
---|---|---|---|---|---|
Approved by: | Nathan Williams | ||||
Approved revision: | 384 | ||||
Merged at revision: | 378 | ||||
Proposed branch: | lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure | ||||
Merge into: | lp:akiban-persistit | ||||
Diff against target: |
367 lines (+209/-17) 8 files modified
src/main/java/com/persistit/Accumulator.java (+31/-11) src/main/java/com/persistit/CheckpointManager.java (+8/-1) src/main/java/com/persistit/Persistit.java (+3/-3) src/main/java/com/persistit/RecoveryManager.java (+1/-1) src/main/java/com/persistit/Transaction.java (+6/-0) src/main/java/com/persistit/TransactionPlayer.java (+7/-1) src/main/java/com/persistit/util/SequencerConstants.java (+10/-0) src/test/java/com/persistit/Bug1064565Test.java (+143/-0) |
||||
To merge this branch: | bzr merge lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure | ||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Nathan Williams | Approve | ||
Review via email: mp+129243@code.launchpad.net |
Description of the change
This proposal fixes https:/
The bug is caused by a subtle race condition in the protocol that determines when Persistit records Accumulator state. The bug mechanism is described in some detail in the bug report.
The central problem is the handling of the _checkpointRef field of the com.persistit.
The invariant regarding timing is this: if some committed transaction with commit timestamp tc has updated an Accumulator, and if the most recent checkpoint has timestamp ts0 then the _checkpointRef field must be non-null whenever there is any transaction such that tc > ts0. The converse is not true; it is permissible for _checkpointRef to be non-null even if there are no qualifying updates. The result in that case is that the Accumulator state is recorded redundantly, but not incorrectly.
The reason for this invariant is as follows. Checkpoint C0 will record the snapshot value of the Accumulator as of its start timestamp ts0. By definition of SI, updates committed after ts0 will not be part of the checkpoint even if the transaction that created them started before ts0. In the event C0 is the very last checkpoint recorded before recovery, the state is valid because those subsequent updates would be replayed from the journal during recovery. However, if one more checkpoint C1 is written (as it is under normal shutdown) with start timestamp ts1 > tc, recovery will ignore the transactions in the journal (by definition of Checkpoint). Therefore C1 must record the updates committed at tc in order for correct recovery after C1 has been written to the journal. It is the value of _checkpointRef that determines this behavior.
The bug occurred because this invariant was violated. The following is an informal proof that the proposed changes enforce the invariant.
Changes:
AccumulatorRef now includes a new AtomicLong field called _latestUpdate in which the commit timestamp of the most-recently-
The AccumulatorRef#
The checkpointNeede
Other minor change:
The Accumulator _baseValue field is now marked volatile because it is set during recovery by one thread and may first be read by a different thread. There is no race condition requiring synchronization, but the Java memory model would permit the second thread to receive stale data.
Informal Proof:
Two threads C and T are executing concurrently; C is calling takeCheckpointR
(Note that both of these methods are lock-free because they are called frequently.)
These methods modify two shared variables, _checkpointRef and _latestUpdate – for brevity call these R and L, respectively.
T updates L then sets R.
C reads L, possibly clears R, reads L and possibly sets R again.
By definition of Java, the operations “update”, “set”, “read” and “clear” are atomic. We’ll use the following abbreviations:
TuL – T updates L
TsR – T sets R
CrL1 – C reads L the first time
CcR – C clears R
CrL2 – C reads L the second time
CsR – C sets R
Clearly all the execution schedules in which T’s operations either precede or follow C’s operations (i.e., TuL, TsR, CrL1, CcR… and CrL1, CcR…, TuL, TsR) are safe.
All schedules in which TuL precedes CrL1 are safe because C will never clear R. The following are the two such possible executions:
TuL, TsR, CrL1
TuL, CrL1, TsR
Now consider schedules in which TuL follows CrL1 but precedes CcR.
CrL1, TuL, TsR, CcR, CrL2, CsR
In this schedule C sees an old value of L and acts by clearing R. In this case the second read CrL2 sees the updated value of L and restores the value of C, leaving C correctly non-null. (This is the case that motivates the CrL2, CsR sequence in takeCheckpointRef.)
CrL1, TuL, CcR, TsR, CrL2, CsR
In this schedule TsR occurs after CcR so the value of C is already non-null. The final execution of CsR simply redundantly sets R. Same analysis applies to the following cases:
CrL1, TuL, CcR, CrL2, TsR, CsR
CrL1, TuL, CcR, CrL2, CsR, TsR
Finally, all schedules in which TuL follows CcR are all safe because in all such cases TsR must follow CsR, e.g.:
CrL1, CcR, TuL, TsR, CrL2, CsR
CrL1, CcR, TuL, CrL2, TsR, CsR
CrL1, CcR, TuL, CrL2, CsR, TsR
With regards to takeCheckpointRef and checkpointNeeded, the description is very complete and the code looks consistent with it. I believe it is correct as-is.
I am a little wary of diff line 81 though. Unconditionally calling checkpointNeeded() with 0 requires that checkpointNeeded always sets the ref. It does today, but I was about to suggest diff line 47 could be return instead, without loss of correctness, to prevent one case of spurious saves.
It looks like there is exactly 1 caller of updateBaseValue() and a timestamp is readily available. Is there a reason we shouldn't pass the ts down so that we are always dealing with real ones in checkpointNeeded?
Minor though and otherwise looks good!