Akiban Persistit

Merge lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure into lp:akiban-persistit

fix-accumulator-checkpoint-failure
Merge into trunk

Proposed by Peter Beaman on 2012-10-11

Status:

Merged

Approved by:

Nathan Williams on 2012-10-11

Approved revision:

384

Merged at revision:

378

Proposed branch:

lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure

Merge into:

lp:akiban-persistit

Diff against target:

367 lines (+209/-17)

8 files modified

src/main/java/com/persistit/Accumulator.java (+31/-11)
src/main/java/com/persistit/CheckpointManager.java (+8/-1)
src/main/java/com/persistit/Persistit.java (+3/-3)
src/main/java/com/persistit/RecoveryManager.java (+1/-1)
src/main/java/com/persistit/Transaction.java (+6/-0)
src/main/java/com/persistit/TransactionPlayer.java (+7/-1)
src/main/java/com/persistit/util/SequencerConstants.java (+10/-0)
src/test/java/com/persistit/Bug1064565Test.java (+143/-0)

To merge this branch:

bzr merge lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure

Critical

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Nathan Williams		2012-10-11	Approve on 2012-10-11
Review via email: mp+129243@code.launchpad.net

Description of the change

This proposal fixes https://bugs.launchpad.net/akiban-persistit/+bug/1064565 which is a data-loss bug affecting the state of Accumulator values after system shutdown/restart.

The bug is caused by a subtle race condition in the protocol that determines when Persistit records Accumulator state. The bug mechanism is described in some detail in the bug report.

The central problem is the handling of the _checkpointRef field of the com.persistit.Accumulator.AccumulatorRef inner class. This field is either null or a reference to an Accumulator. It is intended to be null when the Accumulator has not been updated since the last checkpoint, and non-null when there has been an update since the last checkpoint. Checkpoints happen by default every two minutes, so this loose description of the timing is fine for almost all of every two-minute interval. However, the meaning of “since” needs to be carefully defined in the case where a transaction that is performing Accumulator updates becomes concurrent with a checkpoint. That’s when the bug occurs.

The invariant regarding timing is this: if some committed transaction with commit timestamp tc has updated an Accumulator, and if the most recent checkpoint has timestamp ts0 then the _checkpointRef field must be non-null whenever there is any transaction such that tc > ts0. The converse is not true; it is permissible for _checkpointRef to be non-null even if there are no qualifying updates. The result in that case is that the Accumulator state is recorded redundantly, but not incorrectly.

The reason for this invariant is as follows. Checkpoint C0 will record the snapshot value of the Accumulator as of its start timestamp ts0. By definition of SI, updates committed after ts0 will not be part of the checkpoint even if the transaction that created them started before ts0. In the event C0 is the very last checkpoint recorded before recovery, the state is valid because those subsequent updates would be replayed from the journal during recovery. However, if one more checkpoint C1 is written (as it is under normal shutdown) with start timestamp ts1 > tc, recovery will ignore the transactions in the journal (by definition of Checkpoint). Therefore C1 must record the updates committed at tc in order for correct recovery after C1 has been written to the journal. It is the value of _checkpointRef that determines this behavior.

The bug occurred because this invariant was violated. The following is an informal proof that the proposed changes enforce the invariant.

Changes:

AccumulatorRef now includes a new AtomicLong field called _latestUpdate in which the commit timestamp of the most-recently-started transaction performing an update on the Accumulator is stored. This field is modified using a lock-free CAS loop in AccumulatorRef#checkpointNeeded(…).

The AccumulatorRef#takeCheckpointRef() method has been modified to receive the start timestamp of the checkpoint transaction that is saving Accumulator state. This method sets _checkpointRef to null only if the checkpoint timestamp is greater than the then-current value of _latestUpdate.

The checkpointNeeded(timestamp) method is called from a different place than before. It is now called during the processing of Transaction#commit(…) at a time when the ultimate commit timestamp of the transaction has been determined.

Other minor change:

The Accumulator _baseValue field is now marked volatile because it is set during recovery by one thread and may first be read by a different thread. There is no race condition requiring synchronization, but the Java memory model would permit the second thread to receive stale data.

Informal Proof:

Two threads C and T are executing concurrently; C is calling takeCheckpointRef(ts) (while performing a checkpoint), and T is calling checkpointNeeded(tc) (while committing a transaction). Once both calls are finished we need to prove that under all execution schedules, the value of _checkpointRef is non-null if tc > ts.

(Note that both of these methods are lock-free because they are called frequently.)

These methods modify two shared variables, _checkpointRef and _latestUpdate – for brevity call these R and L, respectively.

T updates L then sets R.
C reads L, possibly clears R, reads L and possibly sets R again.

By definition of Java, the operations “update”, “set”, “read” and “clear” are atomic. We’ll use the following abbreviations:

TuL – T updates L
TsR – T sets R
CrL1 – C reads L the first time
CcR – C clears R
CrL2 – C reads L the second time
CsR – C sets R

Clearly all the execution schedules in which T’s operations either precede or follow C’s operations (i.e., TuL, TsR, CrL1, CcR… and CrL1, CcR…, TuL, TsR) are safe.

All schedules in which TuL precedes CrL1 are safe because C will never clear R. The following are the two such possible executions:

TuL, TsR, CrL1
TuL, CrL1, TsR

Now consider schedules in which TuL follows CrL1 but precedes CcR.

CrL1, TuL, TsR, CcR, CrL2, CsR

In this schedule C sees an old value of L and acts by clearing R. In this case the second read CrL2 sees the updated value of L and restores the value of C, leaving C correctly non-null. (This is the case that motivates the CrL2, CsR sequence in takeCheckpointRef.)

CrL1, TuL, CcR, TsR, CrL2, CsR

In this schedule TsR occurs after CcR so the value of C is already non-null. The final execution of CsR simply redundantly sets R. Same analysis applies to the following cases:

CrL1, TuL, CcR, CrL2, TsR, CsR
CrL1, TuL, CcR, CrL2, CsR, TsR

Finally, all schedules in which TuL follows CcR are all safe because in all such cases TsR must follow CsR, e.g.:

CrL1, CcR, TuL, TsR, CrL2, CsR
CrL1, CcR, TuL, CrL2, TsR, CsR
CrL1, CcR, TuL, CrL2, CsR, TsR

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-10-11:

With regards to takeCheckpointRef and checkpointNeeded, the description is very complete and the code looks consistent with it. I believe it is correct as-is.

I am a little wary of diff line 81 though. Unconditionally calling checkpointNeeded() with 0 requires that checkpointNeeded always sets the ref. It does today, but I was about to suggest diff line 47 could be return instead, without loss of correctness, to prevent one case of spurious saves.

It looks like there is exactly 1 caller of updateBaseValue() and a timestamp is readily available. Is there a reason we shouldn't pass the ts down so that we are always dealing with real ones in checkpointNeeded?

Minor though and otherwise looks good!

review: Needs Information

Revision history for this message

Peter Beaman (pbeaman) wrote on 2012-10-11:

I think my explanation of the updateBaseValue call is a little off. I'll fix the Javadoc. updateBaseValue is called from RecoveryManager.DefaultRecoveryListener to apply a change from a committed transaction that has not yet been checkpointed. So in fact the setting of _checkpointRef in checkpointNeeded will occur precisely when a checkpoint is in fact needed, and there should be no redundant checkpointing. Re 0 vs. the commit timestamp of the transaction: the choice is immaterial; the intent is to set _checkpointRef so that the first checkpoint cycle of the new epoch will checkpoint the accumulator. If updateBaseValue were engaged in a race with the checkpoint manager then the distinction would matter, but checkpoint manager starts after recovery is done. But upon reflection, given that someday someone might change that, I'll change it to pass commit timestamp.

Re return vs. break at +47: I think that might be okay, but it complicates the "proof." Need to think about that one. As is, I believe the code is correct, but the change could eliminate a redundant set operation on volatile _checkpointRef and that might be a tiny help with highly concurrent transactions.

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-10-11:

To be clear, I wasn't worried about a race for updateBaseValue. I was worried that it was passing zero and expecting that to *always* set the ref. This is purely for future proofing changes to checkpointNeeded and reducing "magic" values. Another option would be a checkpointNeeded that takes no timestamp and just unconditionally sets the ref.

lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure updated on 2012-10-11

383. By Peter Beaman on 2012-10-11: Review comments: pass commitTimestamp to updateBaseValue.
384. By Peter Beaman on 2012-10-11: Review comment: return in diff +47

Revision history for this message

Peter Beaman (pbeaman) wrote on 2012-10-11:

Agree about the magic value. Changed it so that updateBaseValue passes the actual commitTimestamp. Also changed break to return in +47. I think that's fine. The important thing is to ensure that _checkpointRef is always set when it needs to be, and I think that property is preserved.

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-10-11:

Thanks for the tweak! I think this is good to go now.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Akiban Technologies

Peter Beaman

 === modified file 'src/main/java/com/persistit/Accumulator.java'
 --- src/main/java/com/persistit/Accumulator.java	2012-08-24 13:57:19 +0000
 +++ src/main/java/com/persistit/Accumulator.java	2012-10-11 20:43:21 +0000
@@ -122,7 +122,7 @@
      /*
       * Check-pointed value read during recovery.
       */
--    private long _baseValue;
++    private volatile long _baseValue;
      /*
       * Snapshot value at the most recent checkpoint
@@ -366,23 +366,37 @@
       */
      final static class AccumulatorRef {
          final WeakReference<Accumulator> _weakRef;
++        final AtomicLong _latestUpdate = new AtomicLong();
          volatile Accumulator _checkpointRef;
          AccumulatorRef(final Accumulator acc) {
              _weakRef = new WeakReference<Accumulator>(acc);
--            _checkpointRef = acc;
+         }
--        Accumulator takeCheckpointRef() {
++        Accumulator takeCheckpointRef(final long timestamp) {
              final Accumulator result = _checkpointRef;
--            _checkpointRef = null;
++
++            if (timestamp > _latestUpdate.get()) {
++                _checkpointRef = null;
++                if (timestamp <= _latestUpdate.get()) {
++                    _checkpointRef = result;
++                }
++            }
++
              return result;
+         }
--        void checkpointNeeded(final Accumulator acc) {
--            if (_checkpointRef == null) {
--                _checkpointRef = acc;
++        void checkpointNeeded(final Accumulator acc, final long timestamp) {
++            while (true) {
++                final long latest = _latestUpdate.get();
++                if (latest > timestamp) {
++                    return;
++                }
++                if (_latestUpdate.compareAndSet(latest, timestamp)) {
++                    break;
++                }
+             }
++            _checkpointRef = acc;
+         }
          boolean isLive() {
@@ -448,8 +462,8 @@
          return _accumulatorRef;
+     }
--    void checkpointNeeded() {
--        _accumulatorRef.checkpointNeeded(this);
++    void checkpointNeeded(final long timestamp) {
++        _accumulatorRef.checkpointNeeded(this, timestamp);
+     }
      long getBucketValue(final int hashIndex) {
@@ -563,9 +577,16 @@
+      *
       * @param value
       */
--    void updateBaseValue(final long value) {
++    void updateBaseValue(final long value, final long commitTimestamp) {
          _baseValue = applyValue(_baseValue, value);
          _liveValue.set(_baseValue);
++        /*
++         * This method is called during recovery processing to handle a delta
++         * operation that was part of a transaction that committed after the
++         * keystone checkpoint. That update requires the accumulator to be saved
++         * on the next checkpoint.
++         */
++        checkpointNeeded(commitTimestamp);
+     }
      /**
@@ -619,7 +640,6 @@
           */
          final long selectedValue = selectValue(value, updated);
          _transactionIndex.addOrCombineDelta(status, this, step, selectedValue);
--        checkpointNeeded();
          return updated;
+     }
 === modified file 'src/main/java/com/persistit/CheckpointManager.java'
 --- src/main/java/com/persistit/CheckpointManager.java	2012-10-04 20:40:40 +0000
 +++ src/main/java/com/persistit/CheckpointManager.java	2012-10-11 20:43:21 +0000
@@ -15,6 +15,8 @@
  package com.persistit;
++import static com.persistit.util.SequencerConstants.ACCUMULATOR_CHECKPOINT_A;
++import static com.persistit.util.ThreadSequencer.sequence;
  import static com.persistit.util.Util.NS_PER_S;
  import java.text.SimpleDateFormat;
@@ -246,7 +248,12 @@
              txn.beginCheckpoint();
              try {
                  _persistit.flushTransactions(txn.getStartTimestamp());
--                final List<Accumulator> accumulators = _persistit.getCheckpointAccumulators();
++                /*
++                 * Test only: block here while Accumulator update occurs
++                 */
++                sequence(ACCUMULATOR_CHECKPOINT_A);
++
++                final List<Accumulator> accumulators = _persistit.takeCheckpointAccumulators(txn.getStartTimestamp());
                  _persistit.getTransactionIndex().checkpointAccumulatorSnapshots(txn.getStartTimestamp(), accumulators);
                  Accumulator.saveAccumulatorCheckpointValues(accumulators);
                  txn.commit(CommitPolicy.HARD);
 === modified file 'src/main/java/com/persistit/Persistit.java'
 --- src/main/java/com/persistit/Persistit.java	2012-10-08 19:10:49 +0000
 +++ src/main/java/com/persistit/Persistit.java	2012-10-11 20:43:21 +0000
@@ -2419,7 +2419,7 @@
+                 }
+             }
+         }
--        if ((checkpointCount % ACCUMULATOR_CHECKPOINT_THRESHOLD) == 0) {
++        if (checkpointCount > 0 && (checkpointCount % ACCUMULATOR_CHECKPOINT_THRESHOLD) == 0) {
              try {
                  _checkpointManager.createCheckpoint();
              } catch (final PersistitException e) {
@@ -2441,7 +2441,7 @@
+         }
+     }
--    List<Accumulator> getCheckpointAccumulators() {
++    List<Accumulator> takeCheckpointAccumulators(final long timestamp) {
          final List<Accumulator> result = new ArrayList<Accumulator>();
          synchronized (_accumulators) {
              for (final Iterator<AccumulatorRef> refIterator = _accumulators.iterator(); refIterator.hasNext();) {
@@ -2449,7 +2449,7 @@
                  if (!ref.isLive()) {
                      refIterator.remove();
+                 }
--                final Accumulator acc = ref.takeCheckpointRef();
++                final Accumulator acc = ref.takeCheckpointRef(timestamp);
                  if (acc != null) {
                      result.add(acc);
+                 }
 === modified file 'src/main/java/com/persistit/RecoveryManager.java'
 --- src/main/java/com/persistit/RecoveryManager.java	2012-10-03 16:04:16 +0000
 +++ src/main/java/com/persistit/RecoveryManager.java	2012-10-11 20:43:21 +0000
@@ -246,7 +246,7 @@
                  final int accumulatorTypeOrdinal, final long value) throws PersistitException {
              final Accumulator.Type type = Accumulator.Type.values()[accumulatorTypeOrdinal];
              final Accumulator accumulator = tree.getAccumulator(type, index);
--            accumulator.updateBaseValue(value);
++            accumulator.updateBaseValue(value, timestamp);
+         }
          @Override
 === modified file 'src/main/java/com/persistit/Transaction.java'
 --- src/main/java/com/persistit/Transaction.java	2012-10-05 19:37:58 +0000
 +++ src/main/java/com/persistit/Transaction.java	2012-10-11 20:43:21 +0000
@@ -854,6 +854,12 @@
              _commitTimestamp = _persistit.getTimestampAllocator().updateTimestamp();
              sequence(COMMIT_FLUSH_C);
              long flushedTimetimestamp = 0;
++
++            for (Delta delta = _transactionStatus.getDelta(); delta != null; delta = delta.getNext()) {
++                final Accumulator acc = delta.getAccumulator();
++                acc.checkpointNeeded(_commitTimestamp);
++            }
++
              boolean committed = false;
              try {
 === modified file 'src/main/java/com/persistit/TransactionPlayer.java'
 --- src/main/java/com/persistit/TransactionPlayer.java	2012-10-03 16:04:16 +0000
 +++ src/main/java/com/persistit/TransactionPlayer.java	2012-10-11 20:43:21 +0000
@@ -225,7 +225,13 @@
                  case D0.TYPE: {
                      final Exchange exchange = getExchange(D0.getTreeHandle(bb), address, startTimestamp);
--                    listener.delta(address, startTimestamp, exchange.getTree(), D0.getIndex(bb),
++                    /*
++                     * Note that the commitTimestamp, not startTimestamp is
++                     * passed to the delta method. The
++                     * Accumulator#updateBaseValue method needs the
++                     * commitTimestamp.
++                     */
++                    listener.delta(address, commitTimestamp, exchange.getTree(), D0.getIndex(bb),
                              D0.getAccumulatorTypeOrdinal(bb), 1);
                      appliedUpdates.incrementAndGet();
                      break;
 === modified file 'src/main/java/com/persistit/util/SequencerConstants.java'
 --- src/main/java/com/persistit/util/SequencerConstants.java	2012-09-28 21:39:44 +0000
 +++ src/main/java/com/persistit/util/SequencerConstants.java	2012-10-11 20:43:21 +0000
@@ -103,4 +103,14 @@
              array(DEALLOCATE_CHAIN_B), array(DEALLOCATE_CHAIN_A, DEALLOCATE_CHAIN_C),
              array(DEALLOCATE_CHAIN_A, DEALLOCATE_CHAIN_C) };
++    /*
++     * Used in testing delete/deallocate sequence in Bug1022567Test
++     */
++    int ACCUMULATOR_CHECKPOINT_A = allocate("ACCUMULATOR_CHECKPOINT_A");
++    int ACCUMULATOR_CHECKPOINT_B = allocate("ACCUMULATOR_CHECKPOINT_B");
++    int ACCUMULATOR_CHECKPOINT_C = allocate("ACCUMULATOR_CHECKPOINT_C");
++    int[][] ACCUMULATOR_CHECKPOINT_SCHEDULED = new int[][] { array(ACCUMULATOR_CHECKPOINT_A, ACCUMULATOR_CHECKPOINT_B),
++            array(ACCUMULATOR_CHECKPOINT_B), array(ACCUMULATOR_CHECKPOINT_A, ACCUMULATOR_CHECKPOINT_C),
++            array(ACCUMULATOR_CHECKPOINT_A, ACCUMULATOR_CHECKPOINT_C) };
++
+ }
 === added file 'src/test/java/com/persistit/Bug1064565Test.java'
 --- src/test/java/com/persistit/Bug1064565Test.java	1970-01-01 00:00:00 +0000
 +++ src/test/java/com/persistit/Bug1064565Test.java	2012-10-11 20:43:21 +0000
@@ -0,0 +1,143 @@
++/**
++ * Copyright © 2012 Akiban Technologies, Inc.  All rights reserved.
++ *
++ * This program and the accompanying materials are made available
++ * under the terms of the Eclipse Public License v1.0 which
++ * accompanies this distribution, and is available at
++ * http://www.eclipse.org/legal/epl-v10.html
++ *
++ * This program may also be available under different license terms.
++ * For more information, see www.akiban.com or contact licensing@akiban.com.
++ *
++ * Contributors:
++ * Akiban Technologies, Inc.
++ */
++
++package com.persistit;
++
++import static com.persistit.unit.UnitTestProperties.VOLUME_NAME;
++import static com.persistit.util.SequencerConstants.ACCUMULATOR_CHECKPOINT_A;
++import static com.persistit.util.SequencerConstants.ACCUMULATOR_CHECKPOINT_B;
++import static com.persistit.util.SequencerConstants.ACCUMULATOR_CHECKPOINT_C;
++import static com.persistit.util.SequencerConstants.ACCUMULATOR_CHECKPOINT_SCHEDULED;
++import static com.persistit.util.ThreadSequencer.addSchedules;
++import static com.persistit.util.ThreadSequencer.enableSequencer;
++import static com.persistit.util.ThreadSequencer.sequence;
++import static com.persistit.util.ThreadSequencer.setCondition;
++import static org.junit.Assert.assertEquals;
++
++import java.util.concurrent.atomic.AtomicBoolean;
++
++import org.junit.Test;
++
++import com.persistit.exception.PersistitException;
++import com.persistit.util.ThreadSequencer.Condition;
++
++/**
++ * https://bugs.launchpad.net/akiban-persistit/+bug/1064565
++ *
++ * The state of an Accumulator is sometimes incorrect after shutting down and
++ * restarting Persistit and as a result an application can read a count or value
++ * that is inconsistent with the history of committed transactions.
++ *
++ * The bug mechanism is a race between the CheckpointManager#createCheckpoint
++ * method and the Accumulator#update method in which an update which occurs in a
++ * transaction that starts immediately after the checkpoint begins its
++ * transaction can be lost. The probability of failure is low but may be
++ * increased by intense I/O activity.
++ *
++ * This is a data loss error and is therefore critical.
++ */
++
++public class Bug1064565Test extends PersistitUnitTestCase {
++
++    private final static String TREE_NAME = "Bug1064565Test";
++
++    private Exchange getExchange() throws PersistitException {
++        return _persistit.getExchange(VOLUME_NAME, TREE_NAME, true);
++    }
++
++    @Test
++    public void accumulatorRace() throws Exception {
++        enableSequencer(false);
++        addSchedules(ACCUMULATOR_CHECKPOINT_SCHEDULED);
++        final AtomicBoolean once = new AtomicBoolean(true);
++        setCondition(ACCUMULATOR_CHECKPOINT_A, new Condition() {
++            @Override
++            public boolean enabled() {
++                return once.getAndSet(false);
++            }
++        });
++
++        Exchange exchange = getExchange();
++        Transaction txn = exchange.getTransaction();
++        final Thread t = new Thread(new Runnable() {
++            @Override
++            public void run() {
++                try {
++                    _persistit.checkpoint();
++                } catch (final PersistitException e) {
++                    throw new RuntimeException(e);
++                }
++            }
++        });
++        t.start();
++
++        txn.begin();
++        Accumulator acc = exchange.getTree().getAccumulator(Accumulator.Type.SUM, 0);
++        acc.update(42, txn);
++        sequence(ACCUMULATOR_CHECKPOINT_B);
++        txn.commit();
++        txn.end();
++        sequence(ACCUMULATOR_CHECKPOINT_C);
++
++        _persistit.close();
++
++        final Configuration config = _persistit.getConfiguration();
++        _persistit = new Persistit();
++        _persistit.initialize(config);
++
++        exchange = getExchange();
++        txn = exchange.getTransaction();
++        txn.begin();
++        acc = exchange.getTree().getAccumulator(Accumulator.Type.SUM, 0);
++        assertEquals("Accumulator state should have been checkpointed", 42, acc.getSnapshotValue(txn));
++        txn.commit();
++        txn.end();
++
++        _persistit.checkpoint();
++        _persistit.checkpoint();
++        _persistit.checkpoint();
++    }
++
++    /**
++     * ThreadSequencer is not even needed: this sequence shows how setting
++     * checkpointNeeded inside of the main transaction is not correctly
++     * sequenced against the checkpoint.
++     */
++    @Test
++    public void nathansVersion() throws Exception {
++        Exchange exchange = getExchange();
++        Transaction txn = exchange.getTransaction();
++        txn.begin();
++        Accumulator acc = exchange.getTree().getAccumulator(Accumulator.Type.SUM, 0);
++        acc.update(42, txn);
++        _persistit.checkpoint();
++        txn.commit();
++        txn.end();
++        _persistit.copyBackPages();
++        final Configuration config = _persistit.getConfiguration();
++        _persistit.close();
++        _persistit = new Persistit();
++        _persistit.initialize(config);
++
++        exchange = getExchange();
++        txn = exchange.getTransaction();
++        txn.begin();
++        acc = exchange.getTree().getAccumulator(Accumulator.Type.SUM, 0);
++        assertEquals("Accumulator state should have been checkpointed", 42, acc.getSnapshotValue(txn));
++        txn.commit();
++        txn.end();
++
++    }
++}

Akiban Persistit

Merge lp:~pbeaman/akiban-persistit/fix-accumulator-checkpoint-failure into lp:akiban-persistit

Commit message

Description of the change

Preview Diff

Subscribers