Akiban Persistit

Merge lp:~pbeaman/akiban-persistit/fix-rebalance-exception2 into lp:akiban-persistit

fix-rebalance-exception2
Merge into trunk

Proposed by Peter Beaman on 2012-11-26

Status:	Merged
Approved by:	Nathan Williams on 2012-11-28
Approved revision:	400
Merged at revision:	400
Proposed branch:	lp:~pbeaman/akiban-persistit/fix-rebalance-exception2
Merge into:	lp:akiban-persistit
Diff against target:	420 lines (+279/-17) 6 files modified src/main/java/com/persistit/Buffer.java (+6/-5) src/main/java/com/persistit/CleanupManager.java (+1/-1) src/main/java/com/persistit/Exchange.java (+93/-6) src/main/java/com/persistit/TransactionPlayer.java (+3/-4) src/main/java/com/persistit/VolumeStructure.java (+1/-1) src/test/java/com/persistit/ExchangeRebalanceTest.java (+175/-0)
To merge this branch:	bzr merge lp:~pbeaman/akiban-persistit/fix-rebalance-exception2
Related bugs:	Link a bug report

Reviewer	Date Requested	Status
Nathan Williams	2012-11-26	Approve on 2012-11-28
Peter Beaman		Needs Resubmitting on 2012-11-28
Review via email: mp+136225@code.launchpad.net

Description of the change

This is a replacement for the original fix-rebalance-exception branch. This version has been demonstrated to work repeatedly in stress tests. I started from a fresh version of trunk at r396 in order to get rid of conflicts.

Here is the original description:

This proposal fixes two fairly delicate issues and should not be approved until Server 1.4.3 has shipped.

1. RebalanceException

There is a rarely encountered finge case where deleting a record can require two pages to be split into three. The case requires two adjacent pages that are very full, and a pair of keys A and B at the left edge of the right sibling page to be long and have a small elided byte count. In other words, B is different from A in such a way that all or nearly all of the bytes in B are required to represent its value. In addition, the key the left of A and A have a deep elision count. The RebalanceException occurs deleting A requires the max key of the left sibling page to become B, and B is too big to fit there. In all previous versions of Persistit, the application would simply receive a RebalanceException in response to the attempt to remove the key.

This phenomenon has never been seen in application code, including Akiban Server testing, but it does occur routinely in Persistit stress tests.

This branch fixes the problem by catching the RebalanceException (thrown by Buffer#join(..)) and handling it properly in Exchange#storeInternal(...). The code splits the left page and then sets up internal variables appropriately for a retry.

A new text ExchangeRebalanceTest creates the preconditions for exercising this code and then verifies by counting pages that the rebalance split has occurred. This code has also passed numerous stress test runs.

2. InUseException in stress tests

Thread A attempts to get page P from the buffer pool. Thread B attempts to get page Q, discovers that it needs to read the page from disk, and before A gets a claim on the buffer, choses the buffer containing P to evict and reused to hold Q. In this scenario, thread A does not detect that the buffer it is waiting for has changed until it can get a claim on the buffer. And further, because the identities of the pages being awaited by the two threads is different, a deadlock and result.

This bug was described by https://bugs.launchpad.net/akiban-persistit/+bug/1021734. However, the fix isn't quite right.

The original fix was for thread A, while waiting for a claim on the buffer containing page P, to wait for only a short interval. A would then recheck the identity of the page contained in the awaited buffer to verify it still contains P before retrying.

The problem with this is that on a very heavily loaded system (or when stress tests run an an ec2 instance with unreliable I/O times), there can be a livelock. Each time Thread A times out, it loses its place in the queue and goes back to the end. We think this is the mechanism for occasional timeouts now seen in stress tests, especially on ec2 instances.

The newly proposed solution is for Thread A to wait for page P using its original timeout (default is 60 seconds) and never give up its position in the queue. However, when Thread B changes the identity of the page held in the awaited buffer to Q, it ends Thread A's wait by interrupting it. Thread A receives the interrupt and uses a simple mechanism to detect whether the interrupt occurred for this reason or because there truly was an interrupt created within the environment.

Since creating the livelock in the first place is extremely difficult, I have no unit test for this. (In fact, I have no proof other than an empirical improvement in test failures to support the hypothesis about the mechanism.) The modified code has run through several stress test cycles, and we will be adding more experience with it over time.

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-11-28:

Just a few questions, but otherwise seems OK.

Why was MINIMUM_PRUNE_OBSOLETE_TRANSACTIONS_INTERVAL_NS decreased by 10x?

Why was _spareKey2/3 changed to 3/4, diff line 209ish? Are 1 and 2 not safe to use here? Can we add a comment or asserts to remind us if so?

The _id and changed() handling int SharedResource seems vulnerable to false failures. As it is a volatile int and changed with ++, another thread can come into claim it and have seen it in one of two states: pre and post increment. In the former, it will return out of the catch block with false (as intended). In the latter, an non-user generated PersistitInterruptedException will be propagated. Is that what we want?

review: Needs Information

lp:~pbeaman/akiban-persistit/fix-rebalance-exception2 updated on 2012-11-28

398. By Peter Beaman on 2012-11-28: Remove interrupt-changed logic
399. By Peter Beaman on 2012-11-28: Merge from trunk

Revision history for this message

Peter Beaman (pbeaman) wrote on 2012-11-28:

Re MINIMUM_PRUNE_OBSOLETE_TRANSACTIONS_INTERVAL_MS -

The intended value was 5 seconds, and due to a typo, the value has instead been 50 seconds. (Yes, someday we should do a time-standard sweep through the whole product - we have a story for this in the icebox.) The original intention was to avoid spending CPU to perform the pruneObsoleteTransaction loop on every CLEANUP_MANAGER polling cycle, which when the system is busy can soak the CPU. But the problem with 50 seconds is that it can allow obsolete transactions to pile up for a very long time. 5 seconds seems like a good compromise.

This change is not related to fixing the rebalance issue, and if you think we really need to, I could propose it as a separate branch. However, all the stress tests I've been running recently have used the new value.

Re _spareKey1/2 vs _spareKey3/4

Yes 1/2 are not safe to use here as I discovered the hard way. Again, we need to refactor Exchange some day to get rid of all these undocumented private contracts, but for the purpose of this branch I simple reviewed the code path and determined that 3/4 available and safe to use.

Re the changed() method and _id logic:

I have removed those changes and will explain why. I believe the increment of _changed is always safe because it was done only by a thread that had exclusive ownership of the page. However, there is an unsafe part of the logic and I could not figure out a way to avoid it. Specifically, if changed() detects that the buffer has changed identity, it will try to wake up all the threads waiting to claim it by interrupting them. The _id hack is there so that those threads can then try to figure out why they were interrupted. But even with the addition of the isQueued(Thread) call, I observed a case where a thread was interrupted while doing something else and so the application got a spurious PersistitInterruptedException.

The original purpose of this logic was to try to avoid a potential live-lock situation in which every time a thread wakes up after a short interval, it loses its place in the waiting thread queue. I believed but had no proof that this was the mechanism that caused some of the InUseException cases. However, recently I found an actual deadlock that probably explains those instances, so am less convinced of the live-lock scenario.

Also, removing that stuff makes this proposal specific to the RebalanceException which is cleaner. If we need the interrupt changes, we can always merge them in separately. I few more stress-test runs will tell the story.

Re MINIMUM_PRUNE_OBSOLETE_TRANSACTIONS_INTERVAL_MS -

The intended value was 5 seconds, and due to a typo, the value has instead been 50 seconds.  (Yes, someday we should do a time-standard sweep through the whole product - we have a story for this in the icebox.) The original intention was to avoid spending CPU to perform the pruneObsoleteTransaction loop on every CLEANUP_MANAGER polling cycle, which when the system is busy can soak the CPU. But the problem with 50 seconds is that it can allow obsolete transactions to pile up for a very long time. 5 seconds seems like a good compromise.

This change is not related to fixing the rebalance issue, and if you think we really need to, I could propose it as a separate branch.  However, all the stress tests I've been running recently have used the new value.

Re _spareKey1/2 vs _spareKey3/4

Yes 1/2 are not safe to use here as I discovered the hard way.  Again, we need to refactor Exchange some day to get rid of all these undocumented private contracts, but for the purpose of this branch I simple reviewed the code path and determined that 3/4 available and safe to use.

Re the changed() method and _id logic:

I have removed those changes and will explain why.  I believe the increment of _changed is always safe because it was done only by a thread that had exclusive ownership of the page.  However, there is an unsafe part of the logic and I could not figure out a way to avoid it.  Specifically, if changed() detects that the buffer has changed identity, it will try to wake up all the threads waiting to claim it by interrupting them.  The _id hack is there so that those threads can then try to figure out why they were interrupted.  But even with the addition of the isQueued(Thread) call, I observed a case where a thread was interrupted while doing something else and so the application got a spurious PersistitInterruptedException.

The original purpose of this logic was to try to avoid a potential live-lock situation in which every time a thread wakes up after a short interval, it loses its place in the waiting thread queue.  I believed but had no proof that this was the mechanism that caused some of the InUseException cases.  However, recently I found an actual deadlock that probably explains those instances, so am less convinced of the live-lock scenario.

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-11-28:

The interval change is fine. I just like to limit the number of things changing in each revision. It's already here and, as you say, tested.

Could we add asserts around the spareKey change? Yes a large refactor could possibly fix it, but misewell be as safe as we can until them.

I wasn't saying the increment of _id wasn't safe. I said it was vulnerable to false exceptions, which you then went on to say you saw (different cause, of course). Having it gone is all the better though.

Revision history for this message

Peter Beaman (pbeaman) wrote on 2012-11-28:

I would be happy with asserts on the spareKey change, but am not sure what
we would assert. Can you suggest the code you'd like to see?

On Wed, Nov 28, 2012 at 11:48 AM, Nathan Williams <email address hidden>wrote:

> The interval change is fine. I just like to limit the number of things
> changing in each revision. It's already here and, as you say, tested.
>
> Could we add asserts around the spareKey change? Yes a large refactor
> could possibly fix it, but misewell be as safe as we can until them.
>
> I wasn't saying the increment of _id wasn't safe. I said it was vulnerable
> to false exceptions, which you then went on to say you saw (different
> cause, of course). Having it gone is all the better though.
> --
>
> https://code.launchpad.net/~pbeaman/akiban-persistit/fix-rebalance-exception2/+merge/136225
> You are the owner of lp:~pbeaman/akiban-persistit/fix-rebalance-exception2.
>

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-11-28:

The problem is that raw_removeKeyRangeInternal already uses 1 and 2 in its implementation, right? So assert that the keys it is passed are not 1 and 2.

If for some reason that isn't how it is being used, at least add a comment next to the change so we don't ever use 1 and 2 again there.

lp:~pbeaman/akiban-persistit/fix-rebalance-exception2 updated on 2012-11-28

400. By Peter Beaman on 2012-11-28: Assert that keys in raw_removeKeyRangeInternal are safe and use safe keys in TransactionPlayer

Revision history for this message

Peter Beaman (pbeaman) wrote on 2012-11-28:

Actually, this led to a great catch Nathan. Thanks. It used to be that sending _spareKey1/2 into raw_removeKeyStateInternal was safe; the keys are mutated by the method but no caller cares. (And the only callers that did use them were TransactionPlayer and pruneLeftEdgeValue.

But now the rebalanceSplit code causes a retry loop, and the mutated values do matter.

So, I added the requested asserts, modified TransactionPlayer (which required adding accessors for _spareKey3/4) and wrote a short comment.

The good catch is due to the possibility of a rebalanceSplit occurring during transaction replay from TransactionPlayer. We would have found it extremely difficult to diagnose such an occurrence in the field.

Revision history for this message

Peter Beaman (pbeaman) on 2012-11-28:

review: Needs Resubmitting

Revision history for this message

Nathan Williams (nwilliams) wrote on 2012-11-28:

Glad my pestering led to something useful.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Akiban Technologies

Peter Beaman

 === modified file 'src/main/java/com/persistit/Buffer.java'
 --- src/main/java/com/persistit/Buffer.java	2012-10-26 19:44:17 +0000
 +++ src/main/java/com/persistit/Buffer.java	2012-11-28 18:57:32 +0000
@@ -246,6 +246,7 @@
      final static int LONGREC_PREFIX_OFFSET = 20;
      final static int LONGREC_SIZE = LONGREC_PREFIX_OFFSET + LONGREC_PREFIX_SIZE;
++    final static int ANTIVALUE_TYPE = Value.CLASS_ANTIVALUE;
      /**
       * Implicit overhead size
       */
@@ -1182,7 +1183,7 @@
          return pointer;
+     }
--    void setLongRecordPointer(final int foundAt, final long pointer) {
++    void neuterLongRecord(final int foundAt) {
          assert isDataPage() : "Invalid page type for long records: " + this;
          final int kbData = getInt(foundAt & P_MASK);
          final int tail = decodeKeyBlockTail(kbData);
@@ -1196,9 +1197,10 @@
          if ((_bytes[tail + _tailHeaderSize + klength] & 0xFF) != LONGREC_TYPE) {
              return;
+         }
--
--        putLong(tail + _tailHeaderSize + klength + LONGREC_PAGE_OFFSET, (int) pointer);
--
++        /*
++         * Value will now be undefined rather than a long record.
++         */
++        putByte(tail + _tailHeaderSize + klength, ANTIVALUE_TYPE);
+     }
      long getPointer(final int foundAt) throws PersistitException {
@@ -3597,7 +3599,6 @@
       * @throws PersistitException
       */
      boolean pruneMvvValues(final Tree tree, final boolean pruneLongMVVs) throws PersistitException {
--
          boolean changed = false;
          try {
              boolean hasLongMvvRecords = false;
 === modified file 'src/main/java/com/persistit/CleanupManager.java'
 --- src/main/java/com/persistit/CleanupManager.java	2012-09-26 18:34:01 +0000
 +++ src/main/java/com/persistit/CleanupManager.java	2012-11-28 18:57:32 +0000
@@ -44,7 +44,7 @@
      private final static long MINIMUM_MAINTENANCE_INTERVAL_NS = 1000000000L;
--    private final static long MINIMUM_PRUNE_OBSOLETE_TRANSACTIONS_INTERVAL_NS = 50000000000L;
++    private final static long MINIMUM_PRUNE_OBSOLETE_TRANSACTIONS_INTERVAL_NS = 5000000000L;
      private final static long DEFAULT_MINIMUM_PRUNING_DELAY_NS = 1000;
 === modified file 'src/main/java/com/persistit/Exchange.java'
 --- src/main/java/com/persistit/Exchange.java	2012-11-25 20:14:58 +0000
 +++ src/main/java/com/persistit/Exchange.java	2012-11-28 18:57:32 +0000
@@ -49,6 +49,7 @@
  import com.persistit.exception.PersistitException;
  import com.persistit.exception.PersistitInterruptedException;
  import com.persistit.exception.ReadOnlyVolumeException;
++import com.persistit.exception.RebalanceException;
  import com.persistit.exception.RetryException;
  import com.persistit.exception.RollbackException;
  import com.persistit.exception.TreeNotFoundException;
@@ -969,6 +970,26 @@
       * An additional <code>Key</code> maintained for the convenience of
       * {@link Transaction}, {@link PersistitMap} and {@link JournalManager}.
+      *
++     * @return spareKey3
++     */
++    Key getAuxiliaryKey3() {
++        return _spareKey3;
++    }
++
++    /**
++     * An additional <code>Key</code> maintained for the convenience of
++     * {@link Transaction}, {@link PersistitMap} and {@link JournalManager}.
++     *
++     * @return spareKey4
++     */
++    Key getAuxiliaryKey4() {
++        return _spareKey4;
++    }
++
++    /**
++     * An additional <code>Key</code> maintained for the convenience of
++     * {@link Transaction}, {@link PersistitMap} and {@link JournalManager}.
++     *
       * @return spareKey2
       */
      Key getAuxiliaryKey2() {
@@ -3196,6 +3217,13 @@
       */
      boolean raw_removeKeyRangeInternal(final Key key1, final Key key2, final boolean fetchFirst,
              final boolean removeOnlyAntiValue) throws PersistitException {
++        /*
++         * _spareKey1 and _spareKey2 are mutated within the method and are then
++         * wrong in the event of a retry loop.
++         */
++
++        assert key1 != _spareKey1 && key2 != _spareKey1 && key1 != _spareKey2 && key2 != _spareKey2;
++
          _persistit.checkClosed();
          _persistit.checkSuspended();
@@ -3422,8 +3450,15 @@
                              Debug.$assert0.t(_tree.isOwnedAsWriterByMe() && buffer1.isOwnedAsWriterByMe()
                                      && buffer2.isOwnedAsWriterByMe());
--                            final boolean rebalanced = buffer1.join(buffer2, foundAt1, foundAt2, _spareKey1,
--                                    _spareKey2, _joinPolicy);
++                            boolean rebalanced = false;
++                            try {
++                                rebalanced = buffer1.join(buffer2, foundAt1, foundAt2, _spareKey1, _spareKey2,
++                                        _joinPolicy);
++                            } catch (final RebalanceException rbe) {
++                                rebalanceSplit(lc);
++                                level++;
++                                continue;
++                            }
                              if (buffer1.isDataPage()) {
                                  _tree.bumpChangeCount();
+                             }
@@ -3581,6 +3616,58 @@
          return result;
+     }
++    /**
++     * Handle the extremely rare case where removing a key from a pair of
++     * adjacent pages requires the left page to be split. To split the page this
++     * method inserts an empty record with key being deleted, allowing the
++     * {@link Buffer#split(Buffer, Key, ValueHelper, int, Key, Sequence, SplitPolicy)}
++     * method to be used.
++     *
++     * @param lc
++     *            LevelCache set up by raw_removeKeyRangeInternal
++     * @throws PersistitException
++     */
++    private void rebalanceSplit(final LevelCache lc) throws PersistitException {
++        //
++        // Allocate a new page
++        //
++        final int level = lc._level;
++        final int foundAt = lc._leftFoundAt;
++        final Buffer left = lc._leftBuffer;
++        final Buffer inserted = _volume.getStructure().allocPage();
++        try {
++            final long timestamp = timestamp();
++            left.writePageOnCheckpoint(timestamp);
++            inserted.writePageOnCheckpoint(timestamp);
++
++            Debug.$assert0.t(inserted.getPageAddress() != 0);
++            Debug.$assert0.t(inserted != left);
++
++            inserted.init(left.getPageType());
++
++            final Value value = _persistit.getThreadLocalValue();
++            value.clear();
++            _rawValueWriter.init(value);
++            final Key key = _persistit.getThreadLocalKey();
++            lc._rightBuffer.nextKey(key, Buffer.HEADER_SIZE);
++
++            left.split(inserted, key, _rawValueWriter, foundAt | EXACT_MASK, _spareKey1, Sequence.NONE,
++                    SplitPolicy.EVEN_BIAS);
++
++            inserted.setRightSibling(left.getRightSibling());
++            left.setRightSibling(inserted.getPageAddress());
++            left.setDirtyAtTimestamp(timestamp);
++            inserted.setDirtyAtTimestamp(timestamp);
++            lc._leftBuffer = inserted;
++            lc._leftFoundAt = inserted.findKey(key);
++
++            _persistit.getCleanupManager().offer(
++                    new CleanupManager.CleanupIndexHole(_tree.getHandle(), inserted.getPageAddress(), level));
++        } finally {
++            left.releaseTouched();
++        }
++    }
++
      private void removeKeyRangeReleaseLevel(final int level) {
          final LevelCache lc = _levelCache[level];
@@ -3710,12 +3797,12 @@
                  final int offset = (int) (at >>> 32);
                  final int size = (int) at;
                  if (size == 1 && buffer.getBytes()[offset] == MVV.TYPE_ANTIVALUE) {
--                    buffer.nextKey(_spareKey1, Buffer.KEY_BLOCK_START);
++                    buffer.nextKey(_spareKey3, Buffer.KEY_BLOCK_START);
                      buffer.release();
                      buffer = null;
--                    _spareKey1.copyTo(_spareKey2);
--                    _spareKey2.nudgeDeeper();
--                    raw_removeKeyRangeInternal(_spareKey1, _spareKey2, false, true);
++                    _spareKey3.copyTo(_spareKey4);
++                    _spareKey4.nudgeDeeper();
++                    raw_removeKeyRangeInternal(_spareKey3, _spareKey4, false, true);
                      return true;
+                 }
+             }
 === modified file 'src/main/java/com/persistit/TransactionPlayer.java'
 --- src/main/java/com/persistit/TransactionPlayer.java	2012-10-11 20:35:03 +0000
 +++ src/main/java/com/persistit/TransactionPlayer.java	2012-11-28 18:57:32 +0000
@@ -199,8 +199,8 @@
                      final int elisionCount = DR.getKey2Elision(bb);
                      final Exchange exchange = getExchange(DR.getTreeHandle(bb), address, startTimestamp);
                      exchange.ignoreTransactions();
--                    final Key key1 = exchange.getAuxiliaryKey1();
--                    final Key key2 = exchange.getAuxiliaryKey2();
++                    final Key key1 = exchange.getAuxiliaryKey3();
++                    final Key key2 = exchange.getAuxiliaryKey4();
                      System.arraycopy(bb.array(), bb.position() + DR.OVERHEAD, key1.getEncodedBytes(), 0, key1Size);
                      key1.setEncodedSize(key1Size);
                      final int key2Size = innerSize - DR.OVERHEAD - key1Size;
@@ -208,8 +208,7 @@
                      System.arraycopy(bb.array(), bb.position() + DR.OVERHEAD + key1Size, key2.getEncodedBytes(),
                              elisionCount, key2Size);
                      key2.setEncodedSize(key2Size + elisionCount);
--                    listener.removeKeyRange(address, startTimestamp, exchange, exchange.getAuxiliaryKey1(),
--                            exchange.getAuxiliaryKey2());
++                    listener.removeKeyRange(address, startTimestamp, exchange, key1, key2);
                      appliedUpdates.incrementAndGet();
                      releaseExchange(exchange);
                      break;
 === modified file 'src/main/java/com/persistit/VolumeStructure.java'
 --- src/main/java/com/persistit/VolumeStructure.java	2012-10-03 14:43:25 +0000
 +++ src/main/java/com/persistit/VolumeStructure.java	2012-11-28 18:57:32 +0000
@@ -624,7 +624,7 @@
                       * Detects whether and prevents same pointer from being read
                       * and deallocated twice.
                       */
--                    buffer.setLongRecordPointer(p, INVALID_PAGE_ADDRESS);
++                    buffer.neuterLongRecord(p);
+                 }
+             }
+         }
 === added file 'src/test/java/com/persistit/ExchangeRebalanceTest.java'
 --- src/test/java/com/persistit/ExchangeRebalanceTest.java	1970-01-01 00:00:00 +0000
 +++ src/test/java/com/persistit/ExchangeRebalanceTest.java	2012-11-28 18:57:32 +0000
@@ -0,0 +1,175 @@
++/**
++ * Copyright © 2012 Akiban Technologies, Inc.  All rights reserved.
++ *
++ * This program and the accompanying materials are made available
++ * under the terms of the Eclipse Public License v1.0 which
++ * accompanies this distribution, and is available at
++ * http://www.eclipse.org/legal/epl-v10.html
++ *
++ * This program may also be available under different license terms.
++ * For more information, see www.akiban.com or contact licensing@akiban.com.
++ *
++ * Contributors:
++ * Akiban Technologies, Inc.
++ */
++
++package com.persistit;
++
++import static org.junit.Assert.assertEquals;
++
++import org.junit.Test;
++
++import com.persistit.exception.PersistitException;
++import com.persistit.policy.SplitPolicy;
++
++public class ExchangeRebalanceTest extends PersistitUnitTestCase {
++
++    final StringBuilder sb = new StringBuilder();
++
++    @Override
++    public void setUp() throws Exception {
++        super.setUp();
++        while (sb.length() < Buffer.MAX_BUFFER_SIZE) {
++            sb.append(RED_FOX);
++        }
++    }
++
++    /**
++     * Construct a tree with two adjacent pages that are nearly full such that
++     * the second key B of the right page is long and has very small elided byte
++     * count relative to the first key A on that page. Removing key A requires
++     * the left sibling to have B as its max key. If B won't fit on the left
++     * page, then Persistit splits the left page to make room for it. Note that
++     * changes in key or page structure will likely break this test.
++     *
++     * @throws Exception
++     */
++    @Test
++    public void testRebalanceException() throws Exception {
++        final Exchange exchange = _persistit.getExchange("persistit", "rebalance", true);
++        exchange.setSplitPolicy(SplitPolicy.LEFT_BIAS);
++        setUpPrettyFullBuffers(exchange, false, RED_FOX.length(), true);
++        System.out.printf("\ntestRebalanceException\n");
++        _persistit.checkAllVolumes();
++        final int beforeRemove = countSiblings(exchange, 0);
++        exchange.clear().append("b").previous(true);
++        exchange.remove();
++        _persistit.getCleanupManager().poll();
++        _persistit.checkAllVolumes();
++        final int afterRemove = countSiblings(exchange, 0);
++        assertEquals("Remove should have caused rebalance", beforeRemove + 1, afterRemove);
++    }
++
++    /**
++     * Similar logic to {@link #testRebalanceException()} except that the keys
++     * are carefully constructed on the index leaf level rather than the data
++     * level. To do this we use three long-ish records per major key in the data
++     * pages so that the key pattern in the index leaf table aligns as desired.
++     *
++     * @throws Exception
++     */
++    @Test
++    public void testRebalanceIndexException() throws Exception {
++        final Exchange exchange = _persistit.getExchange("persistit", "rebalance", true);
++        exchange.setSplitPolicy(SplitPolicy.LEFT_BIAS);
++        setUpPrettyFullBuffers(exchange, true, 0, true);
++        System.out.printf("\ntestRebalanceIndexException\n");
++        _persistit.checkAllVolumes();
++        final int beforeRemove = countSiblings(exchange, 1);
++
++        exchange.clear().append("b").previous(true);
++        exchange.cut();
++        exchange.remove(Key.GT);
++        _persistit.getCleanupManager().poll();
++        _persistit.checkAllVolumes();
++        final int afterRemove = countSiblings(exchange, 1);
++        assertEquals("Remove should have caused rebalance", beforeRemove + 1, afterRemove);
++
++    }
++
++    private void setUpPrettyFullBuffers(final Exchange ex, final boolean asIndex, final int valueLength,
++            final boolean discontinuous) throws PersistitException {
++
++        final int depth = asIndex ? 1 : 0;
++        int a, b;
++        long leftPage = 0;
++
++        // load page A with keys of increasing length
++        for (a = 10;; a++) {
++            setUpDeepKey(ex, 'a', a);
++            storeValue(ex, valueLength, asIndex);
++            if (ex.getTree().getDepth() > depth) {
++                final Buffer b1 = ex.fetchBufferCopy(depth);
++                // Stop when nearly full
++                if (b1.getAvailableSize() < valueLength + 20) {
++                    break;
++                }
++                leftPage = b1.getPageAddress();
++            }
++        }
++
++        // load additional keys into page B
++        for (b = a + 1;; b++) {
++            if (ex.getTree().getDepth() > depth) {
++                final Buffer b2 = ex.fetchBufferCopy(depth);
++                if (b2.getPageAddress() != leftPage && b2.getAvailableSize() < valueLength + 20) {
++                    break;
++                }
++            }
++            setUpDeepKey(ex, discontinuous && b > a ? 'b' : 'a', b);
++            storeValue(ex, valueLength, asIndex);
++        }
++    }
++
++    private void setUpDeepKey(final Exchange ex, final char fill, final int n) {
++        ex.getKey().clear().append(keyString(fill, n, n - 34, 4, n)).append(1);
++    }
++
++    private void setupValue(final Exchange ex, final int valueLength) {
++        ex.getValue().put(sb.toString().substring(0, valueLength));
++    }
++
++    private void storeValue(final Exchange ex, final int valueLength, final boolean asIndex) throws PersistitException {
++        if (asIndex) {
++            // Fill up one data page so that the base key will be inserted into
++            // an index page
++            setupValue(ex, ex.getBufferPool().getBufferSize() / 4);
++            for (int i = 1; i < 4; i++) {
++                ex.cut().append(i).store();
++            }
++        } else {
++            setupValue(ex, valueLength);
++            ex.store();
++        }
++    }
++
++    String keyString(final char fill, final int length, final int prefix, final int width, final int k) {
++        final StringBuilder sb = new StringBuilder();
++        for (int i = 0; i < prefix && i < length; i++) {
++            sb.append(fill);
++        }
++        sb.append(String.format("%0" + width + "d", k));
++        for (int i = length - sb.length(); i < length; i++) {
++            sb.append(fill);
++        }
++        sb.setLength(length);
++        return sb.toString();
++    }
++
++    private int countSiblings(final Exchange exchange, final int level) throws PersistitException {
++        int count = 0;
++        exchange.clear().append(Key.BEFORE);
++        Buffer b = exchange.fetchBufferCopy(level);
++        while (true) {
++            count++;
++            final long rightSibling = b.getRightSibling();
++            if (rightSibling != 0) {
++                b = exchange.getBufferPool().getBufferCopy(exchange.getVolume(), rightSibling);
++            } else {
++                break;
++            }
++        }
++        return count;
++    }
++
++}

Akiban Persistit

Merge lp:~pbeaman/akiban-persistit/fix-rebalance-exception2 into lp:akiban-persistit

Commit message

Description of the change

Preview Diff

Subscribers