Code review comment for lp:~pbeaman/akiban-persistit/fix-rebalance-exception2

Revision history for this message
Peter Beaman (pbeaman) wrote :

Re MINIMUM_PRUNE_OBSOLETE_TRANSACTIONS_INTERVAL_MS -

The intended value was 5 seconds, and due to a typo, the value has instead been 50 seconds. (Yes, someday we should do a time-standard sweep through the whole product - we have a story for this in the icebox.) The original intention was to avoid spending CPU to perform the pruneObsoleteTransaction loop on every CLEANUP_MANAGER polling cycle, which when the system is busy can soak the CPU. But the problem with 50 seconds is that it can allow obsolete transactions to pile up for a very long time. 5 seconds seems like a good compromise.

This change is not related to fixing the rebalance issue, and if you think we really need to, I could propose it as a separate branch. However, all the stress tests I've been running recently have used the new value.

Re _spareKey1/2 vs _spareKey3/4

Yes 1/2 are not safe to use here as I discovered the hard way. Again, we need to refactor Exchange some day to get rid of all these undocumented private contracts, but for the purpose of this branch I simple reviewed the code path and determined that 3/4 available and safe to use.

Re the changed() method and _id logic:

I have removed those changes and will explain why. I believe the increment of _changed is always safe because it was done only by a thread that had exclusive ownership of the page. However, there is an unsafe part of the logic and I could not figure out a way to avoid it. Specifically, if changed() detects that the buffer has changed identity, it will try to wake up all the threads waiting to claim it by interrupting them. The _id hack is there so that those threads can then try to figure out why they were interrupted. But even with the addition of the isQueued(Thread) call, I observed a case where a thread was interrupted while doing something else and so the application got a spurious PersistitInterruptedException.

The original purpose of this logic was to try to avoid a potential live-lock situation in which every time a thread wakes up after a short interval, it loses its place in the waiting thread queue. I believed but had no proof that this was the mechanism that caused some of the InUseException cases. However, recently I found an actual deadlock that probably explains those instances, so am less convinced of the live-lock scenario.

Also, removing that stuff makes this proposal specific to the RebalanceException which is cleaner. If we need the interrupt changes, we can always merge them in separately. I few more stress-test runs will tell the story.

« Back to merge proposal