Merge lp:~laurynas-biveinis/percona-server/bp-split-5.6 into lp:percona-server/5.6

Proposed by Laurynas Biveinis
Status: Merged
Approved by: Alexey Kopytov
Approved revision: no longer in the source branch.
Merged at revision: 425
Proposed branch: lp:~laurynas-biveinis/percona-server/bp-split-5.6
Merge into: lp:percona-server/5.6
Diff against target: 4913 lines (+1001/-789)
22 files modified
Percona-Server/storage/innobase/btr/btr0cur.cc (+18/-6)
Percona-Server/storage/innobase/btr/btr0sea.cc (+5/-8)
Percona-Server/storage/innobase/buf/buf0buddy.cc (+46/-18)
Percona-Server/storage/innobase/buf/buf0buf.cc (+220/-196)
Percona-Server/storage/innobase/buf/buf0dblwr.cc (+1/-0)
Percona-Server/storage/innobase/buf/buf0dump.cc (+8/-8)
Percona-Server/storage/innobase/buf/buf0flu.cc (+153/-125)
Percona-Server/storage/innobase/buf/buf0lru.cc (+280/-177)
Percona-Server/storage/innobase/buf/buf0rea.cc (+41/-26)
Percona-Server/storage/innobase/fsp/fsp0fsp.cc (+1/-1)
Percona-Server/storage/innobase/handler/ha_innodb.cc (+12/-2)
Percona-Server/storage/innobase/handler/i_s.cc (+14/-9)
Percona-Server/storage/innobase/ibuf/ibuf0ibuf.cc (+1/-1)
Percona-Server/storage/innobase/include/buf0buddy.h (+4/-4)
Percona-Server/storage/innobase/include/buf0buddy.ic (+9/-10)
Percona-Server/storage/innobase/include/buf0buf.h (+91/-108)
Percona-Server/storage/innobase/include/buf0buf.ic (+63/-63)
Percona-Server/storage/innobase/include/buf0flu.h (+6/-7)
Percona-Server/storage/innobase/include/buf0flu.ic (+0/-2)
Percona-Server/storage/innobase/include/buf0lru.h (+9/-7)
Percona-Server/storage/innobase/include/sync0sync.h (+13/-4)
Percona-Server/storage/innobase/sync/sync0sync.cc (+6/-7)
To merge this branch: bzr merge lp:~laurynas-biveinis/percona-server/bp-split-5.6
Reviewer Review Type Date Requested Status
Alexey Kopytov (community) Approve
Laurynas Biveinis Pending
Review via email: mp+186711@code.launchpad.net

This proposal supersedes a proposal from 2013-09-09.

Description of the change

Repushed and resubmitted. The only change from the 3rd push is atomic ops split off to be handled later.

http://jenkins.percona.com/job/percona-server-5.6-param/259/

No BT or ST, but 5.6 GA prerequisite.

To post a comment you must log in.
Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal
Download full text (4.1 KiB)

Hi Laurynas,

This is not a complete review, I'm only about 10% into the patch, but I'm posting the comments I have so far to parallelize things a bit:

  - the code in btr_blob_free() can be simplified. just initialize
    ‘freed’ with ‘false’, then assign to it the result of
    buf_LRU_free_page() whenever it is called, and then do this at the
    end:

    if (!freed) {
     mutex_exit(&buf_pool->LRU_list_mutex);
    }

    Would result in less hairy code.

  - wrong comments for buf_LRU_free_page(): s/block
    mutex/buf_page_get_mutex() mutex/

  - the comments for buf_LRU_free_page() say that both LRU_list_mutex
    and block_mutex may be released temporarily if ‘true’ is
    returned. But:
    1) even if ‘false’ is returned, block_mutex may also be released
       temporarily
    2) the comments don’t mention that if ‘true’ is returned,
       LRU_list_mutex is always released upon return, but block_mutex is
       always locked. And callers of buf_LRU_free_page() rely on that.

  - the following code in buf_LRU_free_page() is missing a
    buf_page_free_descriptor() call if b != NULL. Which is a potential
    memory leak.

  if (!buf_LRU_block_remove_hashed(bpage, zip)) {
+
+ mutex_exit(&buf_pool->LRU_list_mutex);
+
+ mutex_enter(block_mutex);
+
   return(true);
  }

  - the patch removes buf_pool_mutex_enter_all() from
    btr_search_validate_one_table(), but then does a number of dirty
    reads from ‘block’ before it locks block->mutex. Any reasons to not
    lock block->mutex earlier?

  - the following checks for mutex != NULL in buf_buddy_relocate() seem
    to be redundant, since they are made after mutex_enter(mutex), so we
    are guaranteed mutex != NULL if we reach that code:

@@ -584,7 +604,11 @@ buf_buddy_relocate(

  mutex_enter(mutex);

- if (buf_page_can_relocate(bpage)) {
+ rw_lock_s_unlock(hash_lock);
+
+ mutex_enter(&buf_pool->zip_free_mutex);
+
+ if (mutex && buf_page_can_relocate(bpage)) {

and

- mutex_exit(mutex);
+ if (mutex)
+ mutex_exit(mutex);
+
+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));

    and the last hunk is also missing braces (in case you decide to keep
    it).

  - asserting that zip_free_mutex is locked also looks redundant to me,
    because it is locked just a few lines above, and there’s nothing in
    the code path that could release it.

  - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
    need that stuff. Their names are misleading as they don’t enforce
    any atomicity. They should be named os_ordered_load_ulint() /
    os_ordered_store_ulint(), but... what specific order are you trying
    to enforce with those constructs?

  - I don’t see a point in maintaining multiple list nodes in buf_page_t
    (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
    page may only be in a single list at any point in time, so splitting
    the list node is purely cosmetic.

    On the other hand, we are looking at a non-trivial buf_page_t size
    increase (112 bytes before the patch, 144 bytes after). Leaving all
    cache and memory locality questions aside, that’s 64 MB of memory
    just for list node pointers on a system with a 32 GB buffer pool....

Read more...

review: Needs Fixing
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (7.7 KiB)

Alexey -

> - the code in btr_blob_free() can be simplified

Simplified.

> - wrong comments for buf_LRU_free_page(): s/block
> mutex/buf_page_get_mutex() mutex/

Edited. But there are numerous other places in this patch (and upstream) that would need this editing too, and "block mutex" is already an established shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention pointer to mutex variables named block_mutex.

Do you want me to edit the other places too?

> - the comments for buf_LRU_free_page() say that both LRU_list_mutex
> and block_mutex may be released temporarily if ‘true’ is
> returned. But:
> 1) even if ‘false’ is returned, block_mutex may also be released
> temporarily
> 2) the comments don’t mention that if ‘true’ is returned,
> LRU_list_mutex is always released upon return, but block_mutex is
> always locked. And callers of buf_LRU_free_page() rely on that.

Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:

/******************************************************************//**
Try to free a block. If bpage is a descriptor of a compressed-only
page, the descriptor object will be freed as well.

NOTE: If this function returns true, it will release the LRU list mutex,
and temporarily release and relock the buf_page_get_mutex() mutex.
Furthermore, the page frame will no longer be accessible via bpage. If this
function returns false, the buf_page_get_get_mutex() might be temporarily
released and relocked too.

The caller must hold the LRU list and buf_page_get_mutex() mutexes.

@return true if freed, false otherwise. */

> - the following code in buf_LRU_free_page() is missing a
> buf_page_free_descriptor() call if b != NULL. Which is a potential
> memory leak.

Fixed.

> - the patch removes buf_pool_mutex_enter_all() from
> btr_search_validate_one_table(), but then does a number of dirty
> reads from ‘block’ before it locks block->mutex. Any reasons to not
> lock block->mutex earlier?

I *think* there were actual reasons, but I cannot remember them, due to the number of things going on with this patch. And I don't see why it locking block->mutex earlier is not possible now. I will look further.

> - the following checks for mutex != NULL in buf_buddy_relocate() seem
> to be redundant, since they are made after mutex_enter(mutex), so we
> are guaranteed mutex != NULL if we reach that code:

Fixed. This looks like a missed cleanup after removing 5.5's buf_page_get_mutex_enter().

> - asserting that zip_free_mutex is locked also looks redundant to me,
> because it is locked just a few lines above, and there’s nothing in
> the code path that could release it.

Removed. Added a comment to the function header "The caller must hold zip_free_mutex, and this
function will release and lock it again." instead.

> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
> need that stuff. Their names are misleading as they don’t enforce
> any atomicity.

The ops being load and store, their atom...

Read more...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal
Download full text (11.5 KiB)

Hi Laurynas,

On Wed, 11 Sep 2013 09:45:13 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>> - the code in btr_blob_free() can be simplified
>
> Simplified.
>
>> - wrong comments for buf_LRU_free_page(): s/block
>> mutex/buf_page_get_mutex() mutex/
>
> Edited. But there are numerous other places in this patch (and upstream) that would need this editing too, and "block mutex" is already an established shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention pointer to mutex variables named block_mutex.
>

I think editing existing comments is worth the efforts (and potential
extra maintenance cost in the future). I would be OK if this specific
comment was left intact too. It only caught my eye because the comment
was edited and I spent some time verifying it.

> Do you want me to edit the other places too?
>
>> - the comments for buf_LRU_free_page() say that both LRU_list_mutex
>> and block_mutex may be released temporarily if ‘true’ is
>> returned. But:
>> 1) even if ‘false’ is returned, block_mutex may also be released
>> temporarily
>> 2) the comments don’t mention that if ‘true’ is returned,
>> LRU_list_mutex is always released upon return, but block_mutex is
>> always locked. And callers of buf_LRU_free_page() rely on that.
>
> Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:
>
> /******************************************************************//**
> Try to free a block. If bpage is a descriptor of a compressed-only
> page, the descriptor object will be freed as well.
>
> NOTE: If this function returns true, it will release the LRU list mutex,
> and temporarily release and relock the buf_page_get_mutex() mutex.
> Furthermore, the page frame will no longer be accessible via bpage. If this
> function returns false, the buf_page_get_get_mutex() might be temporarily
> released and relocked too.
>
> The caller must hold the LRU list and buf_page_get_mutex() mutexes.
>
> @return true if freed, false otherwise. */
>
>

Looks good.

>> - the patch removes buf_pool_mutex_enter_all() from
>> btr_search_validate_one_table(), but then does a number of dirty
>> reads from ‘block’ before it locks block->mutex. Any reasons to not
>> lock block->mutex earlier?
>
> I *think* there were actual reasons, but I cannot remember them, due to the number of things going on with this patch. And I don't see why it locking block->mutex earlier is not possible now. I will look further.
>

OK.

>> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>> need that stuff. Their names are misleading as they don’t enforce
>> any atomicity.
>
> The ops being load and store, their atomicity is enforced by the data type width.
>

Right, the atomicity is enforced by the data type width on those
architectures that provide it. And even those that do provide it have a
number of prerequisites. Neither of those 2 facts is taken care of in
os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
different with respect to atomi...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

On Wed, 11 Sep 2013 16:06:21 +0400, Alexey Kopytov wrote:
> I think editing existing comments is worth the efforts (and potential
> extra maintenance cost in the future). I would be OK if this specific

Grr, that was supposed to be "NOT worth the efforts" of course.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

> > - the patch removes buf_pool_mutex_enter_all() from
> > btr_search_validate_one_table(), but then does a number of dirty
> > reads from ‘block’ before it locks block->mutex. Any reasons to not
> > lock block->mutex earlier?
>
> I *think* there were actual reasons, but I cannot remember them, due to the
> number of things going on with this patch. And I don't see why it locking
> block->mutex earlier is not possible now. I will look further.

A test run helped to recover my memories. So the problem is buf_block_hash_get() call, which locks page_hash, which violates locking order. Keeping the initial reads is OK, because 1) block pointer will not be invalidated because we are reading AHI-indexed pages only, which can be only BUF_BLOCK_FILE_PAGE or BUF_BLOCK_REMOVE_HASH (which is a kind of BUF_BLOCK_FILE_PAGE for our purposes), 2) block space and page_id cannot change neither while we hold a corresponding AHI X latch. Thus the initial dirty reads are not actually dirty.

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

Hi Laurynas,

On Wed, 11 Sep 2013 12:26:57 -0000, Laurynas Biveinis wrote:
>>> - the patch removes buf_pool_mutex_enter_all() from
>>> btr_search_validate_one_table(), but then does a number of dirty
>>> reads from ‘block’ before it locks block->mutex. Any reasons to not
>>> lock block->mutex earlier?
>>
>> I *think* there were actual reasons, but I cannot remember them, due to the
>> number of things going on with this patch. And I don't see why it locking
>> block->mutex earlier is not possible now. I will look further.
>
> A test run helped to recover my memories. So the problem is buf_block_hash_get() call, which locks page_hash, which violates locking order. Keeping the initial reads is OK, because 1) block pointer will not be invalidated because we are reading AHI-indexed pages only, which can be only BUF_BLOCK_FILE_PAGE or BUF_BLOCK_REMOVE_HASH (which is a kind of BUF_BLOCK_FILE_PAGE for our purposes), 2) block space and page_id cannot change neither while we hold a corresponding AHI X latch. Thus the initial dirty reads are not actually dirty.
>

Right, makes sense.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (11.7 KiB)

> >> - wrong comments for buf_LRU_free_page(): s/block
> >> mutex/buf_page_get_mutex() mutex/
> >
> > Edited. But there are numerous other places in this patch (and upstream)
> that would need this editing too, and "block mutex" is already an established
> shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention
> pointer to mutex variables named block_mutex.
> >
>
> I think editing existing comments is not worth the efforts (and potential
> extra maintenance cost in the future). I would be OK if this specific
> comment was left intact too. It only caught my eye because the comment
> was edited and I spent some time verifying it.

Fair enough. I applied same edit through the other comments changed by this patch.

> >> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
> >> need that stuff. Their names are misleading as they don’t enforce
> >> any atomicity.
> >
> > The ops being load and store, their atomicity is enforced by the data type
> width.
> >
>
> Right, the atomicity is enforced by the data type width on those
> architectures that provide it.

I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.

> And even those that do provide it have a
> number of prerequisites. Neither of those 2 facts is taken care of in
> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
> different with respect to atomicity as plain load/store of a ulint and
> thus, have misleading names.
>
> So to justify "atomic" in their names those functions should:
>
> - (if we want to be portable) protect those load/stores with a mutex

Why? I guess this question boils down to, what would the mutex implementation code additionally ensure here, let's say, on x86_64? Or is this referring to the 5.6 mutex fallbacks when no atomic ops are implemented for a platform?

> - (if we only care about x86/x86_64) make sure that values being
> loaded/stored do not cross cache lines or page boundaries. Which is of
> course impossible to guarantee in a generic function.

Why? We are talking about ulints here only, and I was not able to find such requirements in the x86_64 memory model descriptions. There is a requirement to be aligned, and misaligned stores/loads might indeed cross cache line or page boundaries, and anything that crosses them is indeed non-atomic. But alignment is possible to guarantee in a generic function (which doesn't even has to be generic: the x86_64 implementation is for x86_64 only, obviously).

Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:

"The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
(...)
• Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (...):
• Reading or writing a quadword aligned on a 64-bit boundary
"

My understanding of the above is that os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, modulo alignment ...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal
Download full text (11.6 KiB)

Hi Laurynas,

On Wed, 11 Sep 2013 15:29:33 -0000, Laurynas Biveinis wrote:
>>>> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>>>> need that stuff. Their names are misleading as they don’t enforce
>>>> any atomicity.
>>>
>>> The ops being load and store, their atomicity is enforced by the data type
>> width.
>>>
>>
>> Right, the atomicity is enforced by the data type width on those
>> architectures that provide it.
>
>
> I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.
>

Yes, but alignment does not guarantee atomicity, see below.

>
>> And even those that do provide it have a
>> number of prerequisites. Neither of those 2 facts is taken care of in
>> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
>> different with respect to atomicity as plain load/store of a ulint and
>> thus, have misleading names.
>>
>> So to justify "atomic" in their names those functions should:
>>
>> - (if we want to be portable) protect those load/stores with a mutex
>
>
> Why? I guess this question boils down to, what would the mutex implementation code additionally ensure here, let's say, on x86_64? Or is this referring to the 5.6 mutex fallbacks when no atomic ops are implemented for a platform?
>

mutex the only portable way to ensure atomicity. You can use atomic
primitives provided by specific architectures, but then you either limit
support for those architectures or yes, provide a mutex fallback.

>
>> - (if we only care about x86/x86_64) make sure that values being
>> loaded/stored do not cross cache lines or page boundaries. Which is of
>> course impossible to guarantee in a generic function.
>
>
> Why? We are talking about ulints here only, and I was not able to find such requirements in the x86_64 memory model descriptions. There is a requirement to be aligned, and misaligned stores/loads might indeed cross cache line or page boundaries, and anything that crosses them is indeed non-atomic. But alignment is possible to guarantee in a generic function (which doesn't even has to be generic: the x86_64 implementation is for x86_64 only, obviously).
>
> Intel® 64 and IA-32 Architectures Software Developer's Manual
> Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:
>
> "The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
> always be carried out atomically:
> (...)
> • Reading or writing a doubleword aligned on a 32-bit boundary
>
> The Pentium processor (...):
> • Reading or writing a quadword aligned on a 64-bit boundary
> "

Why didn't you quote it further?

"
Accesses to cacheable memory that are split across cache lines and page
boundaries are not guaranteed to be atomic by <all processors>. <all
processors> provide bus control signals that permit external memory
subsystems to make split accesses atomic;
"

Which means even aligned accesses are not guaranteed to be atomic and
it's up to the implementation of "external memory subsystems" (that
probably means chipsets, motherboards, NUMA archi...

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (14.0 KiB)

Alexey -

> >> - (if we only care about x86/x86_64) make sure that values being
> >> loaded/stored do not cross cache lines or page boundaries. Which is of
> >> course impossible to guarantee in a generic function.
> >
> >
> > Why? We are talking about ulints here only, and I was not able to find such
> requirements in the x86_64 memory model descriptions. There is a requirement
> to be aligned, and misaligned stores/loads might indeed cross cache line or
> page boundaries, and anything that crosses them is indeed non-atomic. But
> alignment is possible to guarantee in a generic function (which doesn't even
> has to be generic: the x86_64 implementation is for x86_64 only, obviously).
> >
> > Intel® 64 and IA-32 Architectures Software Developer's Manual
> > Volume 3A: System Programming Guide, Part 1, section 8.1.1,
> http://download.intel.com/products/processor/manual/253668.pdf:
> >
> > "The Intel486 processor (and newer processors since) guarantees that the
> following basic memory operations will
> > always be carried out atomically:
> > (...)
> > • Reading or writing a doubleword aligned on a 32-bit boundary
> >
> > The Pentium processor (...):
> > • Reading or writing a quadword aligned on a 64-bit boundary
> > "
>
> Why didn't you quote it further?
>
> "
> Accesses to cacheable memory that are split across cache lines and page
> boundaries are not guaranteed to be atomic by <all processors>. <all
> processors> provide bus control signals that permit external memory
> subsystems to make split accesses atomic;
> "
>
> Which means even aligned accesses are not guaranteed to be atomic and
> it's up to the implementation of "external memory subsystems" (that
> probably means chipsets, motherboards, NUMA architectures and the like).

I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here. I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures.

> > My understanding of the above is that
> os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description,
> modulo alignment issues, if any. These are easy to ensure by ut_ad().
> >
>
> Modulo alignment, cache line boundary and page boundary issues.

Alignment only unless my reasoning above is wrong.

> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
> structure happens to be allocated in memory so that n_pages_written
> happens to be misaligned, or cross a cache line or a page boundary. How
> exactly ut_ad() is going to ensure that never happens at runtime?

A debug build would hit this assert and we'd fix the structure layout/allocation. Unless I'm mistaken, to get a misaligned ulint, we'd have to ask for this explicitly, by packing a struct, fetching a pointer to it from a byte array, etc. Thus ut_ad() seems reasonable to me.

> >>>> They should be named os_ordered_load_ulint() /
> >>>> os_ordered_store_ulint(),
> >>>
> >>> That's an option, but I needed atomicity, visibility, and ordering, and
> >> chose atomic for ...

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

buf_LRU_remove_all_pages() was ported incorrectly from 5.5, dropping the space id and I/O-fix re-checks after the block mutex acquisition. This has been caught by Roel as bug 1224282.

review: Needs Fixing
Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal
Download full text (17.1 KiB)

Hi Laurynas,

On Thu, 12 Sep 2013 06:16:52 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>> - (if we only care about x86/x86_64) make sure that values being
>>>> loaded/stored do not cross cache lines or page boundaries. Which is of
>>>> course impossible to guarantee in a generic function.
>>>
>>>
>>> Why? We are talking about ulints here only, and I was not able to find such
>> requirements in the x86_64 memory model descriptions. There is a requirement
>> to be aligned, and misaligned stores/loads might indeed cross cache line or
>> page boundaries, and anything that crosses them is indeed non-atomic. But
>> alignment is possible to guarantee in a generic function (which doesn't even
>> has to be generic: the x86_64 implementation is for x86_64 only, obviously).
>>>
>>> Intel® 64 and IA-32 Architectures Software Developer's Manual
>>> Volume 3A: System Programming Guide, Part 1, section 8.1.1,
>> http://download.intel.com/products/processor/manual/253668.pdf:
>>>
>>> "The Intel486 processor (and newer processors since) guarantees that the
>> following basic memory operations will
>>> always be carried out atomically:
>>> (...)
>>> • Reading or writing a doubleword aligned on a 32-bit boundary
>>>
>>> The Pentium processor (...):
>>> • Reading or writing a quadword aligned on a 64-bit boundary
>>> "
>>
>> Why didn't you quote it further?
>>
>> "
>> Accesses to cacheable memory that are split across cache lines and page
>> boundaries are not guaranteed to be atomic by <all processors>. <all
>> processors> provide bus control signals that permit external memory
>> subsystems to make split accesses atomic;
>> "
>>
>> Which means even aligned accesses are not guaranteed to be atomic and
>> it's up to the implementation of "external memory subsystems" (that
>> probably means chipsets, motherboards, NUMA architectures and the like).
>
>
> I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here. I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures.
>

The array of buffer pool descriptors is allocated as follows:

 buf_pool_ptr = (buf_pool_t*) mem_zalloc(
  n_instances * sizeof *buf_pool_ptr);

so individual buf_pool_t instances are not guaranteed to have any
specific alignment, neither to cache line nor to page boundaries, right?

Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into
buf_pool_t so nothing prevents it from crossing a cache line or a page
boundary?

Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again,
nothing prevents them from crossing a cache line or a page boundary, right?

>
>>> My understanding of the above is that
>> os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description,
>> modulo alignment issues, if any. These are easy to ensure by ut_ad().
>>>
>>
>> Modulo alignment, cache line boundary and page boundary issues.
>
>
> Alignment only unless my reasoning above is wrong.
>

Yes.

>
>> I don't see how ut_ad() is going to help here. So a ...

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Repushed branch with the 1st partial review comments. Not a resubmission due to partial review and ongoing discussions.

    Changes from the 1st MP.
    - Simplified btr_blob_free().
    - Added a note about mutexes to the header comment of
      buf_buddy_relocate().
    - Removed redundant mutex == NULL checks and mutex own assertions
      from buf_buddy_relocate().
    - Fixed locking notes in buf_LRU_free_page() header comment.
    - Removed a memory leak in one of the early exits in
      buf_LRU_free_page().
    - Clarified locking in a comment for buf_page_t::zip.
    - Added debug build checks to os_atomic_load_ulint() and
      os_atomic_store_ulint() x86_64 implementation that the accessed
      variable is properly aligned.
    - Added dropped by mistake buffer page space id and I/O fix 2nd
      checks after the buf_page_get_mutex() has been locked in
      buf_LRU_remove_all_pages().

Please ignore the "Added debug build checks to os_atomic_load_ulint() and os_atomic_store_ulint() x86_64 implementation that the accessed variable is properly aligned." bit for now.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

A Jenkins run of the latest branch turned up bug 1224432. Logged a separate bug because I am not sure I'll manage to debug it during this MP cycle.

"...
2013-09-12 12:15:38 7ff450082700 InnoDB: Assertion failure in thread 140687291459328 in file buf0buf.cc line 3694
InnoDB: Failing assertion: buf_fix_count > 0
...

Which appears to be a race condition on buf_fix_count on a page that is a sentinel for buffer pool watch. How exactly this can happen is not clear to me currently. All the watch sentinel buf_fix_count changes happen under zip_mutex and a corresponding page_hash X lock. Further, buf_page_init_for_read() asserts buf_fix_count > 0 through buf_pool_watch_is_sentinel() at line 3647, thus this should have changed between 3647 and 3694, but the hash is X-latched throughout, even though the zip mutex is only acquired at 3670."

review: Needs Fixing
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (19.0 KiB)

Alexey -

> I don't see how a properly-aligned ulint can possibly
> cross a cache line boundary, when cache lines are 64-byte wide and 64-byte
> aligned. Or even 32 for older architectures.
> >
>
> The array of buffer pool descriptors is allocated as follows:

You are right, I failed to consider the base addresses returned by dynamic memory allocation. I also failed to notice your hint to that direction in one of the previous mails.

> buf_pool_ptr = (buf_pool_t*) mem_zalloc(
> n_instances * sizeof *buf_pool_ptr);
>
> so individual buf_pool_t instances are not guaranteed to have any
> specific alignment, neither to cache line nor to page boundaries, right?

Right.

> Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into
> buf_pool_t so nothing prevents it from crossing a cache line or a page
> boundary?

Right.

> Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again,
> nothing prevents them from crossing a cache line or a page boundary, right?

Right, nothing prevents an object of buf_pool_stat_t from crossing it. But that's OK. We only need the individual fields not to cross it.

> >> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
> >> structure happens to be allocated in memory so that n_pages_written
> >> happens to be misaligned, or cross a cache line or a page boundary. How
> >> exactly ut_ad() is going to ensure that never happens at runtime?
> >
> >
> > A debug build would hit this assert and we'd fix the structure
> layout/allocation. Unless I'm mistaken, to get a misaligned ulint, we'd have
> to ask for this explicitly, by packing a struct, fetching a pointer to it from
> a byte array, etc. Thus ut_ad() seems reasonable to me.
> >
>
> The only thing you can assume about dynamically allocated objects is
> that their addresses (and thus, the first member of a structure, if an
> object is a structure) is aligned to machine word size. Which is always
> lower than the cache line size. There are no guarantees on alignment of
> other structure members, no matter what compiler hints were used (those
> only matter for statically allocated objects).

Right. So, to conclude... no individual ulint is going to cross a cache line or a page boundary and we are good? We start with a machine-word aligned address returned from heap, and add a multiple of machine-word width to arrive at the address of an individual field, which is machine-word aligned and thus the individual field cannot cross anything?

> >>>>>> They should be named os_ordered_load_ulint() /
> >>>>>> os_ordered_store_ulint(),
> >>>>>
> >>>>> That's an option, but I needed atomicity, visibility, and ordering, and
> >>>> chose atomic for function name to match the existing CAS and atomic add
> >>>> operations, which also need all three.
> >>>>>
> >>>>
> >>>> I'm not sure you need all of those 3 in every occurrence of those
> >>>> functions, but see below.
> >>>
> >>>
> >>> That's right. And ordering probably is not needed anywhere, sorry about
> >> that, my understanding of atomics is far from fluent. But visibility
> should
> >> be needed in every occurrence if this is ever ported to a...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

Hi Laurynas,

You are right that that os_atomic_{load,store}_ulint() will work as the
name suggests (i.e. be atomic) as long as:

1. it is used to access machine word sized members of structures
2. we are on x86/x86_64

However, the patch implements them as a generic primitives that do
nothing to enforce those restrictions, and that's why their names are
misleading. This is where this discussion has started, and it is a
design flaw. I don't see any arguments in this discussion that would
dispel those concerns. You also acknowledged them in one of the comments.

In contrast, other atomic primitives in existing code keep up to their
promise of being atomic, i.e. do not enforce any implicit requirements.
But they also have mutex-guarded fall back implementations for those
architectures that do not provide atomics.

I also agree that this discussion may be endless and time is precious.
So I think we should implement whatever we both agree does work. That
is: instead of implementing generic atomic primitives that are only
atomic under implicit requirements that are not enforceable at compile
time, it must either use separate mutex(es) to protect them, or use true
atomic primitives provided by the existing code if they are available on
the target architecture and fall back to mutex-guarded access. The
latter is how it is implemented in the rest of InnoDB.

Thanks.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

> Hi Laurynas,
>
> You are right that that os_atomic_{load,store}_ulint() will work as the
> name suggests (i.e. be atomic) as long as:
>
> 1. it is used to access machine word sized members of structures
> 2. we are on x86/x86_64

Right.

> However, the patch implements them as a generic primitives that do
> nothing to enforce those restrictions, and that's why their names are
> misleading.

1. is enforced through "ulint" in the name and args. ulint is commented in univ.i as "unsigned long integer which should be equal to the word size of the machine".
2. is enforced by platform #ifdefs not providing any other implementation except one for x86/x86_64 with GCC or a GCC-like compiler.

Thus I provide generic primitives, whose current implementations will work as designed. However the 1. above also seems to be missing "properly-aligned" and that's where the design is debatable. On one hand it is possible to implement misaligned access atomically by LOCK MOV, and document that the primitives may be used with args of any alignment. But a better alternative to me seems to accept that misaligned accesses are bugs and document/allow aligned accesses only. Even though that's enforceable in debug builds only, so that's not ideally perfect, but IMHO acceptable.

> This is where this discussion has started, and it is a
> design flaw. I don't see any arguments in this discussion that would
> dispel those concerns. You also acknowledged them in one of the comments.

Addressed above.

> I also agree that this discussion may be endless and time is precious.

But it also cannot end prematurely.

> So I think we should implement whatever we both agree does work.

I suggest that the above either is working already or requires some improvements re. alignment only.

> That
> is: instead of implementing generic atomic primitives that are only
> atomic under implicit requirements that are not enforceable at compile
> time, it must either use separate mutex(es) to protect them, or use true
> atomic primitives provided by the existing code if they are available on
> the target architecture

If you either show that how I address 1. and 2. above is incorrect, either show that the alignment issue is major and unsurmountable, then I'll implement load as inc by zero, return old value, and store as dirty read + CAS in a loop using the existing primitives.

Thanks,
Laurynas

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

Hi Laurynas,

On Fri, 13 Sep 2013 07:40:10 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>> Hi Laurynas,
>>
>> You are right that that os_atomic_{load,store}_ulint() will work as the
>> name suggests (i.e. be atomic) as long as:
>>
>> 1. it is used to access machine word sized members of structures
>> 2. we are on x86/x86_64
>
>
> Right.
>
>
>> However, the patch implements them as a generic primitives that do
>> nothing to enforce those restrictions, and that's why their names are
>> misleading.
>
>
> 1. is enforced through "ulint" in the name and args. ulint is commented in univ.i as "unsigned long integer which should be equal to the word size of the machine".

It is not enforced, because nothing prevents me from passing a
misaligned address to those functions and expect them to be atomic as
the name implies.

For example, os_atomic_inc_ulint() is guaranteed to be atomic for any
arguments on any platform. But os_atomic_load_ulint() is not. That is
the problem.

> 2. is enforced by platform #ifdefs not providing any other implementation except one for x86/x86_64 with GCC or a GCC-like compiler.
>

That's correct. I only mentioned #2 for completeness.

> Thus I provide generic primitives, whose current implementations will work as designed. However the 1. above also seems to be missing "properly-aligned" and that's where the design is debatable. On one hand it is possible to implement misaligned access atomically by LOCK MOV, and document that the primitives may be used with args of any alignment. But a better alternative to me seems to accept that misaligned accesses are bugs and document/allow aligned accesses only. Even though that's enforceable in debug builds only, so that's not ideally perfect, but IMHO acceptable.
>

You don't.

>
> If you either show that how I address 1. and 2. above is incorrect, either show that the alignment issue is major and unsurmountable, then I'll implement load as inc by zero, return old value, and store as dirty read + CAS in a loop using the existing primitives.
>

Yes, please do.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

> >> You are right that that os_atomic_{load,store}_ulint() will work as the
> >> name suggests (i.e. be atomic) as long as:
> >>
> >> 1. it is used to access machine word sized members of structures
> >> 2. we are on x86/x86_64
> >
> >
> > Right.
> >
> >
> >> However, the patch implements them as a generic primitives that do
> >> nothing to enforce those restrictions, and that's why their names are
> >> misleading.
> >
> >
> > 1. is enforced through "ulint" in the name and args. ulint is commented in
> univ.i as "unsigned long integer which should be equal to the word size of the
> machine".
>
> It is not enforced, because nothing prevents me from passing a
> misaligned address to those functions and expect them to be atomic as
> the name implies.

This is exactly what I discussed below.

> For example, os_atomic_inc_ulint() is guaranteed to be atomic for any
> arguments on any platform. But os_atomic_load_ulint() is not. That is
> the problem.

Right. os_atomic_load_ulint() has additional restrictions on its arg over os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It is a performance bug in any case to perform misaligned atomic ops even with those ops that make it technically possible. I have added ut_ad()s to catch this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too, although that one looks like an overkill to me.

> > 2. is enforced by platform #ifdefs not providing any other implementation
> except one for x86/x86_64 with GCC or a GCC-like compiler.
> >
>
> That's correct. I only mentioned #2 for completeness.

OK, but I am not sure what does the #2 complete then.

> > Thus I provide generic primitives, whose current implementations will work
> as designed. However the 1. above also seems to be missing "properly-aligned"
> and that's where the design is debatable. On one hand it is possible to
> implement misaligned access atomically by LOCK MOV, and document that the
> primitives may be used with args of any alignment. But a better alternative
> to me seems to accept that misaligned accesses are bugs and document/allow
> aligned accesses only. Even though that's enforceable in debug builds only,
> so that's not ideally perfect, but IMHO acceptable.
> >
>
> You don't.

Will you reply to the rest of that paragraph too please? I am acknowledging that alignment is an issue, so let's see how to resolve it.

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

Hi Laurynas,

On Wed, 11 Sep 2013 15:29:33 -0000, Laurynas Biveinis wrote:
>>>> - the following hunk simply removes reference to buf_pool->mutex. As I
>>>> understand, it should just replace buf_pool->mutex with
>> zip_free_mutex?
>>>> + page_zip_des_t zip; /*!< compressed page; state
>>>> + == BUF_BLOCK_ZIP_PAGE and zip.data
>>>> + == NULL means an active
>>>
>>> Hm, it looked to me that it's protected not with zip_free_mutex but with
>> zip_mutex in its page mutex capacity. I will check.
>>>
>>
>> There was a place in the code that asserted zip_free_mutex locked when
>> bpage->zip.data is modified. But I'm not sure if that is correct.
>
>
> I have checked, I believe it's indeed zip_mutex. Re. zip_free_mutex, you must be referring to this bit in buf_buddy_relocate():
>
> mutex_enter(&buf_pool->zip_free_mutex);
>
> if (buf_page_can_relocate(bpage)) {
> ...
> bpage->zip.data = (page_zip_t*) dst;
> mutex_exit(mutex);
>
> buf_buddy_stat_t* buddy_stat = &buf_pool->buddy_stat[i];
> buddy_stat->relocated++;
> buddy_stat->relocated_usec += ut_time_us(NULL) - usec;
> ...
>
> Here zip_free_mutex happens to protect buddy_stat and pushing it down to if clause would require an else clause to appear that locks the same mutex.
>

No, I was referring to buf_pool_contains_zip(). It traverses the buffer
pool and examines (but not modifies) block->page.zip.data for each
block. However, the patch changes the assertion in
buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
rather than zip_mutex. In fact, in one of the code paths calling
buf_pool_contains_zip() we assert that zip_mutex is NOT locked. Don't we
have a bug here?

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

On Fri, 13 Sep 2013 11:10:36 -0000, Laurynas Biveinis wrote:
>
>
> Right. os_atomic_load_ulint() has additional restrictions on its arg over os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It is a performance bug in any case to perform misaligned atomic ops even with those ops that make it technically possible. I have added ut_ad()s to catch this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too, although that one looks like an overkill to me.
>

The same restrictions would apply even if os_atomic_load_ulint() didn't
exist, right? I.e. the same restrictions would apply if we simply
accessed those variables without any helper functions?

Let me ask you a few simple questions and this time around I demand
"yes/no" answers.

- Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
not do what they promise to do?

- Do you agree that naming them os_ordered_load_ulint() /
os_ordered_store_ulint() would better reflect what they do?

- Do you agree that naming them that way also makes it obvious that
using them in most places is simply unnecessary (e.g. in
buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(),
buf_get_n_pending_read_ios(), etc.)?

>
>>> 2. is enforced by platform #ifdefs not providing any other implementation
>> except one for x86/x86_64 with GCC or a GCC-like compiler.
>>>
>>
>> That's correct. I only mentioned #2 for completeness.
>
>
> OK, but I am not sure what does the #2 complete then.
>
>
>>> Thus I provide generic primitives, whose current implementations will work
>> as designed. However the 1. above also seems to be missing "properly-aligned"
>> and that's where the design is debatable. On one hand it is possible to
>> implement misaligned access atomically by LOCK MOV, and document that the
>> primitives may be used with args of any alignment. But a better alternative
>> to me seems to accept that misaligned accesses are bugs and document/allow
>> aligned accesses only. Even though that's enforceable in debug builds only,
>> so that's not ideally perfect, but IMHO acceptable.
>>>
>>
>> You don't.
>
>
> Will you reply to the rest of that paragraph too please? I am acknowledging that alignment is an issue, so let's see how to resolve it.
>

I don't think enforcing requirements in debug builds only is acceptable.
It must be a compile-time assertion, not a run-time one.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

> > Right. os_atomic_load_ulint() has additional restrictions on its arg over
> os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It
> is a performance bug in any case to perform misaligned atomic ops even with
> those ops that make it technically possible. I have added ut_ad()s to catch
> this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too,
> although that one looks like an overkill to me.
> >
>
> The same restrictions would apply even if os_atomic_load_ulint() didn't
> exist, right? I.e. the same restrictions would apply if we simply
> accessed those variables without any helper functions?

What would be the desired access semantics in that case? If "anything goes", then no restrictions would apply.

> Let me ask you a few simple questions and this time around I demand
> "yes/no" answers.
>
> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> not do what they promise to do?

Yes.

> - Do you agree that naming them os_ordered_load_ulint() /
> os_ordered_store_ulint() would better reflect what they do?

No.

> - Do you agree that naming them that way also makes it obvious that
> using them in most places is simply unnecessary (e.g. in
> buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(),
> buf_get_n_pending_read_ios(), etc.)?

No.

The first answer would be No if not the alignment issue.

> >>> Thus I provide generic primitives, whose current implementations will work
> >> as designed. However the 1. above also seems to be missing "properly-
> aligned"
> >> and that's where the design is debatable. On one hand it is possible to
> >> implement misaligned access atomically by LOCK MOV, and document that the
> >> primitives may be used with args of any alignment. But a better
> alternative
> >> to me seems to accept that misaligned accesses are bugs and document/allow
> >> aligned accesses only. Even though that's enforceable in debug builds
> only,
> >> so that's not ideally perfect, but IMHO acceptable.
> >>>
> >>
> >> You don't.
> >
> >
> > Will you reply to the rest of that paragraph too please? I am acknowledging
> that alignment is an issue, so let's see how to resolve it.
> >
>
> I don't think enforcing requirements in debug builds only is acceptable.
> It must be a compile-time assertion, not a run-time one.

And as we both know, this is not enforceable at compile time. I think that requesting extra protections on the top of already provided ones, when the only way to get a misaligned ulint is to ask for one explicitly, is an overkill. But that's my hand-waving against your hand-waving. Thus let's say that yes, "the alignment issue is major and unsurmountable", and I proceed to do what was offered previously: implement load as atomic add zero and return value, and store as dirty read and CAS until success. The reason I didn't like these implementations is that they are pessimized. But that's OK.

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

On Fri, 13 Sep 2013 12:54:44 -0000, Laurynas Biveinis wrote:
>>> Right. os_atomic_load_ulint() has additional restrictions on its arg over
>> os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It
>> is a performance bug in any case to perform misaligned atomic ops even with
>> those ops that make it technically possible. I have added ut_ad()s to catch
>> this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too,
>> although that one looks like an overkill to me.
>>>
>>
>> The same restrictions would apply even if os_atomic_load_ulint() didn't
>> exist, right? I.e. the same restrictions would apply if we simply
>> accessed those variables without any helper functions?
>
>
> What would be the desired access semantics in that case? If "anything goes", then no restrictions would apply.
>
>
>> Let me ask you a few simple questions and this time around I demand
>> "yes/no" answers.
>>
>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>> not do what they promise to do?
>
>
> Yes.

OK, so we agree that naming is unfortunate.

>
>
>> - Do you agree that naming them os_ordered_load_ulint() /
>> os_ordered_store_ulint() would better reflect what they do?
>
>
> No.

What would be a better naming then?

>
>
>> - Do you agree that naming them that way also makes it obvious that
>> using them in most places is simply unnecessary (e.g. in
>> buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(),
>> buf_get_n_pending_read_ios(), etc.)?
>
>
> No.

OK, then 2 followup questions:

1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
example? Here's an example of a valid answer: "Because that will result
in incorrect values being used in case ...". And some examples of
invalid answers: "non-cache-coherent architectures, visibility, memory
model, sunspots, crop circles, global warming, ...".

2. Why do only 2 out of 9 values are being loaded with
os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
ones need them?

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

> >> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >> not do what they promise to do?
> >
> >
> > Yes.
>
> OK, so we agree that naming is unfortunate.

Vanishingly slightly so, due to the alignment issue, which I believe is mostly theoretical, but nevertheless am ready to address now.

> >> - Do you agree that naming them os_ordered_load_ulint() /
> >> os_ordered_store_ulint() would better reflect what they do?
> >
> >
> > No.
>
> What would be a better naming then?

os_atomic_load_aligned_ulint().

> OK, then 2 followup questions:
>
> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> example? Here's an example of a valid answer: "Because that will result
> in incorrect values being used in case ...". And some examples of
> invalid answers: "non-cache-coherent architectures, visibility, memory
> model, sunspots, crop circles, global warming, ...".

We have gone through this with a disassembly example already, haven't we? We need os_atomic_load_ulint() because we don't want a dirty read. We may well decide that a dirty read there is fine and then replace it. But that's orthogonal to what is this primitive and why.

Are you also objecting to mutex protection here? If not, why? Note that the three n_flush values here are completely independent.

  mutex_enter(&buf_pool->flush_state_mutex);

  pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
  pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
  pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];

  mutex_exit(&buf_pool->flush_state_mutex);

And please don't group some of the valid answers with looney stuff.

> 2. Why do only 2 out of 9 values are being loaded with
> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
> ones need them?

Upstream reads all 9 dirtily. I replaced two of them to be clean instead. Maybe I need to replace all 9. Maybe 0. But that's again orthogonal.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

> No, I was referring to buf_pool_contains_zip(). It traverses the buffer
> pool and examines (but not modifies) block->page.zip.data for each
> block. However, the patch changes the assertion in
> buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
> rather than zip_mutex.

Thanks for the pointer, I have reviewed buf_pool_contains_zip() further. Here, as you say, we iterate through the buffer pool uncompressed pages trying to find the uncompressed page of a given compressed page, in order to assert that no uncompressed page points to it, which can be either just taken from the buddy allocator, either be just removed from the buffer pool. In both cases it's an invalid pointer value and it seems to me that we could iterate through the buffer pool without any locking at all. I have tried to think of possible race conditions in the case of no locking, i.e. can the pointer become valid again somehow, by some uncompressed page starting pointing to this zip page. And the transition of the given zip page pointer unallocated -> allocated is protected by zip_free_mutex. Thus, as long as all its callers hold zip_free_mutex and it's only called with invalid pointer values to assert that FALSE is returned, buf_pool_contains_zip() itself does not need any locking. If my reasoning is correct, I can remove the assert from this function. But maybe this should be documented better somehow?

> In fact, in one of the code paths calling
> buf_pool_contains_zip() we assert that zip_mutex is NOT locked. Don't we
> have a bug here?

buf_buddy_free_low(), right? This function is called for a compressed page that is already removed from the buffer pool, that is, no pointer to it should be live in any other thread (and in the buffer pool, as it's asserted by !buf_pool_contains_zip()). Thus it's ok not hold zip_mutex.

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

Hi Laurynas,

On Fri, 13 Sep 2013 14:21:58 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>> not do what they promise to do?
>>>
>>>
>>> Yes.
>>
>> OK, so we agree that naming is unfortunate.
>
>
> Vanishingly slightly so, due to the alignment issue, which I believe is mostly theoretical, but nevertheless am ready to address now.
>

It doesn't do anything to enforce atomicity, does it? I.e. the following
implementation would be equally "atomic":

ulint
os_atomic_load_ulint(ulint *ptr)
{
 printf("Hello, world!\n");
 return(*ptr);
}

>
>>>> - Do you agree that naming them os_ordered_load_ulint() /
>>>> os_ordered_store_ulint() would better reflect what they do?
>>>
>>>
>>> No.
>>
>> What would be a better naming then?
>
>
> os_atomic_load_aligned_ulint().
>

No, it doesn't do anything to enforce atomicity. That is a caller's
responsibility.

>
>> OK, then 2 followup questions:
>>
>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
>> example? Here's an example of a valid answer: "Because that will result
>> in incorrect values being used in case ...". And some examples of
>> invalid answers: "non-cache-coherent architectures, visibility, memory
>> model, sunspots, crop circles, global warming, ...".
>
>
> We have gone through this with a disassembly example already, haven't we? We need os_atomic_load_ulint() because we don't want a dirty read. We may well decide that a dirty read there is fine and then replace it. But that's orthogonal to what is this primitive and why.
>

We need an atomic read rather than os_atomic_load_ulint() for the above
reasons. An it will be atomic without using any helper functions. Since
I see no answer in the form "Because that will result in incorrect
values being used in case ...", I assume you don't have an answer to
that question.

> Are you also objecting to mutex protection here? If not, why? Note that the three n_flush values here are completely independent.
>
> mutex_enter(&buf_pool->flush_state_mutex);
>
> pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
> pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
> pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
>
> mutex_exit(&buf_pool->flush_state_mutex);
>

I'm not objecting to mutex protection in that code. Why would I?

> And please don't group some of the valid answers with looney stuff.
>
>
>> 2. Why do only 2 out of 9 values are being loaded with
>> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
>> ones need them?
>
>
> Upstream reads all 9 dirtily. I replaced two of them to be clean instead. Maybe I need to replace all 9. Maybe 0. But that's again orthogonal.
>

All 9 reads are atomic. But 7 of them don't use compiler barriers
because they don't need it. Neither do the remaining 2, but you are
quite creative to avoid accepting this simple fact.

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

On Sun, 15 Sep 2013 07:47:59 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>> No, I was referring to buf_pool_contains_zip(). It traverses the buffer
>> pool and examines (but not modifies) block->page.zip.data for each
>> block. However, the patch changes the assertion in
>> buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
>> rather than zip_mutex.
>
>
> Thanks for the pointer, I have reviewed buf_pool_contains_zip() further. Here, as you say, we iterate through the buffer pool uncompressed pages trying to find the uncompressed page of a given compressed page, in order to assert that no uncompressed page points to it, which can be either just taken from the buddy allocator, either be just removed from the buffer pool. In both cases it's an invalid pointer value and it seems to me that we could iterate through the buffer pool without any locking at all. I have tried to think of possible race conditions in the case of no locking, i.e. can the pointer become valid again somehow, by some uncompressed page starting pointing to this zip page. And the transition of the given zip page pointer unallocated -> allocated is protected by zip_free_mutex. Thus, as long as all its callers hold zip_free_mutex and it's only called with invalid pointer values to assert that FALSE is returned, buf_pool_contains_zip() itself does not need
 any locki
ng. If my reasoning is correct, I can remove the assert from this function. But maybe this should be documented better somehow?
>

Looks correct to me. Let's remove the zip_free_mutex assertion then?

>
>> In fact, in one of the code paths calling
>> buf_pool_contains_zip() we assert that zip_mutex is NOT locked. Don't we
>> have a bug here?
>
>
> buf_buddy_free_low(), right? This function is called for a compressed page that is already removed from the buffer pool, that is, no pointer to it should be live in any other thread (and in the buffer pool, as it's asserted by !buf_pool_contains_zip()). Thus it's ok not hold zip_mutex.
>

OK, thanks for clarification.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

> > I have reviewed buf_pool_contains_zip() further.
...
> Looks correct to me. Let's remove the zip_free_mutex assertion then?

Yes.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (4.7 KiB)

Alexey -

> >>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >>>> not do what they promise to do?
> >>>
> >>>
> >>> Yes.
> >>
> >> OK, so we agree that naming is unfortunate.
> >
> >
> > Vanishingly slightly so, due to the alignment issue, which I believe is
> mostly theoretical, but nevertheless am ready to address now.
> >
>
> It doesn't do anything to enforce atomicity, does it? I.e. the following
> implementation would be equally "atomic":
>
> ulint
> os_atomic_load_ulint(ulint *ptr)
> {
> printf("Hello, world!\n");
> return(*ptr);
> }

Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64 as long as pointer is aligned.

> >>>> - Do you agree that naming them os_ordered_load_ulint() /
> >>>> os_ordered_store_ulint() would better reflect what they do?
> >>>
> >>>
> >>> No.
> >>
> >> What would be a better naming then?
> >
> >
> > os_atomic_load_aligned_ulint().
> >
>
> No, it doesn't do anything to enforce atomicity. That is a caller's
> responsibility.

As in, don't pass misaligned values? In that case, yes, it is a caller's responsibility not to pass misaligned values. But where would InnoDB get a misaligned pointer to ulint from that we'd wish to access atomically? Hence enforcing alignment on debug build only seemed like a reasonable compromise, but OK, that's debatable.

> >> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> >> example? Here's an example of a valid answer: "Because that will result
> >> in incorrect values being used in case ...". And some examples of
> >> invalid answers: "non-cache-coherent architectures, visibility, memory
> >> model, sunspots, crop circles, global warming, ...".
> >
> >
> > We have gone through this with a disassembly example already, haven't we?
> We need os_atomic_load_ulint() because we don't want a dirty read. We may
> well decide that a dirty read there is fine and then replace it. But that's
> orthogonal to what is this primitive and why.
> >
>
> We need an atomic read rather than os_atomic_load_ulint() for the above
> reasons. An it will be atomic without using any helper functions.

OK, so is the problem that I wanted to introduce the primitives for such access, that would also document how is the variable accessed, and that they don't have to do much besides a compiler barrier on x86_64?

> Since
> I see no answer in the form "Because that will result in incorrect
> values being used in case ...", I assume you don't have an answer to
> that question.

I know that they are not resulting in incorrect values currently, and that the worst can happen in x86_64 with the most of possible future code changes is that the value loads could be moved earlier, resulting in more out-of-date values used. This and the fact that accessing the variable through the primitive serves as self-documentation seem good enough reasons to me.

> > Are you also objecting to mutex protection here? If not, why? Note that the
> three n_flush values here are completely independent.
> >
> > mutex_enter(&buf_pool->flush_state_mutex);
> >
> > pending_io += buf_pool->n_flush[BUF_FLUSH_...

Read more...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal
Download full text (5.6 KiB)

On Mon, 16 Sep 2013 09:05:39 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>>>> not do what they promise to do?
>>>>>
>>>>>
>>>>> Yes.
>>>>
>>>> OK, so we agree that naming is unfortunate.
>>>
>>>
>>> Vanishingly slightly so, due to the alignment issue, which I believe is
>> mostly theoretical, but nevertheless am ready to address now.
>>>
>>
>> It doesn't do anything to enforce atomicity, does it? I.e. the following
>> implementation would be equally "atomic":
>>
>> ulint
>> os_atomic_load_ulint(ulint *ptr)
>> {
>> printf("Hello, world!\n");
>> return(*ptr);
>> }
>
>
> Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64 as long as pointer is aligned.
>

So... we don't need it after all?

>
>>>>>> - Do you agree that naming them os_ordered_load_ulint() /
>>>>>> os_ordered_store_ulint() would better reflect what they do?
>>>>>
>>>>>
>>>>> No.
>>>>
>>>> What would be a better naming then?
>>>
>>>
>>> os_atomic_load_aligned_ulint().
>>>
>>
>> No, it doesn't do anything to enforce atomicity. That is a caller's
>> responsibility.
>
>
> As in, don't pass misaligned values? In that case, yes, it is a caller's responsibility not to pass misaligned values. But where would InnoDB get a misaligned pointer to ulint from that we'd wish to access atomically? Hence enforcing alignment on debug build only seemed like a reasonable compromise, but OK, that's debatable.
>
>
>>>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
>>>> example? Here's an example of a valid answer: "Because that will result
>>>> in incorrect values being used in case ...". And some examples of
>>>> invalid answers: "non-cache-coherent architectures, visibility, memory
>>>> model, sunspots, crop circles, global warming, ...".
>>>
>>>
>>> We have gone through this with a disassembly example already, haven't we?
>> We need os_atomic_load_ulint() because we don't want a dirty read. We may
>> well decide that a dirty read there is fine and then replace it. But that's
>> orthogonal to what is this primitive and why.
>>>
>>
>> We need an atomic read rather than os_atomic_load_ulint() for the above
>> reasons. An it will be atomic without using any helper functions.
>
>
> OK, so is the problem that I wanted to introduce the primitives for such access, that would also document how is the variable accessed, and that they don't have to do much besides a compiler barrier on x86_64?
>

Yes, the problem is that you introduce primitives that basically do
nothing, and then use those primitives unnecessarily and inconsistently.
Which in turn leads to blowing up the patch size, increased maintenance
burden, and opens the door for wrong assumptions made when reading the
existing code and implementing new code.

>
>> Since
>> I see no answer in the form "Because that will result in incorrect
>> values being used in case ...", I assume you don't have an answer to
>> that question.
>
>
> I know that they are not resulting in incorrect values currently, and that the worst can happen in x86_64 with the most of possible future code changes is that the va...

Read more...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal
Download full text (5.1 KiB)

More comments on the patch:

  - typo (double “get”) in the updated buf_LRU_free_page() comments:
   “function returns false, the buf_page_get_get_mutex() might be temporarily”

  - what’s the reason for changing buf_page_t::in_LRU_list to be present
    in release builds? Unless I’m missing something in the current code,
    it is only assigned, but never read in release builds?

  - spurious blank line changes in buf_buddy_relocate(),
    buf_buddy_free_low(), buf_pool_watch_set(), buf_page_get_gen(),
    buf_page_init_for_read(), buf_pool_validate_instance(),
    buf_flush_check_neighbor(), buf_flush_LRU_list_batch(), buf_LRU_drop_page_hash_for_tablespace()
    and innodb_buffer_pool_evict_uncompressed().

  - the following change in buf_block_try_discard_uncompressed() does
    not release block_mutex if buf_LRU_free_page() returns false.

  bpage = buf_page_hash_get(buf_pool, space, offset);

  if (bpage) {
- buf_LRU_free_page(bpage, false);
+
+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
+
+ mutex_enter(block_mutex);
+
+ if (buf_LRU_free_page(bpage, false)) {
+
+ mutex_exit(block_mutex);
+ return;
+ }
  }

  - do you really need this change?

@@ -2114,8 +2098,8 @@ buf_page_get_zip(
   break;
  case BUF_BLOCK_ZIP_PAGE:
  case BUF_BLOCK_ZIP_DIRTY:
+ buf_enter_zip_mutex_for_page(bpage);
   block_mutex = &buf_pool->zip_mutex;
- mutex_enter(block_mutex);
   bpage->buf_fix_count++;
   goto got_block;
  case BUF_BLOCK_FILE_PAGE:

  - in the following change:

@@ -2721,13 +2707,14 @@ buf_page_get_gen(
   }

   bpage = &block->page;
+ ut_ad(buf_own_zip_mutex_for_page(bpage));

   if (bpage->buf_fix_count
       || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
    /* This condition often occurs when the buffer
    is not buffer-fixed, but I/O-fixed by
    buf_page_init_for_read(). */
- mutex_exit(block_mutex);
+ mutex_exit(&buf_pool->zip_mutex);
 wait_until_unfixed:
    /* The block is buffer-fixed or I/O-fixed.
    Try again later. */

     is there a reason for replacing block_mutex with zip_mutex? you
     could just assert that block_mutex is zip_mutex next to, or even
     instead of the buf_own_zip_mutex_for_page() call.

  - same comments for this change:

@@ -2737,11 +2724,11 @@ buf_page_get_gen(
   }

   /* Allocate an uncompressed page. */
- mutex_exit(block_mutex);
+ mutex_exit(&buf_pool->zip_mutex);
   block = buf_LRU_get_free_block(buf_pool);
   ut_a(block);

- buf_pool_mutex_enter(buf_pool);
+ mutex_enter(&buf_pool->LRU_list_mutex);

   /* As we have released the page_hash lock and the
   block_mutex to allocate an uncompressed page it is

  - in buf_mark_space_corrupt() I see LRU_list_mutex, hash_lock and
    buf_page_get_mutex() being acquired, but only LRU_list_mutex and
    buf_page_get_mutex() being released, i.e. it returns with hash_lock
    acquired?

  - it looks like with the changes in buf_page_io_complete() for io_type
    == BUF_IO_WRITE we don’t set io_fix to BUF_IO_NONE, though we did
    before the changes?

  - more spurious changes:

@@ -475,8 +473,8 @@ buf_flush_insert_sorted_into_flush_list(
  if (prev_b == NULL) {
   UT_LIST_ADD_FIRST(list, buf_pool->flush_list, &block->page);
...

Read more...

review: Needs Fixing
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

Replying to this bit separately as it may need further discussion while I am addressing the rest of comments.

> - what’s the reason for changing buf_page_t::in_LRU_list to be present
> in release builds? Unless I’m missing something in the current code,
> it is only assigned, but never read in release builds?

This is another "automerge" from 5.5 and indeed serves no release
build purpose in the current 5.6 code. But it points out a
non-trivial thing. In 5.5 it is used as follows:
-) for checking whether a given page is still on the LRU list if both
   block and LRU mutexes were temporarily released:
   buf_page_get_zip(), buf_LRU_free_block() (both are different code
   from 5.6).
-) for iterating through the LRU list without holding the LRU list
   mutex at all: buf_LRU_free_from_common_LRU_list(),
   buf_LRU_free_block(),
   buf_flush_LRU_recommendation()/buf_flush_ready_for_replace(). I
   think this is unsafe and a bug in 5.5 due to page relocations
   potentially resulting in wild pointers, even if it does wonders for
   the LRU list contention. 5.6 holds the LRU list mutex in the corresponding code.
-) redundant checks, ie. on LRU list iteration where the mutex is not
   released: buf_LRU_insert_zip_clean(),
   buf_LRU_free_from_unzip_LRU_list().

Thus I think 1) in_LRU_list changes should be reverted now. 2) 5.5
might need fixing. 3) The LRU list mutex is hot in 5.6. If there is
a safe way not to hold it in 5.6 (for example, for
BUF_BLOCK_FILE_PAGE, but hard to tell the page type without
dereferencing page pointer - maybe by comparing the page address
against buffer pool chunk address range?), then it's worth looking
into it.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (5.3 KiB)

Alexey -

> - typo (double “get”) in the updated buf_LRU_free_page() comments:
> “function returns false, the buf_page_get_get_mutex() might be temporarily”

Fixed.

> - what’s the reason for changing buf_page_t::in_LRU_list to be present
> in release builds? Unless I’m missing something in the current code,
> it is only assigned, but never read in release builds?

in_LRU_list changes have been reverted with the exception of an extra
assert in buf_page_set_sticky(). I also found that in_unzip_LRU_list
was converted to a release build flag but no uses were converted.
Reverted that too.

> - spurious blank line changes in buf_buddy_relocate(),
> buf_buddy_free_low(), buf_pool_watch_set(),

Fixed.

> buf_page_get_gen(),

I didn't find this one. Will re-check before the final push.

> buf_page_init_for_read(), buf_pool_validate_instance(),
> buf_flush_check_neighbor(), buf_flush_LRU_list_batch(),
> buf_LRU_drop_page_hash_for_tablespace()
> and innodb_buffer_pool_evict_uncompressed().

Fixed. Also removed a diagnostic printf from
buf_pool_validate_instance().

> - the following change in buf_block_try_discard_uncompressed() does
> not release block_mutex if buf_LRU_free_page() returns false.

Fixed.

> - do you really need this change?
>
> @@ -2114,8 +2098,8 @@ buf_page_get_zip(
> break;
> case BUF_BLOCK_ZIP_PAGE:
> case BUF_BLOCK_ZIP_DIRTY:
> + buf_enter_zip_mutex_for_page(bpage);
> block_mutex = &buf_pool->zip_mutex;
> - mutex_enter(block_mutex);
> bpage->buf_fix_count++;
> goto got_block;
> case BUF_BLOCK_FILE_PAGE:

No, I don't. A debugging leftover, reverted.

> - in the following change:
>
> @@ -2721,13 +2707,14 @@ buf_page_get_gen(
> }
>
> bpage = &block->page;
> + ut_ad(buf_own_zip_mutex_for_page(bpage));
>
> if (bpage->buf_fix_count
> || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
> /* This condition often occurs when the buffer
> is not buffer-fixed, but I/O-fixed by
> buf_page_init_for_read(). */
> - mutex_exit(block_mutex);
> + mutex_exit(&buf_pool->zip_mutex);
> wait_until_unfixed:
> /* The block is buffer-fixed or I/O-fixed.
> Try again later. */
>
> is there a reason for replacing block_mutex with zip_mutex? you
> could just assert that block_mutex is zip_mutex next to, or even
> instead of the buf_own_zip_mutex_for_page() call.

Yes. Replaced buf_own_zip_mutex_for_page() with ut_ad(block_mutex ==
&buf_pool->zip_mutex), which also happens to be symmetric with "!="
assert above for BUF_BLOCK_FILE_PAGE. Reverted the mutex_exit change
too.

> - same comments for this change:
>
> @@ -2737,11 +2724,11 @@ buf_page_get_gen(
> }
>
> /* Allocate an uncompressed page. */
> - mutex_exit(block_mutex);
> + mutex_exit(&buf_pool->zip_mutex);...

Read more...

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (4.4 KiB)

buf_enter_zip_mutex_for_page(buf_page_t *) is a bad idea, unless a buffer pool pointer is passed too, which then makes it an overkill, because it dereferences an unprotected bpage pointer, resulting in the below. Will remove it and replace back with mutex_enter(&buf_pool->zip_mutex).

http://jenkins.percona.com/job/percona-server-5.6-param/273/BUILD_TYPE=debug,Host=debian-wheezy-x64/testReport/junit/%28root%29/innodb/innodb_wl5522_zip/

Thread 1 (Thread 0x7f3acebfe700 (LWP 321)):
#0 __pthread_kill (threadid=<optimized out>, signo=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:63
#1 0x0000000000ae2f45 in my_write_core (sig=6) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/mysys/stacktrace.c:422
#2 0x000000000075eaa9 in handle_fatal_signal (sig=6) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/sql/signal_handler.cc:251
#3 <signal handler called>
#4 0x00007f3ae835c475 in *__GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5 0x00007f3ae835f6f0 in *__GI_abort () at abort.c:92
#6 0x0000000000d9b44c in buf_pool_from_bpage (bpage=0x33912c0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/buf0buf.ic:88
#7 0x0000000000d9cb3c in buf_enter_zip_mutex_for_page (bpage=0x33912c0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/buf0buf.ic:1412
#8 0x0000000000da2e5b in buf_page_get_gen (space=431, zip_size=8192, offset=9, rw_latch=1, guess=0x0, mode=10, file=0x1109668 "/mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/btr/btr0pcur.cc", line=438, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/buf/buf0buf.cc:2743
#9 0x0000000000d8c45a in btr_block_get_func (space=431, zip_size=8192, page_no=9, mode=1, file=0x1109668 "/mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/btr/btr0pcur.cc", line=438, index=0x3393d18, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/btr0btr.ic:60
#10 0x0000000000d8df82 in btr_pcur_move_to_next_page (cursor=0x7f3acebfd420, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/btr/btr0pcur.cc:438
#11 0x0000000000df3a47 in btr_pcur_move_to_next_user_rec (cursor=0x7f3acebfd420, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/btr0pcur.ic:323
#12 0x0000000000df5a6e in dict_stats_analyze_index_level (index=0x3393d18, level=0, n_diff=0x3496418, total_recs=0x7f3acebfdac0, total_pages=0x7f3acebfdab8, n_diff_boundaries=0x0, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0s...

Read more...

review: Needs Fixing
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal
Download full text (6.9 KiB)

Alexey -

> >>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >>>>>> not do what they promise to do?
> >>>>>
> >>>>>
> >>>>> Yes.
> >>>>
> >>>> OK, so we agree that naming is unfortunate.
> >>>
> >>>
> >>> Vanishingly slightly so, due to the alignment issue, which I believe is
> >> mostly theoretical, but nevertheless am ready to address now.
> >>>
> >>
> >> It doesn't do anything to enforce atomicity, does it? I.e. the following
> >> implementation would be equally "atomic":
> >>
> >> ulint
> >> os_atomic_load_ulint(ulint *ptr)
> >> {
> >> printf("Hello, world!\n");
> >> return(*ptr);
> >> }
> >
> >
> > Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64
> as long as pointer is aligned.
> >
>
> So... we don't need it after all?

But we don't want to ignore visibility and ordering.

> >>>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> >>>> example? Here's an example of a valid answer: "Because that will result
> >>>> in incorrect values being used in case ...". And some examples of
> >>>> invalid answers: "non-cache-coherent architectures, visibility, memory
> >>>> model, sunspots, crop circles, global warming, ...".
> >>>
> >>>
> >>> We have gone through this with a disassembly example already, haven't we?
> >> We need os_atomic_load_ulint() because we don't want a dirty read. We may
> >> well decide that a dirty read there is fine and then replace it. But
> that's
> >> orthogonal to what is this primitive and why.
> >>>
> >>
> >> We need an atomic read rather than os_atomic_load_ulint() for the above
> >> reasons. An it will be atomic without using any helper functions.
> >
> >
> > OK, so is the problem that I wanted to introduce the primitives for such
> access, that would also document how is the variable accessed, and that they
> don't have to do much besides a compiler barrier on x86_64?
> >
>
> Yes, the problem is that you introduce primitives that basically do
> nothing, and then use those primitives unnecessarily and inconsistently.
> Which in turn leads to blowing up the patch size, increased maintenance
> burden, and opens the door for wrong assumptions made when reading the
> existing code and implementing new code.

OK, now I see your concerns, and understand a big part of them but not all of them. What wrong assumptions are encouraged by the primitives?

> >> Since
> >> I see no answer in the form "Because that will result in incorrect
> >> values being used in case ...", I assume you don't have an answer to
> >> that question.
> >
> >
> > I know that they are not resulting in incorrect values currently, and that
> the worst can happen in x86_64 with the most of possible future code changes
> is that the value loads could be moved earlier, resulting in more out-of-date
> values used. This and the fact that accessing the variable through the
> primitive serves as self-documentation seem good enough reasons to me.
> >
>
> Whether they are more "out-of-date" or less "out-of-date" depends on the
> definition of "date". By defining "date" as the "point in time when the
> function is called", "more out-of-date" can be easily...

Read more...

Revision history for this message
Alexey Kopytov (akopytov) wrote : Posted in a previous version of this proposal

Hi Laurynas,

On Tue, 17 Sep 2013 12:41:18 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>>>>>> not do what they promise to do?
>>>>>>>
>>>>>>>
>>>>>>> Yes.
>>>>>>
>>>>>> OK, so we agree that naming is unfortunate.
>>>>>
>>>>>
>>>>> Vanishingly slightly so, due to the alignment issue, which I believe is
>>>> mostly theoretical, but nevertheless am ready to address now.
>>>>>
>>>>
>>>> It doesn't do anything to enforce atomicity, does it? I.e. the following
>>>> implementation would be equally "atomic":
>>>>
>>>> ulint
>>>> os_atomic_load_ulint(ulint *ptr)
>>>> {
>>>> printf("Hello, world!\n");
>>>> return(*ptr);
>>>> }
>>>
>>>
>>> Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64
>> as long as pointer is aligned.
>>>
>>
>> So... we don't need it after all?
>
>
> But we don't want to ignore visibility and ordering.
>

Yeah, you forgot non-cache-coherent architectures.

This discussion has been running in circles for almost a week now, and I
have a feeling you deliberately keep it this way. Since I have not been
presented any technical arguments for keeping that code, and have better
things to do, I'm going to wrap up the democrazy and stop it forcefully.

I will not approve this MP with "atomic" primitives present in the code.
The discussion is over.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Alexey -

> This discussion has been running in circles for almost a week now, and I
> have a feeling you deliberately keep it this way. Since I have not been
> presented any technical arguments for keeping that code, and have better
> things to do, I'm going to wrap up the democrazy and stop it forcefully.
>
> I will not approve this MP with "atomic" primitives present in the code.
> The discussion is over.

I did not do anything to deserve this kind of treatment. Your feeling on me keeping this deliberately is plain wrong (What do I have to gain? Minus one week of my copious time? Smug feeling of being right?). I have addressed every single your comment without hesitation and coming from an assumption that you are right in this MP and tens of MPs before.

I have tried my best to explain why the code is correct. You seem to disagree but I have trouble understanding why. I am well within in my rights to ask you to explain further and the burden is on you to show why the code is wrong. Hence your refusal to continue the review is stalling the review right now. Please continue the technical discussion.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Repushed. Changes from the 2nd push. Not a resubmission yet.

    - Removed ut_ad(mutex_own(&buf_pool->zip_free_mutex)) from
      buf_pool_contains_zip().
    - Fixed comment typos.
    - Reverted in_LRU_list and in_unzip_LRU_list changes, with the
      exception of an extra assert in buf_page_set_sticky().
    - Reverted spurious whitespace changes.
    - Removed spurious diagnostic printf from
      buf_pool_validate_instance().
    - Fixed locking in buf_block_try_discard_uncompressed().
    - Reverted redundant locking changes in buf_page_get_zip() and
      buf_page_get_gen().
    - Removed buf_enter_zip_mutex_for_page().
    - Removed os_atomic_load_ulint() uses from from
      buf_get_total_stat(), buf_print_instance(),
      buf_stat_get_pool_info(), buf_LRU_evict_from_unzip_LRU(), the
      last instance in buf_read_recv_pages(), and.
      srv_mon_process_existing_counter().

Revision history for this message
Alexey Kopytov (akopytov) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'Percona-Server/storage/innobase/btr/btr0cur.cc'
2--- Percona-Server/storage/innobase/btr/btr0cur.cc 2013-09-06 13:40:39 +0000
3+++ Percona-Server/storage/innobase/btr/btr0cur.cc 2013-09-20 05:29:11 +0000
4@@ -4503,12 +4503,14 @@
5 buf_pool_t* buf_pool = buf_pool_from_block(block);
6 ulint space = buf_block_get_space(block);
7 ulint page_no = buf_block_get_page_no(block);
8+ bool freed = false;
9
10 ut_ad(mtr_memo_contains(mtr, block, MTR_MEMO_PAGE_X_FIX));
11
12 mtr_commit(mtr);
13
14- buf_pool_mutex_enter(buf_pool);
15+ mutex_enter(&buf_pool->LRU_list_mutex);
16+ mutex_enter(&block->mutex);
17
18 /* Only free the block if it is still allocated to
19 the same file page. */
20@@ -4518,16 +4520,26 @@
21 && buf_block_get_space(block) == space
22 && buf_block_get_page_no(block) == page_no) {
23
24- if (!buf_LRU_free_page(&block->page, all)
25- && all && block->page.zip.data) {
26+ freed = buf_LRU_free_page(&block->page, all);
27+
28+ if (!freed && all && block->page.zip.data
29+ /* Now, buf_LRU_free_page() may release mutexes
30+ temporarily */
31+ && buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE
32+ && buf_block_get_space(block) == space
33+ && buf_block_get_page_no(block) == page_no) {
34+
35 /* Attempt to deallocate the uncompressed page
36 if the whole block cannot be deallocted. */
37-
38- buf_LRU_free_page(&block->page, false);
39+ freed = buf_LRU_free_page(&block->page, false);
40 }
41 }
42
43- buf_pool_mutex_exit(buf_pool);
44+ if (!freed) {
45+ mutex_exit(&buf_pool->LRU_list_mutex);
46+ }
47+
48+ mutex_exit(&block->mutex);
49 }
50
51 /*******************************************************************//**
52
53=== modified file 'Percona-Server/storage/innobase/btr/btr0sea.cc'
54--- Percona-Server/storage/innobase/btr/btr0sea.cc 2013-09-06 13:40:39 +0000
55+++ Percona-Server/storage/innobase/btr/btr0sea.cc 2013-09-20 05:29:11 +0000
56@@ -1922,19 +1922,15 @@
57
58 rec_offs_init(offsets_);
59
60- buf_pool_mutex_enter_all();
61-
62 cell_count = hash_get_n_cells(btr_search_sys->hash_tables[t]);
63
64 for (i = 0; i < cell_count; i++) {
65 /* We release btr_search_latch every once in a while to
66 give other queries a chance to run. */
67 if ((i != 0) && ((i % chunk_size) == 0)) {
68- buf_pool_mutex_exit_all();
69 btr_search_x_unlock_all();
70 os_thread_yield();
71 btr_search_x_lock_all();
72- buf_pool_mutex_enter_all();
73 }
74
75 node = (ha_node_t*)
76@@ -1942,7 +1938,7 @@
77 i)->node;
78
79 for (; node != NULL; node = node->next) {
80- const buf_block_t* block
81+ buf_block_t* block
82 = buf_block_align((byte*) node->data);
83 const buf_block_t* hash_block;
84 buf_pool_t* buf_pool;
85@@ -1983,6 +1979,8 @@
86 == BUF_BLOCK_REMOVE_HASH);
87 }
88
89+ mutex_enter(&block->mutex);
90+
91 ut_a(!dict_index_is_ibuf(block->index));
92
93 page_index_id = btr_page_get_index_id(block->frame);
94@@ -2038,6 +2036,8 @@
95 n_page_dumps++;
96 }
97 }
98+
99+ mutex_exit(&block->mutex);
100 }
101 }
102
103@@ -2047,11 +2047,9 @@
104 /* We release btr_search_latch every once in a while to
105 give other queries a chance to run. */
106 if (i != 0) {
107- buf_pool_mutex_exit_all();
108 btr_search_x_unlock_all();
109 os_thread_yield();
110 btr_search_x_lock_all();
111- buf_pool_mutex_enter_all();
112 }
113
114 if (!ha_validate(btr_search_sys->hash_tables[t], i,
115@@ -2060,7 +2058,6 @@
116 }
117 }
118
119- buf_pool_mutex_exit_all();
120 if (UNIV_LIKELY_NULL(heap)) {
121 mem_heap_free(heap);
122 }
123
124=== modified file 'Percona-Server/storage/innobase/buf/buf0buddy.cc'
125--- Percona-Server/storage/innobase/buf/buf0buddy.cc 2013-08-06 15:16:34 +0000
126+++ Percona-Server/storage/innobase/buf/buf0buddy.cc 2013-09-20 05:29:11 +0000
127@@ -205,7 +205,7 @@
128 {
129 const ulint size = BUF_BUDDY_LOW << i;
130
131- ut_ad(buf_pool_mutex_own(buf_pool));
132+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));
133 ut_ad(!ut_align_offset(buf, size));
134 ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
135
136@@ -278,7 +278,7 @@
137 ulint i) /*!< in: index of
138 buf_pool->zip_free[] */
139 {
140- ut_ad(buf_pool_mutex_own(buf_pool));
141+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));
142 ut_ad(buf_pool->zip_free[i].start != buf);
143
144 buf_buddy_stamp_free(buf, i);
145@@ -297,7 +297,7 @@
146 ulint i) /*!< in: index of
147 buf_pool->zip_free[] */
148 {
149- ut_ad(buf_pool_mutex_own(buf_pool));
150+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));
151 ut_ad(buf_buddy_check_free(buf_pool, buf, i));
152
153 UT_LIST_REMOVE(list, buf_pool->zip_free[i], buf);
154@@ -316,7 +316,7 @@
155 {
156 buf_buddy_free_t* buf;
157
158- ut_ad(buf_pool_mutex_own(buf_pool));
159+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));
160 ut_a(i < BUF_BUDDY_SIZES);
161 ut_a(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
162
163@@ -369,10 +369,11 @@
164 buf_page_t* bpage;
165 buf_block_t* block;
166
167- ut_ad(buf_pool_mutex_own(buf_pool));
168 ut_ad(!mutex_own(&buf_pool->zip_mutex));
169 ut_a(!ut_align_offset(buf, UNIV_PAGE_SIZE));
170
171+ mutex_enter(&buf_pool->zip_hash_mutex);
172+
173 HASH_SEARCH(hash, buf_pool->zip_hash, fold, buf_page_t*, bpage,
174 ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_MEMORY
175 && bpage->in_zip_hash && !bpage->in_page_hash),
176@@ -384,6 +385,8 @@
177 ut_d(bpage->in_zip_hash = FALSE);
178 HASH_DELETE(buf_page_t, hash, buf_pool->zip_hash, fold, bpage);
179
180+ mutex_exit(&buf_pool->zip_hash_mutex);
181+
182 ut_d(memset(buf, 0, UNIV_PAGE_SIZE));
183 UNIV_MEM_INVALID(buf, UNIV_PAGE_SIZE);
184
185@@ -406,7 +409,6 @@
186 {
187 buf_pool_t* buf_pool = buf_pool_from_block(block);
188 const ulint fold = BUF_POOL_ZIP_FOLD(block);
189- ut_ad(buf_pool_mutex_own(buf_pool));
190 ut_ad(!mutex_own(&buf_pool->zip_mutex));
191 ut_ad(buf_block_get_state(block) == BUF_BLOCK_READY_FOR_USE);
192
193@@ -418,7 +420,10 @@
194 ut_ad(!block->page.in_page_hash);
195 ut_ad(!block->page.in_zip_hash);
196 ut_d(block->page.in_zip_hash = TRUE);
197+
198+ mutex_enter(&buf_pool->zip_hash_mutex);
199 HASH_INSERT(buf_page_t, hash, buf_pool->zip_hash, fold, &block->page);
200+ mutex_exit(&buf_pool->zip_hash_mutex);
201
202 ut_d(buf_pool->buddy_n_frames++);
203 }
204@@ -438,6 +443,7 @@
205 of buf_pool->zip_free[] */
206 {
207 ulint offs = BUF_BUDDY_LOW << j;
208+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));
209 ut_ad(j <= BUF_BUDDY_SIZES);
210 ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
211 ut_ad(j >= i);
212@@ -461,8 +467,8 @@
213
214 /**********************************************************************//**
215 Allocate a block. The thread calling this function must hold
216-buf_pool->mutex and must not hold buf_pool->zip_mutex or any block->mutex.
217-The buf_pool_mutex may be released and reacquired.
218+buf_pool->LRU_list_mutex and must not hold buf_pool->zip_mutex or any
219+block->mutex. The buf_pool->LRU_list_mutex may be released and reacquired.
220 @return allocated block, never NULL */
221 UNIV_INTERN
222 void*
223@@ -474,23 +480,25 @@
224 ibool* lru) /*!< in: pointer to a variable that
225 will be assigned TRUE if storage was
226 allocated from the LRU list and
227- buf_pool->mutex was temporarily
228- released */
229+ buf_pool->LRU_list_mutex was
230+ temporarily released */
231 {
232 buf_block_t* block;
233
234 ut_ad(lru);
235- ut_ad(buf_pool_mutex_own(buf_pool));
236+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
237 ut_ad(!mutex_own(&buf_pool->zip_mutex));
238 ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
239
240 if (i < BUF_BUDDY_SIZES) {
241 /* Try to allocate from the buddy system. */
242+ mutex_enter(&buf_pool->zip_free_mutex);
243 block = (buf_block_t*) buf_buddy_alloc_zip(buf_pool, i);
244
245 if (block) {
246 goto func_exit;
247 }
248+ mutex_exit(&buf_pool->zip_free_mutex);
249 }
250
251 /* Try allocating from the buf_pool->free list. */
252@@ -502,24 +510,28 @@
253 }
254
255 /* Try replacing an uncompressed page in the buffer pool. */
256- buf_pool_mutex_exit(buf_pool);
257+ mutex_exit(&buf_pool->LRU_list_mutex);
258 block = buf_LRU_get_free_block(buf_pool);
259 *lru = TRUE;
260- buf_pool_mutex_enter(buf_pool);
261+ mutex_enter(&buf_pool->LRU_list_mutex);
262
263 alloc_big:
264 buf_buddy_block_register(block);
265
266+ mutex_enter(&buf_pool->zip_free_mutex);
267 block = (buf_block_t*) buf_buddy_alloc_from(
268 buf_pool, block->frame, i, BUF_BUDDY_SIZES);
269
270 func_exit:
271 buf_pool->buddy_stat[i].used++;
272+ mutex_exit(&buf_pool->zip_free_mutex);
273+
274 return(block);
275 }
276
277 /**********************************************************************//**
278-Try to relocate a block.
279+Try to relocate a block. The caller must hold zip_free_mutex, and this
280+function will release and lock it again.
281 @return true if relocated */
282 static
283 bool
284@@ -536,8 +548,9 @@
285 ib_mutex_t* mutex;
286 ulint space;
287 ulint offset;
288+ rw_lock_t* hash_lock;
289
290- ut_ad(buf_pool_mutex_own(buf_pool));
291+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));
292 ut_ad(!mutex_own(&buf_pool->zip_mutex));
293 ut_ad(!ut_align_offset(src, size));
294 ut_ad(!ut_align_offset(dst, size));
295@@ -556,7 +569,9 @@
296
297 ut_ad(space != BUF_BUDDY_STAMP_FREE);
298
299- bpage = buf_page_hash_get(buf_pool, space, offset);
300+ mutex_exit(&buf_pool->zip_free_mutex);
301+ /* Lock page hash to prevent a relocation for the target page */
302+ bpage = buf_page_hash_get_s_locked(buf_pool, space, offset, &hash_lock);
303
304 if (!bpage || bpage->zip.data != src) {
305 /* The block has probably been freshly
306@@ -564,6 +579,10 @@
307 added to buf_pool->page_hash yet. Obviously,
308 it cannot be relocated. */
309
310+ if (bpage) {
311+ rw_lock_s_unlock(hash_lock);
312+ }
313+ mutex_enter(&buf_pool->zip_free_mutex);
314 return(false);
315 }
316
317@@ -573,6 +592,8 @@
318 For the sake of simplicity, give up. */
319 ut_ad(page_zip_get_size(&bpage->zip) < size);
320
321+ rw_lock_s_unlock(hash_lock);
322+ mutex_enter(&buf_pool->zip_free_mutex);
323 return(false);
324 }
325
326@@ -584,6 +605,10 @@
327
328 mutex_enter(mutex);
329
330+ rw_lock_s_unlock(hash_lock);
331+
332+ mutex_enter(&buf_pool->zip_free_mutex);
333+
334 if (buf_page_can_relocate(bpage)) {
335 /* Relocate the compressed page. */
336 ullint usec = ut_time_us(NULL);
337@@ -618,17 +643,19 @@
338 {
339 buf_buddy_free_t* buddy;
340
341- ut_ad(buf_pool_mutex_own(buf_pool));
342 ut_ad(!mutex_own(&buf_pool->zip_mutex));
343 ut_ad(i <= BUF_BUDDY_SIZES);
344 ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN));
345+
346+ mutex_enter(&buf_pool->zip_free_mutex);
347+
348 ut_ad(buf_pool->buddy_stat[i].used > 0);
349-
350 buf_pool->buddy_stat[i].used--;
351 recombine:
352 UNIV_MEM_ASSERT_AND_ALLOC(buf, BUF_BUDDY_LOW << i);
353
354 if (i == BUF_BUDDY_SIZES) {
355+ mutex_exit(&buf_pool->zip_free_mutex);
356 buf_buddy_block_free(buf_pool, buf);
357 return;
358 }
359@@ -695,4 +722,5 @@
360 buf_buddy_add_to_free(buf_pool,
361 reinterpret_cast<buf_buddy_free_t*>(buf),
362 i);
363+ mutex_exit(&buf_pool->zip_free_mutex);
364 }
365
366=== modified file 'Percona-Server/storage/innobase/buf/buf0buf.cc'
367--- Percona-Server/storage/innobase/buf/buf0buf.cc 2013-09-02 10:01:38 +0000
368+++ Percona-Server/storage/innobase/buf/buf0buf.cc 2013-09-20 05:29:11 +0000
369@@ -119,24 +119,9 @@
370
371 Buffer pool struct
372 ------------------
373-The buffer buf_pool contains a single mutex which protects all the
374+The buffer buf_pool contains several mutexes which protect all the
375 control data structures of the buf_pool. The content of a buffer frame is
376 protected by a separate read-write lock in its control block, though.
377-These locks can be locked and unlocked without owning the buf_pool->mutex.
378-The OS events in the buf_pool struct can be waited for without owning the
379-buf_pool->mutex.
380-
381-The buf_pool->mutex is a hot-spot in main memory, causing a lot of
382-memory bus traffic on multiprocessor systems when processors
383-alternately access the mutex. On our Pentium, the mutex is accessed
384-maybe every 10 microseconds. We gave up the solution to have mutexes
385-for each control block, for instance, because it seemed to be
386-complicated.
387-
388-A solution to reduce mutex contention of the buf_pool->mutex is to
389-create a separate mutex for the page hash table. On Pentium,
390-accessing the hash table takes 2 microseconds, about half
391-of the total buf_pool->mutex hold time.
392
393 Control blocks
394 --------------
395@@ -311,6 +296,11 @@
396 UNIV_INTERN mysql_pfs_key_t buffer_block_mutex_key;
397 UNIV_INTERN mysql_pfs_key_t buf_pool_mutex_key;
398 UNIV_INTERN mysql_pfs_key_t buf_pool_zip_mutex_key;
399+UNIV_INTERN mysql_pfs_key_t buf_pool_flush_state_mutex_key;
400+UNIV_INTERN mysql_pfs_key_t buf_pool_LRU_list_mutex_key;
401+UNIV_INTERN mysql_pfs_key_t buf_pool_free_list_mutex_key;
402+UNIV_INTERN mysql_pfs_key_t buf_pool_zip_free_mutex_key;
403+UNIV_INTERN mysql_pfs_key_t buf_pool_zip_hash_mutex_key;
404 UNIV_INTERN mysql_pfs_key_t flush_list_mutex_key;
405 #endif /* UNIV_PFS_MUTEX */
406
407@@ -1186,7 +1176,6 @@
408 buf_chunk_t* chunk = buf_pool->chunks;
409
410 ut_ad(buf_pool);
411- ut_ad(buf_pool_mutex_own(buf_pool));
412 for (n = buf_pool->n_chunks; n--; chunk++) {
413
414 buf_block_t* block = buf_chunk_contains_zip(chunk, data);
415@@ -1265,8 +1254,6 @@
416 ulint i;
417 ulint curr_size = 0;
418
419- buf_pool_mutex_enter_all();
420-
421 for (i = 0; i < srv_buf_pool_instances; i++) {
422 buf_pool_t* buf_pool;
423
424@@ -1276,8 +1263,6 @@
425
426 srv_buf_pool_curr_size = curr_size;
427 srv_buf_pool_old_size = srv_buf_pool_size;
428-
429- buf_pool_mutex_exit_all();
430 }
431
432 /********************************************************************//**
433@@ -1297,12 +1282,18 @@
434
435 /* 1. Initialize general fields
436 ------------------------------- */
437- mutex_create(buf_pool_mutex_key,
438- &buf_pool->mutex, SYNC_BUF_POOL);
439+ mutex_create(buf_pool_LRU_list_mutex_key,
440+ &buf_pool->LRU_list_mutex, SYNC_BUF_LRU_LIST);
441+ mutex_create(buf_pool_free_list_mutex_key,
442+ &buf_pool->free_list_mutex, SYNC_BUF_FREE_LIST);
443+ mutex_create(buf_pool_zip_free_mutex_key,
444+ &buf_pool->zip_free_mutex, SYNC_BUF_ZIP_FREE);
445+ mutex_create(buf_pool_zip_hash_mutex_key,
446+ &buf_pool->zip_hash_mutex, SYNC_BUF_ZIP_HASH);
447 mutex_create(buf_pool_zip_mutex_key,
448 &buf_pool->zip_mutex, SYNC_BUF_BLOCK);
449-
450- buf_pool_mutex_enter(buf_pool);
451+ mutex_create(buf_pool_flush_state_mutex_key,
452+ &buf_pool->flush_state_mutex, SYNC_BUF_FLUSH_STATE);
453
454 if (buf_pool_size > 0) {
455 buf_pool->n_chunks = 1;
456@@ -1316,8 +1307,6 @@
457 mem_free(chunk);
458 mem_free(buf_pool);
459
460- buf_pool_mutex_exit(buf_pool);
461-
462 return(DB_ERROR);
463 }
464
465@@ -1361,8 +1350,6 @@
466
467 buf_pool->try_LRU_scan = TRUE;
468
469- buf_pool_mutex_exit(buf_pool);
470-
471 return(DB_SUCCESS);
472 }
473
474@@ -1537,7 +1524,7 @@
475
476 fold = buf_page_address_fold(bpage->space, bpage->offset);
477
478- ut_ad(buf_pool_mutex_own(buf_pool));
479+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
480 ut_ad(buf_page_hash_lock_held_x(buf_pool, bpage));
481 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
482 ut_a(buf_page_get_io_fix(bpage) == BUF_IO_NONE);
483@@ -1670,14 +1657,14 @@
484 return(bpage);
485 }
486 /* Add to an existing watch. */
487+ mutex_enter(&buf_pool->zip_mutex);
488 bpage->buf_fix_count++;
489+ mutex_exit(&buf_pool->zip_mutex);
490 return(NULL);
491 }
492
493 /* From this point this function becomes fairly heavy in terms
494- of latching. We acquire the buf_pool mutex as well as all the
495- hash_locks. buf_pool mutex is needed because any changes to
496- the page_hash must be covered by it and hash_locks are needed
497+ of latching. We acquire all the hash_locks. They are needed
498 because we don't want to read any stale information in
499 buf_pool->watch[]. However, it is not in the critical code path
500 as this function will be called only by the purge thread. */
501@@ -1686,18 +1673,16 @@
502 /* To obey latching order first release the hash_lock. */
503 rw_lock_x_unlock(hash_lock);
504
505- buf_pool_mutex_enter(buf_pool);
506 hash_lock_x_all(buf_pool->page_hash);
507
508 /* We have to recheck that the page
509 was not loaded or a watch set by some other
510 purge thread. This is because of the small
511 time window between when we release the
512- hash_lock to acquire buf_pool mutex above. */
513+ hash_lock to acquire all the hash locks above. */
514
515 bpage = buf_page_hash_get_low(buf_pool, space, offset, fold);
516 if (UNIV_LIKELY_NULL(bpage)) {
517- buf_pool_mutex_exit(buf_pool);
518 hash_unlock_x_all_but(buf_pool->page_hash, hash_lock);
519 goto page_found;
520 }
521@@ -1716,21 +1701,19 @@
522 ut_ad(!bpage->in_page_hash);
523 ut_ad(bpage->buf_fix_count == 0);
524
525- /* bpage is pointing to buf_pool->watch[],
526- which is protected by buf_pool->mutex.
527- Normally, buf_page_t objects are protected by
528- buf_block_t::mutex or buf_pool->zip_mutex or both. */
529+ mutex_enter(&buf_pool->zip_mutex);
530
531 bpage->state = BUF_BLOCK_ZIP_PAGE;
532 bpage->space = space;
533 bpage->offset = offset;
534 bpage->buf_fix_count = 1;
535
536+ mutex_exit(&buf_pool->zip_mutex);
537+
538 ut_d(bpage->in_page_hash = TRUE);
539 HASH_INSERT(buf_page_t, hash, buf_pool->page_hash,
540 fold, bpage);
541
542- buf_pool_mutex_exit(buf_pool);
543 /* Once the sentinel is in the page_hash we can
544 safely release all locks except just the
545 relevant hash_lock */
546@@ -1777,7 +1760,8 @@
547 ut_ad(rw_lock_own(hash_lock, RW_LOCK_EX));
548 #endif /* UNIV_SYNC_DEBUG */
549
550- ut_ad(buf_pool_mutex_own(buf_pool));
551+ ut_ad(buf_page_get_state(watch) == BUF_BLOCK_ZIP_PAGE);
552+ ut_ad(buf_own_zip_mutex_for_page(watch));
553
554 HASH_DELETE(buf_page_t, hash, buf_pool->page_hash, fold, watch);
555 ut_d(watch->in_page_hash = FALSE);
556@@ -1801,13 +1785,6 @@
557 rw_lock_t* hash_lock = buf_page_hash_lock_get(buf_pool,
558 fold);
559
560- /* We only need to have buf_pool mutex in case where we end
561- up calling buf_pool_watch_remove but to obey latching order
562- we acquire it here before acquiring hash_lock. This should
563- not cause too much grief as this function is only ever
564- called from the purge thread. */
565- buf_pool_mutex_enter(buf_pool);
566-
567 rw_lock_x_lock(hash_lock);
568
569 bpage = buf_page_hash_get_low(buf_pool, space, offset, fold);
570@@ -1825,12 +1802,13 @@
571 } else {
572 ut_a(bpage->buf_fix_count > 0);
573
574+ mutex_enter(&buf_pool->zip_mutex);
575 if (UNIV_LIKELY(!--bpage->buf_fix_count)) {
576 buf_pool_watch_remove(buf_pool, fold, bpage);
577 }
578+ mutex_exit(&buf_pool->zip_mutex);
579 }
580
581- buf_pool_mutex_exit(buf_pool);
582 rw_lock_x_unlock(hash_lock);
583 }
584
585@@ -1877,13 +1855,13 @@
586 {
587 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
588
589- buf_pool_mutex_enter(buf_pool);
590+ mutex_enter(&buf_pool->LRU_list_mutex);
591
592 ut_a(buf_page_in_file(bpage));
593
594 buf_LRU_make_block_young(bpage);
595
596- buf_pool_mutex_exit(buf_pool);
597+ mutex_exit(&buf_pool->LRU_list_mutex);
598 }
599
600 /********************************************************************//**
601@@ -1897,10 +1875,6 @@
602 buf_page_t* bpage) /*!< in/out: buffer block of a
603 file page */
604 {
605-#ifdef UNIV_DEBUG
606- buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
607- ut_ad(!buf_pool_mutex_own(buf_pool));
608-#endif /* UNIV_DEBUG */
609 ut_a(buf_page_in_file(bpage));
610
611 if (buf_page_peek_if_too_old(bpage)) {
612@@ -1921,16 +1895,12 @@
613 buf_block_t* block;
614 buf_pool_t* buf_pool = buf_pool_get(space, offset);
615
616- buf_pool_mutex_enter(buf_pool);
617-
618 block = (buf_block_t*) buf_page_hash_get(buf_pool, space, offset);
619
620 if (block && buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE) {
621 ut_ad(!buf_pool_watch_is_sentinel(buf_pool, &block->page));
622 block->check_index_page_at_flush = FALSE;
623 }
624-
625- buf_pool_mutex_exit(buf_pool);
626 }
627
628 #if defined UNIV_DEBUG_FILE_ACCESSES || defined UNIV_DEBUG
629@@ -2014,21 +1984,32 @@
630 buf_page_t* bpage;
631 buf_pool_t* buf_pool = buf_pool_get(space, offset);
632
633- /* Since we need to acquire buf_pool mutex to discard
634- the uncompressed frame and because page_hash mutex resides
635- below buf_pool mutex in sync ordering therefore we must
636- first release the page_hash mutex. This means that the
637- block in question can move out of page_hash. Therefore
638- we need to check again if the block is still in page_hash. */
639- buf_pool_mutex_enter(buf_pool);
640+ /* Since we need to acquire buf_pool->LRU_list_mutex to discard
641+ the uncompressed frame and because page_hash mutex resides below
642+ buf_pool->LRU_list_mutex in sync ordering therefore we must first
643+ release the page_hash mutex. This means that the block in question
644+ can move out of page_hash. Therefore we need to check again if the
645+ block is still in page_hash. */
646+
647+ mutex_enter(&buf_pool->LRU_list_mutex);
648
649 bpage = buf_page_hash_get(buf_pool, space, offset);
650
651 if (bpage) {
652- buf_LRU_free_page(bpage, false);
653+
654+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
655+
656+ mutex_enter(block_mutex);
657+
658+ if (buf_LRU_free_page(bpage, false)) {
659+
660+ mutex_exit(block_mutex);
661+ return;
662+ }
663+ mutex_exit(block_mutex);
664 }
665
666- buf_pool_mutex_exit(buf_pool);
667+ mutex_exit(&buf_pool->LRU_list_mutex);
668 }
669
670 /********************************************************************//**
671@@ -2324,7 +2305,8 @@
672 ut_ad(block->frame == page_align(ptr));
673 #ifdef UNIV_DEBUG
674 /* A thread that updates these fields must
675- hold buf_pool->mutex and block->mutex. Acquire
676+ hold one of the buf_pool mutexes, depending on the
677+ page state, and block->mutex. Acquire
678 only the latter. */
679 mutex_enter(&block->mutex);
680
681@@ -2708,6 +2690,7 @@
682 buf_page_t* bpage;
683
684 case BUF_BLOCK_FILE_PAGE:
685+ ut_ad(block_mutex != &buf_pool->zip_mutex);
686 break;
687
688 case BUF_BLOCK_ZIP_PAGE:
689@@ -2721,13 +2704,14 @@
690 }
691
692 bpage = &block->page;
693+ ut_ad(block_mutex == &buf_pool->zip_mutex);
694
695 if (bpage->buf_fix_count
696 || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
697 /* This condition often occurs when the buffer
698 is not buffer-fixed, but I/O-fixed by
699 buf_page_init_for_read(). */
700- mutex_exit(block_mutex);
701+ mutex_exit(&buf_pool->zip_mutex);
702 wait_until_unfixed:
703 /* The block is buffer-fixed or I/O-fixed.
704 Try again later. */
705@@ -2737,11 +2721,11 @@
706 }
707
708 /* Allocate an uncompressed page. */
709- mutex_exit(block_mutex);
710+ mutex_exit(&buf_pool->zip_mutex);
711 block = buf_LRU_get_free_block(buf_pool);
712 ut_a(block);
713
714- buf_pool_mutex_enter(buf_pool);
715+ mutex_enter(&buf_pool->LRU_list_mutex);
716
717 /* As we have released the page_hash lock and the
718 block_mutex to allocate an uncompressed page it is
719@@ -2754,22 +2738,25 @@
720 offset, fold);
721
722 mutex_enter(&block->mutex);
723+ mutex_enter(&buf_pool->zip_mutex);
724+
725 if (bpage != hash_bpage
726 || bpage->buf_fix_count
727 || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
728 buf_LRU_block_free_non_file_page(block);
729- buf_pool_mutex_exit(buf_pool);
730+ mutex_exit(&buf_pool->LRU_list_mutex);
731+ mutex_exit(&buf_pool->zip_mutex);
732 rw_lock_x_unlock(hash_lock);
733 mutex_exit(&block->mutex);
734
735 if (bpage != hash_bpage) {
736 /* The buf_pool->page_hash was modified
737- while buf_pool->mutex was not held
738+ while buf_pool->LRU_list_mutex was not held
739 by this thread. */
740 goto loop;
741 } else {
742 /* The block was buffer-fixed or
743- I/O-fixed while buf_pool->mutex was
744+ I/O-fixed while buf_pool->LRU_list_mutex was
745 not held by this thread. */
746 goto wait_until_unfixed;
747 }
748@@ -2778,8 +2765,6 @@
749 /* Move the compressed page from bpage to block,
750 and uncompress it. */
751
752- mutex_enter(&buf_pool->zip_mutex);
753-
754 buf_relocate(bpage, &block->page);
755 buf_block_init_low(block);
756 block->lock_hash_val = lock_rec_hash(space, offset);
757@@ -2808,6 +2793,8 @@
758 /* Insert at the front of unzip_LRU list */
759 buf_unzip_LRU_add_block(block, FALSE);
760
761+ mutex_exit(&buf_pool->LRU_list_mutex);
762+
763 block->page.buf_fix_count = 1;
764 buf_block_set_io_fix(block, BUF_IO_READ);
765 rw_lock_x_lock_inline(&block->lock, 0, file, line);
766@@ -2816,8 +2803,7 @@
767
768 rw_lock_x_unlock(hash_lock);
769
770- buf_pool->n_pend_unzip++;
771- buf_pool_mutex_exit(buf_pool);
772+ os_atomic_increment_ulint(&buf_pool->n_pend_unzip, 1);
773
774 access_time = buf_page_is_accessed(&block->page);
775 mutex_exit(&block->mutex);
776@@ -2826,7 +2812,7 @@
777 buf_page_free_descriptor(bpage);
778
779 /* Decompress the page while not holding
780- buf_pool->mutex or block->mutex. */
781+ any buf_pool or block->mutex. */
782
783 /* Page checksum verification is already done when
784 the page is read from disk. Hence page checksum
785@@ -2845,12 +2831,10 @@
786 }
787
788 /* Unfix and unlatch the block. */
789- buf_pool_mutex_enter(buf_pool);
790 mutex_enter(&block->mutex);
791 block->page.buf_fix_count--;
792 buf_block_set_io_fix(block, BUF_IO_NONE);
793- buf_pool->n_pend_unzip--;
794- buf_pool_mutex_exit(buf_pool);
795+ os_atomic_decrement_ulint(&buf_pool->n_pend_unzip, 1);
796 rw_lock_x_unlock(&block->lock);
797
798 break;
799@@ -2885,23 +2869,18 @@
800 insert buffer (change buffer) as much as possible. */
801
802 /* To obey the latching order, release the
803- block->mutex before acquiring buf_pool->mutex. Protect
804+ block->mutex before acquiring buf_pool->LRU_list_mutex. Protect
805 the block from changes by temporarily buffer-fixing it
806 for the time we are not holding block->mutex. */
807+
808 buf_block_buf_fix_inc(block, file, line);
809 mutex_exit(&block->mutex);
810- buf_pool_mutex_enter(buf_pool);
811+ mutex_enter(&buf_pool->LRU_list_mutex);
812 mutex_enter(&block->mutex);
813 buf_block_buf_fix_dec(block);
814- mutex_exit(&block->mutex);
815-
816- /* Now we are only holding the buf_pool->mutex,
817- not block->mutex or hash_lock. Blocks cannot be
818- relocated or enter or exit the buf_pool while we
819- are holding the buf_pool->mutex. */
820
821 if (buf_LRU_free_page(&block->page, true)) {
822- buf_pool_mutex_exit(buf_pool);
823+ mutex_exit(&block->mutex);
824 rw_lock_x_lock(hash_lock);
825
826 if (mode == BUF_GET_IF_IN_POOL_OR_WATCH) {
827@@ -2931,10 +2910,11 @@
828 "innodb_change_buffering_debug evict %u %u\n",
829 (unsigned) space, (unsigned) offset);
830 return(NULL);
831+ } else {
832+
833+ mutex_exit(&buf_pool->LRU_list_mutex);
834 }
835
836- mutex_enter(&block->mutex);
837-
838 if (buf_flush_page_try(buf_pool, block)) {
839 fprintf(stderr,
840 "innodb_change_buffering_debug flush %u %u\n",
841@@ -2944,8 +2924,6 @@
842 }
843
844 /* Failed to evict the page; change it directly */
845-
846- buf_pool_mutex_exit(buf_pool);
847 }
848 #endif /* UNIV_DEBUG || UNIV_IBUF_DEBUG */
849
850@@ -3420,7 +3398,6 @@
851 buf_page_t* hash_page;
852
853 ut_ad(buf_pool == buf_pool_get(space, offset));
854- ut_ad(buf_pool_mutex_own(buf_pool));
855
856 ut_ad(mutex_own(&(block->mutex)));
857 ut_a(buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE);
858@@ -3455,11 +3432,17 @@
859 if (UNIV_LIKELY(!hash_page)) {
860 } else if (buf_pool_watch_is_sentinel(buf_pool, hash_page)) {
861 /* Preserve the reference count. */
862+
863+ mutex_enter(&buf_pool->zip_mutex);
864+
865 ulint buf_fix_count = hash_page->buf_fix_count;
866
867 ut_a(buf_fix_count > 0);
868 block->page.buf_fix_count += buf_fix_count;
869 buf_pool_watch_remove(buf_pool, fold, hash_page);
870+
871+ mutex_exit(&buf_pool->zip_mutex);
872+
873 } else {
874 fprintf(stderr,
875 "InnoDB: Error: page %lu %lu already found"
876@@ -3469,7 +3452,6 @@
877 (const void*) hash_page, (const void*) block);
878 #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
879 mutex_exit(&block->mutex);
880- buf_pool_mutex_exit(buf_pool);
881 buf_print();
882 buf_LRU_print();
883 buf_validate();
884@@ -3556,7 +3538,7 @@
885 fold = buf_page_address_fold(space, offset);
886 hash_lock = buf_page_hash_lock_get(buf_pool, fold);
887
888- buf_pool_mutex_enter(buf_pool);
889+ mutex_enter(&buf_pool->LRU_list_mutex);
890 rw_lock_x_lock(hash_lock);
891
892 watch_page = buf_page_hash_get_low(buf_pool, space, offset, fold);
893@@ -3564,6 +3546,7 @@
894 /* The page is already in the buffer pool. */
895 watch_page = NULL;
896 err_exit:
897+ mutex_exit(&buf_pool->LRU_list_mutex);
898 rw_lock_x_unlock(hash_lock);
899 if (block) {
900 mutex_enter(&block->mutex);
901@@ -3596,6 +3579,7 @@
902
903 /* The block must be put to the LRU list, to the old blocks */
904 buf_LRU_add_block(bpage, TRUE/* to old blocks */);
905+ mutex_exit(&buf_pool->LRU_list_mutex);
906
907 /* We set a pass-type x-lock on the frame because then
908 the same thread which called for the read operation
909@@ -3610,15 +3594,16 @@
910 buf_page_set_io_fix(bpage, BUF_IO_READ);
911
912 if (zip_size) {
913- /* buf_pool->mutex may be released and
914+ /* buf_pool->LRU_list_mutex may be released and
915 reacquired by buf_buddy_alloc(). Thus, we
916 must release block->mutex in order not to
917 break the latching order in the reacquisition
918- of buf_pool->mutex. We also must defer this
919+ of buf_pool->LRU_list_mutex. We also must defer this
920 operation until after the block descriptor has
921 been added to buf_pool->LRU and
922 buf_pool->page_hash. */
923 mutex_exit(&block->mutex);
924+ mutex_enter(&buf_pool->LRU_list_mutex);
925 data = buf_buddy_alloc(buf_pool, zip_size, &lru);
926 mutex_enter(&block->mutex);
927 block->page.zip.data = (page_zip_t*) data;
928@@ -3630,6 +3615,7 @@
929 after block->page.zip.data is set. */
930 ut_ad(buf_page_belongs_to_unzip_LRU(&block->page));
931 buf_unzip_LRU_add_block(block, TRUE);
932+ mutex_exit(&buf_pool->LRU_list_mutex);
933 }
934
935 mutex_exit(&block->mutex);
936@@ -3645,8 +3631,9 @@
937 rw_lock_x_lock(hash_lock);
938
939 /* If buf_buddy_alloc() allocated storage from the LRU list,
940- it released and reacquired buf_pool->mutex. Thus, we must
941- check the page_hash again, as it may have been modified. */
942+ it released and reacquired buf_pool->LRU_list_mutex. Thus, we
943+ must check the page_hash again, as it may have been
944+ modified. */
945 if (UNIV_UNLIKELY(lru)) {
946
947 watch_page = buf_page_hash_get_low(
948@@ -3657,6 +3644,7 @@
949 watch_page))) {
950
951 /* The block was added by some other thread. */
952+ mutex_exit(&buf_pool->LRU_list_mutex);
953 rw_lock_x_unlock(hash_lock);
954 watch_page = NULL;
955 buf_buddy_free(buf_pool, data, zip_size);
956@@ -3700,6 +3688,7 @@
957 /* Preserve the reference count. */
958 ulint buf_fix_count = watch_page->buf_fix_count;
959 ut_a(buf_fix_count > 0);
960+ ut_ad(buf_own_zip_mutex_for_page(bpage));
961 bpage->buf_fix_count += buf_fix_count;
962 ut_ad(buf_pool_watch_is_sentinel(buf_pool, watch_page));
963 buf_pool_watch_remove(buf_pool, fold, watch_page);
964@@ -3716,15 +3705,15 @@
965 #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
966 buf_LRU_insert_zip_clean(bpage);
967 #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
968+ mutex_exit(&buf_pool->LRU_list_mutex);
969
970 buf_page_set_io_fix(bpage, BUF_IO_READ);
971
972 mutex_exit(&buf_pool->zip_mutex);
973 }
974
975- buf_pool->n_pend_reads++;
976+ os_atomic_increment_ulint(&buf_pool->n_pend_reads, 1);
977 func_exit:
978- buf_pool_mutex_exit(buf_pool);
979
980 if (mode == BUF_READ_IBUF_PAGES_ONLY) {
981
982@@ -3773,7 +3762,7 @@
983 fold = buf_page_address_fold(space, offset);
984 hash_lock = buf_page_hash_lock_get(buf_pool, fold);
985
986- buf_pool_mutex_enter(buf_pool);
987+ mutex_enter(&buf_pool->LRU_list_mutex);
988 rw_lock_x_lock(hash_lock);
989
990 block = (buf_block_t*) buf_page_hash_get_low(
991@@ -3790,8 +3779,8 @@
992 #endif /* UNIV_DEBUG_FILE_ACCESSES || UNIV_DEBUG */
993
994 /* Page can be found in buf_pool */
995- buf_pool_mutex_exit(buf_pool);
996 rw_lock_x_unlock(hash_lock);
997+ mutex_exit(&buf_pool->LRU_list_mutex);
998
999 buf_block_free(free_block);
1000
1001@@ -3827,17 +3816,17 @@
1002 ibool lru;
1003
1004 /* Prevent race conditions during buf_buddy_alloc(),
1005- which may release and reacquire buf_pool->mutex,
1006+ which may release and reacquire buf_pool->LRU_list_mutex,
1007 by IO-fixing and X-latching the block. */
1008
1009 buf_page_set_io_fix(&block->page, BUF_IO_READ);
1010 rw_lock_x_lock(&block->lock);
1011
1012 mutex_exit(&block->mutex);
1013- /* buf_pool->mutex may be released and reacquired by
1014+ /* buf_pool->LRU_list_mutex may be released and reacquired by
1015 buf_buddy_alloc(). Thus, we must release block->mutex
1016 in order not to break the latching order in
1017- the reacquisition of buf_pool->mutex. We also must
1018+ the reacquisition of buf_pool->LRU_list_mutex. We also must
1019 defer this operation until after the block descriptor
1020 has been added to buf_pool->LRU and buf_pool->page_hash. */
1021 data = buf_buddy_alloc(buf_pool, zip_size, &lru);
1022@@ -3856,7 +3845,7 @@
1023 rw_lock_x_unlock(&block->lock);
1024 }
1025
1026- buf_pool_mutex_exit(buf_pool);
1027+ mutex_exit(&buf_pool->LRU_list_mutex);
1028
1029 mtr_memo_push(mtr, block, MTR_MEMO_BUF_FIX);
1030
1031@@ -3907,6 +3896,8 @@
1032 const byte* frame;
1033 monitor_id_t counter;
1034
1035+ ut_ad(mutex_own(buf_page_get_mutex(bpage)));
1036+
1037 /* If the counter module is not turned on, just return */
1038 if (!MONITOR_IS_ON(MONITOR_MODULE_BUF_PAGE)) {
1039 return;
1040@@ -4014,9 +4005,13 @@
1041 == BUF_BLOCK_FILE_PAGE);
1042 ulint space = bpage->space;
1043 ibool ret = TRUE;
1044+ const ulint fold = buf_page_address_fold(bpage->space,
1045+ bpage->offset);
1046+ rw_lock_t* hash_lock = buf_page_hash_lock_get(buf_pool, fold);
1047
1048 /* First unfix and release lock on the bpage */
1049- buf_pool_mutex_enter(buf_pool);
1050+ mutex_enter(&buf_pool->LRU_list_mutex);
1051+ rw_lock_x_lock(hash_lock);
1052 mutex_enter(buf_page_get_mutex(bpage));
1053 ut_ad(buf_page_get_io_fix(bpage) == BUF_IO_READ);
1054 ut_ad(bpage->buf_fix_count == 0);
1055@@ -4030,19 +4025,18 @@
1056 BUF_IO_READ);
1057 }
1058
1059- mutex_exit(buf_page_get_mutex(bpage));
1060-
1061 /* Find the table with specified space id, and mark it corrupted */
1062 if (dict_set_corrupted_by_space(space)) {
1063 buf_LRU_free_one_page(bpage);
1064 } else {
1065+ mutex_exit(buf_page_get_mutex(bpage));
1066 ret = FALSE;
1067 }
1068
1069+ mutex_exit(&buf_pool->LRU_list_mutex);
1070+
1071 ut_ad(buf_pool->n_pend_reads > 0);
1072- buf_pool->n_pend_reads--;
1073-
1074- buf_pool_mutex_exit(buf_pool);
1075+ os_atomic_decrement_ulint(&buf_pool->n_pend_reads, 1);
1076
1077 return(ret);
1078 }
1079@@ -4061,6 +4055,7 @@
1080 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1081 const ibool uncompressed = (buf_page_get_state(bpage)
1082 == BUF_BLOCK_FILE_PAGE);
1083+ bool have_LRU_mutex = false;
1084
1085 ut_a(buf_page_in_file(bpage));
1086
1087@@ -4070,7 +4065,7 @@
1088 ensures that this is the only thread that handles the i/o for this
1089 block. */
1090
1091- io_type = buf_page_get_io_fix(bpage);
1092+ io_type = buf_page_get_io_fix_unlocked(bpage);
1093 ut_ad(io_type == BUF_IO_READ || io_type == BUF_IO_WRITE);
1094
1095 if (io_type == BUF_IO_READ) {
1096@@ -4080,15 +4075,16 @@
1097
1098 if (buf_page_get_zip_size(bpage)) {
1099 frame = bpage->zip.data;
1100- buf_pool->n_pend_unzip++;
1101+ os_atomic_increment_ulint(&buf_pool->n_pend_unzip, 1);
1102 if (uncompressed
1103 && !buf_zip_decompress((buf_block_t*) bpage,
1104 FALSE)) {
1105
1106- buf_pool->n_pend_unzip--;
1107+ os_atomic_decrement_ulint(
1108+ &buf_pool->n_pend_unzip, 1);
1109 goto corrupt;
1110 }
1111- buf_pool->n_pend_unzip--;
1112+ os_atomic_decrement_ulint(&buf_pool->n_pend_unzip, 1);
1113 } else {
1114 ut_a(uncompressed);
1115 frame = ((buf_block_t*) bpage)->frame;
1116@@ -4255,8 +4251,37 @@
1117 }
1118 }
1119
1120- buf_pool_mutex_enter(buf_pool);
1121- mutex_enter(buf_page_get_mutex(bpage));
1122+ if (io_type == BUF_IO_WRITE
1123+ && (
1124+#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
1125+ /* to keep consistency at buf_LRU_insert_zip_clean() */
1126+ buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY ||
1127+#endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
1128+ buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU)) {
1129+
1130+ have_LRU_mutex = TRUE; /* optimistic */
1131+ }
1132+retry_mutex:
1133+ if (have_LRU_mutex) {
1134+ mutex_enter(&buf_pool->LRU_list_mutex);
1135+ }
1136+
1137+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
1138+ mutex_enter(block_mutex);
1139+
1140+ if (UNIV_UNLIKELY(io_type == BUF_IO_WRITE
1141+ && (
1142+#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
1143+ buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY
1144+ ||
1145+#endif
1146+ buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU)
1147+ && !have_LRU_mutex)) {
1148+
1149+ mutex_exit(block_mutex);
1150+ have_LRU_mutex = TRUE;
1151+ goto retry_mutex;
1152+ }
1153
1154 #ifdef UNIV_IBUF_COUNT_DEBUG
1155 if (io_type == BUF_IO_WRITE || uncompressed) {
1156@@ -4271,17 +4296,20 @@
1157 removes the newest lock debug record, without checking the thread
1158 id. */
1159
1160- buf_page_set_io_fix(bpage, BUF_IO_NONE);
1161-
1162 switch (io_type) {
1163 case BUF_IO_READ:
1164+
1165+ buf_page_set_io_fix(bpage, BUF_IO_NONE);
1166+
1167 /* NOTE that the call to ibuf may have moved the ownership of
1168 the x-latch to this OS thread: do not let this confuse you in
1169 debugging! */
1170
1171 ut_ad(buf_pool->n_pend_reads > 0);
1172- buf_pool->n_pend_reads--;
1173- buf_pool->stat.n_pages_read++;
1174+ os_atomic_decrement_ulint(&buf_pool->n_pend_reads, 1);
1175+ os_atomic_increment_ulint(&buf_pool->stat.n_pages_read, 1);
1176+
1177+ ut_ad(!have_LRU_mutex);
1178
1179 if (uncompressed) {
1180 rw_lock_x_unlock_gen(&((buf_block_t*) bpage)->lock,
1181@@ -4296,13 +4324,17 @@
1182
1183 buf_flush_write_complete(bpage);
1184
1185+ os_atomic_increment_ulint(&buf_pool->stat.n_pages_written, 1);
1186+
1187+ if (have_LRU_mutex) {
1188+ mutex_exit(&buf_pool->LRU_list_mutex);
1189+ }
1190+
1191 if (uncompressed) {
1192 rw_lock_s_unlock_gen(&((buf_block_t*) bpage)->lock,
1193 BUF_IO_WRITE);
1194 }
1195
1196- buf_pool->stat.n_pages_written++;
1197-
1198 break;
1199
1200 default:
1201@@ -4320,8 +4352,7 @@
1202 }
1203 #endif /* UNIV_DEBUG */
1204
1205- mutex_exit(buf_page_get_mutex(bpage));
1206- buf_pool_mutex_exit(buf_pool);
1207+ mutex_exit(block_mutex);
1208
1209 return(true);
1210 }
1211@@ -4340,14 +4371,16 @@
1212
1213 ut_ad(buf_pool);
1214
1215- buf_pool_mutex_enter(buf_pool);
1216-
1217 chunk = buf_pool->chunks;
1218
1219 for (i = buf_pool->n_chunks; i--; chunk++) {
1220
1221+ mutex_enter(&buf_pool->LRU_list_mutex);
1222+
1223 const buf_block_t* block = buf_chunk_not_freed(chunk);
1224
1225+ mutex_exit(&buf_pool->LRU_list_mutex);
1226+
1227 if (UNIV_LIKELY_NULL(block)) {
1228 fprintf(stderr,
1229 "Page %lu %lu still fixed or dirty\n",
1230@@ -4357,8 +4390,6 @@
1231 }
1232 }
1233
1234- buf_pool_mutex_exit(buf_pool);
1235-
1236 return(TRUE);
1237 }
1238
1239@@ -4372,7 +4403,9 @@
1240 {
1241 ulint i;
1242
1243- buf_pool_mutex_enter(buf_pool);
1244+ ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
1245+
1246+ mutex_enter(&buf_pool->flush_state_mutex);
1247
1248 for (i = BUF_FLUSH_LRU; i < BUF_FLUSH_N_TYPES; i++) {
1249
1250@@ -4390,21 +4423,20 @@
1251 if (buf_pool->n_flush[i] > 0) {
1252 buf_flush_t type = static_cast<buf_flush_t>(i);
1253
1254- buf_pool_mutex_exit(buf_pool);
1255+ mutex_exit(&buf_pool->flush_state_mutex);
1256 buf_flush_wait_batch_end(buf_pool, type);
1257- buf_pool_mutex_enter(buf_pool);
1258+ mutex_enter(&buf_pool->flush_state_mutex);
1259 }
1260 }
1261-
1262- buf_pool_mutex_exit(buf_pool);
1263+ mutex_exit(&buf_pool->flush_state_mutex);
1264
1265 ut_ad(buf_all_freed_instance(buf_pool));
1266
1267- buf_pool_mutex_enter(buf_pool);
1268-
1269 while (buf_LRU_scan_and_free_block(buf_pool, TRUE)) {
1270 }
1271
1272+ mutex_enter(&buf_pool->LRU_list_mutex);
1273+
1274 ut_ad(UT_LIST_GET_LEN(buf_pool->LRU) == 0);
1275 ut_ad(UT_LIST_GET_LEN(buf_pool->unzip_LRU) == 0);
1276
1277@@ -4412,10 +4444,10 @@
1278 buf_pool->LRU_old = NULL;
1279 buf_pool->LRU_old_len = 0;
1280
1281+ mutex_exit(&buf_pool->LRU_list_mutex);
1282+
1283 memset(&buf_pool->stat, 0x00, sizeof(buf_pool->stat));
1284 buf_refresh_io_stats(buf_pool);
1285-
1286- buf_pool_mutex_exit(buf_pool);
1287 }
1288
1289 /*********************************************************************//**
1290@@ -4460,8 +4492,11 @@
1291
1292 ut_ad(buf_pool);
1293
1294- buf_pool_mutex_enter(buf_pool);
1295+ mutex_enter(&buf_pool->LRU_list_mutex);
1296 hash_lock_x_all(buf_pool->page_hash);
1297+ mutex_enter(&buf_pool->zip_mutex);
1298+ mutex_enter(&buf_pool->free_list_mutex);
1299+ mutex_enter(&buf_pool->flush_state_mutex);
1300
1301 chunk = buf_pool->chunks;
1302
1303@@ -4474,8 +4509,6 @@
1304
1305 for (j = chunk->size; j--; block++) {
1306
1307- mutex_enter(&block->mutex);
1308-
1309 switch (buf_block_get_state(block)) {
1310 case BUF_BLOCK_POOL_WATCH:
1311 case BUF_BLOCK_ZIP_PAGE:
1312@@ -4486,6 +4519,7 @@
1313 break;
1314
1315 case BUF_BLOCK_FILE_PAGE:
1316+
1317 space = buf_block_get_space(block);
1318 offset = buf_block_get_page_no(block);
1319 fold = buf_page_address_fold(space, offset);
1320@@ -4496,14 +4530,15 @@
1321 == &block->page);
1322
1323 #ifdef UNIV_IBUF_COUNT_DEBUG
1324- ut_a(buf_page_get_io_fix(&block->page)
1325+ ut_a(buf_page_get_io_fix_unlocked(&block->page)
1326 == BUF_IO_READ
1327 || !ibuf_count_get(buf_block_get_space(
1328 block),
1329 buf_block_get_page_no(
1330 block)));
1331 #endif
1332- switch (buf_page_get_io_fix(&block->page)) {
1333+ switch (buf_page_get_io_fix_unlocked(
1334+ &block->page)) {
1335 case BUF_IO_NONE:
1336 break;
1337
1338@@ -4511,17 +4546,8 @@
1339 switch (buf_page_get_flush_type(
1340 &block->page)) {
1341 case BUF_FLUSH_LRU:
1342- n_lru_flush++;
1343- goto assert_s_latched;
1344 case BUF_FLUSH_SINGLE_PAGE:
1345- n_page_flush++;
1346-assert_s_latched:
1347- ut_a(rw_lock_is_locked(
1348- &block->lock,
1349- RW_LOCK_SHARED));
1350- break;
1351 case BUF_FLUSH_LIST:
1352- n_list_flush++;
1353 break;
1354 default:
1355 ut_error;
1356@@ -4552,13 +4578,9 @@
1357 /* do nothing */
1358 break;
1359 }
1360-
1361- mutex_exit(&block->mutex);
1362 }
1363 }
1364
1365- mutex_enter(&buf_pool->zip_mutex);
1366-
1367 /* Check clean compressed-only blocks. */
1368
1369 for (b = UT_LIST_GET_FIRST(buf_pool->zip_clean); b;
1370@@ -4604,7 +4626,9 @@
1371 case BUF_BLOCK_ZIP_DIRTY:
1372 n_lru++;
1373 n_zip++;
1374- switch (buf_page_get_io_fix(b)) {
1375+ /* fallthrough */
1376+ case BUF_BLOCK_FILE_PAGE:
1377+ switch (buf_page_get_io_fix_unlocked(b)) {
1378 case BUF_IO_NONE:
1379 case BUF_IO_READ:
1380 case BUF_IO_PIN:
1381@@ -4624,11 +4648,10 @@
1382 ut_error;
1383 }
1384 break;
1385+ default:
1386+ ut_error;
1387 }
1388 break;
1389- case BUF_BLOCK_FILE_PAGE:
1390- /* uncompressed page */
1391- break;
1392 case BUF_BLOCK_POOL_WATCH:
1393 case BUF_BLOCK_ZIP_PAGE:
1394 case BUF_BLOCK_NOT_USED:
1395@@ -4658,6 +4681,9 @@
1396 }
1397
1398 ut_a(UT_LIST_GET_LEN(buf_pool->LRU) == n_lru);
1399+
1400+ mutex_exit(&buf_pool->LRU_list_mutex);
1401+
1402 if (UT_LIST_GET_LEN(buf_pool->free) != n_free) {
1403 fprintf(stderr, "Free list len %lu, free blocks %lu\n",
1404 (ulong) UT_LIST_GET_LEN(buf_pool->free),
1405@@ -4665,11 +4691,13 @@
1406 ut_error;
1407 }
1408
1409+ mutex_exit(&buf_pool->free_list_mutex);
1410+
1411 ut_a(buf_pool->n_flush[BUF_FLUSH_LIST] == n_list_flush);
1412 ut_a(buf_pool->n_flush[BUF_FLUSH_LRU] == n_lru_flush);
1413 ut_a(buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE] == n_page_flush);
1414
1415- buf_pool_mutex_exit(buf_pool);
1416+ mutex_exit(&buf_pool->flush_state_mutex);
1417
1418 ut_a(buf_LRU_validate());
1419 ut_a(buf_flush_validate(buf_pool));
1420@@ -4727,8 +4755,7 @@
1421
1422 counts = static_cast<ulint*>(mem_alloc(sizeof(ulint) * size));
1423
1424- buf_pool_mutex_enter(buf_pool);
1425- buf_flush_list_mutex_enter(buf_pool);
1426+ /* Dirty reads below */
1427
1428 fprintf(stderr,
1429 "buf_pool size %lu\n"
1430@@ -4755,12 +4782,12 @@
1431 (ulong) buf_pool->stat.n_pages_created,
1432 (ulong) buf_pool->stat.n_pages_written);
1433
1434- buf_flush_list_mutex_exit(buf_pool);
1435-
1436 /* Count the number of blocks belonging to each index in the buffer */
1437
1438 n_found = 0;
1439
1440+ mutex_enter(&buf_pool->LRU_list_mutex);
1441+
1442 chunk = buf_pool->chunks;
1443
1444 for (i = buf_pool->n_chunks; i--; chunk++) {
1445@@ -4796,7 +4823,7 @@
1446 }
1447 }
1448
1449- buf_pool_mutex_exit(buf_pool);
1450+ mutex_exit(&buf_pool->LRU_list_mutex);
1451
1452 for (i = 0; i < n_found; i++) {
1453 index = dict_index_get_if_in_cache(index_ids[i]);
1454@@ -4853,7 +4880,8 @@
1455 buf_chunk_t* chunk;
1456 ulint fixed_pages_number = 0;
1457
1458- buf_pool_mutex_enter(buf_pool);
1459+ /* The LRU list mutex is enough to protect the required fields below */
1460+ mutex_enter(&buf_pool->LRU_list_mutex);
1461
1462 chunk = buf_pool->chunks;
1463
1464@@ -4870,18 +4898,17 @@
1465 continue;
1466 }
1467
1468- mutex_enter(&block->mutex);
1469-
1470 if (block->page.buf_fix_count != 0
1471- || buf_page_get_io_fix(&block->page)
1472+ || buf_page_get_io_fix_unlocked(&block->page)
1473 != BUF_IO_NONE) {
1474 fixed_pages_number++;
1475 }
1476
1477- mutex_exit(&block->mutex);
1478 }
1479 }
1480
1481+ mutex_exit(&buf_pool->LRU_list_mutex);
1482+
1483 mutex_enter(&buf_pool->zip_mutex);
1484
1485 /* Traverse the lists of clean and dirty compressed-only blocks. */
1486@@ -4925,7 +4952,6 @@
1487
1488 buf_flush_list_mutex_exit(buf_pool);
1489 mutex_exit(&buf_pool->zip_mutex);
1490- buf_pool_mutex_exit(buf_pool);
1491
1492 return(fixed_pages_number);
1493 }
1494@@ -5073,9 +5099,6 @@
1495 /* Find appropriate pool_info to store stats for this buffer pool */
1496 pool_info = &all_pool_info[pool_id];
1497
1498- buf_pool_mutex_enter(buf_pool);
1499- buf_flush_list_mutex_enter(buf_pool);
1500-
1501 pool_info->pool_unique_id = pool_id;
1502
1503 pool_info->pool_size = buf_pool->curr_size;
1504@@ -5094,6 +5117,8 @@
1505
1506 pool_info->n_pend_reads = buf_pool->n_pend_reads;
1507
1508+ mutex_enter(&buf_pool->flush_state_mutex);
1509+
1510 pool_info->n_pending_flush_lru =
1511 (buf_pool->n_flush[BUF_FLUSH_LRU]
1512 + buf_pool->init_flush[BUF_FLUSH_LRU]);
1513@@ -5106,7 +5131,7 @@
1514 (buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE]
1515 + buf_pool->init_flush[BUF_FLUSH_SINGLE_PAGE]);
1516
1517- buf_flush_list_mutex_exit(buf_pool);
1518+ mutex_exit(&buf_pool->flush_state_mutex);
1519
1520 current_time = time(NULL);
1521 time_elapsed = 0.001 + difftime(current_time,
1522@@ -5189,7 +5214,6 @@
1523 pool_info->unzip_cur = buf_LRU_stat_cur.unzip;
1524
1525 buf_refresh_io_stats(buf_pool);
1526- buf_pool_mutex_exit(buf_pool);
1527 }
1528
1529 /*********************************************************************//**
1530@@ -5398,22 +5422,22 @@
1531 ulint i;
1532 ulint pending_io = 0;
1533
1534- buf_pool_mutex_enter_all();
1535-
1536 for (i = 0; i < srv_buf_pool_instances; i++) {
1537- const buf_pool_t* buf_pool;
1538+ buf_pool_t* buf_pool;
1539
1540 buf_pool = buf_pool_from_array(i);
1541
1542- pending_io += buf_pool->n_pend_reads
1543- + buf_pool->n_flush[BUF_FLUSH_LRU]
1544- + buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE]
1545- + buf_pool->n_flush[BUF_FLUSH_LIST];
1546-
1547+ pending_io += buf_pool->n_pend_reads;
1548+
1549+ mutex_enter(&buf_pool->flush_state_mutex);
1550+
1551+ pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
1552+ pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
1553+ pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
1554+
1555+ mutex_exit(&buf_pool->flush_state_mutex);
1556 }
1557
1558- buf_pool_mutex_exit_all();
1559-
1560 return(pending_io);
1561 }
1562
1563@@ -5429,11 +5453,11 @@
1564 {
1565 ulint len;
1566
1567- buf_pool_mutex_enter(buf_pool);
1568+ mutex_enter(&buf_pool->free_list_mutex);
1569
1570 len = UT_LIST_GET_LEN(buf_pool->free);
1571
1572- buf_pool_mutex_exit(buf_pool);
1573+ mutex_exit(&buf_pool->free_list_mutex);
1574
1575 return(len);
1576 }
1577
1578=== modified file 'Percona-Server/storage/innobase/buf/buf0dblwr.cc'
1579--- Percona-Server/storage/innobase/buf/buf0dblwr.cc 2013-06-20 15:16:00 +0000
1580+++ Percona-Server/storage/innobase/buf/buf0dblwr.cc 2013-09-20 05:29:11 +0000
1581@@ -936,6 +936,7 @@
1582 ulint zip_size;
1583
1584 ut_a(buf_page_in_file(bpage));
1585+ ut_ad(!mutex_own(&buf_pool_from_bpage(bpage)->LRU_list_mutex));
1586
1587 try_again:
1588 mutex_enter(&buf_dblwr->mutex);
1589
1590=== modified file 'Percona-Server/storage/innobase/buf/buf0dump.cc'
1591--- Percona-Server/storage/innobase/buf/buf0dump.cc 2012-12-04 08:24:59 +0000
1592+++ Percona-Server/storage/innobase/buf/buf0dump.cc 2013-09-20 05:29:11 +0000
1593@@ -28,7 +28,7 @@
1594 #include <stdarg.h> /* va_* */
1595 #include <string.h> /* strerror() */
1596
1597-#include "buf0buf.h" /* buf_pool_mutex_enter(), srv_buf_pool_instances */
1598+#include "buf0buf.h" /* srv_buf_pool_instances */
1599 #include "buf0dump.h"
1600 #include "db0err.h"
1601 #include "dict0dict.h" /* dict_operation_lock */
1602@@ -58,8 +58,8 @@
1603 static ibool buf_load_abort_flag = FALSE;
1604
1605 /* Used to temporary store dump info in order to avoid IO while holding
1606-buffer pool mutex during dump and also to sort the contents of the dump
1607-before reading the pages from disk during load.
1608+buffer pool LRU list mutex during dump and also to sort the contents of the
1609+dump before reading the pages from disk during load.
1610 We store the space id in the high 32 bits and page no in low 32 bits. */
1611 typedef ib_uint64_t buf_dump_t;
1612
1613@@ -218,15 +218,15 @@
1614
1615 buf_pool = buf_pool_from_array(i);
1616
1617- /* obtain buf_pool mutex before allocate, since
1618+ /* obtain buf_pool LRU list mutex before allocate, since
1619 UT_LIST_GET_LEN(buf_pool->LRU) could change */
1620- buf_pool_mutex_enter(buf_pool);
1621+ mutex_enter(&buf_pool->LRU_list_mutex);
1622
1623 n_pages = UT_LIST_GET_LEN(buf_pool->LRU);
1624
1625 /* skip empty buffer pools */
1626 if (n_pages == 0) {
1627- buf_pool_mutex_exit(buf_pool);
1628+ mutex_exit(&buf_pool->LRU_list_mutex);
1629 continue;
1630 }
1631
1632@@ -234,7 +234,7 @@
1633 ut_malloc(n_pages * sizeof(*dump))) ;
1634
1635 if (dump == NULL) {
1636- buf_pool_mutex_exit(buf_pool);
1637+ mutex_exit(&buf_pool->LRU_list_mutex);
1638 fclose(f);
1639 buf_dump_status(STATUS_ERR,
1640 "Cannot allocate " ULINTPF " bytes: %s",
1641@@ -256,7 +256,7 @@
1642
1643 ut_a(j == n_pages);
1644
1645- buf_pool_mutex_exit(buf_pool);
1646+ mutex_exit(&buf_pool->LRU_list_mutex);
1647
1648 for (j = 0; j < n_pages && !SHOULD_QUIT(); j++) {
1649 ret = fprintf(f, ULINTPF "," ULINTPF "\n",
1650
1651=== modified file 'Percona-Server/storage/innobase/buf/buf0flu.cc'
1652--- Percona-Server/storage/innobase/buf/buf0flu.cc 2013-08-16 09:11:51 +0000
1653+++ Percona-Server/storage/innobase/buf/buf0flu.cc 2013-09-20 05:29:11 +0000
1654@@ -350,7 +350,6 @@
1655 buf_block_t* block, /*!< in/out: block which is modified */
1656 lsn_t lsn) /*!< in: oldest modification */
1657 {
1658- ut_ad(!buf_pool_mutex_own(buf_pool));
1659 ut_ad(log_flush_order_mutex_own());
1660 ut_ad(mutex_own(&block->mutex));
1661
1662@@ -409,15 +408,14 @@
1663 buf_page_t* prev_b;
1664 buf_page_t* b;
1665
1666- ut_ad(!buf_pool_mutex_own(buf_pool));
1667 ut_ad(log_flush_order_mutex_own());
1668 ut_ad(mutex_own(&block->mutex));
1669 ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE);
1670
1671 buf_flush_list_mutex_enter(buf_pool);
1672
1673- /* The field in_LRU_list is protected by buf_pool->mutex, which
1674- we are not holding. However, while a block is in the flush
1675+ /* The field in_LRU_list is protected by buf_pool->LRU_list_mutex,
1676+ which we are not holding. However, while a block is in the flush
1677 list, it is dirty and cannot be discarded, not from the
1678 page_hash or from the LRU list. At most, the uncompressed
1679 page frame of a compressed block may be discarded or created
1680@@ -501,7 +499,7 @@
1681 {
1682 #ifdef UNIV_DEBUG
1683 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1684- ut_ad(buf_pool_mutex_own(buf_pool));
1685+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
1686 #endif
1687 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
1688 ut_ad(bpage->in_LRU_list);
1689@@ -535,17 +533,13 @@
1690 buf_page_in_file(bpage) */
1691 buf_flush_t flush_type)/*!< in: type of flush */
1692 {
1693-#ifdef UNIV_DEBUG
1694- buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1695- ut_ad(buf_pool_mutex_own(buf_pool));
1696-#endif /* UNIV_DEBUG */
1697-
1698+ ut_ad(flush_type < BUF_FLUSH_N_TYPES);
1699+ ut_ad(mutex_own(buf_page_get_mutex(bpage))
1700+ || flush_type == BUF_FLUSH_LIST);
1701 ut_a(buf_page_in_file(bpage));
1702- ut_ad(mutex_own(buf_page_get_mutex(bpage)));
1703- ut_ad(flush_type < BUF_FLUSH_N_TYPES);
1704
1705 if (bpage->oldest_modification == 0
1706- || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
1707+ || buf_page_get_io_fix_unlocked(bpage) != BUF_IO_NONE) {
1708 return(false);
1709 }
1710
1711@@ -583,8 +577,11 @@
1712 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1713 ulint zip_size;
1714
1715- ut_ad(buf_pool_mutex_own(buf_pool));
1716 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
1717+#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
1718+ ut_ad(buf_page_get_state(bpage) != BUF_BLOCK_ZIP_DIRTY
1719+ || mutex_own(&buf_pool->LRU_list_mutex));
1720+#endif
1721 ut_ad(bpage->in_flush_list);
1722
1723 buf_flush_list_mutex_enter(buf_pool);
1724@@ -655,7 +652,6 @@
1725 buf_page_t* prev_b = NULL;
1726 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1727
1728- ut_ad(buf_pool_mutex_own(buf_pool));
1729 /* Must reside in the same buffer pool. */
1730 ut_ad(buf_pool == buf_pool_from_bpage(dpage));
1731
1732@@ -663,13 +659,6 @@
1733
1734 buf_flush_list_mutex_enter(buf_pool);
1735
1736- /* FIXME: At this point we have both buf_pool and flush_list
1737- mutexes. Theoretically removal of a block from flush list is
1738- only covered by flush_list mutex but currently we do
1739- have buf_pool mutex in buf_flush_remove() therefore this block
1740- is guaranteed to be in the flush list. We need to check if
1741- this will work without the assumption of block removing code
1742- having the buf_pool mutex. */
1743 ut_ad(bpage->in_flush_list);
1744 ut_ad(dpage->in_flush_list);
1745
1746@@ -720,14 +709,15 @@
1747 /*=====================*/
1748 buf_page_t* bpage) /*!< in: pointer to the block in question */
1749 {
1750- buf_flush_t flush_type;
1751+ buf_flush_t flush_type = buf_page_get_flush_type(bpage);
1752 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1753
1754- ut_ad(bpage);
1755+ mutex_enter(&buf_pool->flush_state_mutex);
1756
1757 buf_flush_remove(bpage);
1758
1759- flush_type = buf_page_get_flush_type(bpage);
1760+ buf_page_set_io_fix(bpage, BUF_IO_NONE);
1761+
1762 buf_pool->n_flush[flush_type]--;
1763
1764 /* fprintf(stderr, "n pending flush %lu\n",
1765@@ -742,6 +732,8 @@
1766 }
1767
1768 buf_dblwr_update(bpage, flush_type);
1769+
1770+ mutex_exit(&buf_pool->flush_state_mutex);
1771 }
1772 #endif /* !UNIV_HOTBACKUP */
1773
1774@@ -890,7 +882,7 @@
1775
1776 #ifdef UNIV_DEBUG
1777 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
1778- ut_ad(!buf_pool_mutex_own(buf_pool));
1779+ ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
1780 #endif
1781
1782 #ifdef UNIV_LOG_DEBUG
1783@@ -899,15 +891,14 @@
1784
1785 ut_ad(buf_page_in_file(bpage));
1786
1787- /* We are not holding buf_pool->mutex or block_mutex here.
1788+ /* We are not holding block_mutex here.
1789 Nevertheless, it is safe to access bpage, because it is
1790 io_fixed and oldest_modification != 0. Thus, it cannot be
1791 relocated in the buffer pool or removed from flush_list or
1792 LRU_list. */
1793- ut_ad(!buf_pool_mutex_own(buf_pool));
1794 ut_ad(!buf_flush_list_mutex_own(buf_pool));
1795 ut_ad(!mutex_own(buf_page_get_mutex(bpage)));
1796- ut_ad(buf_page_get_io_fix(bpage) == BUF_IO_WRITE);
1797+ ut_ad(buf_page_get_io_fix_unlocked(bpage) == BUF_IO_WRITE);
1798 ut_ad(bpage->oldest_modification != 0);
1799
1800 #ifdef UNIV_IBUF_COUNT_DEBUG
1801@@ -989,9 +980,8 @@
1802 Writes a flushable page asynchronously from the buffer pool to a file.
1803 NOTE: in simulated aio we must call
1804 os_aio_simulated_wake_handler_threads after we have posted a batch of
1805-writes! NOTE: buf_pool->mutex and buf_page_get_mutex(bpage) must be
1806-held upon entering this function, and they will be released by this
1807-function. */
1808+writes! NOTE: buf_page_get_mutex(bpage) must be held upon entering this
1809+function, and it will be released by this function. */
1810 UNIV_INTERN
1811 void
1812 buf_flush_page(
1813@@ -1005,7 +995,7 @@
1814 ibool is_uncompressed;
1815
1816 ut_ad(flush_type < BUF_FLUSH_N_TYPES);
1817- ut_ad(buf_pool_mutex_own(buf_pool));
1818+ ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
1819 ut_ad(buf_page_in_file(bpage));
1820 ut_ad(!sync || flush_type == BUF_FLUSH_SINGLE_PAGE);
1821
1822@@ -1014,6 +1004,8 @@
1823
1824 ut_ad(buf_flush_ready_for_flush(bpage, flush_type));
1825
1826+ mutex_enter(&buf_pool->flush_state_mutex);
1827+
1828 buf_page_set_io_fix(bpage, BUF_IO_WRITE);
1829
1830 buf_page_set_flush_type(bpage, flush_type);
1831@@ -1025,6 +1017,8 @@
1832
1833 buf_pool->n_flush[flush_type]++;
1834
1835+ mutex_exit(&buf_pool->flush_state_mutex);
1836+
1837 is_uncompressed = (buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
1838 ut_ad(is_uncompressed == (block_mutex != &buf_pool->zip_mutex));
1839
1840@@ -1042,7 +1036,6 @@
1841 }
1842
1843 mutex_exit(block_mutex);
1844- buf_pool_mutex_exit(buf_pool);
1845
1846 /* Even though bpage is not protected by any mutex at
1847 this point, it is safe to access bpage, because it is
1848@@ -1080,11 +1073,10 @@
1849 }
1850
1851 /* Note that the s-latch is acquired before releasing the
1852- buf_pool mutex: this ensures that the latch is acquired
1853- immediately. */
1854+ buf_page_get_mutex() mutex: this ensures that the latch is
1855+ acquired immediately. */
1856
1857 mutex_exit(block_mutex);
1858- buf_pool_mutex_exit(buf_pool);
1859 break;
1860
1861 default:
1862@@ -1109,9 +1101,9 @@
1863 # if defined UNIV_DEBUG || defined UNIV_IBUF_DEBUG
1864 /********************************************************************//**
1865 Writes a flushable page asynchronously from the buffer pool to a file.
1866-NOTE: buf_pool->mutex and block->mutex must be held upon entering this
1867-function, and they will be released by this function after flushing.
1868-This is loosely based on buf_flush_batch() and buf_flush_page().
1869+NOTE: block->mutex must be held upon entering this function, and it will be
1870+released by this function after flushing. This is loosely based on
1871+buf_flush_batch() and buf_flush_page().
1872 @return TRUE if the page was flushed and the mutexes released */
1873 UNIV_INTERN
1874 ibool
1875@@ -1120,7 +1112,6 @@
1876 buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
1877 buf_block_t* block) /*!< in/out: buffer control block */
1878 {
1879- ut_ad(buf_pool_mutex_own(buf_pool));
1880 ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE);
1881 ut_ad(mutex_own(&block->mutex));
1882
1883@@ -1149,21 +1140,27 @@
1884 buf_page_t* bpage;
1885 buf_pool_t* buf_pool = buf_pool_get(space, offset);
1886 bool ret;
1887+ rw_lock_t* hash_lock;
1888+ ib_mutex_t* block_mutex;
1889
1890 ut_ad(flush_type == BUF_FLUSH_LRU
1891 || flush_type == BUF_FLUSH_LIST);
1892
1893- buf_pool_mutex_enter(buf_pool);
1894-
1895 /* We only want to flush pages from this buffer pool. */
1896- bpage = buf_page_hash_get(buf_pool, space, offset);
1897+ bpage = buf_page_hash_get_s_locked(buf_pool, space, offset,
1898+ &hash_lock);
1899
1900 if (!bpage) {
1901
1902- buf_pool_mutex_exit(buf_pool);
1903 return(false);
1904 }
1905
1906+ block_mutex = buf_page_get_mutex(bpage);
1907+
1908+ mutex_enter(block_mutex);
1909+
1910+ rw_lock_s_unlock(hash_lock);
1911+
1912 ut_a(buf_page_in_file(bpage));
1913
1914 /* We avoid flushing 'non-old' blocks in an LRU flush,
1915@@ -1171,15 +1168,13 @@
1916
1917 ret = false;
1918 if (flush_type != BUF_FLUSH_LRU || buf_page_is_old(bpage)) {
1919- ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
1920
1921- mutex_enter(block_mutex);
1922 if (buf_flush_ready_for_flush(bpage, flush_type)) {
1923 ret = true;
1924 }
1925- mutex_exit(block_mutex);
1926 }
1927- buf_pool_mutex_exit(buf_pool);
1928+
1929+ mutex_exit(block_mutex);
1930
1931 return(ret);
1932 }
1933@@ -1207,6 +1202,8 @@
1934 buf_pool_t* buf_pool = buf_pool_get(space, offset);
1935
1936 ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_LIST);
1937+ ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
1938+ ut_ad(!buf_flush_list_mutex_own(buf_pool));
1939
1940 if (UT_LIST_GET_LEN(buf_pool->LRU) < BUF_LRU_OLD_MIN_LEN
1941 || srv_flush_neighbors == 0) {
1942@@ -1262,6 +1259,8 @@
1943 for (i = low; i < high; i++) {
1944
1945 buf_page_t* bpage;
1946+ rw_lock_t* hash_lock;
1947+ ib_mutex_t* block_mutex;
1948
1949 if ((count + n_flushed) >= n_to_flush) {
1950
1951@@ -1280,17 +1279,21 @@
1952
1953 buf_pool = buf_pool_get(space, i);
1954
1955- buf_pool_mutex_enter(buf_pool);
1956-
1957 /* We only want to flush pages from this buffer pool. */
1958- bpage = buf_page_hash_get(buf_pool, space, i);
1959+ bpage = buf_page_hash_get_s_locked(buf_pool, space, i,
1960+ &hash_lock);
1961
1962 if (!bpage) {
1963
1964- buf_pool_mutex_exit(buf_pool);
1965 continue;
1966 }
1967
1968+ block_mutex = buf_page_get_mutex(bpage);
1969+
1970+ mutex_enter(block_mutex);
1971+
1972+ rw_lock_s_unlock(hash_lock);
1973+
1974 ut_a(buf_page_in_file(bpage));
1975
1976 /* We avoid flushing 'non-old' blocks in an LRU flush,
1977@@ -1299,9 +1302,6 @@
1978 if (flush_type != BUF_FLUSH_LRU
1979 || i == offset
1980 || buf_page_is_old(bpage)) {
1981- ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
1982-
1983- mutex_enter(block_mutex);
1984
1985 if (buf_flush_ready_for_flush(bpage, flush_type)
1986 && (i == offset || !bpage->buf_fix_count)) {
1987@@ -1316,14 +1316,12 @@
1988
1989 buf_flush_page(buf_pool, bpage, flush_type, false);
1990 ut_ad(!mutex_own(block_mutex));
1991- ut_ad(!buf_pool_mutex_own(buf_pool));
1992 count++;
1993 continue;
1994- } else {
1995- mutex_exit(block_mutex);
1996 }
1997 }
1998- buf_pool_mutex_exit(buf_pool);
1999+
2000+ mutex_exit(block_mutex);
2001 }
2002
2003 if (count > 0) {
2004@@ -1341,8 +1339,9 @@
2005 Check if the block is modified and ready for flushing. If the the block
2006 is ready to flush then flush the page and try o flush its neighbors.
2007
2008-@return TRUE if buf_pool mutex was released during this function.
2009-This does not guarantee that some pages were written as well.
2010+@return TRUE if, depending on the flush type, either LRU or flush list
2011+mutex was released during this function. This does not guarantee that some
2012+pages were written as well.
2013 Number of pages written are incremented to the count. */
2014 static
2015 ibool
2016@@ -1358,16 +1357,21 @@
2017 ulint* count) /*!< in/out: number of pages
2018 flushed */
2019 {
2020- ib_mutex_t* block_mutex;
2021+ ib_mutex_t* block_mutex = NULL;
2022 ibool flushed = FALSE;
2023 #ifdef UNIV_DEBUG
2024 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
2025 #endif /* UNIV_DEBUG */
2026
2027- ut_ad(buf_pool_mutex_own(buf_pool));
2028+ ut_ad((flush_type == BUF_FLUSH_LRU
2029+ && mutex_own(&buf_pool->LRU_list_mutex))
2030+ || (flush_type == BUF_FLUSH_LIST
2031+ && buf_flush_list_mutex_own(buf_pool)));
2032
2033- block_mutex = buf_page_get_mutex(bpage);
2034- mutex_enter(block_mutex);
2035+ if (flush_type == BUF_FLUSH_LRU) {
2036+ block_mutex = buf_page_get_mutex(bpage);
2037+ mutex_enter(block_mutex);
2038+ }
2039
2040 ut_a(buf_page_in_file(bpage));
2041
2042@@ -1378,14 +1382,20 @@
2043
2044 buf_pool = buf_pool_from_bpage(bpage);
2045
2046- buf_pool_mutex_exit(buf_pool);
2047+ if (flush_type == BUF_FLUSH_LRU) {
2048+ mutex_exit(&buf_pool->LRU_list_mutex);
2049+ }
2050
2051- /* These fields are protected by both the
2052- buffer pool mutex and block mutex. */
2053+ /* These fields are protected by the buf_page_get_mutex()
2054+ mutex. */
2055 space = buf_page_get_space(bpage);
2056 offset = buf_page_get_page_no(bpage);
2057
2058- mutex_exit(block_mutex);
2059+ if (flush_type == BUF_FLUSH_LRU) {
2060+ mutex_exit(block_mutex);
2061+ } else {
2062+ buf_flush_list_mutex_exit(buf_pool);
2063+ }
2064
2065 /* Try to flush also all the neighbors */
2066 *count += buf_flush_try_neighbors(space,
2067@@ -1394,13 +1404,20 @@
2068 *count,
2069 n_to_flush);
2070
2071- buf_pool_mutex_enter(buf_pool);
2072+ if (flush_type == BUF_FLUSH_LRU) {
2073+ mutex_enter(&buf_pool->LRU_list_mutex);
2074+ } else {
2075+ buf_flush_list_mutex_enter(buf_pool);
2076+ }
2077 flushed = TRUE;
2078- } else {
2079+ } else if (flush_type == BUF_FLUSH_LRU) {
2080 mutex_exit(block_mutex);
2081 }
2082
2083- ut_ad(buf_pool_mutex_own(buf_pool));
2084+ ut_ad((flush_type == BUF_FLUSH_LRU
2085+ && mutex_own(&buf_pool->LRU_list_mutex))
2086+ || (flush_type == BUF_FLUSH_LIST
2087+ && buf_flush_list_mutex_own(buf_pool)));
2088
2089 return(flushed);
2090 }
2091@@ -1428,22 +1445,31 @@
2092 ulint free_len = UT_LIST_GET_LEN(buf_pool->free);
2093 ulint lru_len = UT_LIST_GET_LEN(buf_pool->unzip_LRU);
2094
2095- ut_ad(buf_pool_mutex_own(buf_pool));
2096+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2097
2098 block = UT_LIST_GET_LAST(buf_pool->unzip_LRU);
2099 while (block != NULL && count < max
2100 && free_len < srv_LRU_scan_depth
2101 && lru_len > UT_LIST_GET_LEN(buf_pool->LRU) / 10) {
2102
2103+ ib_mutex_t* block_mutex = buf_page_get_mutex(&block->page);
2104+
2105 ++scanned;
2106+
2107+ mutex_enter(block_mutex);
2108+
2109 if (buf_LRU_free_page(&block->page, false)) {
2110- /* Block was freed. buf_pool->mutex potentially
2111+
2112+ mutex_exit(block_mutex);
2113+ /* Block was freed. LRU list mutex potentially
2114 released and reacquired */
2115 ++count;
2116+ mutex_enter(&buf_pool->LRU_list_mutex);
2117 block = UT_LIST_GET_LAST(buf_pool->unzip_LRU);
2118
2119 } else {
2120
2121+ mutex_exit(block_mutex);
2122 block = UT_LIST_GET_PREV(unzip_LRU, block);
2123 }
2124
2125@@ -1451,7 +1477,7 @@
2126 lru_len = UT_LIST_GET_LEN(buf_pool->unzip_LRU);
2127 }
2128
2129- ut_ad(buf_pool_mutex_own(buf_pool));
2130+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2131
2132 if (scanned) {
2133 MONITOR_INC_VALUE_CUMULATIVE(
2134@@ -1485,7 +1511,7 @@
2135 ulint free_len = UT_LIST_GET_LEN(buf_pool->free);
2136 ulint lru_len = UT_LIST_GET_LEN(buf_pool->LRU);
2137
2138- ut_ad(buf_pool_mutex_own(buf_pool));
2139+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2140
2141 bpage = UT_LIST_GET_LAST(buf_pool->LRU);
2142 while (bpage != NULL && count < max
2143@@ -1494,13 +1520,20 @@
2144
2145 ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2146 ibool evict;
2147-
2148- mutex_enter(block_mutex);
2149- evict = buf_flush_ready_for_replace(bpage);
2150- mutex_exit(block_mutex);
2151+ ulint failed_acquire;
2152
2153 ++scanned;
2154
2155+ failed_acquire = mutex_enter_nowait(block_mutex);
2156+
2157+ evict = UNIV_LIKELY(!failed_acquire)
2158+ && buf_flush_ready_for_replace(bpage);
2159+
2160+ if (UNIV_LIKELY(!failed_acquire) && !evict) {
2161+
2162+ mutex_exit(block_mutex);
2163+ }
2164+
2165 /* If the block is ready to be replaced we try to
2166 free it i.e.: put it on the free list.
2167 Otherwise we try to flush the block and its
2168@@ -1514,28 +1547,35 @@
2169 O(n*n). */
2170 if (evict) {
2171 if (buf_LRU_free_page(bpage, true)) {
2172- /* buf_pool->mutex was potentially
2173- released and reacquired. */
2174+
2175+ mutex_exit(block_mutex);
2176+ mutex_enter(&buf_pool->LRU_list_mutex);
2177 bpage = UT_LIST_GET_LAST(buf_pool->LRU);
2178 } else {
2179+
2180 bpage = UT_LIST_GET_PREV(LRU, bpage);
2181+ mutex_exit(block_mutex);
2182 }
2183- } else if (buf_flush_page_and_try_neighbors(
2184+ } else if (UNIV_LIKELY(!failed_acquire)) {
2185+
2186+ if (buf_flush_page_and_try_neighbors(
2187 bpage,
2188 BUF_FLUSH_LRU, max, &count)) {
2189
2190- /* buf_pool->mutex was released.
2191- Restart the scan. */
2192- bpage = UT_LIST_GET_LAST(buf_pool->LRU);
2193- } else {
2194- bpage = UT_LIST_GET_PREV(LRU, bpage);
2195+ /* LRU list mutex was released.
2196+ Restart the scan. */
2197+ bpage = UT_LIST_GET_LAST(buf_pool->LRU);
2198+ } else {
2199+
2200+ bpage = UT_LIST_GET_PREV(LRU, bpage);
2201+ }
2202 }
2203
2204 free_len = UT_LIST_GET_LEN(buf_pool->free);
2205 lru_len = UT_LIST_GET_LEN(buf_pool->LRU);
2206 }
2207
2208- ut_ad(buf_pool_mutex_own(buf_pool));
2209+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2210
2211 /* We keep track of all flushes happening as part of LRU
2212 flush. When estimating the desired rate at which flush_list
2213@@ -1604,8 +1644,6 @@
2214 ulint count = 0;
2215 ulint scanned = 0;
2216
2217- ut_ad(buf_pool_mutex_own(buf_pool));
2218-
2219 /* Start from the end of the list looking for a suitable
2220 block to be flushed. */
2221 buf_flush_list_mutex_enter(buf_pool);
2222@@ -1629,16 +1667,12 @@
2223 prev = UT_LIST_GET_PREV(list, bpage);
2224 buf_flush_set_hp(buf_pool, prev);
2225
2226- buf_flush_list_mutex_exit(buf_pool);
2227-
2228 #ifdef UNIV_DEBUG
2229 bool flushed =
2230 #endif /* UNIV_DEBUG */
2231 buf_flush_page_and_try_neighbors(
2232 bpage, BUF_FLUSH_LIST, min_n, &count);
2233
2234- buf_flush_list_mutex_enter(buf_pool);
2235-
2236 ut_ad(flushed || buf_flush_is_hp(buf_pool, prev));
2237
2238 if (!buf_flush_is_hp(buf_pool, prev)) {
2239@@ -1663,8 +1697,6 @@
2240 MONITOR_FLUSH_BATCH_SCANNED_PER_CALL,
2241 scanned);
2242
2243- ut_ad(buf_pool_mutex_own(buf_pool));
2244-
2245 return(count);
2246 }
2247
2248@@ -1701,13 +1733,13 @@
2249 || sync_thread_levels_empty_except_dict());
2250 #endif /* UNIV_SYNC_DEBUG */
2251
2252- buf_pool_mutex_enter(buf_pool);
2253-
2254- /* Note: The buffer pool mutex is released and reacquired within
2255+ /* Note: The buffer pool mutexes are released and reacquired within
2256 the flush functions. */
2257 switch (flush_type) {
2258 case BUF_FLUSH_LRU:
2259+ mutex_enter(&buf_pool->LRU_list_mutex);
2260 count = buf_do_LRU_batch(buf_pool, min_n);
2261+ mutex_exit(&buf_pool->LRU_list_mutex);
2262 break;
2263 case BUF_FLUSH_LIST:
2264 count = buf_do_flush_list_batch(buf_pool, min_n, lsn_limit);
2265@@ -1716,8 +1748,6 @@
2266 ut_error;
2267 }
2268
2269- buf_pool_mutex_exit(buf_pool);
2270-
2271 #ifdef UNIV_DEBUG
2272 if (buf_debug_prints && count > 0) {
2273 fprintf(stderr, flush_type == BUF_FLUSH_LRU
2274@@ -1765,21 +1795,21 @@
2275 buf_flush_t flush_type) /*!< in: BUF_FLUSH_LRU
2276 or BUF_FLUSH_LIST */
2277 {
2278- buf_pool_mutex_enter(buf_pool);
2279+ mutex_enter(&buf_pool->flush_state_mutex);
2280
2281 if (buf_pool->n_flush[flush_type] > 0
2282- || buf_pool->init_flush[flush_type] == TRUE) {
2283+ || buf_pool->init_flush[flush_type] == TRUE) {
2284
2285 /* There is already a flush batch of the same type running */
2286
2287- buf_pool_mutex_exit(buf_pool);
2288+ mutex_exit(&buf_pool->flush_state_mutex);
2289
2290 return(FALSE);
2291 }
2292
2293 buf_pool->init_flush[flush_type] = TRUE;
2294
2295- buf_pool_mutex_exit(buf_pool);
2296+ mutex_exit(&buf_pool->flush_state_mutex);
2297
2298 return(TRUE);
2299 }
2300@@ -1794,7 +1824,7 @@
2301 buf_flush_t flush_type) /*!< in: BUF_FLUSH_LRU
2302 or BUF_FLUSH_LIST */
2303 {
2304- buf_pool_mutex_enter(buf_pool);
2305+ mutex_enter(&buf_pool->flush_state_mutex);
2306
2307 buf_pool->init_flush[flush_type] = FALSE;
2308
2309@@ -1807,7 +1837,7 @@
2310 os_event_set(buf_pool->no_flush[flush_type]);
2311 }
2312
2313- buf_pool_mutex_exit(buf_pool);
2314+ mutex_exit(&buf_pool->flush_state_mutex);
2315 }
2316
2317 /******************************************************************//**
2318@@ -1989,7 +2019,7 @@
2319 ibool freed;
2320 bool evict_zip;
2321
2322- buf_pool_mutex_enter(buf_pool);
2323+ mutex_enter(&buf_pool->LRU_list_mutex);
2324
2325 for (bpage = UT_LIST_GET_LAST(buf_pool->LRU), scanned = 1;
2326 bpage != NULL;
2327@@ -2006,6 +2036,8 @@
2328 mutex_exit(block_mutex);
2329 }
2330
2331+ mutex_exit(&buf_pool->LRU_list_mutex);
2332+
2333 MONITOR_INC_VALUE_CUMULATIVE(
2334 MONITOR_LRU_SINGLE_FLUSH_SCANNED,
2335 MONITOR_LRU_SINGLE_FLUSH_SCANNED_NUM_CALL,
2336@@ -2014,22 +2046,20 @@
2337
2338 if (!bpage) {
2339 /* Can't find a single flushable page. */
2340- buf_pool_mutex_exit(buf_pool);
2341 return(FALSE);
2342 }
2343
2344- /* The following call will release the buffer pool and
2345- block mutex. */
2346+ /* The following call will release the buf_page_get_mutex() mutex. */
2347 buf_flush_page(buf_pool, bpage, BUF_FLUSH_SINGLE_PAGE, true);
2348
2349 /* At this point the page has been written to the disk.
2350- As we are not holding buffer pool or block mutex therefore
2351+ As we are not holding LRU list or buf_page_get_mutex() mutex therefore
2352 we cannot use the bpage safely. It may have been plucked out
2353 of the LRU list by some other thread or it may even have
2354 relocated in case of a compressed page. We need to start
2355 the scan of LRU list again to remove the block from the LRU
2356 list and put it on the free list. */
2357- buf_pool_mutex_enter(buf_pool);
2358+ mutex_enter(&buf_pool->LRU_list_mutex);
2359
2360 for (bpage = UT_LIST_GET_LAST(buf_pool->LRU);
2361 bpage != NULL;
2362@@ -2040,23 +2070,25 @@
2363 block_mutex = buf_page_get_mutex(bpage);
2364 mutex_enter(block_mutex);
2365 ready = buf_flush_ready_for_replace(bpage);
2366- mutex_exit(block_mutex);
2367 if (ready) {
2368 break;
2369 }
2370+ mutex_exit(block_mutex);
2371
2372 }
2373
2374 if (!bpage) {
2375 /* Can't find a single replaceable page. */
2376- buf_pool_mutex_exit(buf_pool);
2377+ mutex_exit(&buf_pool->LRU_list_mutex);
2378 return(FALSE);
2379 }
2380
2381 evict_zip = !buf_LRU_evict_from_unzip_LRU(buf_pool);;
2382
2383 freed = buf_LRU_free_page(bpage, evict_zip);
2384- buf_pool_mutex_exit(buf_pool);
2385+ if (!freed)
2386+ mutex_exit(&buf_pool->LRU_list_mutex);
2387+ mutex_exit(block_mutex);
2388
2389 return(freed);
2390 }
2391@@ -2082,9 +2114,7 @@
2392
2393 /* srv_LRU_scan_depth can be arbitrarily large value.
2394 We cap it with current LRU size. */
2395- buf_pool_mutex_enter(buf_pool);
2396 scan_depth = UT_LIST_GET_LEN(buf_pool->LRU);
2397- buf_pool_mutex_exit(buf_pool);
2398
2399 scan_depth = ut_min(srv_LRU_scan_depth, scan_depth);
2400
2401@@ -2143,15 +2173,15 @@
2402
2403 buf_pool = buf_pool_from_array(i);
2404
2405- buf_pool_mutex_enter(buf_pool);
2406+ mutex_enter(&buf_pool->flush_state_mutex);
2407
2408 if (buf_pool->n_flush[BUF_FLUSH_LRU] > 0
2409 || buf_pool->init_flush[BUF_FLUSH_LRU]) {
2410
2411- buf_pool_mutex_exit(buf_pool);
2412+ mutex_exit(&buf_pool->flush_state_mutex);
2413 buf_flush_wait_batch_end(buf_pool, BUF_FLUSH_LRU);
2414 } else {
2415- buf_pool_mutex_exit(buf_pool);
2416+ mutex_exit(&buf_pool->flush_state_mutex);
2417 }
2418 }
2419 }
2420@@ -2629,7 +2659,6 @@
2421 {
2422 ulint count = 0;
2423
2424- buf_pool_mutex_enter(buf_pool);
2425 buf_flush_list_mutex_enter(buf_pool);
2426
2427 buf_page_t* bpage;
2428@@ -2648,7 +2677,6 @@
2429 }
2430
2431 buf_flush_list_mutex_exit(buf_pool);
2432- buf_pool_mutex_exit(buf_pool);
2433
2434 return(count);
2435 }
2436
2437=== modified file 'Percona-Server/storage/innobase/buf/buf0lru.cc'
2438--- Percona-Server/storage/innobase/buf/buf0lru.cc 2013-08-16 09:11:51 +0000
2439+++ Percona-Server/storage/innobase/buf/buf0lru.cc 2013-09-20 05:29:11 +0000
2440@@ -75,7 +75,7 @@
2441 /** When dropping the search hash index entries before deleting an ibd
2442 file, we build a local array of pages belonging to that tablespace
2443 in the buffer pool. Following is the size of that array.
2444-We also release buf_pool->mutex after scanning this many pages of the
2445+We also release buf_pool->LRU_list_mutex after scanning this many pages of the
2446 flush_list when dropping a table. This is to ensure that other threads
2447 are not blocked for extended period of time when using very large
2448 buffer pools. */
2449@@ -133,7 +133,7 @@
2450 If the block is compressed-only (BUF_BLOCK_ZIP_PAGE),
2451 the object will be freed.
2452
2453-The caller must hold buf_pool->mutex, the buf_page_get_mutex() mutex
2454+The caller must hold buf_pool->LRU_list_mutex, the buf_page_get_mutex() mutex
2455 and the appropriate hash_lock. This function will release the
2456 buf_page_get_mutex() and the hash_lock.
2457
2458@@ -170,7 +170,7 @@
2459 buf_page_t* bpage, /*!< in: control block */
2460 buf_pool_t* buf_pool) /*!< in: buffer pool instance */
2461 {
2462- ut_ad(buf_pool_mutex_own(buf_pool));
2463+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2464 ulint zip_size = page_zip_get_size(&bpage->zip);
2465 buf_pool->stat.LRU_bytes += zip_size ? zip_size : UNIV_PAGE_SIZE;
2466 ut_ad(buf_pool->stat.LRU_bytes <= buf_pool->curr_pool_size);
2467@@ -189,7 +189,7 @@
2468 ulint io_avg;
2469 ulint unzip_avg;
2470
2471- ut_ad(buf_pool_mutex_own(buf_pool));
2472+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2473
2474 /* If the unzip_LRU list is empty, we can only use the LRU. */
2475 if (UT_LIST_GET_LEN(buf_pool->unzip_LRU) == 0) {
2476@@ -276,7 +276,7 @@
2477 page_arr = static_cast<ulint*>(ut_malloc(
2478 sizeof(ulint) * BUF_LRU_DROP_SEARCH_SIZE));
2479
2480- buf_pool_mutex_enter(buf_pool);
2481+ mutex_enter(&buf_pool->LRU_list_mutex);
2482 num_entries = 0;
2483
2484 scan_again:
2485@@ -285,6 +285,7 @@
2486 while (bpage != NULL) {
2487 buf_page_t* prev_bpage;
2488 ibool is_fixed;
2489+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2490
2491 prev_bpage = UT_LIST_GET_PREV(LRU, bpage);
2492
2493@@ -301,10 +302,10 @@
2494 continue;
2495 }
2496
2497- mutex_enter(&((buf_block_t*) bpage)->mutex);
2498+ mutex_enter(block_mutex);
2499 is_fixed = bpage->buf_fix_count > 0
2500 || !((buf_block_t*) bpage)->index;
2501- mutex_exit(&((buf_block_t*) bpage)->mutex);
2502+ mutex_exit(block_mutex);
2503
2504 if (is_fixed) {
2505 goto next_page;
2506@@ -320,18 +321,18 @@
2507 goto next_page;
2508 }
2509
2510- /* Array full. We release the buf_pool->mutex to obey
2511+ /* Array full. We release the buf_pool->LRU_list_mutex to obey
2512 the latching order. */
2513- buf_pool_mutex_exit(buf_pool);
2514+ mutex_exit(&buf_pool->LRU_list_mutex);
2515
2516 buf_LRU_drop_page_hash_batch(
2517 id, zip_size, page_arr, num_entries);
2518
2519 num_entries = 0;
2520
2521- buf_pool_mutex_enter(buf_pool);
2522+ mutex_enter(&buf_pool->LRU_list_mutex);
2523
2524- /* Note that we released the buf_pool mutex above
2525+ /* Note that we released the buf_pool->LRU_list_mutex above
2526 after reading the prev_bpage during processing of a
2527 page_hash_batch (i.e.: when the array was full).
2528 Because prev_bpage could belong to a compressed-only
2529@@ -345,15 +346,15 @@
2530 guarantee that ALL such entries will be dropped. */
2531
2532 /* If, however, bpage has been removed from LRU list
2533- to the free list then we should restart the scan.
2534- bpage->state is protected by buf_pool mutex. */
2535+ to the free list then we should restart the scan. */
2536+
2537 if (bpage
2538 && buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE) {
2539 goto scan_again;
2540 }
2541 }
2542
2543- buf_pool_mutex_exit(buf_pool);
2544+ mutex_exit(&buf_pool->LRU_list_mutex);
2545
2546 /* Drop any remaining batch of search hashed pages. */
2547 buf_LRU_drop_page_hash_batch(id, zip_size, page_arr, num_entries);
2548@@ -373,27 +374,25 @@
2549 buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
2550 buf_page_t* bpage) /*!< in/out: current page */
2551 {
2552- ib_mutex_t* block_mutex;
2553+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2554
2555- ut_ad(buf_pool_mutex_own(buf_pool));
2556+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2557+ ut_ad(mutex_own(block_mutex));
2558 ut_ad(buf_page_in_file(bpage));
2559
2560- block_mutex = buf_page_get_mutex(bpage);
2561-
2562- mutex_enter(block_mutex);
2563 /* "Fix" the block so that the position cannot be
2564 changed after we release the buffer pool and
2565 block mutexes. */
2566 buf_page_set_sticky(bpage);
2567
2568- /* Now it is safe to release the buf_pool->mutex. */
2569- buf_pool_mutex_exit(buf_pool);
2570+ /* Now it is safe to release the LRU list mutex */
2571+ mutex_exit(&buf_pool->LRU_list_mutex);
2572
2573 mutex_exit(block_mutex);
2574 /* Try and force a context switch. */
2575 os_thread_yield();
2576
2577- buf_pool_mutex_enter(buf_pool);
2578+ mutex_enter(&buf_pool->LRU_list_mutex);
2579
2580 mutex_enter(block_mutex);
2581 /* "Unfix" the block now that we have both the
2582@@ -413,21 +412,47 @@
2583 /*================*/
2584 buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
2585 buf_page_t* bpage, /*!< in/out: bpage to remove */
2586- ulint processed) /*!< in: number of pages processed */
2587+ ulint processed, /*!< in: number of pages processed */
2588+ bool* must_restart) /*!< in/out: if true, we have to
2589+ restart the flush list scan */
2590 {
2591 /* Every BUF_LRU_DROP_SEARCH_SIZE iterations in the
2592- loop we release buf_pool->mutex to let other threads
2593+ loop we release buf_pool->LRU_list_mutex to let other threads
2594 do their job but only if the block is not IO fixed. This
2595 ensures that the block stays in its position in the
2596 flush_list. */
2597
2598 if (bpage != NULL
2599 && processed >= BUF_LRU_DROP_SEARCH_SIZE
2600- && buf_page_get_io_fix(bpage) == BUF_IO_NONE) {
2601+ && buf_page_get_io_fix_unlocked(bpage) == BUF_IO_NONE) {
2602+
2603+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2604
2605 buf_flush_list_mutex_exit(buf_pool);
2606
2607- /* Release the buffer pool and block mutex
2608+ /* We don't have to worry about bpage becoming a dangling
2609+ pointer by a compressed page flush list relocation because
2610+ buf_page_get_gen() won't be called for pages from this
2611+ tablespace. */
2612+
2613+ mutex_enter(block_mutex);
2614+ /* Recheck the I/O fix and the flush list presence now that we
2615+ hold the right mutex */
2616+ if (UNIV_UNLIKELY(buf_page_get_io_fix(bpage) != BUF_IO_NONE
2617+ || bpage->oldest_modification == 0)) {
2618+
2619+ mutex_exit(block_mutex);
2620+
2621+ *must_restart = true;
2622+
2623+ buf_flush_list_mutex_enter(buf_pool);
2624+
2625+ return false;
2626+ }
2627+
2628+ *must_restart = false;
2629+
2630+ /* Release the LRU list and buf_page_get_mutex() mutex
2631 to give the other threads a go. */
2632
2633 buf_flush_yield(buf_pool, bpage);
2634@@ -456,18 +481,22 @@
2635 /*=====================*/
2636 buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
2637 buf_page_t* bpage, /*!< in/out: bpage to remove */
2638- bool flush) /*!< in: flush to disk if true but
2639+ bool flush, /*!< in: flush to disk if true but
2640 don't remove else remove without
2641 flushing to disk */
2642+ bool* must_restart) /*!< in/out: if true, must restart the
2643+ flush list scan */
2644 {
2645- ut_ad(buf_pool_mutex_own(buf_pool));
2646+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2647+
2648+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2649 ut_ad(buf_flush_list_mutex_own(buf_pool));
2650
2651- /* bpage->space and bpage->io_fix are protected by
2652- buf_pool->mutex and block_mutex. It is safe to check
2653- them while holding buf_pool->mutex only. */
2654+ /* It is safe to check bpage->space and bpage->io_fix while holding
2655+ buf_pool->LRU_list_mutex only. */
2656
2657- if (buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
2658+ if (UNIV_UNLIKELY(buf_page_get_io_fix_unlocked(bpage)
2659+ != BUF_IO_NONE)) {
2660
2661 /* We cannot remove this page during this scan
2662 yet; maybe the system is currently reading it
2663@@ -476,22 +505,31 @@
2664
2665 }
2666
2667- ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2668 bool processed = false;
2669
2670- /* We have to release the flush_list_mutex to obey the
2671- latching order. We are however guaranteed that the page
2672- will stay in the flush_list and won't be relocated because
2673- buf_flush_remove() and buf_flush_relocate_on_flush_list()
2674- need buf_pool->mutex as well. */
2675-
2676 buf_flush_list_mutex_exit(buf_pool);
2677
2678+ /* We don't have to worry about bpage becoming a dangling
2679+ pointer by a compressed page flush list relocation because
2680+ buf_page_get_gen() won't be called for pages from this
2681+ tablespace. */
2682+
2683 mutex_enter(block_mutex);
2684
2685- ut_ad(bpage->oldest_modification != 0);
2686-
2687- if (!flush) {
2688+ /* Recheck the page I/O fix and the flush list presence now
2689+ that we hold the right mutex. */
2690+ if (UNIV_UNLIKELY(buf_page_get_io_fix(bpage) != BUF_IO_NONE
2691+ || bpage->oldest_modification == 0)) {
2692+
2693+ /* The page became I/O-fixed or is not on the flush
2694+ list anymore, this invalidates any flush-list-page
2695+ pointers we have. */
2696+
2697+ mutex_exit(block_mutex);
2698+
2699+ *must_restart = TRUE;
2700+
2701+ } else if (!flush) {
2702
2703 buf_flush_remove(bpage);
2704
2705@@ -502,8 +540,10 @@
2706 } else if (buf_flush_ready_for_flush(bpage,
2707 BUF_FLUSH_SINGLE_PAGE)) {
2708
2709- /* The following call will release the buffer pool
2710- and block mutex. */
2711+ mutex_exit(&buf_pool->LRU_list_mutex);
2712+
2713+ /* The following call will release the buf_page_get_mutex()
2714+ mutex. */
2715 buf_flush_page(buf_pool, bpage, BUF_FLUSH_SINGLE_PAGE, false);
2716 ut_ad(!mutex_own(block_mutex));
2717
2718@@ -511,7 +551,7 @@
2719 post the writes to the operating system */
2720 os_aio_simulated_wake_handler_threads();
2721
2722- buf_pool_mutex_enter(buf_pool);
2723+ mutex_enter(&buf_pool->LRU_list_mutex);
2724
2725 processed = true;
2726 } else {
2727@@ -525,7 +565,7 @@
2728 buf_flush_list_mutex_enter(buf_pool);
2729
2730 ut_ad(!mutex_own(block_mutex));
2731- ut_ad(buf_pool_mutex_own(buf_pool));
2732+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2733
2734 return(processed);
2735 }
2736@@ -558,6 +598,7 @@
2737 buf_flush_list_mutex_enter(buf_pool);
2738
2739 rescan:
2740+ bool must_restart = false;
2741 bool all_freed = true;
2742
2743 for (bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
2744@@ -576,15 +617,16 @@
2745 /* Skip this block, as it does not belong to
2746 the target space. */
2747
2748- } else if (!buf_flush_or_remove_page(buf_pool, bpage, flush)) {
2749+ } else if (!buf_flush_or_remove_page(buf_pool, bpage, flush,
2750+ &must_restart)) {
2751
2752 /* Remove was unsuccessful, we have to try again
2753 by scanning the entire list from the end.
2754 This also means that we never released the
2755- buf_pool mutex. Therefore we can trust the prev
2756+ flush list mutex. Therefore we can trust the prev
2757 pointer.
2758 buf_flush_or_remove_page() released the
2759- flush list mutex but not the buf_pool mutex.
2760+ flush list mutex but not the LRU list mutex.
2761 Therefore it is possible that a new page was
2762 added to the flush list. For example, in case
2763 where we are at the head of the flush list and
2764@@ -602,17 +644,23 @@
2765 } else if (flush) {
2766
2767 /* The processing was successful. And during the
2768- processing we have released the buf_pool mutex
2769+ processing we have released all the buf_pool mutexes
2770 when calling buf_page_flush(). We cannot trust
2771 prev pointer. */
2772 goto rescan;
2773+ } else if (UNIV_UNLIKELY(must_restart)) {
2774+
2775+ ut_ad(!all_freed);
2776+ break;
2777 }
2778
2779 ++processed;
2780
2781 /* Yield if we have hogged the CPU and mutexes for too long. */
2782- if (buf_flush_try_yield(buf_pool, prev, processed)) {
2783+ if (buf_flush_try_yield(buf_pool, prev, processed,
2784+ &must_restart)) {
2785
2786+ ut_ad(!must_restart);
2787 /* Reset the batch size counter if we had to yield. */
2788
2789 processed = 0;
2790@@ -658,11 +706,11 @@
2791 dberr_t err;
2792
2793 do {
2794- buf_pool_mutex_enter(buf_pool);
2795+ mutex_enter(&buf_pool->LRU_list_mutex);
2796
2797 err = buf_flush_or_remove_pages(buf_pool, id, flush, trx);
2798
2799- buf_pool_mutex_exit(buf_pool);
2800+ mutex_exit(&buf_pool->LRU_list_mutex);
2801
2802 ut_ad(buf_flush_validate(buf_pool));
2803
2804@@ -695,7 +743,7 @@
2805 ibool all_freed;
2806
2807 scan_again:
2808- buf_pool_mutex_enter(buf_pool);
2809+ mutex_enter(&buf_pool->LRU_list_mutex);
2810
2811 all_freed = TRUE;
2812
2813@@ -712,15 +760,16 @@
2814
2815 prev_bpage = UT_LIST_GET_PREV(LRU, bpage);
2816
2817- /* bpage->space and bpage->io_fix are protected by
2818- buf_pool->mutex and the block_mutex. It is safe to check
2819- them while holding buf_pool->mutex only. */
2820+ /* It is safe to check bpage->space and bpage->io_fix while
2821+ holding buf_pool->LRU_list_mutex only and later recheck
2822+ while holding the buf_page_get_mutex() mutex. */
2823
2824 if (buf_page_get_space(bpage) != id) {
2825 /* Skip this block, as it does not belong to
2826 the space that is being invalidated. */
2827 goto next_page;
2828- } else if (buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
2829+ } else if (UNIV_UNLIKELY(buf_page_get_io_fix_unlocked(bpage)
2830+ != BUF_IO_NONE)) {
2831 /* We cannot remove this page during this scan
2832 yet; maybe the system is currently reading it
2833 in, or flushing the modifications to the file */
2834@@ -738,7 +787,11 @@
2835 block_mutex = buf_page_get_mutex(bpage);
2836 mutex_enter(block_mutex);
2837
2838- if (bpage->buf_fix_count > 0) {
2839+ if (UNIV_UNLIKELY(
2840+ buf_page_get_space(bpage) != id
2841+ || bpage->buf_fix_count > 0
2842+ || (buf_page_get_io_fix(bpage)
2843+ != BUF_IO_NONE))) {
2844
2845 mutex_exit(block_mutex);
2846
2847@@ -772,15 +825,15 @@
2848 ulint page_no;
2849 ulint zip_size;
2850
2851- buf_pool_mutex_exit(buf_pool);
2852+ mutex_exit(&buf_pool->LRU_list_mutex);
2853
2854 zip_size = buf_page_get_zip_size(bpage);
2855 page_no = buf_page_get_page_no(bpage);
2856
2857+ mutex_exit(block_mutex);
2858+
2859 rw_lock_x_unlock(hash_lock);
2860
2861- mutex_exit(block_mutex);
2862-
2863 /* Note that the following call will acquire
2864 and release block->lock X-latch. */
2865
2866@@ -800,7 +853,10 @@
2867 /* Remove from the LRU list. */
2868
2869 if (buf_LRU_block_remove_hashed(bpage, true)) {
2870+
2871+ mutex_enter(block_mutex);
2872 buf_LRU_block_free_hashed_page((buf_block_t*) bpage);
2873+ mutex_exit(block_mutex);
2874 } else {
2875 ut_ad(block_mutex == &buf_pool->zip_mutex);
2876 }
2877@@ -817,7 +873,7 @@
2878 bpage = prev_bpage;
2879 }
2880
2881- buf_pool_mutex_exit(buf_pool);
2882+ mutex_exit(&buf_pool->LRU_list_mutex);
2883
2884 if (!all_freed) {
2885 os_thread_sleep(20000);
2886@@ -921,7 +977,8 @@
2887 buf_page_t* b;
2888 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
2889
2890- ut_ad(buf_pool_mutex_own(buf_pool));
2891+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2892+ ut_ad(mutex_own(&buf_pool->zip_mutex));
2893 ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_PAGE);
2894
2895 /* Find the first successor of bpage in the LRU list
2896@@ -961,7 +1018,7 @@
2897 ibool freed;
2898 ulint scanned;
2899
2900- ut_ad(buf_pool_mutex_own(buf_pool));
2901+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2902
2903 if (!buf_LRU_evict_from_unzip_LRU(buf_pool)) {
2904 return(FALSE);
2905@@ -976,12 +1033,16 @@
2906 buf_block_t* prev_block = UT_LIST_GET_PREV(unzip_LRU,
2907 block);
2908
2909+ mutex_enter(&block->mutex);
2910+
2911 ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE);
2912 ut_ad(block->in_unzip_LRU_list);
2913 ut_ad(block->page.in_LRU_list);
2914
2915 freed = buf_LRU_free_page(&block->page, false);
2916
2917+ mutex_exit(&block->mutex);
2918+
2919 block = prev_block;
2920 }
2921
2922@@ -1009,7 +1070,7 @@
2923 ibool freed;
2924 ulint scanned;
2925
2926- ut_ad(buf_pool_mutex_own(buf_pool));
2927+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
2928
2929 for (bpage = UT_LIST_GET_LAST(buf_pool->LRU),
2930 scanned = 1, freed = FALSE;
2931@@ -1020,12 +1081,19 @@
2932 unsigned accessed;
2933 buf_page_t* prev_bpage = UT_LIST_GET_PREV(LRU,
2934 bpage);
2935+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
2936
2937 ut_ad(buf_page_in_file(bpage));
2938 ut_ad(bpage->in_LRU_list);
2939
2940 accessed = buf_page_is_accessed(bpage);
2941+
2942+ mutex_enter(block_mutex);
2943+
2944 freed = buf_LRU_free_page(bpage, true);
2945+
2946+ mutex_exit(block_mutex);
2947+
2948 if (freed && !accessed) {
2949 /* Keep track of pages that are evicted without
2950 ever being accessed. This gives us a measure of
2951@@ -1057,11 +1125,24 @@
2952 if TRUE, otherwise scan only
2953 'old' blocks. */
2954 {
2955- ut_ad(buf_pool_mutex_own(buf_pool));
2956-
2957- return(buf_LRU_free_from_unzip_LRU_list(buf_pool, scan_all)
2958- || buf_LRU_free_from_common_LRU_list(
2959- buf_pool, scan_all));
2960+ ibool freed = FALSE;
2961+ bool use_unzip_list = UT_LIST_GET_LEN(buf_pool->unzip_LRU) > 0;
2962+
2963+ mutex_enter(&buf_pool->LRU_list_mutex);
2964+
2965+ if (use_unzip_list) {
2966+ freed = buf_LRU_free_from_unzip_LRU_list(buf_pool, scan_all);
2967+ }
2968+
2969+ if (!freed) {
2970+ freed = buf_LRU_free_from_common_LRU_list(buf_pool, scan_all);
2971+ }
2972+
2973+ if (!freed) {
2974+ mutex_exit(&buf_pool->LRU_list_mutex);
2975+ }
2976+
2977+ return(freed);
2978 }
2979
2980 /******************************************************************//**
2981@@ -1082,8 +1163,6 @@
2982
2983 buf_pool = buf_pool_from_array(i);
2984
2985- buf_pool_mutex_enter(buf_pool);
2986-
2987 if (!recv_recovery_on
2988 && UT_LIST_GET_LEN(buf_pool->free)
2989 + UT_LIST_GET_LEN(buf_pool->LRU)
2990@@ -1091,8 +1170,6 @@
2991
2992 ret = TRUE;
2993 }
2994-
2995- buf_pool_mutex_exit(buf_pool);
2996 }
2997
2998 return(ret);
2999@@ -1110,9 +1187,9 @@
3000 {
3001 buf_block_t* block;
3002
3003- ut_ad(buf_pool_mutex_own(buf_pool));
3004+ mutex_enter(&buf_pool->free_list_mutex);
3005
3006- block = (buf_block_t*) UT_LIST_GET_FIRST(buf_pool->free);
3007+ block = (buf_block_t*) UT_LIST_GET_LAST(buf_pool->free);
3008
3009 if (block) {
3010
3011@@ -1122,18 +1199,23 @@
3012 ut_ad(!block->page.in_LRU_list);
3013 ut_a(!buf_page_in_file(&block->page));
3014 UT_LIST_REMOVE(list, buf_pool->free, (&block->page));
3015-
3016- mutex_enter(&block->mutex);
3017-
3018 buf_block_set_state(block, BUF_BLOCK_READY_FOR_USE);
3019+
3020+ mutex_exit(&buf_pool->free_list_mutex);
3021+
3022+ mutex_enter(&block->mutex);
3023+
3024 UNIV_MEM_ALLOC(block->frame, UNIV_PAGE_SIZE);
3025
3026 ut_ad(buf_pool_from_block(block) == buf_pool);
3027
3028 mutex_exit(&block->mutex);
3029+ return(block);
3030 }
3031
3032- return(block);
3033+ mutex_exit(&buf_pool->free_list_mutex);
3034+
3035+ return(NULL);
3036 }
3037
3038 /******************************************************************//**
3039@@ -1147,8 +1229,6 @@
3040 /*===================================*/
3041 const buf_pool_t* buf_pool) /*!< in: buffer pool instance */
3042 {
3043- ut_ad(buf_pool_mutex_own(buf_pool));
3044-
3045 if (!recv_recovery_on && UT_LIST_GET_LEN(buf_pool->free)
3046 + UT_LIST_GET_LEN(buf_pool->LRU) < buf_pool->curr_size / 20) {
3047 ut_print_timestamp(stderr);
3048@@ -1253,10 +1333,10 @@
3049 ibool mon_value_was = FALSE;
3050 ibool started_monitor = FALSE;
3051
3052+ ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
3053+
3054 MONITOR_INC(MONITOR_LRU_GET_FREE_SEARCH);
3055 loop:
3056- buf_pool_mutex_enter(buf_pool);
3057-
3058 buf_LRU_check_size_of_non_data_objects(buf_pool);
3059
3060 /* If there is a block in the free list, take it */
3061@@ -1264,7 +1344,6 @@
3062
3063 if (block) {
3064
3065- buf_pool_mutex_exit(buf_pool);
3066 ut_ad(buf_pool_from_block(block) == buf_pool);
3067 memset(&block->page.zip, 0, sizeof block->page.zip);
3068
3069@@ -1275,22 +1354,28 @@
3070 return(block);
3071 }
3072
3073+ mutex_enter(&buf_pool->flush_state_mutex);
3074+
3075 if (buf_pool->init_flush[BUF_FLUSH_LRU]
3076 && srv_use_doublewrite_buf
3077 && buf_dblwr != NULL) {
3078
3079+ mutex_exit(&buf_pool->flush_state_mutex);
3080+
3081 /* If there is an LRU flush happening in the background
3082 then we wait for it to end instead of trying a single
3083 page flush. If, however, we are not using doublewrite
3084 buffer then it is better to do our own single page
3085 flush instead of waiting for LRU flush to end. */
3086- buf_pool_mutex_exit(buf_pool);
3087 buf_flush_wait_batch_end(buf_pool, BUF_FLUSH_LRU);
3088 goto loop;
3089 }
3090
3091+ mutex_exit(&buf_pool->flush_state_mutex);
3092+
3093 freed = FALSE;
3094 if (buf_pool->try_LRU_scan || n_iterations > 0) {
3095+
3096 /* If no block was in the free list, search from the
3097 end of the LRU list and try to free a block there.
3098 If we are doing for the first time we'll scan only
3099@@ -1308,8 +1393,6 @@
3100 }
3101 }
3102
3103- buf_pool_mutex_exit(buf_pool);
3104-
3105 if (freed) {
3106 goto loop;
3107
3108@@ -1395,7 +1478,7 @@
3109 ulint new_len;
3110
3111 ut_a(buf_pool->LRU_old);
3112- ut_ad(buf_pool_mutex_own(buf_pool));
3113+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3114 ut_ad(buf_pool->LRU_old_ratio >= BUF_LRU_OLD_RATIO_MIN);
3115 ut_ad(buf_pool->LRU_old_ratio <= BUF_LRU_OLD_RATIO_MAX);
3116 #if BUF_LRU_OLD_RATIO_MIN * BUF_LRU_OLD_MIN_LEN <= BUF_LRU_OLD_RATIO_DIV * (BUF_LRU_OLD_TOLERANCE + 5)
3117@@ -1461,7 +1544,7 @@
3118 {
3119 buf_page_t* bpage;
3120
3121- ut_ad(buf_pool_mutex_own(buf_pool));
3122+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3123 ut_a(UT_LIST_GET_LEN(buf_pool->LRU) == BUF_LRU_OLD_MIN_LEN);
3124
3125 /* We first initialize all blocks in the LRU list as old and then use
3126@@ -1496,7 +1579,7 @@
3127 ut_ad(buf_pool);
3128 ut_ad(bpage);
3129 ut_ad(buf_page_in_file(bpage));
3130- ut_ad(buf_pool_mutex_own(buf_pool));
3131+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3132
3133 if (buf_page_belongs_to_unzip_LRU(bpage)) {
3134 buf_block_t* block = (buf_block_t*) bpage;
3135@@ -1521,7 +1604,7 @@
3136
3137 ut_ad(buf_pool);
3138 ut_ad(bpage);
3139- ut_ad(buf_pool_mutex_own(buf_pool));
3140+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3141
3142 ut_a(buf_page_in_file(bpage));
3143
3144@@ -1601,7 +1684,7 @@
3145
3146 ut_ad(buf_pool);
3147 ut_ad(block);
3148- ut_ad(buf_pool_mutex_own(buf_pool));
3149+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3150
3151 ut_a(buf_page_belongs_to_unzip_LRU(&block->page));
3152
3153@@ -1630,7 +1713,7 @@
3154
3155 ut_ad(buf_pool);
3156 ut_ad(bpage);
3157- ut_ad(buf_pool_mutex_own(buf_pool));
3158+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3159
3160 ut_a(buf_page_in_file(bpage));
3161
3162@@ -1686,7 +1769,7 @@
3163
3164 ut_ad(buf_pool);
3165 ut_ad(bpage);
3166- ut_ad(buf_pool_mutex_own(buf_pool));
3167+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3168
3169 ut_a(buf_page_in_file(bpage));
3170 ut_ad(!bpage->in_LRU_list);
3171@@ -1770,7 +1853,7 @@
3172 {
3173 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
3174
3175- ut_ad(buf_pool_mutex_own(buf_pool));
3176+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3177
3178 if (bpage->old) {
3179 buf_pool->stat.n_pages_made_young++;
3180@@ -1796,12 +1879,14 @@
3181 Try to free a block. If bpage is a descriptor of a compressed-only
3182 page, the descriptor object will be freed as well.
3183
3184-NOTE: If this function returns true, it will temporarily
3185-release buf_pool->mutex. Furthermore, the page frame will no longer be
3186-accessible via bpage.
3187-
3188-The caller must hold buf_pool->mutex and must not hold any
3189-buf_page_get_mutex() when calling this function.
3190+NOTE: If this function returns true, it will release the LRU list mutex,
3191+and temporarily release and relock the buf_page_get_mutex() mutex.
3192+Furthermore, the page frame will no longer be accessible via bpage. If this
3193+function returns false, the buf_page_get_mutex() might be temporarily released
3194+and relocked too.
3195+
3196+The caller must hold the LRU list and buf_page_get_mutex() mutexes.
3197+
3198 @return true if freed, false otherwise. */
3199 UNIV_INTERN
3200 bool
3201@@ -1819,13 +1904,11 @@
3202
3203 ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
3204
3205- ut_ad(buf_pool_mutex_own(buf_pool));
3206+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3207+ ut_ad(mutex_own(block_mutex));
3208 ut_ad(buf_page_in_file(bpage));
3209 ut_ad(bpage->in_LRU_list);
3210
3211- rw_lock_x_lock(hash_lock);
3212- mutex_enter(block_mutex);
3213-
3214 #if UNIV_WORD_SIZE == 4
3215 /* On 32-bit systems, there is no padding in buf_page_t. On
3216 other systems, Valgrind could complain about uninitialized pad
3217@@ -1836,7 +1919,7 @@
3218 if (!buf_page_can_relocate(bpage)) {
3219
3220 /* Do not free buffer-fixed or I/O-fixed blocks. */
3221- goto func_exit;
3222+ return(false);
3223 }
3224
3225 #ifdef UNIV_IBUF_COUNT_DEBUG
3226@@ -1848,7 +1931,7 @@
3227 /* Do not completely free dirty blocks. */
3228
3229 if (bpage->oldest_modification) {
3230- goto func_exit;
3231+ return(false);
3232 }
3233 } else if ((bpage->oldest_modification)
3234 && (buf_page_get_state(bpage)
3235@@ -1857,18 +1940,13 @@
3236 ut_ad(buf_page_get_state(bpage)
3237 == BUF_BLOCK_ZIP_DIRTY);
3238
3239-func_exit:
3240- rw_lock_x_unlock(hash_lock);
3241- mutex_exit(block_mutex);
3242 return(false);
3243
3244 } else if (buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE) {
3245 b = buf_page_alloc_descriptor();
3246 ut_a(b);
3247- memcpy(b, bpage, sizeof *b);
3248 }
3249
3250- ut_ad(buf_pool_mutex_own(buf_pool));
3251 ut_ad(buf_page_in_file(bpage));
3252 ut_ad(bpage->in_LRU_list);
3253 ut_ad(!bpage->in_flush_list == !bpage->oldest_modification);
3254@@ -1887,12 +1965,45 @@
3255 }
3256 #endif /* UNIV_DEBUG */
3257
3258-#ifdef UNIV_SYNC_DEBUG
3259- ut_ad(rw_lock_own(hash_lock, RW_LOCK_EX));
3260-#endif /* UNIV_SYNC_DEBUG */
3261- ut_ad(buf_page_can_relocate(bpage));
3262+ mutex_exit(block_mutex);
3263+
3264+ rw_lock_x_lock(hash_lock);
3265+ mutex_enter(block_mutex);
3266+
3267+ if (UNIV_UNLIKELY(!buf_page_can_relocate(bpage)
3268+ || ((zip || !bpage->zip.data)
3269+ && bpage->oldest_modification))) {
3270+
3271+not_freed:
3272+ rw_lock_x_unlock(hash_lock);
3273+ if (b) {
3274+ buf_page_free_descriptor(b);
3275+ }
3276+
3277+ return(false);
3278+ } else if (UNIV_UNLIKELY(bpage->oldest_modification
3279+ && (buf_page_get_state(bpage)
3280+ != BUF_BLOCK_FILE_PAGE))) {
3281+
3282+ ut_ad(buf_page_get_state(bpage)
3283+ == BUF_BLOCK_ZIP_DIRTY);
3284+ goto not_freed;
3285+ }
3286+
3287+ if (b) {
3288+ memcpy(b, bpage, sizeof *b);
3289+ }
3290
3291 if (!buf_LRU_block_remove_hashed(bpage, zip)) {
3292+
3293+ mutex_exit(&buf_pool->LRU_list_mutex);
3294+
3295+ if (b) {
3296+ buf_page_free_descriptor(b);
3297+ }
3298+
3299+ mutex_enter(block_mutex);
3300+
3301 return(true);
3302 }
3303
3304@@ -1997,6 +2108,8 @@
3305 buf_LRU_add_block_low(b, buf_page_is_old(b));
3306 }
3307
3308+ mutex_enter(&buf_pool->zip_mutex);
3309+ rw_lock_x_unlock(hash_lock);
3310 if (b->state == BUF_BLOCK_ZIP_PAGE) {
3311 #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
3312 buf_LRU_insert_zip_clean(b);
3313@@ -2008,35 +2121,16 @@
3314
3315 bpage->zip.data = NULL;
3316 page_zip_set_size(&bpage->zip, 0);
3317- mutex_exit(block_mutex);
3318
3319 /* Prevent buf_page_get_gen() from
3320- decompressing the block while we release
3321- buf_pool->mutex and block_mutex. */
3322- block_mutex = buf_page_get_mutex(b);
3323- mutex_enter(block_mutex);
3324+ decompressing the block while we release block_mutex. */
3325 buf_page_set_sticky(b);
3326- mutex_exit(block_mutex);
3327-
3328- rw_lock_x_unlock(hash_lock);
3329-
3330- } else {
3331-
3332- /* There can be multiple threads doing an LRU scan to
3333- free a block. The page_cleaner thread can be doing an
3334- LRU batch whereas user threads can potentially be doing
3335- multiple single page flushes. As we release
3336- buf_pool->mutex below we need to make sure that no one
3337- else considers this block as a victim for page
3338- replacement. This block is already out of page_hash
3339- and we are about to remove it from the LRU list and put
3340- it on the free list. */
3341- mutex_enter(block_mutex);
3342- buf_page_set_sticky(bpage);
3343- mutex_exit(block_mutex);
3344+ mutex_exit(&buf_pool->zip_mutex);
3345+ mutex_exit(block_mutex);
3346+
3347 }
3348
3349- buf_pool_mutex_exit(buf_pool);
3350+ mutex_exit(&buf_pool->LRU_list_mutex);
3351
3352 /* Remove possible adaptive hash index on the page.
3353 The page was declared uninitialized by
3354@@ -2069,13 +2163,17 @@
3355 checksum);
3356 }
3357
3358- buf_pool_mutex_enter(buf_pool);
3359-
3360 mutex_enter(block_mutex);
3361- buf_page_unset_sticky(b != NULL ? b : bpage);
3362- mutex_exit(block_mutex);
3363+
3364+ if (b) {
3365+ mutex_enter(&buf_pool->zip_mutex);
3366+ buf_page_unset_sticky(b);
3367+ mutex_exit(&buf_pool->zip_mutex);
3368+ }
3369
3370 buf_LRU_block_free_hashed_page((buf_block_t*) bpage);
3371+ ut_ad(mutex_own(block_mutex));
3372+ ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
3373 return(true);
3374 }
3375
3376@@ -2091,7 +2189,6 @@
3377 buf_pool_t* buf_pool = buf_pool_from_block(block);
3378
3379 ut_ad(block);
3380- ut_ad(buf_pool_mutex_own(buf_pool));
3381 ut_ad(mutex_own(&block->mutex));
3382
3383 switch (buf_block_get_state(block)) {
3384@@ -2109,8 +2206,6 @@
3385 ut_ad(!block->page.in_flush_list);
3386 ut_ad(!block->page.in_LRU_list);
3387
3388- buf_block_set_state(block, BUF_BLOCK_NOT_USED);
3389-
3390 UNIV_MEM_ALLOC(block->frame, UNIV_PAGE_SIZE);
3391 #ifdef UNIV_DEBUG
3392 /* Wipe contents of page to reveal possible stale pointers to it */
3393@@ -2125,18 +2220,19 @@
3394 if (data) {
3395 block->page.zip.data = NULL;
3396 mutex_exit(&block->mutex);
3397- buf_pool_mutex_exit_forbid(buf_pool);
3398
3399 buf_buddy_free(
3400 buf_pool, data, page_zip_get_size(&block->page.zip));
3401
3402- buf_pool_mutex_exit_allow(buf_pool);
3403 mutex_enter(&block->mutex);
3404 page_zip_set_size(&block->page.zip, 0);
3405 }
3406
3407+ mutex_enter(&buf_pool->free_list_mutex);
3408+ buf_block_set_state(block, BUF_BLOCK_NOT_USED);
3409 UT_LIST_ADD_FIRST(list, buf_pool->free, (&block->page));
3410 ut_d(block->page.in_free_list = TRUE);
3411+ mutex_exit(&buf_pool->free_list_mutex);
3412
3413 UNIV_MEM_ASSERT_AND_FREE(block->frame, UNIV_PAGE_SIZE);
3414 }
3415@@ -2146,7 +2242,7 @@
3416 If the block is compressed-only (BUF_BLOCK_ZIP_PAGE),
3417 the object will be freed.
3418
3419-The caller must hold buf_pool->mutex, the buf_page_get_mutex() mutex
3420+The caller must hold buf_pool->LRU_list_mutex, the buf_page_get_mutex() mutex
3421 and the appropriate hash_lock. This function will release the
3422 buf_page_get_mutex() and the hash_lock.
3423
3424@@ -2171,7 +2267,7 @@
3425 rw_lock_t* hash_lock;
3426
3427 ut_ad(bpage);
3428- ut_ad(buf_pool_mutex_own(buf_pool));
3429+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3430 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
3431
3432 fold = buf_page_address_fold(bpage->space, bpage->offset);
3433@@ -2287,7 +2383,7 @@
3434 #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
3435 mutex_exit(buf_page_get_mutex(bpage));
3436 rw_lock_x_unlock(hash_lock);
3437- buf_pool_mutex_exit(buf_pool);
3438+ mutex_exit(&buf_pool->LRU_list_mutex);
3439 buf_print();
3440 buf_LRU_print();
3441 buf_validate();
3442@@ -2314,13 +2410,11 @@
3443
3444 mutex_exit(&buf_pool->zip_mutex);
3445 rw_lock_x_unlock(hash_lock);
3446- buf_pool_mutex_exit_forbid(buf_pool);
3447
3448 buf_buddy_free(
3449 buf_pool, bpage->zip.data,
3450 page_zip_get_size(&bpage->zip));
3451
3452- buf_pool_mutex_exit_allow(buf_pool);
3453 buf_page_free_descriptor(bpage);
3454 return(false);
3455
3456@@ -2344,14 +2438,15 @@
3457 page_hash. Only possibility is when while invalidating
3458 a tablespace we buffer fix the prev_page in LRU to
3459 avoid relocation during the scan. But that is not
3460- possible because we are holding buf_pool mutex.
3461+ possible because we are holding LRU list mutex.
3462
3463 2) Not possible because in buf_page_init_for_read()
3464- we do a look up of page_hash while holding buf_pool
3465- mutex and since we are holding buf_pool mutex here
3466+ we do a look up of page_hash while holding LRU list
3467+ mutex and since we are holding LRU list mutex here
3468 and by the time we'll release it in the caller we'd
3469 have inserted the compressed only descriptor in the
3470 page_hash. */
3471+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3472 rw_lock_x_unlock(hash_lock);
3473 mutex_exit(&((buf_block_t*) bpage)->mutex);
3474
3475@@ -2363,13 +2458,11 @@
3476 ut_ad(!bpage->in_free_list);
3477 ut_ad(!bpage->in_flush_list);
3478 ut_ad(!bpage->in_LRU_list);
3479- buf_pool_mutex_exit_forbid(buf_pool);
3480
3481 buf_buddy_free(
3482 buf_pool, data,
3483 page_zip_get_size(&bpage->zip));
3484
3485- buf_pool_mutex_exit_allow(buf_pool);
3486 page_zip_set_size(&bpage->zip, 0);
3487 }
3488
3489@@ -2397,16 +2490,11 @@
3490 buf_block_t* block) /*!< in: block, must contain a file page and
3491 be in a state where it can be freed */
3492 {
3493-#ifdef UNIV_DEBUG
3494- buf_pool_t* buf_pool = buf_pool_from_block(block);
3495- ut_ad(buf_pool_mutex_own(buf_pool));
3496-#endif
3497+ ut_ad(mutex_own(&block->mutex));
3498
3499- mutex_enter(&block->mutex);
3500 buf_block_set_state(block, BUF_BLOCK_MEMORY);
3501
3502 buf_LRU_block_free_non_file_page(block);
3503- mutex_exit(&block->mutex);
3504 }
3505
3506 /******************************************************************//**
3507@@ -2419,19 +2507,26 @@
3508 be in a state where it can be freed; there
3509 may or may not be a hash index to the page */
3510 {
3511+#if defined(UNIV_DEBUG) || defined(UNIV_SYNC_DEBUG)
3512 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
3513+#endif
3514+#ifdef UNIV_SYNC_DEBUG
3515 const ulint fold = buf_page_address_fold(bpage->space,
3516 bpage->offset);
3517 rw_lock_t* hash_lock = buf_page_hash_lock_get(buf_pool, fold);
3518+#endif
3519 ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
3520
3521- ut_ad(buf_pool_mutex_own(buf_pool));
3522-
3523- rw_lock_x_lock(hash_lock);
3524- mutex_enter(block_mutex);
3525+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
3526+ ut_ad(mutex_own(block_mutex));
3527+#ifdef UNIV_SYNC_DEBUG
3528+ ut_ad(rw_lock_own(hash_lock, RW_LOCK_EX));
3529+#endif
3530
3531 if (buf_LRU_block_remove_hashed(bpage, true)) {
3532+ mutex_enter(block_mutex);
3533 buf_LRU_block_free_hashed_page((buf_block_t*) bpage);
3534+ mutex_exit(block_mutex);
3535 }
3536
3537 /* buf_LRU_block_remove_hashed() releases hash_lock and block_mutex */
3538@@ -2466,7 +2561,7 @@
3539 }
3540
3541 if (adjust) {
3542- buf_pool_mutex_enter(buf_pool);
3543+ mutex_enter(&buf_pool->LRU_list_mutex);
3544
3545 if (ratio != buf_pool->LRU_old_ratio) {
3546 buf_pool->LRU_old_ratio = ratio;
3547@@ -2478,7 +2573,7 @@
3548 }
3549 }
3550
3551- buf_pool_mutex_exit(buf_pool);
3552+ mutex_exit(&buf_pool->LRU_list_mutex);
3553 } else {
3554 buf_pool->LRU_old_ratio = ratio;
3555 }
3556@@ -2583,7 +2678,7 @@
3557 ulint new_len;
3558
3559 ut_ad(buf_pool);
3560- buf_pool_mutex_enter(buf_pool);
3561+ mutex_enter(&buf_pool->LRU_list_mutex);
3562
3563 if (UT_LIST_GET_LEN(buf_pool->LRU) >= BUF_LRU_OLD_MIN_LEN) {
3564
3565@@ -2641,6 +2736,10 @@
3566
3567 ut_a(buf_pool->LRU_old_len == old_len);
3568
3569+ mutex_exit(&buf_pool->LRU_list_mutex);
3570+
3571+ mutex_enter(&buf_pool->free_list_mutex);
3572+
3573 UT_LIST_VALIDATE(list, buf_page_t, buf_pool->free, CheckInFreeList());
3574
3575 for (bpage = UT_LIST_GET_FIRST(buf_pool->free);
3576@@ -2650,6 +2749,10 @@
3577 ut_a(buf_page_get_state(bpage) == BUF_BLOCK_NOT_USED);
3578 }
3579
3580+ mutex_exit(&buf_pool->free_list_mutex);
3581+
3582+ mutex_enter(&buf_pool->LRU_list_mutex);
3583+
3584 UT_LIST_VALIDATE(
3585 unzip_LRU, buf_block_t, buf_pool->unzip_LRU,
3586 CheckUnzipLRUAndLRUList());
3587@@ -2663,7 +2766,7 @@
3588 ut_a(buf_page_belongs_to_unzip_LRU(&block->page));
3589 }
3590
3591- buf_pool_mutex_exit(buf_pool);
3592+ mutex_exit(&buf_pool->LRU_list_mutex);
3593 }
3594
3595 /**********************************************************************//**
3596@@ -2699,7 +2802,7 @@
3597 const buf_page_t* bpage;
3598
3599 ut_ad(buf_pool);
3600- buf_pool_mutex_enter(buf_pool);
3601+ mutex_enter(&buf_pool->LRU_list_mutex);
3602
3603 bpage = UT_LIST_GET_FIRST(buf_pool->LRU);
3604
3605@@ -2756,7 +2859,7 @@
3606 bpage = UT_LIST_GET_NEXT(LRU, bpage);
3607 }
3608
3609- buf_pool_mutex_exit(buf_pool);
3610+ mutex_exit(&buf_pool->LRU_list_mutex);
3611 }
3612
3613 /**********************************************************************//**
3614
3615=== modified file 'Percona-Server/storage/innobase/buf/buf0rea.cc'
3616--- Percona-Server/storage/innobase/buf/buf0rea.cc 2013-08-06 15:16:34 +0000
3617+++ Percona-Server/storage/innobase/buf/buf0rea.cc 2013-09-20 05:29:11 +0000
3618@@ -63,10 +63,15 @@
3619 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
3620 const bool uncompressed = (buf_page_get_state(bpage)
3621 == BUF_BLOCK_FILE_PAGE);
3622+ const ulint fold = buf_page_address_fold(bpage->space,
3623+ bpage->offset);
3624+ rw_lock_t* hash_lock = buf_page_hash_lock_get(buf_pool, fold);
3625+
3626+ mutex_enter(&buf_pool->LRU_list_mutex);
3627+ rw_lock_x_lock(hash_lock);
3628+ mutex_enter(buf_page_get_mutex(bpage));
3629
3630 /* First unfix and release lock on the bpage */
3631- buf_pool_mutex_enter(buf_pool);
3632- mutex_enter(buf_page_get_mutex(bpage));
3633 ut_ad(buf_page_get_io_fix(bpage) == BUF_IO_READ);
3634 ut_ad(bpage->buf_fix_count == 0);
3635
3636@@ -79,15 +84,13 @@
3637 BUF_IO_READ);
3638 }
3639
3640- mutex_exit(buf_page_get_mutex(bpage));
3641-
3642 /* remove the block from LRU list */
3643 buf_LRU_free_one_page(bpage);
3644
3645+ mutex_exit(&buf_pool->LRU_list_mutex);
3646+
3647 ut_ad(buf_pool->n_pend_reads > 0);
3648- buf_pool->n_pend_reads--;
3649-
3650- buf_pool_mutex_exit(buf_pool);
3651+ os_atomic_decrement_ulint(&buf_pool->n_pend_reads, 1);
3652 }
3653
3654 /********************************************************************//**
3655@@ -216,6 +219,7 @@
3656 #endif
3657
3658 ut_ad(buf_page_in_file(bpage));
3659+ ut_ad(!mutex_own(&buf_pool_from_bpage(bpage)->LRU_list_mutex));
3660
3661 if (sync) {
3662 thd_wait_begin(NULL, THD_WAIT_DISKIO);
3663@@ -332,11 +336,8 @@
3664 high = fil_space_get_size(space);
3665 }
3666
3667- buf_pool_mutex_enter(buf_pool);
3668-
3669 if (buf_pool->n_pend_reads
3670 > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
3671- buf_pool_mutex_exit(buf_pool);
3672
3673 return(0);
3674 }
3675@@ -345,8 +346,12 @@
3676 that is, reside near the start of the LRU list. */
3677
3678 for (i = low; i < high; i++) {
3679+
3680+ rw_lock_t* hash_lock;
3681+
3682 const buf_page_t* bpage =
3683- buf_page_hash_get(buf_pool, space, i);
3684+ buf_page_hash_get_s_locked(buf_pool, space, i,
3685+ &hash_lock);
3686
3687 if (bpage
3688 && buf_page_is_accessed(bpage)
3689@@ -357,13 +362,16 @@
3690 if (recent_blocks
3691 >= BUF_READ_AHEAD_RANDOM_THRESHOLD(buf_pool)) {
3692
3693- buf_pool_mutex_exit(buf_pool);
3694+ rw_lock_s_unlock(hash_lock);
3695 goto read_ahead;
3696 }
3697 }
3698+
3699+ if (bpage) {
3700+ rw_lock_s_unlock(hash_lock);
3701+ }
3702 }
3703
3704- buf_pool_mutex_exit(buf_pool);
3705 /* Do nothing */
3706 return(0);
3707
3708@@ -551,6 +559,7 @@
3709 buf_page_t* bpage;
3710 buf_frame_t* frame;
3711 buf_page_t* pred_bpage = NULL;
3712+ unsigned pred_bpage_is_accessed = 0;
3713 ulint pred_offset;
3714 ulint succ_offset;
3715 ulint count;
3716@@ -602,10 +611,7 @@
3717
3718 tablespace_version = fil_space_get_version(space);
3719
3720- buf_pool_mutex_enter(buf_pool);
3721-
3722 if (high > fil_space_get_size(space)) {
3723- buf_pool_mutex_exit(buf_pool);
3724 /* The area is not whole, return */
3725
3726 return(0);
3727@@ -613,7 +619,6 @@
3728
3729 if (buf_pool->n_pend_reads
3730 > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
3731- buf_pool_mutex_exit(buf_pool);
3732
3733 return(0);
3734 }
3735@@ -636,7 +641,11 @@
3736 fail_count = 0;
3737
3738 for (i = low; i < high; i++) {
3739- bpage = buf_page_hash_get(buf_pool, space, i);
3740+
3741+ rw_lock_t* hash_lock;
3742+
3743+ bpage = buf_page_hash_get_s_locked(buf_pool, space, i,
3744+ &hash_lock);
3745
3746 if (bpage == NULL || !buf_page_is_accessed(bpage)) {
3747 /* Not accessed */
3748@@ -653,7 +662,7 @@
3749 a little against this. */
3750 int res = ut_ulint_cmp(
3751 buf_page_is_accessed(bpage),
3752- buf_page_is_accessed(pred_bpage));
3753+ pred_bpage_is_accessed);
3754 /* Accesses not in the right order */
3755 if (res != 0 && res != asc_or_desc) {
3756 fail_count++;
3757@@ -662,12 +671,20 @@
3758
3759 if (fail_count > threshold) {
3760 /* Too many failures: return */
3761- buf_pool_mutex_exit(buf_pool);
3762+ if (bpage) {
3763+ rw_lock_s_unlock(hash_lock);
3764+ }
3765 return(0);
3766 }
3767
3768- if (bpage && buf_page_is_accessed(bpage)) {
3769- pred_bpage = bpage;
3770+ if (bpage) {
3771+ if (buf_page_is_accessed(bpage)) {
3772+ pred_bpage = bpage;
3773+ pred_bpage_is_accessed
3774+ = buf_page_is_accessed(bpage);
3775+ }
3776+
3777+ rw_lock_s_unlock(hash_lock);
3778 }
3779 }
3780
3781@@ -677,7 +694,6 @@
3782 bpage = buf_page_hash_get(buf_pool, space, offset);
3783
3784 if (bpage == NULL) {
3785- buf_pool_mutex_exit(buf_pool);
3786
3787 return(0);
3788 }
3789@@ -703,8 +719,6 @@
3790 pred_offset = fil_page_get_prev(frame);
3791 succ_offset = fil_page_get_next(frame);
3792
3793- buf_pool_mutex_exit(buf_pool);
3794-
3795 if ((offset == low) && (succ_offset == offset + 1)) {
3796
3797 /* This is ok, we can continue */
3798@@ -961,7 +975,8 @@
3799
3800 os_aio_print_debug = FALSE;
3801 buf_pool = buf_pool_get(space, page_nos[i]);
3802- while (buf_pool->n_pend_reads >= recv_n_pool_free_frames / 2) {
3803+ while (buf_pool->n_pend_reads
3804+ >= recv_n_pool_free_frames / 2) {
3805
3806 os_aio_simulated_wake_handler_threads();
3807 os_thread_sleep(10000);
3808
3809=== modified file 'Percona-Server/storage/innobase/fsp/fsp0fsp.cc'
3810--- Percona-Server/storage/innobase/fsp/fsp0fsp.cc 2013-05-30 12:47:19 +0000
3811+++ Percona-Server/storage/innobase/fsp/fsp0fsp.cc 2013-09-20 05:29:11 +0000
3812@@ -2870,7 +2870,7 @@
3813
3814 /* The convoluted mutex acquire is to overcome latching order
3815 issues: The problem is that the fil_mutex is at a lower level
3816- than the tablespace latch and the buffer pool mutex. We have to
3817+ than the tablespace latch and the buffer pool mutexes. We have to
3818 first prevent any operations on the file system by acquiring the
3819 dictionary mutex. Then acquire the tablespace latch to obey the
3820 latching order and then release the dictionary mutex. That way we
3821
3822=== modified file 'Percona-Server/storage/innobase/handler/ha_innodb.cc'
3823--- Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-08-30 13:23:53 +0000
3824+++ Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-09-20 05:29:11 +0000
3825@@ -294,6 +294,11 @@
3826 # endif /* !PFS_SKIP_BUFFER_MUTEX_RWLOCK */
3827 {&buf_pool_mutex_key, "buf_pool_mutex", 0},
3828 {&buf_pool_zip_mutex_key, "buf_pool_zip_mutex", 0},
3829+ {&buf_pool_LRU_list_mutex_key, "buf_pool_LRU_list_mutex", 0},
3830+ {&buf_pool_free_list_mutex_key, "buf_pool_free_list_mutex", 0},
3831+ {&buf_pool_zip_free_mutex_key, "buf_pool_zip_free_mutex", 0},
3832+ {&buf_pool_zip_hash_mutex_key, "buf_pool_zip_hash_mutex", 0},
3833+ {&buf_pool_flush_state_mutex_key, "buf_pool_flush_state_mutex", 0},
3834 {&cache_last_read_mutex_key, "cache_last_read_mutex", 0},
3835 {&dict_foreign_err_mutex_key, "dict_foreign_err_mutex", 0},
3836 {&dict_sys_mutex_key, "dict_sys_mutex", 0},
3837@@ -15485,7 +15490,7 @@
3838 for (ulint i = 0; i < srv_buf_pool_instances; i++) {
3839 buf_pool_t* buf_pool = &buf_pool_ptr[i];
3840
3841- buf_pool_mutex_enter(buf_pool);
3842+ mutex_enter(&buf_pool->LRU_list_mutex);
3843
3844 for (buf_block_t* block = UT_LIST_GET_LAST(
3845 buf_pool->unzip_LRU);
3846@@ -15497,14 +15502,19 @@
3847 ut_ad(block->in_unzip_LRU_list);
3848 ut_ad(block->page.in_LRU_list);
3849
3850+ mutex_enter(&block->mutex);
3851 if (!buf_LRU_free_page(&block->page, false)) {
3852+ mutex_exit(&block->mutex);
3853 all_evicted = false;
3854+ } else {
3855+ mutex_exit(&block->mutex);
3856+ mutex_enter(&buf_pool->LRU_list_mutex);
3857 }
3858
3859 block = prev_block;
3860 }
3861
3862- buf_pool_mutex_exit(buf_pool);
3863+ mutex_exit(&buf_pool->LRU_list_mutex);
3864 }
3865
3866 return(all_evicted);
3867
3868=== modified file 'Percona-Server/storage/innobase/handler/i_s.cc'
3869--- Percona-Server/storage/innobase/handler/i_s.cc 2013-08-14 03:57:21 +0000
3870+++ Percona-Server/storage/innobase/handler/i_s.cc 2013-09-20 05:29:11 +0000
3871@@ -2103,7 +2103,7 @@
3872
3873 buf_pool = buf_pool_from_array(i);
3874
3875- buf_pool_mutex_enter(buf_pool);
3876+ mutex_enter(&buf_pool->zip_free_mutex);
3877
3878 for (uint x = 0; x <= BUF_BUDDY_SIZES; x++) {
3879 buf_buddy_stat_t* buddy_stat;
3880@@ -2122,7 +2122,8 @@
3881 (ulong) (buddy_stat->relocated_usec / 1000000));
3882
3883 if (reset) {
3884- /* This is protected by buf_pool->mutex. */
3885+ /* This is protected by
3886+ buf_pool->zip_free_mutex. */
3887 buddy_stat->relocated = 0;
3888 buddy_stat->relocated_usec = 0;
3889 }
3890@@ -2133,7 +2134,7 @@
3891 }
3892 }
3893
3894- buf_pool_mutex_exit(buf_pool);
3895+ mutex_exit(&buf_pool->zip_free_mutex);
3896
3897 if (status) {
3898 break;
3899@@ -4954,12 +4955,16 @@
3900 out: structure filled with scanned
3901 info */
3902 {
3903+ ib_mutex_t* mutex = buf_page_get_mutex(bpage);
3904+
3905 ut_ad(pool_id < MAX_BUFFER_POOLS);
3906
3907 page_info->pool_id = pool_id;
3908
3909 page_info->block_id = pos;
3910
3911+ mutex_enter(mutex);
3912+
3913 page_info->page_state = buf_page_get_state(bpage);
3914
3915 /* Only fetch information for buffers that map to a tablespace,
3916@@ -4998,6 +5003,7 @@
3917 break;
3918 case BUF_IO_READ:
3919 page_info->page_type = I_S_PAGE_TYPE_UNKNOWN;
3920+ mutex_exit(mutex);
3921 return;
3922 }
3923
3924@@ -5018,6 +5024,8 @@
3925 } else {
3926 page_info->page_type = I_S_PAGE_TYPE_UNKNOWN;
3927 }
3928+
3929+ mutex_exit(mutex);
3930 }
3931
3932 /*******************************************************************//**
3933@@ -5075,7 +5083,6 @@
3934 buffer pool info printout, we are not required to
3935 preserve the overall consistency, so we can
3936 release mutex periodically */
3937- buf_pool_mutex_enter(buf_pool);
3938
3939 /* GO through each block in the chunk */
3940 for (n_blocks = num_to_process; n_blocks--; block++) {
3941@@ -5086,8 +5093,6 @@
3942 num_page++;
3943 }
3944
3945- buf_pool_mutex_exit(buf_pool);
3946-
3947 /* Fill in information schema table with information
3948 just collected from the buffer chunk scan */
3949 status = i_s_innodb_buffer_page_fill(
3950@@ -5609,9 +5614,9 @@
3951 DBUG_ENTER("i_s_innodb_fill_buffer_lru");
3952 RETURN_IF_INNODB_NOT_STARTED(tables->schema_table_name);
3953
3954- /* Obtain buf_pool mutex before allocate info_buffer, since
3955+ /* Obtain buf_pool->LRU_list_mutex before allocate info_buffer, since
3956 UT_LIST_GET_LEN(buf_pool->LRU) could change */
3957- buf_pool_mutex_enter(buf_pool);
3958+ mutex_enter(&buf_pool->LRU_list_mutex);
3959
3960 lru_len = UT_LIST_GET_LEN(buf_pool->LRU);
3961
3962@@ -5645,7 +5650,7 @@
3963 ut_ad(lru_pos == UT_LIST_GET_LEN(buf_pool->LRU));
3964
3965 exit:
3966- buf_pool_mutex_exit(buf_pool);
3967+ mutex_exit(&buf_pool->LRU_list_mutex);
3968
3969 if (info_buffer) {
3970 status = i_s_innodb_buf_page_lru_fill(
3971
3972=== modified file 'Percona-Server/storage/innobase/ibuf/ibuf0ibuf.cc'
3973--- Percona-Server/storage/innobase/ibuf/ibuf0ibuf.cc 2013-09-02 10:01:38 +0000
3974+++ Percona-Server/storage/innobase/ibuf/ibuf0ibuf.cc 2013-09-20 05:29:11 +0000
3975@@ -4611,7 +4611,7 @@
3976 ut_ad(!block || buf_block_get_space(block) == space);
3977 ut_ad(!block || buf_block_get_page_no(block) == page_no);
3978 ut_ad(!block || buf_block_get_zip_size(block) == zip_size);
3979- ut_ad(!block || buf_block_get_io_fix(block) == BUF_IO_READ);
3980+ ut_ad(!block || buf_block_get_io_fix_unlocked(block) == BUF_IO_READ);
3981
3982 if (srv_force_recovery >= SRV_FORCE_NO_IBUF_MERGE
3983 || trx_sys_hdr_page(space, page_no)) {
3984
3985=== modified file 'Percona-Server/storage/innobase/include/buf0buddy.h'
3986--- Percona-Server/storage/innobase/include/buf0buddy.h 2013-08-06 15:16:34 +0000
3987+++ Percona-Server/storage/innobase/include/buf0buddy.h 2013-09-20 05:29:11 +0000
3988@@ -36,8 +36,8 @@
3989
3990 /**********************************************************************//**
3991 Allocate a block. The thread calling this function must hold
3992-buf_pool->mutex and must not hold buf_pool->zip_mutex or any
3993-block->mutex. The buf_pool->mutex may be released and reacquired.
3994+buf_pool->LRU_list_mutex and must not hold buf_pool->zip_mutex or any
3995+block->mutex. The buf_pool->LRU_list_mutex may be released and reacquired.
3996 This function should only be used for allocating compressed page frames.
3997 @return allocated block, never NULL */
3998 UNIV_INLINE
3999@@ -52,8 +52,8 @@
4000 ibool* lru) /*!< in: pointer to a variable
4001 that will be assigned TRUE if
4002 storage was allocated from the
4003- LRU list and buf_pool->mutex was
4004- temporarily released */
4005+ LRU list and buf_pool->LRU_list_mutex
4006+ was temporarily released */
4007 __attribute__((malloc, nonnull));
4008
4009 /**********************************************************************//**
4010
4011=== modified file 'Percona-Server/storage/innobase/include/buf0buddy.ic'
4012--- Percona-Server/storage/innobase/include/buf0buddy.ic 2013-08-06 15:16:34 +0000
4013+++ Percona-Server/storage/innobase/include/buf0buddy.ic 2013-09-20 05:29:11 +0000
4014@@ -35,8 +35,8 @@
4015
4016 /**********************************************************************//**
4017 Allocate a block. The thread calling this function must hold
4018-buf_pool->mutex and must not hold buf_pool->zip_mutex or any block->mutex.
4019-The buf_pool_mutex may be released and reacquired.
4020+buf_pool->LRU_list_mutex and must not hold buf_pool->zip_mutex or any
4021+block->mutex. The buf_pool->LRU_list_mutex may be released and reacquired.
4022 @return allocated block, never NULL */
4023 UNIV_INTERN
4024 void*
4025@@ -48,8 +48,8 @@
4026 ibool* lru) /*!< in: pointer to a variable that
4027 will be assigned TRUE if storage was
4028 allocated from the LRU list and
4029- buf_pool->mutex was temporarily
4030- released */
4031+ buf_pool->LRU_list_mutex was
4032+ temporarily released */
4033 __attribute__((malloc, nonnull));
4034
4035 /**********************************************************************//**
4036@@ -88,8 +88,8 @@
4037
4038 /**********************************************************************//**
4039 Allocate a block. The thread calling this function must hold
4040-buf_pool->mutex and must not hold buf_pool->zip_mutex or any
4041-block->mutex. The buf_pool->mutex may be released and reacquired.
4042+buf_pool->LRU_list_mutex and must not hold buf_pool->zip_mutex or any
4043+block->mutex. The buf_pool->LRU_list_mutex may be released and reacquired.
4044 This function should only be used for allocating compressed page frames.
4045 @return allocated block, never NULL */
4046 UNIV_INLINE
4047@@ -104,10 +104,10 @@
4048 ibool* lru) /*!< in: pointer to a variable
4049 that will be assigned TRUE if
4050 storage was allocated from the
4051- LRU list and buf_pool->mutex was
4052- temporarily released */
4053+ LRU list and buf_pool->LRU_list_mutex
4054+ was temporarily released */
4055 {
4056- ut_ad(buf_pool_mutex_own(buf_pool));
4057+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
4058 ut_ad(ut_is_2pow(size));
4059 ut_ad(size >= UNIV_ZIP_SIZE_MIN);
4060 ut_ad(size <= UNIV_PAGE_SIZE);
4061@@ -129,7 +129,6 @@
4062 ulint size) /*!< in: block size,
4063 up to UNIV_PAGE_SIZE */
4064 {
4065- ut_ad(buf_pool_mutex_own(buf_pool));
4066 ut_ad(ut_is_2pow(size));
4067 ut_ad(size >= UNIV_ZIP_SIZE_MIN);
4068 ut_ad(size <= UNIV_PAGE_SIZE);
4069
4070=== modified file 'Percona-Server/storage/innobase/include/buf0buf.h'
4071--- Percona-Server/storage/innobase/include/buf0buf.h 2013-09-02 10:01:38 +0000
4072+++ Percona-Server/storage/innobase/include/buf0buf.h 2013-09-20 05:29:11 +0000
4073@@ -208,19 +208,6 @@
4074 };
4075
4076 #ifndef UNIV_HOTBACKUP
4077-/********************************************************************//**
4078-Acquire mutex on all buffer pool instances */
4079-UNIV_INLINE
4080-void
4081-buf_pool_mutex_enter_all(void);
4082-/*===========================*/
4083-
4084-/********************************************************************//**
4085-Release mutex on all buffer pool instances */
4086-UNIV_INLINE
4087-void
4088-buf_pool_mutex_exit_all(void);
4089-/*==========================*/
4090
4091 /********************************************************************//**
4092 Creates the buffer pool.
4093@@ -581,7 +568,7 @@
4094 page frame */
4095 /********************************************************************//**
4096 Increments the modify clock of a frame by 1. The caller must (1) own the
4097-buf_pool->mutex and block bufferfix count has to be zero, (2) or own an x-lock
4098+LRU list mutex and block bufferfix count has to be zero, (2) or own an x-lock
4099 on the block. */
4100 UNIV_INLINE
4101 void
4102@@ -938,6 +925,17 @@
4103 const buf_block_t* block) /*!< in: pointer to the control block */
4104 __attribute__((pure));
4105 /*********************************************************************//**
4106+Gets the io_fix state of a block. Does not assert that the
4107+buf_page_get_mutex() mutex is held, to be used in the cases where it is safe
4108+not to hold it.
4109+@return io_fix state */
4110+UNIV_INLINE
4111+enum buf_io_fix
4112+buf_page_get_io_fix_unlocked(
4113+/*=========================*/
4114+ const buf_page_t* bpage) /*!< in: pointer to the control block */
4115+ __attribute__((pure));
4116+/*********************************************************************//**
4117 Sets the io_fix state of a block. */
4118 UNIV_INLINE
4119 void
4120@@ -955,7 +953,7 @@
4121 enum buf_io_fix io_fix);/*!< in: io_fix state */
4122 /*********************************************************************//**
4123 Makes a block sticky. A sticky block implies that even after we release
4124-the buf_pool->mutex and the block->mutex:
4125+the buf_pool->LRU_list_mutex and the block->mutex:
4126 * it cannot be removed from the flush_list
4127 * the block descriptor cannot be relocated
4128 * it cannot be removed from the LRU list
4129@@ -1410,6 +1408,19 @@
4130
4131 #endif /* !UNIV_HOTBACKUP */
4132
4133+#ifdef UNIV_DEBUG
4134+/********************************************************************//**
4135+Checks if buf_pool->zip_mutex is owned and is serving for a given page as its
4136+block mutex.
4137+@return true if buf_pool->zip_mutex is owned. */
4138+UNIV_INLINE
4139+bool
4140+buf_own_zip_mutex_for_page(
4141+/*=======================*/
4142+ const buf_page_t* bpage)
4143+ __attribute__((nonnull,warn_unused_result));
4144+#endif /* UNIV_DEBUG */
4145+
4146 /** The common buffer control block structure
4147 for compressed and uncompressed frames */
4148
4149@@ -1421,18 +1432,14 @@
4150 None of these bit-fields must be modified without holding
4151 buf_page_get_mutex() [buf_block_t::mutex or
4152 buf_pool->zip_mutex], since they can be stored in the same
4153- machine word. Some of these fields are additionally protected
4154- by buf_pool->mutex. */
4155+ machine word. */
4156 /* @{ */
4157
4158- unsigned space:32; /*!< tablespace id; also protected
4159- by buf_pool->mutex. */
4160- unsigned offset:32; /*!< page number; also protected
4161- by buf_pool->mutex. */
4162+ unsigned space:32; /*!< tablespace id. */
4163+ unsigned offset:32; /*!< page number. */
4164
4165 unsigned state:BUF_PAGE_STATE_BITS;
4166- /*!< state of the control block; also
4167- protected by buf_pool->mutex.
4168+ /*!< state of the control block.
4169 State transitions from
4170 BUF_BLOCK_READY_FOR_USE to
4171 BUF_BLOCK_MEMORY need not be
4172@@ -1450,11 +1457,21 @@
4173 #ifndef UNIV_HOTBACKUP
4174 unsigned flush_type:2; /*!< if this block is currently being
4175 flushed to disk, this tells the
4176- flush_type.
4177+ flush_type. Writes during flushing
4178+ protected by buf_page_get_mutex_enter()
4179+ mutex and the corresponding flush state
4180+ mutex.
4181 @see buf_flush_t */
4182- unsigned io_fix:2; /*!< type of pending I/O operation;
4183- also protected by buf_pool->mutex
4184- @see enum buf_io_fix */
4185+ unsigned io_fix:2; /*!< type of pending I/O operation.
4186+ Transitions from BUF_IO_NONE to
4187+ BUF_IO_WRITE and back are protected by
4188+ the buf_page_get_mutex() mutex and the
4189+ corresponding flush state mutex. The
4190+ flush state mutex protection for io_fix
4191+ and flush_type is not strictly
4192+ required, but it ensures consistent
4193+ buffer pool instance state snapshots in
4194+ buf_pool_validate_instance(). */
4195 unsigned buf_fix_count:19;/*!< count of how manyfold this block
4196 is currently bufferfixed */
4197 unsigned buf_pool_index:6;/*!< index number of the buffer pool
4198@@ -1466,7 +1483,7 @@
4199 #endif /* !UNIV_HOTBACKUP */
4200 page_zip_des_t zip; /*!< compressed page; zip.data
4201 (but not the data it points to) is
4202- also protected by buf_pool->mutex;
4203+ protected by buf_pool->zip_mutex;
4204 state == BUF_BLOCK_ZIP_PAGE and
4205 zip.data == NULL means an active
4206 buf_pool->watch */
4207@@ -1479,15 +1496,13 @@
4208 ibool in_zip_hash; /*!< TRUE if in buf_pool->zip_hash */
4209 #endif /* UNIV_DEBUG */
4210
4211- /** @name Page flushing fields
4212- All these are protected by buf_pool->mutex. */
4213+ /** @name Page flushing fields */
4214 /* @{ */
4215
4216 UT_LIST_NODE_T(buf_page_t) list;
4217 /*!< based on state, this is a
4218 list node, protected either by
4219- buf_pool->mutex or by
4220- buf_pool->flush_list_mutex,
4221+ a corresponding list mutex,
4222 in one of the following lists in
4223 buf_pool:
4224
4225@@ -1500,7 +1515,8 @@
4226 then the node pointers are
4227 covered by buf_pool->flush_list_mutex.
4228 Otherwise these pointers are
4229- protected by buf_pool->mutex.
4230+ protected by a corresponding list
4231+ mutex.
4232
4233 The contents of the list node
4234 is undefined if !in_flush_list
4235@@ -1523,8 +1539,8 @@
4236 reads can happen while holding
4237 any one of the two mutexes */
4238 ibool in_free_list; /*!< TRUE if in buf_pool->free; when
4239- buf_pool->mutex is free, the following
4240- should hold: in_free_list
4241+ buf_pool->free_list_mutex is free, the
4242+ following should hold: in_free_list
4243 == (state == BUF_BLOCK_NOT_USED) */
4244 #endif /* UNIV_DEBUG */
4245 lsn_t newest_modification;
4246@@ -1547,9 +1563,7 @@
4247 reads can happen while holding
4248 any one of the two mutexes */
4249 /* @} */
4250- /** @name LRU replacement algorithm fields
4251- These fields are protected by buf_pool->mutex only (not
4252- buf_pool->zip_mutex or buf_block_t::mutex). */
4253+ /** @name LRU replacement algorithm fields */
4254 /* @{ */
4255
4256 UT_LIST_NODE_T(buf_page_t) LRU;
4257@@ -1560,7 +1574,10 @@
4258 debugging */
4259 #endif /* UNIV_DEBUG */
4260 unsigned old:1; /*!< TRUE if the block is in the old
4261- blocks in buf_pool->LRU_old */
4262+ blocks in buf_pool->LRU_old. Protected
4263+ by the LRU list mutex. May be read for
4264+ heuristics purposes under the block
4265+ mutex instead. */
4266 unsigned freed_page_clock:31;/*!< the value of
4267 buf_pool->freed_page_clock
4268 when this block was the last
4269@@ -1612,8 +1629,7 @@
4270 used in debugging */
4271 #endif /* UNIV_DEBUG */
4272 ib_mutex_t mutex; /*!< mutex protecting this block:
4273- state (also protected by the buffer
4274- pool mutex), io_fix, buf_fix_count,
4275+ state, io_fix, buf_fix_count,
4276 and accessed; we introduce this new
4277 mutex in InnoDB-5.1 to relieve
4278 contention on the buffer pool mutex */
4279@@ -1622,8 +1638,8 @@
4280 unsigned lock_hash_val:32;/*!< hashed value of the page address
4281 in the record lock hash table;
4282 protected by buf_block_t::lock
4283- (or buf_block_t::mutex, buf_pool->mutex
4284- in buf_page_get_gen(),
4285+ (or buf_block_t::mutex in
4286+ buf_page_get_gen(),
4287 buf_page_init_for_read()
4288 and buf_page_create()) */
4289 ibool check_index_page_at_flush;
4290@@ -1646,8 +1662,8 @@
4291 positioning: if the modify clock has
4292 not changed, we know that the pointer
4293 is still valid; this field may be
4294- changed if the thread (1) owns the
4295- pool mutex and the page is not
4296+ changed if the thread (1) owns the LRU
4297+ list mutex and the page is not
4298 bufferfixed, or (2) the thread has an
4299 x-latch on the block */
4300 /* @} */
4301@@ -1754,11 +1770,11 @@
4302 ulint n_page_gets; /*!< number of page gets performed;
4303 also successful searches through
4304 the adaptive hash index are
4305- counted as page gets; this field
4306- is NOT protected by the buffer
4307- pool mutex */
4308- ulint n_pages_read; /*!< number read operations */
4309- ulint n_pages_written;/*!< number write operations */
4310+ counted as page gets. */
4311+ ulint n_pages_read; /*!< number read operations. Accessed
4312+ atomically. */
4313+ ulint n_pages_written;/*!< number write operations. Accessed
4314+ atomically.*/
4315 ulint n_pages_created;/*!< number of pages created
4316 in the pool with no read */
4317 ulint n_ra_pages_read_rnd;/*!< number of pages read in
4318@@ -1798,12 +1814,16 @@
4319
4320 /** @name General fields */
4321 /* @{ */
4322- ib_mutex_t mutex; /*!< Buffer pool mutex of this
4323- instance */
4324 ib_mutex_t zip_mutex; /*!< Zip mutex of this buffer
4325 pool instance, protects compressed
4326 only pages (of type buf_page_t, not
4327 buf_block_t */
4328+ ib_mutex_t LRU_list_mutex;
4329+ ib_mutex_t free_list_mutex;
4330+ ib_mutex_t zip_free_mutex;
4331+ ib_mutex_t zip_hash_mutex;
4332+ ib_mutex_t flush_state_mutex; /*!< Flush state protection
4333+ mutex */
4334 ulint instance_no; /*!< Array index of this buffer
4335 pool instance */
4336 ulint old_pool_size; /*!< Old pool size in bytes */
4337@@ -1814,9 +1834,6 @@
4338 ulint buddy_n_frames; /*!< Number of frames allocated from
4339 the buffer pool to the buddy system */
4340 #endif
4341-#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
4342- ulint mutex_exit_forbidden; /*!< Forbid release mutex */
4343-#endif
4344 ulint n_chunks; /*!< number of buffer pool chunks */
4345 buf_chunk_t* chunks; /*!< buffer pool chunks */
4346 ulint curr_size; /*!< current pool size in pages */
4347@@ -1828,26 +1845,23 @@
4348 buf_page_in_file() == TRUE,
4349 indexed by (space_id, offset).
4350 page_hash is protected by an
4351- array of mutexes.
4352- Changes in page_hash are protected
4353- by buf_pool->mutex and the relevant
4354- page_hash mutex. Lookups can happen
4355- while holding the buf_pool->mutex or
4356- the relevant page_hash mutex. */
4357+ array of mutexes. */
4358 hash_table_t* zip_hash; /*!< hash table of buf_block_t blocks
4359 whose frames are allocated to the
4360 zip buddy system,
4361 indexed by block->frame */
4362 ulint n_pend_reads; /*!< number of pending read
4363- operations */
4364- ulint n_pend_unzip; /*!< number of pending decompressions */
4365+ operations. Accessed atomically */
4366+ ulint n_pend_unzip; /*!< number of pending decompressions.
4367+ Accesssed atomically */
4368
4369 time_t last_printout_time;
4370 /*!< when buf_print_io was last time
4371- called */
4372+ called. Accesses not protected */
4373 buf_buddy_stat_t buddy_stat[BUF_BUDDY_SIZES_MAX + 1];
4374 /*!< Statistics of buddy system,
4375- indexed by block size */
4376+ indexed by block size. Protected by
4377+ zip_free_mutex. */
4378 buf_pool_stat_t stat; /*!< current statistics */
4379 buf_pool_stat_t old_stat; /*!< old statistics */
4380
4381@@ -1874,10 +1888,12 @@
4382 list */
4383 ibool init_flush[BUF_FLUSH_N_TYPES];
4384 /*!< this is TRUE when a flush of the
4385- given type is being initialized */
4386+ given type is being initialized.
4387+ Protected by flush_state_mutex. */
4388 ulint n_flush[BUF_FLUSH_N_TYPES];
4389 /*!< this is the number of pending
4390- writes in the given flush type */
4391+ writes in the given flush type.
4392+ Protected by flush_state_mutex. */
4393 os_event_t no_flush[BUF_FLUSH_N_TYPES];
4394 /*!< this is in the set state
4395 when there is no flush batch
4396@@ -1904,7 +1920,8 @@
4397 billion! A thread is allowed
4398 to read this for heuristic
4399 purposes without holding any
4400- mutex or latch */
4401+ mutex or latch. For non-heuristic
4402+ purposes protected by LRU_list_mutex */
4403 ibool try_LRU_scan; /*!< Set to FALSE when an LRU
4404 scan for free block fails. This
4405 flag is used to avoid repeated
4406@@ -1913,8 +1930,7 @@
4407 available in the scan depth for
4408 eviction. Set to TRUE whenever
4409 we flush a batch from the
4410- buffer pool. Protected by the
4411- buf_pool->mutex */
4412+ buffer pool. Accessed atomically. */
4413 /* @} */
4414
4415 /** @name LRU replacement algorithm fields */
4416@@ -1942,7 +1958,8 @@
4417
4418 UT_LIST_BASE_NODE_T(buf_block_t) unzip_LRU;
4419 /*!< base node of the
4420- unzip_LRU list */
4421+ unzip_LRU list. The list is protected
4422+ by LRU list mutex. */
4423
4424 /* @} */
4425 /** @name Buddy allocator fields
4426@@ -1959,8 +1976,7 @@
4427
4428 buf_page_t* watch;
4429 /*!< Sentinel records for buffer
4430- pool watches. Protected by
4431- buf_pool->mutex. */
4432+ pool watches. */
4433
4434 #if BUF_BUDDY_LOW > UNIV_ZIP_SIZE_MIN
4435 # error "BUF_BUDDY_LOW > UNIV_ZIP_SIZE_MIN"
4436@@ -1968,18 +1984,10 @@
4437 /* @} */
4438 };
4439
4440-/** @name Accessors for buf_pool->mutex.
4441-Use these instead of accessing buf_pool->mutex directly. */
4442+/** @name Accessors for buffer pool mutexes
4443+Use these instead of accessing buffer pool mutexes directly. */
4444 /* @{ */
4445
4446-/** Test if a buffer pool mutex is owned. */
4447-#define buf_pool_mutex_own(b) mutex_own(&b->mutex)
4448-/** Acquire a buffer pool mutex. */
4449-#define buf_pool_mutex_enter(b) do { \
4450- ut_ad(!mutex_own(&b->zip_mutex)); \
4451- mutex_enter(&b->mutex); \
4452-} while (0)
4453-
4454 /** Test if flush list mutex is owned. */
4455 #define buf_flush_list_mutex_own(b) mutex_own(&b->flush_list_mutex)
4456
4457@@ -2035,31 +2043,6 @@
4458 # define buf_block_hash_lock_held_s_or_x(b, p) (TRUE)
4459 #endif /* UNIV_SYNC_DEBUG */
4460
4461-#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
4462-/** Forbid the release of the buffer pool mutex. */
4463-# define buf_pool_mutex_exit_forbid(b) do { \
4464- ut_ad(buf_pool_mutex_own(b)); \
4465- b->mutex_exit_forbidden++; \
4466-} while (0)
4467-/** Allow the release of the buffer pool mutex. */
4468-# define buf_pool_mutex_exit_allow(b) do { \
4469- ut_ad(buf_pool_mutex_own(b)); \
4470- ut_a(b->mutex_exit_forbidden); \
4471- b->mutex_exit_forbidden--; \
4472-} while (0)
4473-/** Release the buffer pool mutex. */
4474-# define buf_pool_mutex_exit(b) do { \
4475- ut_a(!b->mutex_exit_forbidden); \
4476- mutex_exit(&b->mutex); \
4477-} while (0)
4478-#else
4479-/** Forbid the release of the buffer pool mutex. */
4480-# define buf_pool_mutex_exit_forbid(b) ((void) 0)
4481-/** Allow the release of the buffer pool mutex. */
4482-# define buf_pool_mutex_exit_allow(b) ((void) 0)
4483-/** Release the buffer pool mutex. */
4484-# define buf_pool_mutex_exit(b) mutex_exit(&b->mutex)
4485-#endif
4486 #endif /* !UNIV_HOTBACKUP */
4487 /* @} */
4488
4489
4490=== modified file 'Percona-Server/storage/innobase/include/buf0buf.ic'
4491--- Percona-Server/storage/innobase/include/buf0buf.ic 2013-06-25 13:13:06 +0000
4492+++ Percona-Server/storage/innobase/include/buf0buf.ic 2013-09-20 05:29:11 +0000
4493@@ -121,7 +121,7 @@
4494 /*==========================*/
4495 const buf_page_t* bpage) /*!< in: block */
4496 {
4497- /* This is sometimes read without holding buf_pool->mutex. */
4498+ /* This is sometimes read without holding any buffer pool mutex. */
4499 return(bpage->freed_page_clock);
4500 }
4501
4502@@ -420,8 +420,21 @@
4503 /*================*/
4504 const buf_page_t* bpage) /*!< in: pointer to the control block */
4505 {
4506- ut_ad(bpage != NULL);
4507+ ut_ad(mutex_own(buf_page_get_mutex(bpage)));
4508+ return buf_page_get_io_fix_unlocked(bpage);
4509+}
4510
4511+/*********************************************************************//**
4512+Gets the io_fix state of a block. Does not assert that the
4513+buf_page_get_mutex() mutex is held, to be used in the cases where it is safe
4514+not to hold it.
4515+@return io_fix state */
4516+UNIV_INLINE
4517+enum buf_io_fix
4518+buf_page_get_io_fix_unlocked(
4519+/*=========================*/
4520+ const buf_page_t* bpage) /*!< in: pointer to the control block */
4521+{
4522 enum buf_io_fix io_fix = (enum buf_io_fix) bpage->io_fix;
4523 #ifdef UNIV_DEBUG
4524 switch (io_fix) {
4525@@ -449,6 +462,21 @@
4526 }
4527
4528 /*********************************************************************//**
4529+Gets the io_fix state of a block. Does not assert that the
4530+buf_page_get_mutex() mutex is held, to be used in the cases where it is safe
4531+not to hold it.
4532+@return io_fix state */
4533+UNIV_INLINE
4534+enum buf_io_fix
4535+buf_block_get_io_fix_unlocked(
4536+/*==========================*/
4537+ const buf_block_t* block) /*!< in: pointer to the control block */
4538+{
4539+ return(buf_page_get_io_fix_unlocked(&block->page));
4540+}
4541+
4542+
4543+/*********************************************************************//**
4544 Sets the io_fix state of a block. */
4545 UNIV_INLINE
4546 void
4547@@ -457,10 +485,6 @@
4548 buf_page_t* bpage, /*!< in/out: control block */
4549 enum buf_io_fix io_fix) /*!< in: io_fix state */
4550 {
4551-#ifdef UNIV_DEBUG
4552- buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4553- ut_ad(buf_pool_mutex_own(buf_pool));
4554-#endif
4555 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
4556
4557 bpage->io_fix = io_fix;
4558@@ -481,7 +505,7 @@
4559
4560 /*********************************************************************//**
4561 Makes a block sticky. A sticky block implies that even after we release
4562-the buf_pool->mutex and the block->mutex:
4563+the buf_pool->LRU_list_mutex and the block->mutex:
4564 * it cannot be removed from the flush_list
4565 * the block descriptor cannot be relocated
4566 * it cannot be removed from the LRU list
4567@@ -496,10 +520,11 @@
4568 {
4569 #ifdef UNIV_DEBUG
4570 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4571- ut_ad(buf_pool_mutex_own(buf_pool));
4572+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
4573 #endif
4574 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
4575 ut_ad(buf_page_get_io_fix(bpage) == BUF_IO_NONE);
4576+ ut_ad(bpage->in_LRU_list);
4577
4578 bpage->io_fix = BUF_IO_PIN;
4579 }
4580@@ -512,10 +537,6 @@
4581 /*==================*/
4582 buf_page_t* bpage) /*!< in/out: control block */
4583 {
4584-#ifdef UNIV_DEBUG
4585- buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4586- ut_ad(buf_pool_mutex_own(buf_pool));
4587-#endif
4588 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
4589 ut_ad(buf_page_get_io_fix(bpage) == BUF_IO_PIN);
4590
4591@@ -531,10 +552,6 @@
4592 /*==================*/
4593 const buf_page_t* bpage) /*!< control block being relocated */
4594 {
4595-#ifdef UNIV_DEBUG
4596- buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4597- ut_ad(buf_pool_mutex_own(buf_pool));
4598-#endif
4599 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
4600 ut_ad(buf_page_in_file(bpage));
4601 ut_ad(bpage->in_LRU_list);
4602@@ -554,8 +571,12 @@
4603 {
4604 #ifdef UNIV_DEBUG
4605 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4606- ut_ad(buf_pool_mutex_own(buf_pool));
4607 #endif
4608+ /* Buffer page mutex is not strictly required here for heuristic
4609+ purposes even if LRU mutex is not being held. Keep the assertion
4610+ for now since all the callers hold it. */
4611+ ut_ad(mutex_own(buf_page_get_mutex(bpage))
4612+ || mutex_own(&buf_pool->LRU_list_mutex));
4613 ut_ad(buf_page_in_file(bpage));
4614
4615 return(bpage->old);
4616@@ -574,7 +595,7 @@
4617 buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4618 #endif /* UNIV_DEBUG */
4619 ut_a(buf_page_in_file(bpage));
4620- ut_ad(buf_pool_mutex_own(buf_pool));
4621+ ut_ad(mutex_own(&buf_pool->LRU_list_mutex));
4622 ut_ad(bpage->in_LRU_list);
4623
4624 #ifdef UNIV_LRU_DEBUG
4625@@ -619,11 +640,7 @@
4626 /*==================*/
4627 buf_page_t* bpage) /*!< in/out: control block */
4628 {
4629-#ifdef UNIV_DEBUG
4630- buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4631- ut_ad(!buf_pool_mutex_own(buf_pool));
4632 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
4633-#endif
4634 ut_a(buf_page_in_file(bpage));
4635
4636 if (!bpage->access_time) {
4637@@ -885,10 +902,6 @@
4638 /*===========*/
4639 buf_block_t* block) /*!< in, own: block to be freed */
4640 {
4641- buf_pool_t* buf_pool = buf_pool_from_bpage((buf_page_t*) block);
4642-
4643- buf_pool_mutex_enter(buf_pool);
4644-
4645 mutex_enter(&block->mutex);
4646
4647 ut_a(buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE);
4648@@ -896,8 +909,6 @@
4649 buf_LRU_block_free_non_file_page(block);
4650
4651 mutex_exit(&block->mutex);
4652-
4653- buf_pool_mutex_exit(buf_pool);
4654 }
4655 #endif /* !UNIV_HOTBACKUP */
4656
4657@@ -962,7 +973,7 @@
4658
4659 /********************************************************************//**
4660 Increments the modify clock of a frame by 1. The caller must (1) own the
4661-buf_pool mutex and block bufferfix count has to be zero, (2) or own an x-lock
4662+LRU list mutex and block bufferfix count has to be zero, (2) or own an x-lock
4663 on the block. */
4664 UNIV_INLINE
4665 void
4666@@ -973,7 +984,7 @@
4667 #ifdef UNIV_SYNC_DEBUG
4668 buf_pool_t* buf_pool = buf_pool_from_bpage((buf_page_t*) block);
4669
4670- ut_ad((buf_pool_mutex_own(buf_pool)
4671+ ut_ad((mutex_own(&buf_pool->LRU_list_mutex)
4672 && (block->page.buf_fix_count == 0))
4673 || rw_lock_own(&(block->lock), RW_LOCK_EXCLUSIVE));
4674 #endif /* UNIV_SYNC_DEBUG */
4675@@ -1371,39 +1382,6 @@
4676 sync_thread_add_level(&block->lock, level, FALSE);
4677 }
4678 #endif /* UNIV_SYNC_DEBUG */
4679-/********************************************************************//**
4680-Acquire mutex on all buffer pool instances. */
4681-UNIV_INLINE
4682-void
4683-buf_pool_mutex_enter_all(void)
4684-/*==========================*/
4685-{
4686- ulint i;
4687-
4688- for (i = 0; i < srv_buf_pool_instances; i++) {
4689- buf_pool_t* buf_pool;
4690-
4691- buf_pool = buf_pool_from_array(i);
4692- buf_pool_mutex_enter(buf_pool);
4693- }
4694-}
4695-
4696-/********************************************************************//**
4697-Release mutex on all buffer pool instances. */
4698-UNIV_INLINE
4699-void
4700-buf_pool_mutex_exit_all(void)
4701-/*=========================*/
4702-{
4703- ulint i;
4704-
4705- for (i = 0; i < srv_buf_pool_instances; i++) {
4706- buf_pool_t* buf_pool;
4707-
4708- buf_pool = buf_pool_from_array(i);
4709- buf_pool_mutex_exit(buf_pool);
4710- }
4711-}
4712 /*********************************************************************//**
4713 Get the nth chunk's buffer block in the specified buffer pool.
4714 @return the nth chunk's buffer block. */
4715@@ -1421,4 +1399,26 @@
4716 *chunk_size = chunk->size;
4717 return(chunk->blocks);
4718 }
4719+
4720+#ifdef UNIV_DEBUG
4721+/********************************************************************//**
4722+Checks if buf_pool->zip_mutex is owned and is serving for a given page as its
4723+block mutex.
4724+@return true if buf_pool->zip_mutex is owned. */
4725+UNIV_INLINE
4726+bool
4727+buf_own_zip_mutex_for_page(
4728+/*=======================*/
4729+ const buf_page_t* bpage)
4730+{
4731+ buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
4732+
4733+ ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_PAGE
4734+ || buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY);
4735+ ut_ad(buf_page_get_mutex(bpage) == &buf_pool->zip_mutex);
4736+
4737+ return(mutex_own(&buf_pool->zip_mutex));
4738+}
4739+#endif /* UNIV_DEBUG */
4740+
4741 #endif /* !UNIV_HOTBACKUP */
4742
4743=== modified file 'Percona-Server/storage/innobase/include/buf0flu.h'
4744--- Percona-Server/storage/innobase/include/buf0flu.h 2013-08-16 09:11:51 +0000
4745+++ Percona-Server/storage/innobase/include/buf0flu.h 2013-09-20 05:29:11 +0000
4746@@ -37,7 +37,7 @@
4747 extern ibool buf_page_cleaner_is_active;
4748
4749 /********************************************************************//**
4750-Remove a block from the flush list of modified blocks. */
4751+Remove a block from the flush list of modified blocks. */
4752 UNIV_INTERN
4753 void
4754 buf_flush_remove(
4755@@ -75,9 +75,9 @@
4756 # if defined UNIV_DEBUG || defined UNIV_IBUF_DEBUG
4757 /********************************************************************//**
4758 Writes a flushable page asynchronously from the buffer pool to a file.
4759-NOTE: buf_pool->mutex and block->mutex must be held upon entering this
4760-function, and they will be released by this function after flushing.
4761-This is loosely based on buf_flush_batch() and buf_flush_page().
4762+NOTE: block->mutex must be held upon entering this function, and they will be
4763+released by this function after flushing. This is loosely based on
4764+buf_flush_batch() and buf_flush_page().
4765 @return TRUE if the page was flushed and the mutexes released */
4766 UNIV_INTERN
4767 ibool
4768@@ -232,9 +232,8 @@
4769 Writes a flushable page asynchronously from the buffer pool to a file.
4770 NOTE: in simulated aio we must call
4771 os_aio_simulated_wake_handler_threads after we have posted a batch of
4772-writes! NOTE: buf_pool->mutex and buf_page_get_mutex(bpage) must be
4773-held upon entering this function, and they will be released by this
4774-function. */
4775+writes! NOTE: buf_page_get_mutex(bpage) must be held upon entering this
4776+function, and they will be released by this function. */
4777 UNIV_INTERN
4778 void
4779 buf_flush_page(
4780
4781=== modified file 'Percona-Server/storage/innobase/include/buf0flu.ic'
4782--- Percona-Server/storage/innobase/include/buf0flu.ic 2013-08-06 15:16:34 +0000
4783+++ Percona-Server/storage/innobase/include/buf0flu.ic 2013-09-20 05:29:11 +0000
4784@@ -69,7 +69,6 @@
4785 ut_ad(rw_lock_own(&(block->lock), RW_LOCK_EX));
4786 #endif /* UNIV_SYNC_DEBUG */
4787
4788- ut_ad(!buf_pool_mutex_own(buf_pool));
4789 ut_ad(!buf_flush_list_mutex_own(buf_pool));
4790 ut_ad(!mtr->made_dirty || log_flush_order_mutex_own());
4791
4792@@ -116,7 +115,6 @@
4793 ut_ad(rw_lock_own(&(block->lock), RW_LOCK_EX));
4794 #endif /* UNIV_SYNC_DEBUG */
4795
4796- ut_ad(!buf_pool_mutex_own(buf_pool));
4797 ut_ad(!buf_flush_list_mutex_own(buf_pool));
4798 ut_ad(log_flush_order_mutex_own());
4799
4800
4801=== modified file 'Percona-Server/storage/innobase/include/buf0lru.h'
4802--- Percona-Server/storage/innobase/include/buf0lru.h 2013-08-16 09:11:51 +0000
4803+++ Percona-Server/storage/innobase/include/buf0lru.h 2013-09-20 05:29:11 +0000
4804@@ -79,12 +79,14 @@
4805 Try to free a block. If bpage is a descriptor of a compressed-only
4806 page, the descriptor object will be freed as well.
4807
4808-NOTE: If this function returns true, it will temporarily
4809-release buf_pool->mutex. Furthermore, the page frame will no longer be
4810-accessible via bpage.
4811-
4812-The caller must hold buf_pool->mutex and must not hold any
4813-buf_page_get_mutex() when calling this function.
4814+NOTE: If this function returns true, it will release the LRU list mutex,
4815+and temporarily release and relock the buf_page_get_mutex() mutex.
4816+Furthermore, the page frame will no longer be accessible via bpage. If this
4817+function returns false, the buf_page_get_mutex() might be temporarily released
4818+and relocked too.
4819+
4820+The caller must hold the LRU list and buf_page_get_mutex() mutexes.
4821+
4822 @return true if freed, false otherwise. */
4823 UNIV_INTERN
4824 bool
4825@@ -291,7 +293,7 @@
4826 extern buf_LRU_stat_t buf_LRU_stat_cur;
4827
4828 /** Running sum of past values of buf_LRU_stat_cur.
4829-Updated by buf_LRU_stat_update(). Protected by buf_pool->mutex. */
4830+Updated by buf_LRU_stat_update(). */
4831 extern buf_LRU_stat_t buf_LRU_stat_sum;
4832
4833 /********************************************************************//**
4834
4835=== modified file 'Percona-Server/storage/innobase/include/sync0sync.h'
4836--- Percona-Server/storage/innobase/include/sync0sync.h 2013-08-06 15:16:34 +0000
4837+++ Percona-Server/storage/innobase/include/sync0sync.h 2013-09-20 05:29:11 +0000
4838@@ -71,6 +71,11 @@
4839 extern mysql_pfs_key_t buffer_block_mutex_key;
4840 extern mysql_pfs_key_t buf_pool_mutex_key;
4841 extern mysql_pfs_key_t buf_pool_zip_mutex_key;
4842+extern mysql_pfs_key_t buf_pool_LRU_list_mutex_key;
4843+extern mysql_pfs_key_t buf_pool_free_list_mutex_key;
4844+extern mysql_pfs_key_t buf_pool_zip_free_mutex_key;
4845+extern mysql_pfs_key_t buf_pool_zip_hash_mutex_key;
4846+extern mysql_pfs_key_t buf_pool_flush_state_mutex_key;
4847 extern mysql_pfs_key_t cache_last_read_mutex_key;
4848 extern mysql_pfs_key_t dict_foreign_err_mutex_key;
4849 extern mysql_pfs_key_t dict_sys_mutex_key;
4850@@ -632,7 +637,7 @@
4851 Search system mutex
4852 |
4853 V
4854-Buffer pool mutex
4855+Buffer pool mutexes
4856 |
4857 V
4858 Log mutex
4859@@ -723,11 +728,15 @@
4860 SYNC_SEARCH_SYS, as memory allocation
4861 can call routines there! Otherwise
4862 the level is SYNC_MEM_HASH. */
4863-#define SYNC_BUF_POOL 150 /* Buffer pool mutex */
4864+#define SYNC_BUF_LRU_LIST 151
4865 #define SYNC_BUF_PAGE_HASH 149 /* buf_pool->page_hash rw_lock */
4866 #define SYNC_BUF_BLOCK 146 /* Block mutex */
4867-#define SYNC_BUF_FLUSH_LIST 145 /* Buffer flush list mutex */
4868-#define SYNC_DOUBLEWRITE 140
4869+#define SYNC_BUF_FREE_LIST 145
4870+#define SYNC_BUF_ZIP_FREE 144
4871+#define SYNC_BUF_ZIP_HASH 143
4872+#define SYNC_BUF_FLUSH_STATE 142
4873+#define SYNC_BUF_FLUSH_LIST 141 /* Buffer flush list mutex */
4874+#define SYNC_DOUBLEWRITE 139
4875 #define SYNC_ANY_LATCH 135
4876 #define SYNC_MEM_HASH 131
4877 #define SYNC_MEM_POOL 130
4878
4879=== modified file 'Percona-Server/storage/innobase/sync/sync0sync.cc'
4880--- Percona-Server/storage/innobase/sync/sync0sync.cc 2013-09-02 10:01:38 +0000
4881+++ Percona-Server/storage/innobase/sync/sync0sync.cc 2013-09-20 05:29:11 +0000
4882@@ -1201,7 +1201,11 @@
4883 /* fallthrough */
4884 }
4885 case SYNC_BUF_FLUSH_LIST:
4886- case SYNC_BUF_POOL:
4887+ case SYNC_BUF_LRU_LIST:
4888+ case SYNC_BUF_FREE_LIST:
4889+ case SYNC_BUF_ZIP_FREE:
4890+ case SYNC_BUF_ZIP_HASH:
4891+ case SYNC_BUF_FLUSH_STATE:
4892 /* We can have multiple mutexes of this type therefore we
4893 can only check whether the greater than condition holds. */
4894 if (!sync_thread_levels_g(array, level-1, TRUE)) {
4895@@ -1215,17 +1219,12 @@
4896
4897 case SYNC_BUF_PAGE_HASH:
4898 /* Multiple page_hash locks are only allowed during
4899- buf_validate and that is where buf_pool mutex is already
4900- held. */
4901+ buf_validate. */
4902 /* Fall through */
4903
4904 case SYNC_BUF_BLOCK:
4905- /* Either the thread must own the buffer pool mutex
4906- (buf_pool->mutex), or it is allowed to latch only ONE
4907- buffer block (block->mutex or buf_pool->zip_mutex). */
4908 if (!sync_thread_levels_g(array, level, FALSE)) {
4909 ut_a(sync_thread_levels_g(array, level - 1, TRUE));
4910- ut_a(sync_thread_levels_contain(array, SYNC_BUF_POOL));
4911 }
4912 break;
4913 case SYNC_REC_LOCK:

Subscribers

People subscribed via source and target branches