Hi Laurynas,

On Wed, 11 Sep 2013 09:45:13 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>>    - the code in btr_blob_free() can be simplified
>
> Simplified.
>
>>    - wrong comments for buf_LRU_free_page(): s/block
>>      mutex/buf_page_get_mutex() mutex/
>
> Edited. But there are numerous other places in this patch (and upstream) that would need this editing too, and "block mutex" is already an established shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention pointer to mutex variables named block_mutex.
>

I think editing existing comments is worth the efforts (and potential 
extra maintenance cost in the future). I would be OK if this specific 
comment was left intact too. It only caught my eye because the comment 
was edited and I spent some time verifying it.

> Do you want me to edit the other places too?
>
>>    - the comments for buf_LRU_free_page() say that both LRU_list_mutex
>>      and block_mutex may be released temporarily if ‘true’ is
>>      returned. But:
>>      1) even if ‘false’ is returned, block_mutex may also be released
>>         temporarily
>>      2) the comments don’t mention that if ‘true’ is returned,
>>         LRU_list_mutex is always released upon return, but block_mutex is
>>         always locked. And callers of buf_LRU_free_page() rely on that.
>
> Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:
>
> /******************************************************************//**
> Try to free a block.  If bpage is a descriptor of a compressed-only
> page, the descriptor object will be freed as well.
>
> NOTE: If this function returns true, it will release the LRU list mutex,
> and temporarily release and relock the buf_page_get_mutex() mutex.
> Furthermore, the page frame will no longer be accessible via bpage.  If this
> function returns false, the buf_page_get_get_mutex() might be temporarily
> released and relocked too.
>
> The caller must hold the LRU list and buf_page_get_mutex() mutexes.
>
> @return true if freed, false otherwise. */
>
>

Looks good.

>>    - the patch removes buf_pool_mutex_enter_all() from
>>      btr_search_validate_one_table(), but then does a number of dirty
>>      reads from ‘block’ before it locks block->mutex. Any reasons to not
>>      lock block->mutex earlier?
>
> I *think* there were actual reasons, but I cannot remember them, due to the number of things going on with this patch. And I don't see why it locking block->mutex earlier is not possible now. I will look further.
>

OK.

>>    - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>>      need that stuff. Their names are misleading as they don’t enforce
>>      any atomicity.
>
> The ops being load and store, their atomicity is enforced by the data type width.
>

Right, the atomicity is enforced by the data type width on those 
architectures that provide it. And even those that do provide it have a 
number of prerequisites. Neither of those 2 facts is taken care of in 
os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any 
different with respect to atomicity as plain load/store of a ulint and 
thus, have misleading names.

So to justify "atomic" in their names those functions should:

- (if we want to be portable) protect those load/stores with a mutex
- (if we only care about x86/x86_64) make sure that values being 
loaded/stored do not cross cache lines or page boundaries. Which is of 
course impossible to guarantee in a generic function.

>> They should be named os_ordered_load_ulint() /
>>      os_ordered_store_ulint(),
>
> That's an option, but I needed atomicity, visibility, and ordering, and chose atomic for function name to match the existing CAS and atomic add operations, which also need all three.
>

I'm not sure you need all of those 3 in every occurrence of those 
functions, but see below.

>> but... what specific order are you trying
>>      to enforce with those constructs?
>
> In buf_read_recv_pages() and buf_read_ibuf_page().
>
> 		while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
> 		       > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
> 			os_thread_sleep(500000);
> 		}
>
> it is necessary to have a load barrier (compiler only for x86/x86_64, potentially a CPU barrier too elsewhere for visibility and ordering).
>

One needs compiler/CPU barriers when the code, after loading a value 
from one location, expects values from other locations in a certain 
state (or assumes certain cross-location visibility in other words). 
What assumptions about other value states are made in the above code 
after reading buf_pool->n_pend_reads?

It makes even less sense in other occurrences of os_atomic_load_ulint(), 
e.g. in buf_get_total_stat(). I disassembled the code generated for 
buf_get_total_statw() by GCC with compiler barriers both enabled and 
disabled.

With barriers disabled the code corresponding to 
"tot_stat->n_pages_written += 
os_atomic_load_ulint(&buf_stat->n_pages_written)" is:

# Load buf_stat->n_pages_written and add it to tot_stat value in
# a register
add    -0x30(%r13,%rdx,1),%r14
# store the computed value to the tot_stat field
mov    %r14,0x10(%rdi)

With barriers enabled the same code is:

# Load the current tot_stat field to a register
mov    0x10(%rdi),%rax
# Add the value of buf_stat->n_pages_written
add    -0x30(%r8,%rcx,1),%rax
# store the computed value back to the tot_stat field
mov    %rax,0x10(%rdi)

So what has been achieved by a compiler barrier here except the fact 
that the compiler avoids storing buf_stat->n_pages_written to a register 
and thus produces less efficient code?

> In buf_LRU_get_free_block() there is a loop, which can be executed by multiple query threads concurrently, and it both loads and stores buf_pool->try_LRU_scan in it, which requires immediate visibility to any concurrent threads, again necessary to have the load and store barriers.
>

OK, compiler barriers force the compiler to perform _multiple_ 
load/store ops in the desired order. And CPU barriers impose the same 
reqs on CPU. Which means they only matter for cross-location visibility, 
but have nothing to do with a single value visibility.

Aren't you reinventing volatile here?

But taking the atomicity considerations into account, I'm fairly sure 
it's much better to protect those values with a separate mutex and be 
done with it.

>>    - I don’t see a point in maintaining multiple list nodes in buf_page_t
>>      (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
>>      page may only be in a single list at any point in time, so splitting
>>      the list node is purely cosmetic.
>>
>>      On the other hand, we are looking at a non-trivial buf_page_t size
>>      increase (112 bytes before the patch, 144 bytes after). Leaving all
>>      cache and memory locality questions aside, that’s 64 MB of memory
>>      just for list node pointers on a system with a 32 GB buffer pool.
>>
>>      What’s the “optimistic use” mentioned in the comment for those nodes
>>      in buf_page_t?
>
> These are my concerns too. I don't address them in the current MP as these bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a follow-up that is common with 5.5.
>

I'm not sure it should be addressed separately. It fits rather well into 
other bpsplit cleanups you have already done. As in, it's about removing 
unnecessary code, not about adding new code/functionality on top of 
bpsplit in 5.5.

The cleanup might be done for 5.5 as well if we had unlimited resources. 
But as my cleanups in AHI partitioning have shown, implementing, 
merging, reviewing things like this separately is more expensive in 
terms of man-hours.

>>    - the following hunk simply removes reference to buf_pool->mutex. As I
>>      understand, it should just replace buf_pool->mutex with zip_free_mutex?
>>
>> -       page_zip_des_t  zip;            /*!< compressed page; zip.data
>> -                                       (but not the data it points to) is
>> -                                       also protected by buf_pool->mutex;
>> -                                       state == BUF_BLOCK_ZIP_PAGE and
>> -                                       zip.data == NULL means an active
>> +       page_zip_des_t  zip;            /*!< compressed page; state
>> +                                       == BUF_BLOCK_ZIP_PAGE and zip.data
>> +                                       == NULL means an active
>
> Hm, it looked to me that it's protected not with zip_free_mutex but with zip_mutex in its page mutex capacity. I will check.
>

There was a place in the code that asserted zip_free_mutex locked when 
bpage->zip.data is modified. But I'm not sure if that is correct.

>>    - I’m not sure creating separate mutex per each n_flush[]/init_flush[]
>>      element is worth the effort. Those mutexes are basically locked to
>>      read/update a few memory locations on code paths that don’t look
>>      critical. I don’t think there’s any measurable effect from split
>>      mutexes (and it can potentially even have negative impact in cases
>>      where we need to lock them all). Let’s just have a single
>>      buf_pool->flush_state_mutex.
>
> First, the effort is very little actually, the per-flush-type split is very natural. The cases of needing to lock them all are 1) buf_pool_validate_instance() to lock them all at once, 2) buf_stats_get_pool_info() locks them sequentially, 3) buf_pool_check_pending_io() sequentially too. 1) is debug-only code, 2) should not be called with such frequency that it becomes a problem, 3) is startup/shutdown code only.
>

Well, 2 non-debug sites that need to lock all of those mutexes 
sequentially is probably a hint that splitting them is an overkill? Note 
that a mutex acquisition/release may be an expensive operation in 
itself. So, for example, depending on many factors, 3 sequential 
mutex_enter() calls may be much more expensive than a single one. Which 
may be especially visible if the critical section itself is rather short.

> Second, on how I arrived at the current mutexes. The initial patch version did not have the flush_state_mutex array, and it kept the remaining buf_pool->mutex just like the current 5.5 version does. But we observed contention on this remaining mutex. Then I converted all the variables protected by it (stats, init_flush[], n_flush[], try_LRU_scan, maybe something else I forgot) to atomic variables and this is the current state of things on the experimental branches. But while working on the current MP I worked out that it is impossible for init_flush and n_flush to be independent atomic variables, and if one wants to observe a consistent buffer pool state snapshot as in buf_pool_validate_instance(), then some of the buffer page I/O fix, flush state and flush list presence transitions are not independent neither. Hence the mutex, and since the per-flush-type split appeared to be very natural, the mutex array.
>

I see, thanks for clarifications. So the question is: have you 
considered the option to protect just init_flush and n_flush with a 
single mutex? This looks like a good trade-off between 2 extremes: 1) 
protecting them with an overused mutex that is also used to protect a 
bunch of other things; 2) have individual mutexes per each flush type 
(with the implied performance overhead from mutex_enter()/mutex_exit(), 
context switches, and code maintenance overhead).

> Do you want me to repush the branch as I address the comments or wait for a full first review pass first?
>

As you wish. Repushing would probably be slightly better, but I'm OK 
either way.