> >>    - wrong comments for buf_LRU_free_page(): s/block
> >>      mutex/buf_page_get_mutex() mutex/
> >
> > Edited. But there are numerous other places in this patch (and upstream)
> that would need this editing too, and "block mutex" is already an established
> shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention
> pointer to mutex variables named block_mutex.
> >
> 
> I think editing existing comments is not worth the efforts (and potential
> extra maintenance cost in the future). I would be OK if this specific
> comment was left intact too. It only caught my eye because the comment
> was edited and I spent some time verifying it.


Fair enough. I applied same edit through the other comments changed by this patch.


> >>    - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
> >>      need that stuff. Their names are misleading as they don’t enforce
> >>      any atomicity.
> >
> > The ops being load and store, their atomicity is enforced by the data type
> width.
> >
> 
> Right, the atomicity is enforced by the data type width on those
> architectures that provide it. 


I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.


> And even those that do provide it have a
> number of prerequisites. Neither of those 2 facts is taken care of in
> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
> different with respect to atomicity as plain load/store of a ulint and
> thus, have misleading names.
> 
> So to justify "atomic" in their names those functions should:
> 
> - (if we want to be portable) protect those load/stores with a mutex


Why? I guess this question boils down to, what would the mutex implementation code additionally ensure here, let's say, on x86_64? Or is this referring to the 5.6 mutex fallbacks when no atomic ops are implemented for a platform?


> - (if we only care about x86/x86_64) make sure that values being
> loaded/stored do not cross cache lines or page boundaries. Which is of
> course impossible to guarantee in a generic function.


Why? We are talking about ulints here only, and I was not able to find such requirements in the x86_64 memory model descriptions. There is a requirement to be aligned, and misaligned stores/loads might indeed cross cache line or page boundaries, and anything that crosses them is indeed non-atomic. But alignment is possible to guarantee in a generic function (which doesn't even has to be generic: the x86_64 implementation is for x86_64 only, obviously).

Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:

"The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
(...)
• Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (...):
• Reading or writing a quadword aligned on a 64-bit boundary
"

My understanding of the above is that os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, modulo alignment issues, if any. These are easy to ensure by ut_ad().


> >> They should be named os_ordered_load_ulint() /
> >>      os_ordered_store_ulint(),
> >
> > That's an option, but I needed atomicity, visibility, and ordering, and
> chose atomic for function name to match the existing CAS and atomic add
> operations, which also need all three.
> >
> 
> I'm not sure you need all of those 3 in every occurrence of those
> functions, but see below.


That's right. And ordering probably is not needed anywhere, sorry about that, my understanding of atomics is far from fluent.  But visibility should be needed in every occurrence if this is ever ported to a non-cache-coherent architecture, or if some non-temporal store source code optimization is implemented on x86_64.


> >> but... what specific order are you trying
> >>      to enforce with those constructs?
> >
> > In buf_read_recv_pages() and buf_read_ibuf_page().
> >
> >               while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
> >                      > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
> >                       os_thread_sleep(500000);
> >               }
> >
> > it is necessary to have a load barrier (compiler only for x86/x86_64,
> potentially a CPU barrier too elsewhere for visibility and ordering).
> >
> 
> One needs compiler/CPU barriers when the code, after loading a value
> from one location, expects values from other locations in a certain
> state (or assumes certain cross-location visibility in other words).


That is ordering, yes, but I don't think this is an exhaustive list of when the barriers is needed. Here a compiler barrier is needed to prevent the hoisting of the load above the loop, turning its body to an infinite loop.


> What assumptions about other value states are made in the above code
> after reading buf_pool->n_pend_reads?


Thus this question does not apply.


> It makes even less sense in other occurrences of os_atomic_load_ulint(),
> e.g. in buf_get_total_stat(). I disassembled the code generated for
> buf_get_total_stat() by GCC with compiler barriers both enabled and
> disabled.
> 
> With barriers disabled the code corresponding to
> "tot_stat->n_pages_written +=
> os_atomic_load_ulint(&buf_stat->n_pages_written)" is:
> 
> # Load buf_stat->n_pages_written and add it to tot_stat value in
> # a register
> add    -0x30(%r13,%rdx,1),%r14
> # store the computed value to the tot_stat field
> mov    %r14,0x10(%rdi)
> 
> With barriers enabled the same code is:
> 
> # Load the current tot_stat field to a register
> mov    0x10(%rdi),%rax
> # Add the value of buf_stat->n_pages_written
> add    -0x30(%r8,%rcx,1),%rax
> # store the computed value back to the tot_stat field
> mov    %rax,0x10(%rdi)


I'm mildly curious how would this compare to a volatile n_pages_written too.


> So what has been achieved by a compiler barrier here except the fact
> that the compiler avoids storing buf_stat->n_pages_written to a register
> and thus produces less efficient code?


buf_stat->n_pages_written load from memory is forced to happen next to where the value is used. On non-cache-coherent architectures it would be additionally ensured by a correct os_atomic_load_ulint() implementation that the value loaded is the latest value written by any other thread.


> > In buf_LRU_get_free_block() there is a loop, which can be executed by
> multiple query threads concurrently, and it both loads and stores
> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
> concurrent threads, again necessary to have the load and store barriers.
> >
> 
> OK, compiler barriers force the compiler to perform _multiple_
> load/store ops in the desired order. And CPU barriers impose the same
> reqs on CPU. Which means they only matter for cross-location visibility,
> but have nothing to do with a single value visibility.


By "multiple" above I believe you are again referring to ordering different memory location accesses. And, as above, compiler barriers are required for single location access too, and CPU barriers would be as well.

Without a compiler barrier there is nothing that prevents the compiler to compiling buf_LRU_get_free_block to

register = try_LRU_scan
...loop that sets and accesses try_LRU_scan through the register...
try_LRU_scan = register


> Aren't you reinventing volatile here?


Volatile is not enough: on a non-cache coherent architecture the volatile would force a load/store to be performed, yes, but not necessarily would make a store visible elsewhere nor a load would necessarily be performed in a cache-coherent manner.


> But taking the atomicity considerations into account, I'm fairly sure
> it's much better to protect those values with a separate mutex and be
> done with it.


IMHO the jury is still out.


> >>    - I don’t see a point in maintaining multiple list nodes in buf_page_t
> >>      (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
> >>      page may only be in a single list at any point in time, so splitting
> >>      the list node is purely cosmetic.
...
> > These are my concerns too. I don't address them in the current MP as these
> bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a
> follow-up that is common with 5.5.
> >
> 
> I'm not sure it should be addressed separately. It fits rather well into
> other bpsplit cleanups you have already done. As in, it's about removing
> unnecessary code, not about adding new code/functionality on top of
> bpsplit in 5.5.


OK, I will do this and log a Wishlist bug for 5.5 to backport.


> The cleanup might be done for 5.5 as well if we had unlimited resources.
> But as my cleanups in AHI partitioning have shown, implementing,
> merging, reviewing things like this separately is more expensive in
> terms of man-hours.


The AHI example is a surprise to me.  I see the additional time need for separate testing runs, but I personally always prefer to do common things across the versions commonly, and separate things separately.  This probably depends on a person.

> >>    - the following hunk simply removes reference to buf_pool->mutex. As I
> >>      understand, it should just replace buf_pool->mutex with
> zip_free_mutex?
> >> +       page_zip_des_t  zip;            /*!< compressed page; state
> >> +                                       == BUF_BLOCK_ZIP_PAGE and zip.data
> >> +                                       == NULL means an active
> >
> > Hm, it looked to me that it's protected not with zip_free_mutex but with
> zip_mutex in its page mutex capacity. I will check.
> >
> 
> There was a place in the code that asserted zip_free_mutex locked when
> bpage->zip.data is modified. But I'm not sure if that is correct.


I have checked, I believe it's indeed zip_mutex.  Re. zip_free_mutex, you must be referring to this bit in buf_buddy_relocate():

	mutex_enter(&buf_pool->zip_free_mutex);

	if (buf_page_can_relocate(bpage)) {
...
		bpage->zip.data = (page_zip_t*) dst;
		mutex_exit(mutex);

		buf_buddy_stat_t*	buddy_stat = &buf_pool->buddy_stat[i];
		buddy_stat->relocated++;
		buddy_stat->relocated_usec += ut_time_us(NULL) - usec;
...

Here zip_free_mutex happens to protect buddy_stat and pushing it down to if clause would require an else clause to appear that locks the same mutex.


> >>    - I’m not sure creating separate mutex per each n_flush[]/init_flush[]
> >>      element is worth the effort. Those mutexes are basically locked to
> >>      read/update a few memory locations on code paths that don’t look
> >>      critical. I don’t think there’s any measurable effect from split
> >>      mutexes (and it can potentially even have negative impact in cases
> >>      where we need to lock them all). Let’s just have a single
> >>      buf_pool->flush_state_mutex.
> >
> > First, the effort is very little actually, the per-flush-type split is very
> natural. The cases of needing to lock them all are 1)
> buf_pool_validate_instance() to lock them all at once, 2)
> buf_stats_get_pool_info() locks them sequentially, 3)
> buf_pool_check_pending_io() sequentially too. 1) is debug-only code, 2) should
> not be called with such frequency that it becomes a problem, 3) is
> startup/shutdown code only.
> >
> 
> Well, 2 non-debug sites that need to lock all of those mutexes
> sequentially is probably a hint that splitting them is an overkill? Note
> that a mutex acquisition/release may be an expensive operation in
> itself. So, for example, depending on many factors, 3 sequential
> mutex_enter() calls may be much more expensive than a single one. Which
> may be especially visible if the critical section itself is rather short.


OK. I will convert to a single mutex. For the record I don't see big advantages either way over the other way, but let's have the mild advantage of a single mutex.