Hi Laurynas,

On Wed, 11 Sep 2013 15:29:33 -0000, Laurynas Biveinis wrote:
>>>>     - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>>>>       need that stuff. Their names are misleading as they don’t enforce
>>>>       any atomicity.
>>>
>>> The ops being load and store, their atomicity is enforced by the data type
>> width.
>>>
>>
>> Right, the atomicity is enforced by the data type width on those
>> architectures that provide it.
>
>
> I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.
>

Yes, but alignment does not guarantee atomicity, see below.

>
>> And even those that do provide it have a
>> number of prerequisites. Neither of those 2 facts is taken care of in
>> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
>> different with respect to atomicity as plain load/store of a ulint and
>> thus, have misleading names.
>>
>> So to justify "atomic" in their names those functions should:
>>
>> - (if we want to be portable) protect those load/stores with a mutex
>
>
> Why? I guess this question boils down to, what would the mutex implementation code additionally ensure here, let's say, on x86_64? Or is this referring to the 5.6 mutex fallbacks when no atomic ops are implemented for a platform?
>

mutex the only portable way to ensure atomicity. You can use atomic 
primitives provided by specific architectures, but then you either limit 
support for those architectures or yes, provide a mutex fallback.

>
>> - (if we only care about x86/x86_64) make sure that values being
>> loaded/stored do not cross cache lines or page boundaries. Which is of
>> course impossible to guarantee in a generic function.
>
>
> Why? We are talking about ulints here only, and I was not able to find such requirements in the x86_64 memory model descriptions. There is a requirement to be aligned, and misaligned stores/loads might indeed cross cache line or page boundaries, and anything that crosses them is indeed non-atomic. But alignment is possible to guarantee in a generic function (which doesn't even has to be generic: the x86_64 implementation is for x86_64 only, obviously).
>
> Intel® 64 and IA-32 Architectures Software Developer's Manual
> Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:
>
> "The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
> always be carried out atomically:
> (...)
> • Reading or writing a doubleword aligned on a 32-bit boundary
>
> The Pentium processor (...):
> • Reading or writing a quadword aligned on a 64-bit boundary
> "

Why didn't you quote it further?

"
Accesses to cacheable memory that are split across cache lines and page 
boundaries are not guaranteed to be atomic by <all processors>. <all 
processors> provide bus control signals that permit external memory 
subsystems to make split accesses atomic;
"

Which means even aligned accesses are not guaranteed to be atomic and 
it's up to the implementation of "external memory subsystems" (that 
probably means chipsets, motherboards, NUMA architectures and the like).

That's the answer to the "why?" question.

>
> My understanding of the above is that os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, modulo alignment issues, if any. These are easy to ensure by ut_ad().
>

Modulo alignment, cache line boundary and page boundary issues.

I don't see how ut_ad() is going to help here. So a buf_pool_stat_t 
structure happens to be allocated in memory so that n_pages_written 
happens to be misaligned, or cross a cache line or a page boundary. How 
exactly ut_ad() is going to ensure that never happens at runtime?

>
>>>> They should be named os_ordered_load_ulint() /
>>>>       os_ordered_store_ulint(),
>>>
>>> That's an option, but I needed atomicity, visibility, and ordering, and
>> chose atomic for function name to match the existing CAS and atomic add
>> operations, which also need all three.
>>>
>>
>> I'm not sure you need all of those 3 in every occurrence of those
>> functions, but see below.
>
>
> That's right. And ordering probably is not needed anywhere, sorry about that, my understanding of atomics is far from fluent.  But visibility should be needed in every occurrence if this is ever ported to a non-cache-coherent architecture, or if some non-temporal store source code optimization is implemented on x86_64.
>

Grr.

- the "atomic" word is misused, because that code is not really atomic
- right, but I need atomicity + new_entity_1 + new_entity_2
- you don't need new_entity_1 and new_entity_2
- that's right, but new_entity_1 should be needed for new_entity_3 and 
new_entity_4

What do compiler barriers have to do with cache coherence?

>
>>>> but... what specific order are you trying
>>>>       to enforce with those constructs?
>>>
>>> In buf_read_recv_pages() and buf_read_ibuf_page().
>>>
>>>                while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
>>>                       > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
>>>                        os_thread_sleep(500000);
>>>                }
>>>
>>> it is necessary to have a load barrier (compiler only for x86/x86_64,
>> potentially a CPU barrier too elsewhere for visibility and ordering).
>>>

That doesn't answer my question: what specific ordering are you trying 
to enforce with those constructs?

>>
>> One needs compiler/CPU barriers when the code, after loading a value
>> from one location, expects values from other locations in a certain
>> state (or assumes certain cross-location visibility in other words).
>
>
> That is ordering, yes, but I don't think this is an exhaustive list of when the barriers is needed. Here a compiler barrier is needed to prevent the hoisting of the load above the loop, turning its body to an infinite loop.
>

The same problem in mutex_spin_wait(), for example, is addressed by 
ib_mutex_t::lock_word being defined with the volatile keyword.

However, no atomicity, visibility or ordering is enforced when reading 
that value in mutex_spin_wait(), because it doesn't need them.

We need atomicity for counters, so the volatile keyword is not 
sufficient. But it's not provided by any barriers either.

>
>> What assumptions about other value states are made in the above code
>> after reading buf_pool->n_pend_reads?
>
>
> Thus this question does not apply.
>

It does, because the answer is "none" and it shows that "volatile" would 
be a sufficient replacement for os_atomic_load/store if not the 
atomicity issues, which are not addressed by either approach.

>
>> It makes even less sense in other occurrences of os_atomic_load_ulint(),
>> e.g. in buf_get_total_stat(). I disassembled the code generated for
>> buf_get_total_stat() by GCC with compiler barriers both enabled and
>> disabled.
>>
>> With barriers disabled the code corresponding to
>> "tot_stat->n_pages_written +=
>> os_atomic_load_ulint(&buf_stat->n_pages_written)" is:
>>
>> # Load buf_stat->n_pages_written and add it to tot_stat value in
>> # a register
>> add    -0x30(%r13,%rdx,1),%r14
>> # store the computed value to the tot_stat field
>> mov    %r14,0x10(%rdi)
>>
>> With barriers enabled the same code is:
>>
>> # Load the current tot_stat field to a register
>> mov    0x10(%rdi),%rax
>> # Add the value of buf_stat->n_pages_written
>> add    -0x30(%r8,%rcx,1),%rax
>> # store the computed value back to the tot_stat field
>> mov    %rax,0x10(%rdi)
>
>
> I'm mildly curious how would this compare to a volatile n_pages_written too.
>

You are welcome to experiment yourself.
>
>> So what has been achieved by a compiler barrier here except the fact
>> that the compiler avoids storing buf_stat->n_pages_written to a register
>> and thus produces less efficient code?
>
>
> buf_stat->n_pages_written load from memory is forced to happen next to where the value is used. On non-cache-coherent architectures it would be additionally ensured by a correct os_atomic_load_ulint() implementation that the value loaded is the latest value written by any other thread.
>

No, in both cases the value is used to modify a register and then the 
register is stored to memory. So the correct answer to my question is 
"nothing".

>
>>> In buf_LRU_get_free_block() there is a loop, which can be executed by
>> multiple query threads concurrently, and it both loads and stores
>> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
>> concurrent threads, again necessary to have the load and store barriers.
>>>

Er, and how exactly do compiler barriers implement immediate visibility 
to any concurrent threads?

That also needs atomicity, but that's not what barriers guarantee.

>>
>> OK, compiler barriers force the compiler to perform _multiple_
>> load/store ops in the desired order. And CPU barriers impose the same
>> reqs on CPU. Which means they only matter for cross-location visibility,
>> but have nothing to do with a single value visibility.
>
>
> By "multiple" above I believe you are again referring to ordering different memory location accesses. And, as above, compiler barriers are required for single location access too, and CPU barriers would be as well.
>
> Without a compiler barrier there is nothing that prevents the compiler to compiling buf_LRU_get_free_block to
>
> register = try_LRU_scan
> ...loop that sets and accesses try_LRU_scan through the register...
> try_LRU_scan = register
>

That's what the volatile keyword is for.

>
>> Aren't you reinventing volatile here?
>
>
> Volatile is not enough: on a non-cache coherent architecture the volatile would force a load/store to be performed, yes, but not necessarily would make a store visible elsewhere nor a load would necessarily be performed in a cache-coherent manner.
>

*sigh* and compiler barriers solve that problem by ...?

>
>> But taking the atomicity considerations into account, I'm fairly sure
>> it's much better to protect those values with a separate mutex and be
>> done with it.
>
>
> IMHO the jury is still out.
>

*shrug* if you have better ideas, I'd love to hear them.

>
>>>>     - I don’t see a point in maintaining multiple list nodes in buf_page_t
>>>>       (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
>>>>       page may only be in a single list at any point in time, so splitting
>>>>       the list node is purely cosmetic.
> ...
>>> These are my concerns too. I don't address them in the current MP as these
>> bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a
>> follow-up that is common with 5.5.
>>>
>>
>> I'm not sure it should be addressed separately. It fits rather well into
>> other bpsplit cleanups you have already done. As in, it's about removing
>> unnecessary code, not about adding new code/functionality on top of
>> bpsplit in 5.5.
>
>
> OK, I will do this and log a Wishlist bug for 5.5 to backport.
>

Thanks.

>
>> The cleanup might be done for 5.5 as well if we had unlimited resources.
>> But as my cleanups in AHI partitioning have shown, implementing,
>> merging, reviewing things like this separately is more expensive in
>> terms of man-hours.
>
>
> The AHI example is a surprise to me.  I see the additional time need for separate testing runs, but I personally always prefer to do common things across the versions commonly, and separate things separately.  This probably depends on a person.
>

The relevant code has diverged between 5.5 and 5.6 considerably. So 
implementing those cleanups in 5.5 and then merging them to 5.6 was far 
from trivial. That's one thing.

Another thing is that implementing those cleanups in 5.5 is less 
important than 5.6, so it could wait longer than 5.6.

Finally, 1 MP/review cycle is less man-hours than 1 + 2 extra MP/review 
cycles. It's that simple.