Alexey -

> I don't see how a properly-aligned ulint can possibly
> cross a cache line boundary, when cache lines are 64-byte wide and 64-byte
> aligned. Or even 32 for older architectures.
> >
> 
> The array of buffer pool descriptors is allocated as follows:


You are right, I failed to consider the base addresses returned by dynamic memory allocation.  I also failed to notice your hint to that direction in one of the previous mails.


>         buf_pool_ptr = (buf_pool_t*) mem_zalloc(
>                 n_instances * sizeof *buf_pool_ptr);
> 
> so individual buf_pool_t instances are not guaranteed to have any
> specific alignment, neither to cache line nor to page boundaries, right?


Right.


> Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into
> buf_pool_t so nothing prevents it from crossing a cache line or a page
> boundary?


Right.


> Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again,
> nothing prevents them from crossing a cache line or a page boundary, right?


Right, nothing prevents an object of buf_pool_stat_t from crossing it. But that's OK. We only need the individual fields not to cross it.


> >> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
> >> structure happens to be allocated in memory so that n_pages_written
> >> happens to be misaligned, or cross a cache line or a page boundary. How
> >> exactly ut_ad() is going to ensure that never happens at runtime?
> >
> >
> > A debug build would hit this assert and we'd fix the structure
> layout/allocation.  Unless I'm mistaken, to get a misaligned ulint, we'd have
> to ask for this explicitly, by packing a struct, fetching a pointer to it from
> a byte array, etc.  Thus ut_ad() seems reasonable to me.
> >
> 
> The only thing you can assume about dynamically allocated objects is
> that their addresses (and thus, the first member of a structure, if an
> object is a structure) is aligned to machine word size. Which is always
> lower than the cache line size. There are no guarantees on alignment of
> other structure members, no matter what compiler hints were used (those
> only matter for statically allocated objects).


Right.  So, to conclude... no individual ulint is going to cross a cache line or a page boundary and we are good? We start with a machine-word aligned address returned from heap, and add a multiple of machine-word width to arrive at the address of an individual field, which is machine-word aligned and thus the individual field cannot cross anything?


> >>>>>> They should be named os_ordered_load_ulint() /
> >>>>>>        os_ordered_store_ulint(),
> >>>>>
> >>>>> That's an option, but I needed atomicity, visibility, and ordering, and
> >>>> chose atomic for function name to match the existing CAS and atomic add
> >>>> operations, which also need all three.
> >>>>>
> >>>>
> >>>> I'm not sure you need all of those 3 in every occurrence of those
> >>>> functions, but see below.
> >>>
> >>>
> >>> That's right. And ordering probably is not needed anywhere, sorry about
> >> that, my understanding of atomics is far from fluent.  But visibility
> should
> >> be needed in every occurrence if this is ever ported to a non-cache-
> coherent
> >> architecture, or if some non-temporal store source code optimization is
> >> implemented on x86_64.
> >>>
> >>
> >> Grr.
> >>
> >> - the "atomic" word is misused, because that code is not really atomic
> >> - right, but I need atomicity + new_entity_1 + new_entity_2
> >> - you don't need new_entity_1 and new_entity_2
> >> - that's right, but new_entity_1 should be needed for new_entity_3 and
> >> new_entity_4
> >
> >
> > I am getting confused by the term "ordering".  It appears to me, perhaps
> wrongly, that by ordering you refer to a ordering between different memory
> location accesses.  If yes, then I don't need ordering, but the primitives
> provide it, as there is no point in not providing it anyway.  But "ordering"
> can mean memory access ordering to a single memory location in a single
> thread.  In this case I need this ordering already.  I need atomicity and
> visibility without questions.
> >
> 
> You have a set of one element: {x}. Please describe what actions must be
> taken to "order" it.


It's probably ordered already.  But I cannot understand your question.  Can you please tell me what is your working definition of memory access ordering?  Is "{x}" an theoretical example of a set or do you arrive at it from the "single memory location in a single thread" above?  Because in the latter case it is still plural accesses that have to be ordered.  I'm sure it's my fault that my basic understanding differs from yours, thus the sooner I know yours, the sooner I start making sense and agreeing.


> > I don't understand your objection re. new_entity_3 and new_entity_4. My goal
> is to provide primitives that abstract away the underlying architecture.
> That's a sane goal and atomic+visible+ordered load/store is a sane model. Any
> porting anywhere, or even x86_64 optimization is a matter of implementing the
> correct primitives, not checking the call sites because we suddenly lost some
> property.
> >
> 
> You don't abstract anything in terms of atomicity and visibility. But
> putting "atomic" in the function name. That's the problem.


I can rename.  I also don't understand what's not atomic and not visible about them.


> >> What do compiler barriers have to do with cache coherence?
> >
> >
> > You cannot add CPU barriers to achieve cache coherence without having
> compiler barriers there too.
> >
> 
> That doesn't answer my question. I cannot add CPU barriers to achieve
> cache coherence, not matter what code is generated by the compiler.
> Cache coherence is a property of the hardware.


You're right, it's a hardware property.  The hardware also provides the necessary instructions to work with it.  For example, if you do a non-temporal store in x86_64, visibility is relaxed and you need CPU barriers to guarantee visibility where they are placed.  So a CPU barrier achieves cache coherence.


> > buf_pool->n_pend_reads load must happen after the previous loop iteration
> buf_pool->n_pend_reads load.
> >
> 
> So it might happen that buf_pool->n_pend_reads load is performed
> _before_ the previous loop iteration buf_pool->n_pend_reads load. See,
> "order" is a before/after relation.
> 
> What you want to do here is to prevent compiler from optimizing the load
> away (so that no loads happen in loop iterations). 


You're right.  And my way appears to implement it correctly, unless I'm missing something.


> That's what
> "volatile" is for.


Volatile will prevent the load from optimizing away.  Volatile inhibits compiler optimizations.  It has nothing to do with a memory model besides having an effect that inhibited optimizations is what helps us to not to move memory accesses around.  It will not say, that e.g. now these memory accesses are visible or e.g. ordered with non-volatile memory accesses.  Why should we use a less correct mechanism with partial guarantees instead of a more correct one?  Now this comment is of course handwaving, but so is saying that volatile is for this.


> >>>> One needs compiler/CPU barriers when the code, after loading a value
> >>>> from one location, expects values from other locations in a certain
> >>>> state (or assumes certain cross-location visibility in other words).
> >>>
> >>>
> >>> That is ordering, yes, but I don't think this is an exhaustive list of
> when
> >> the barriers is needed. Here a compiler barrier is needed to prevent the
> >> hoisting of the load above the loop, turning its body to an infinite loop.
> >>>
> >>
> >> The same problem in mutex_spin_wait(), for example, is addressed by
> >> ib_mutex_t::lock_word being defined with the volatile keyword.
> >
> >
> > I don't think that's an example to follow, elaborated below.
> >
> >
> >> However, no atomicity, visibility or ordering is enforced when reading
> >> that value in mutex_spin_wait(), because it doesn't need them.
> >
> >
> > I don't think mutex_spin_wait() wants ever, regarding 1) atomicity: to
> observe a partial load or perform a store so that another thread observes a
> partial load; 2) visibility: to not see a mutex lock word write by another
> thread or make its own write not visible to another thread; 3) ordering: to
> optimize mutex lock word load out of the loop.  Unless I'm mistaken, all three
> are needed, and the fact that the current code has volatile instead is
> slightly troubling me.
> >

> > In fact, mutex_exit_func() has the following comment:
...
> The comment refers to a completely different problem...


Yes, so we both agree that a barrier would solve this here.


> I was referring to this specific piece of code in mutex_spin_wait():
> 
>         while (mutex_get_lock_word(mutex) != 0 && i < SYNC_SPIN_ROUNDS) {
>                 if (srv_spin_wait_delay) {
>                         ut_delay(ut_rnd_interval(0, srv_spin_wait_delay));
>                 }
> 
>                 i++;
>         }
> 


My comments 1) 2) 3) in a paragraph above are referring to the same piece too.  Can you please explain how this code does not need atomicity, visibility, ordering as you say above?


> Which only works correctly (i.e. is not optimized into an infinite
> loop), because 'lock_word' is volatile. 


Same comment as above re. volatile.


> That's the problem we were
> discussing with respect to buf_pool->n_pend_reads in
> buf_read_recv_pages(), right?


The problem we are discussing also depends on 1) 2) 3) above too.

I know that adding volatile will keep the loop, probably.  But we also know exactly how n_pend_reads should be accessed concurrently.  Then, we can perform these accesses in the code using the necessary primitives and be sure about atomicity, visibility, ordering.  As opposed to selectively disabling compiler optimizations.


> >> We need atomicity for counters, so the volatile keyword is not
> >> sufficient. But it's not provided by any barriers either.
> >
> >
> > Counter atomicity is ensured by atomic inc or atomic CAS operation.  This
> operation will include all the compiler/CPU barriers to ensure visibility and
> ordering too.  An atomic load or store is not any less atomic than inc/CAS
> even if there is no computation with intermediate results.
> >
> 
> *sigh* that's a different kind of atomicity. 


How many kinds of atomicity are there?  The operation either has completed, either has not, without a possibility of observing any intermediate state.  This applies the same to CAS, INC, as well as to load and store.  Or am I missing something here?


> atomic inc ensures that a
> value will be loaded, modified and stored back atomically. It doesn't
> enforce any atomicity on other code reading that value. 


No, why should it?


> Neither does
> os_atomic_load_ulint().


That depends on the alignment issue outcome.  Right now I believe it does.


> One needs compiler/CPU barriers when the code, after loading a value 
> from one location, expects values from other locations in a certain 
> state (or assumes certain cross-location visibility in other words). 

> >>>> What assumptions about other value states are made in the above code
> >>>> after reading buf_pool->n_pend_reads?
> >>>
> >>>
> >>> Thus this question does not apply.
> >>>
> >>
> >> It does, because the answer is "none" and it shows that "volatile" would
> >> be a sufficient replacement for os_atomic_load/store if not the
> >> atomicity issues, which are not addressed by either approach.
> >
> >
> > It does not, because we don't need to have "other values" to have a reason
> to consider the memory model applies to this value only and how to access it.
> >
> 
> That's just hand-waving.


I apologize.  I'm trying hard not to, but it's hard and we seem to think from different models.

I added back the context ("One needs ... ") above to the question ("What assumptions").  I hope I not misedited it.  If I'm not mistaken, your question comes from the premise that barriers are only needed for ordering accesses to different memory locations.  Right?  If so, the answer to your question is indeed "none".  Now the problem is that my premise is that barriers are needed not only for that, and that's why my answer to this question "none, the memory barriers are used here not for cross-location ordering".


> >>>> So what has been achieved by a compiler barrier here except the fact
> >>>> that the compiler avoids storing buf_stat->n_pages_written to a register
> >>>> and thus produces less efficient code?
> >>>
> >>>
> >>> buf_stat->n_pages_written load from memory is forced to happen next to
> where
> >> the value is used. On non-cache-coherent architectures it would be
> >> additionally ensured by a correct os_atomic_load_ulint() implementation
> that
> >> the value loaded is the latest value written by any other thread.
> >>>
> >>
> >> No, in both cases the value is used to modify a register and then the
> >> register is stored to memory. So the correct answer to my question is
> >> "nothing".
> >
> >
> > Your answer seems to ignore my bit "On non-cache-coherent ... ".  Well, it
> ignores the first sentence too, x86_64 assembly comparison will provide little
> for this discussion, because we don't need CPU barriers here.
> >
> 
> I'm asking simple questions which require single-word answers. Instead,
> I see abstract hand-waving.
> 
> You saw the code with compiler barriers and without them. The question
> is: "what difference is there in terms of program state?". You may
> describe the difference either for cache coherent architectures or
> non-cache-coherent architectures, as you like.


I apologize again.  The problem is not that I'm trying to avoid a single-word answer which will collapse my reasoning.  I'm not.  The problem is that I'm failing to see how exactly the questions are relevant.

In the two assembly examples above the difference is immaterial.  The atomic op example will load the value right next to where it's needed, atomically, etc.  So the second example does what I want it to do and guarantees it's done.  The first example _happens_ to do the same thing too here, without guarantees.  Hence no significant differences and that's why I don't think this question provides any insights in the issue.


> >>>>> In buf_LRU_get_free_block() there is a loop, which can be executed by
> >>>> multiple query threads concurrently, and it both loads and stores
> >>>> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
> >>>> concurrent threads, again necessary to have the load and store barriers.
> >>>>>
> >>
> >> Er, and how exactly do compiler barriers implement immediate visibility
> >> to any concurrent threads?
> >
> >
> > Compiler barrier forces a memory store to happen where it is.  A memory
> store will be immediately visible on x86_64 to everybody else.  A compiler
> barrier forces a memory load to happen too.  Which means, that a store by
> another thread is visible to the current thread.
> >
> 
> No, a compiler barrier does not force a memory store to happen where it
> is. It forces the compiler to generate load/store instructions in
> specific order and to avoid caching them in registers. That does not in
> itself guarantee any visibility.


Unless I'm misunderstanding you above, you're fully right except for the conclusion, where I believe that it in itself guarantees visibility just fine.  x86_64 memory regular loads and stores are guaranteed visible.


> >> That also needs atomicity, but that's not what barriers guarantee.
> >
> >
> > I don't think I ever said that it is a _compiler barrier_ implements
> _atomicity_.  Have I?
> >
> 
> os_atomic_load_ulint() has "atomic" in its name, but the only thing it
> does is injecting a compiler barrier?


It also has an "ulint" in its name and carries the alignment assumption.  Does that help?


> > Volatile semantics are not very clear and they have little to do with
> correct concurrency.  The compilers emit code that happens to, partially or
> fully, do what we want when volatile is encountered by emitting memory loads
> and stores and not moving them around, but there is no guarantee that a store
> is fully retired before the next load, no visibility guarantee.  The correct
> multithreaded programming must happen with correct primitives that are aware
> of the memory model.
> >
> 
> And os_atomic_load_ulint() / os_atomic_store_ulint() address that
> problem by ...?


By providing the primitives for atomic, visible, ordered access.  Would you agree that having such primitives, assuming any implementation, addresses this problem?  I think yes, since you said previously that mutex-protected access would work?  Or is it too brave an assumption on my part?


> >>>> Aren't you reinventing volatile here?
> >>>
> >>>
> >>> Volatile is not enough: on a non-cache coherent architecture the volatile
> >> would force a load/store to be performed, yes, but not necessarily would
> make
> >> a store visible elsewhere nor a load would necessarily be performed in a
> >> cache-coherent manner.
> >>>
> >>
> >> *sigh* and compiler barriers solve that problem by ...?
> >
> >
> > Where did I say that _compiler_ barriers solve this problem on a _non_-cache
> coherent architecture?  CPU barriers are needed there, and compiler barriers
> are a prerequisite for them.
> >
> 
> You've been referring to _non_-cache coherent architectures whenever I
> asked you simple questions on compiler barriers. 


But you question above referred to the _non_ case, thus I must have misunderstood something.


> And no, CPU barriers do
> not help with non-cache coherent architectures. Google it.


I have provided the non-temporal store on x86_64 example above where CPU barrier helps, which I believe is a counterexample.  My googling trying to find a positive example failed.  Can you please point me to one?


> >>>> But taking the atomicity considerations into account, I'm fairly sure
> >>>> it's much better to protect those values with a separate mutex and be
> >>>> done with it.
> >>>
> >>>
> >>> IMHO the jury is still out.
> >>>
> >>
> >> *shrug* if you have better ideas, I'd love to hear them.
> >
> >
> > I still have the same idea but obviously I fail to explain it well.
> >
> 
> I think you misunderstand your idea in the first place.


That's probable.  I will drop the idea the moment I see I'm wrong, I'm not married to it.  But so far I believe I misunderstand your objections more than my idea.  

If you allow me a detour.  I could drop it just to save us both time (probably significant by now), but honestly, I don't want to drop what still looks technically sound from my point of view.  For example, I'd have replaced the existing hacks in log0online to store/load LSNs atomically instead of simulating the store with "add zero, atomically, get the old value" op.  Further, there is this new counter framework in 5.6, get_sched_indexer_t, which is a per-CPU based counter array, and which is currently not used.  My hunch is that it's not used because it did not perform, and non-temporal stores is what is needed for it to perform.  And for that the database needs to have both the proper atomic ops framework and the working knowledge of it and memory model in the developers.  Our discussion shows that the latter is sorely missing and I believe it's critical to work this out.

Thanks.