Alexey - > >> - (if we only care about x86/x86_64) make sure that values being > >> loaded/stored do not cross cache lines or page boundaries. Which is of > >> course impossible to guarantee in a generic function. > > > > > > Why? We are talking about ulints here only, and I was not able to find such > requirements in the x86_64 memory model descriptions. There is a requirement > to be aligned, and misaligned stores/loads might indeed cross cache line or > page boundaries, and anything that crosses them is indeed non-atomic. But > alignment is possible to guarantee in a generic function (which doesn't even > has to be generic: the x86_64 implementation is for x86_64 only, obviously). > > > > Intel® 64 and IA-32 Architectures Software Developer's Manual > > Volume 3A: System Programming Guide, Part 1, section 8.1.1, > http://download.intel.com/products/processor/manual/253668.pdf: > > > > "The Intel486 processor (and newer processors since) guarantees that the > following basic memory operations will > > always be carried out atomically: > > (...) > > • Reading or writing a doubleword aligned on a 32-bit boundary > > > > The Pentium processor (...): > > • Reading or writing a quadword aligned on a 64-bit boundary > > " > > Why didn't you quote it further? > > " > Accesses to cacheable memory that are split across cache lines and page > boundaries are not guaranteed to be atomic by . processors> provide bus control signals that permit external memory > subsystems to make split accesses atomic; > " > > Which means even aligned accesses are not guaranteed to be atomic and > it's up to the implementation of "external memory subsystems" (that > probably means chipsets, motherboards, NUMA architectures and the like). I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here. I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures. > > My understanding of the above is that > os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, > modulo alignment issues, if any. These are easy to ensure by ut_ad(). > > > > Modulo alignment, cache line boundary and page boundary issues. Alignment only unless my reasoning above is wrong. > I don't see how ut_ad() is going to help here. So a buf_pool_stat_t > structure happens to be allocated in memory so that n_pages_written > happens to be misaligned, or cross a cache line or a page boundary. How > exactly ut_ad() is going to ensure that never happens at runtime? A debug build would hit this assert and we'd fix the structure layout/allocation. Unless I'm mistaken, to get a misaligned ulint, we'd have to ask for this explicitly, by packing a struct, fetching a pointer to it from a byte array, etc. Thus ut_ad() seems reasonable to me. > >>>> They should be named os_ordered_load_ulint() / > >>>> os_ordered_store_ulint(), > >>> > >>> That's an option, but I needed atomicity, visibility, and ordering, and > >> chose atomic for function name to match the existing CAS and atomic add > >> operations, which also need all three. > >>> > >> > >> I'm not sure you need all of those 3 in every occurrence of those > >> functions, but see below. > > > > > > That's right. And ordering probably is not needed anywhere, sorry about > that, my understanding of atomics is far from fluent. But visibility should > be needed in every occurrence if this is ever ported to a non-cache-coherent > architecture, or if some non-temporal store source code optimization is > implemented on x86_64. > > > > Grr. > > - the "atomic" word is misused, because that code is not really atomic > - right, but I need atomicity + new_entity_1 + new_entity_2 > - you don't need new_entity_1 and new_entity_2 > - that's right, but new_entity_1 should be needed for new_entity_3 and > new_entity_4 I am getting confused by the term "ordering". It appears to me, perhaps wrongly, that by ordering you refer to a ordering between different memory location accesses. If yes, then I don't need ordering, but the primitives provide it, as there is no point in not providing it anyway. But "ordering" can mean memory access ordering to a single memory location in a single thread. In this case I need this ordering already. I need atomicity and visibility without questions. I don't understand your objection re. new_entity_3 and new_entity_4. My goal is to provide primitives that abstract away the underlying architecture. That's a sane goal and atomic+visible+ordered load/store is a sane model. Any porting anywhere, or even x86_64 optimization is a matter of implementing the correct primitives, not checking the call sites because we suddenly lost some property. > What do compiler barriers have to do with cache coherence? You cannot add CPU barriers to achieve cache coherence without having compiler barriers there too. > >>>> but... what specific order are you trying > >>>> to enforce with those constructs? > >>> > >>> In buf_read_recv_pages() and buf_read_ibuf_page(). > >>> > >>> while (os_atomic_load_ulint(&buf_pool->n_pend_reads) > >>> > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) { > >>> os_thread_sleep(500000); > >>> } > >>> > >>> it is necessary to have a load barrier (compiler only for x86/x86_64, > >> potentially a CPU barrier too elsewhere for visibility and ordering). > >>> > > That doesn't answer my question: what specific ordering are you trying > to enforce with those constructs? buf_pool->n_pend_reads load must happen after the previous loop iteration buf_pool->n_pend_reads load. > >> One needs compiler/CPU barriers when the code, after loading a value > >> from one location, expects values from other locations in a certain > >> state (or assumes certain cross-location visibility in other words). > > > > > > That is ordering, yes, but I don't think this is an exhaustive list of when > the barriers is needed. Here a compiler barrier is needed to prevent the > hoisting of the load above the loop, turning its body to an infinite loop. > > > > The same problem in mutex_spin_wait(), for example, is addressed by > ib_mutex_t::lock_word being defined with the volatile keyword. I don't think that's an example to follow, elaborated below. > However, no atomicity, visibility or ordering is enforced when reading > that value in mutex_spin_wait(), because it doesn't need them. I don't think mutex_spin_wait() wants ever, regarding 1) atomicity: to observe a partial load or perform a store so that another thread observes a partial load; 2) visibility: to not see a mutex lock word write by another thread or make its own write not visible to another thread; 3) ordering: to optimize mutex lock word load out of the loop. Unless I'm mistaken, all three are needed, and the fact that the current code has volatile instead is slightly troubling me. In fact, mutex_exit_func() has the following comment: /* A problem: we assume that mutex_reset_lock word is a memory barrier, that is when we read the waiters field next, the read must be serialized in memory after the reset. A speculative processor might perform the read first, which could leave a waiting thread hanging indefinitely. Our current solution call every second sync_arr_wake_threads_if_sema_free() to wake up possible hanging threads if they are missed in mutex_signal_object. */ So, that's concurrent programming with volatile only and without barriers. What I'm proposing is to acknowledge the memory model and use the correct primitives to operate on them. > We need atomicity for counters, so the volatile keyword is not > sufficient. But it's not provided by any barriers either. Counter atomicity is ensured by atomic inc or atomic CAS operation. This operation will include all the compiler/CPU barriers to ensure visibility and ordering too. An atomic load or store is not any less atomic than inc/CAS even if there is no computation with intermediate results. > >> What assumptions about other value states are made in the above code > >> after reading buf_pool->n_pend_reads? > > > > > > Thus this question does not apply. > > > > It does, because the answer is "none" and it shows that "volatile" would > be a sufficient replacement for os_atomic_load/store if not the > atomicity issues, which are not addressed by either approach. It does not, because we don't need to have "other values" to have a reason to consider the memory model applies to this value only and how to access it. > >> It makes even less sense in other occurrences of os_atomic_load_ulint(), > >> e.g. in buf_get_total_stat(). I disassembled the code generated for > >> buf_get_total_stat() by GCC with compiler barriers both enabled and > >> disabled. > >> > >> With barriers disabled the code corresponding to > >> "tot_stat->n_pages_written += > >> os_atomic_load_ulint(&buf_stat->n_pages_written)" is: > >> ... > >> > >> With barriers enabled the same code is: > >> ... > > > > > > I'm mildly curious how would this compare to a volatile n_pages_written too. > > > > You are welcome to experiment yourself. I didn't mean to imply to that you or I or anybody has to do it. > >> So what has been achieved by a compiler barrier here except the fact > >> that the compiler avoids storing buf_stat->n_pages_written to a register > >> and thus produces less efficient code? > > > > > > buf_stat->n_pages_written load from memory is forced to happen next to where > the value is used. On non-cache-coherent architectures it would be > additionally ensured by a correct os_atomic_load_ulint() implementation that > the value loaded is the latest value written by any other thread. > > > > No, in both cases the value is used to modify a register and then the > register is stored to memory. So the correct answer to my question is > "nothing". Your answer seems to ignore my bit "On non-cache-coherent ... ". Well, it ignores the first sentence too, x86_64 assembly comparison will provide little for this discussion, because we don't need CPU barriers here. > >>> In buf_LRU_get_free_block() there is a loop, which can be executed by > >> multiple query threads concurrently, and it both loads and stores > >> buf_pool->try_LRU_scan in it, which requires immediate visibility to any > >> concurrent threads, again necessary to have the load and store barriers. > >>> > > Er, and how exactly do compiler barriers implement immediate visibility > to any concurrent threads? Compiler barrier forces a memory store to happen where it is. A memory store will be immediately visible on x86_64 to everybody else. A compiler barrier forces a memory load to happen too. Which means, that a store by another thread is visible to the current thread. > That also needs atomicity, but that's not what barriers guarantee. I don't think I ever said that it is a _compiler barrier_ implements _atomicity_. Have I? > >> OK, compiler barriers force the compiler to perform _multiple_ > >> load/store ops in the desired order. And CPU barriers impose the same > >> reqs on CPU. Which means they only matter for cross-location visibility, > >> but have nothing to do with a single value visibility. > > > > > > By "multiple" above I believe you are again referring to ordering different > memory location accesses. And, as above, compiler barriers are required for > single location access too, and CPU barriers would be as well. > > > > Without a compiler barrier there is nothing that prevents the compiler to > compiling buf_LRU_get_free_block to > > > > register = try_LRU_scan > > ...loop that sets and accesses try_LRU_scan through the register... > > try_LRU_scan = register > > > > That's what the volatile keyword is for. Volatile semantics are not very clear and they have little to do with correct concurrency. The compilers emit code that happens to, partially or fully, do what we want when volatile is encountered by emitting memory loads and stores and not moving them around, but there is no guarantee that a store is fully retired before the next load, no visibility guarantee. The correct multithreaded programming must happen with correct primitives that are aware of the memory model. > >> Aren't you reinventing volatile here? > > > > > > Volatile is not enough: on a non-cache coherent architecture the volatile > would force a load/store to be performed, yes, but not necessarily would make > a store visible elsewhere nor a load would necessarily be performed in a > cache-coherent manner. > > > > *sigh* and compiler barriers solve that problem by ...? Where did I say that _compiler_ barriers solve this problem on a _non_-cache coherent architecture? CPU barriers are needed there, and compiler barriers are a prerequisite for them. > >> But taking the atomicity considerations into account, I'm fairly sure > >> it's much better to protect those values with a separate mutex and be > >> done with it. > > > > > > IMHO the jury is still out. > > > > *shrug* if you have better ideas, I'd love to hear them. I still have the same idea but obviously I fail to explain it well. > >> The cleanup might be done for 5.5 as well if we had unlimited resources. > >> But as my cleanups in AHI partitioning have shown, implementing, > >> merging, reviewing things like this separately is more expensive in > >> terms of man-hours. > > > > > > The AHI example is a surprise to me. I see the additional time need for > separate testing runs, but I personally always prefer to do common things > across the versions commonly, and separate things separately. This probably > depends on a person. > > > > The relevant code has diverged between 5.5 and 5.6 considerably. So > implementing those cleanups in 5.5 and then merging them to 5.6 was far > from trivial. That's one thing. > > Another thing is that implementing those cleanups in 5.5 is less > important than 5.6, so it could wait longer than 5.6. > > Finally, 1 MP/review cycle is less man-hours than 1 + 2 extra MP/review > cycles. It's that simple. Thanks for the explanations.