Percona Server moved to https://jira.percona.com/projects/PS

Merge lp:~laurynas-biveinis/percona-server/bp-split-5.6 into lp:percona-server/5.6

bp-split-5.6
Merge into 5.6

Proposed by Laurynas Biveinis on 2013-09-20

Status:

Merged

Approved by:

Alexey Kopytov on 2013-09-20

Approved revision:

no longer in the source branch.

Merged at revision:

425

Proposed branch:

lp:~laurynas-biveinis/percona-server/bp-split-5.6

Merge into:

lp:percona-server/5.6

Diff against target:

4913 lines (+1001/-789)

22 files modified

Percona-Server/storage/innobase/btr/btr0cur.cc (+18/-6)
Percona-Server/storage/innobase/btr/btr0sea.cc (+5/-8)
Percona-Server/storage/innobase/buf/buf0buddy.cc (+46/-18)
Percona-Server/storage/innobase/buf/buf0buf.cc (+220/-196)
Percona-Server/storage/innobase/buf/buf0dblwr.cc (+1/-0)
Percona-Server/storage/innobase/buf/buf0dump.cc (+8/-8)
Percona-Server/storage/innobase/buf/buf0flu.cc (+153/-125)
Percona-Server/storage/innobase/buf/buf0lru.cc (+280/-177)
Percona-Server/storage/innobase/buf/buf0rea.cc (+41/-26)
Percona-Server/storage/innobase/fsp/fsp0fsp.cc (+1/-1)
Percona-Server/storage/innobase/handler/ha_innodb.cc (+12/-2)
Percona-Server/storage/innobase/handler/i_s.cc (+14/-9)
Percona-Server/storage/innobase/ibuf/ibuf0ibuf.cc (+1/-1)
Percona-Server/storage/innobase/include/buf0buddy.h (+4/-4)
Percona-Server/storage/innobase/include/buf0buddy.ic (+9/-10)
Percona-Server/storage/innobase/include/buf0buf.h (+91/-108)
Percona-Server/storage/innobase/include/buf0buf.ic (+63/-63)
Percona-Server/storage/innobase/include/buf0flu.h (+6/-7)
Percona-Server/storage/innobase/include/buf0flu.ic (+0/-2)
Percona-Server/storage/innobase/include/buf0lru.h (+9/-7)
Percona-Server/storage/innobase/include/sync0sync.h (+13/-4)
Percona-Server/storage/innobase/sync/sync0sync.cc (+6/-7)

To merge this branch:

bzr merge lp:~laurynas-biveinis/percona-server/bp-split-5.6

Related bugs:

Bug #1219842: buf_LRU_buf_pool_running_out() does not need to lock any mutexes	Medium	Fix Released
Bug #1220544: Is buf_LRU_free_page() really supposed to make non-zip block sticky at the end?	Medium	Fix Released

Link a bug report

Reviewer	Date Requested	Status
Alexey Kopytov (community)	2013-09-20	Approve on 2013-09-20
Laurynas Biveinis		Pending
Review via email: mp+186711@code.launchpad.net

This proposal supersedes a proposal from 2013-09-09.

Description of the change

Repushed and resubmitted. The only change from the 3rd push is atomic ops split off to be handled later.

http://jenkins.percona.com/job/percona-server-5.6-param/259/

No BT or ST, but 5.6 GA prerequisite.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-10: Posted in a previous version of this proposal

Download full text (4.1 KiB)

Hi Laurynas,

This is not a complete review, I'm only about 10% into the patch, but I'm posting the comments I have so far to parallelize things a bit:

  - the code in btr_blob_free() can be simplified. just initialize
    ‘freed’ with ‘false’, then assign to it the result of
    buf_LRU_free_page() whenever it is called, and then do this at the
    end:

    if (!freed) {
     mutex_exit(&buf_pool->LRU_list_mutex);
    }

Would result in less hairy code.

- wrong comments for buf_LRU_free_page(): s/block
mutex/buf_page_get_mutex() mutex/

  - the comments for buf_LRU_free_page() say that both LRU_list_mutex
    and block_mutex may be released temporarily if ‘true’ is
    returned. But:
    1) even if ‘false’ is returned, block_mutex may also be released
       temporarily
    2) the comments don’t mention that if ‘true’ is returned,
       LRU_list_mutex is always released upon return, but block_mutex is
       always locked. And callers of buf_LRU_free_page() rely on that.

  - the following code in buf_LRU_free_page() is missing a
    buf_page_free_descriptor() call if b != NULL. Which is a potential
    memory leak.

  if (!buf_LRU_block_remove_hashed(bpage, zip)) {
+
+ mutex_exit(&buf_pool->LRU_list_mutex);
+
+ mutex_enter(block_mutex);
+
   return(true);
  }

  - the patch removes buf_pool_mutex_enter_all() from
    btr_search_validate_one_table(), but then does a number of dirty
    reads from ‘block’ before it locks block->mutex. Any reasons to not
    lock block->mutex earlier?

  - the following checks for mutex != NULL in buf_buddy_relocate() seem
    to be redundant, since they are made after mutex_enter(mutex), so we
    are guaranteed mutex != NULL if we reach that code:

@@ -584,7 +604,11 @@ buf_buddy_relocate(

mutex_enter(mutex);

- if (buf_page_can_relocate(bpage)) {
+ rw_lock_s_unlock(hash_lock);
+
+ mutex_enter(&buf_pool->zip_free_mutex);
+
+ if (mutex && buf_page_can_relocate(bpage)) {

and

- mutex_exit(mutex);
+ if (mutex)
+ mutex_exit(mutex);
+
+ ut_ad(mutex_own(&buf_pool->zip_free_mutex));

and the last hunk is also missing braces (in case you decide to keep
it).

  - asserting that zip_free_mutex is locked also looks redundant to me,
    because it is locked just a few lines above, and there’s nothing in
    the code path that could release it.

  - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
    need that stuff. Their names are misleading as they don’t enforce
    any atomicity. They should be named os_ordered_load_ulint() /
    os_ordered_store_ulint(), but... what specific order are you trying
    to enforce with those constructs?

  - I don’t see a point in maintaining multiple list nodes in buf_page_t
    (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
    page may only be in a single list at any point in time, so splitting
    the list node is purely cosmetic.

    On the other hand, we are looking at a non-trivial buf_page_t size
    increase (112 bytes before the patch, 144 bytes after). Leaving all
    cache and memory locality questions aside, that’s 64 MB of memory
    just for list node pointers on a system with a 32 GB buffer pool....

Hi Laurynas,

This is not a complete review, I'm only about 10% into the patch, but I'm posting the comments I have so far to parallelize things a bit:

if (!freed) {
    	mutex_exit(&buf_pool->LRU_list_mutex);
    }
    
    Would result in less hairy code.

- wrong comments for buf_LRU_free_page(): s/block
    mutex/buf_page_get_mutex() mutex/

- the following code in buf_LRU_free_page() is missing a
    buf_page_free_descriptor() call if b != NULL. Which is a potential
    memory leak.

if (!buf_LRU_block_remove_hashed(bpage, zip)) {
+
+		mutex_exit(&buf_pool->LRU_list_mutex);
+
+		mutex_enter(block_mutex);
+
 		return(true);
 	}

- the following checks for mutex != NULL in buf_buddy_relocate() seem
    to be redundant, since they are made after mutex_enter(mutex), so we
    are guaranteed mutex != NULL if we reach that code:

@@ -584,7 +604,11 @@ buf_buddy_relocate(
 
 	mutex_enter(mutex);
 
-	if (buf_page_can_relocate(bpage)) {
+	rw_lock_s_unlock(hash_lock);
+
+	mutex_enter(&buf_pool->zip_free_mutex);
+
+	if (mutex && buf_page_can_relocate(bpage)) {

and

-	mutex_exit(mutex);
+	if (mutex)
+		mutex_exit(mutex);
+
+	ut_ad(mutex_own(&buf_pool->zip_free_mutex));

and the last hunk is also missing braces (in case you decide to keep
    it).

- asserting that zip_free_mutex is locked also looks redundant to me,
    because it is locked just a few lines above, and there’s nothing in
    the code path that could release it.

What’s the “optimistic use” mentioned in the comment for those nodes
    in buf_page_t?

- the following hunk simply removes reference to buf_pool->mutex. As I
    understand, it should just replace buf_pool->mutex with zip_free_mutex?

-	page_zip_des_t	zip;		/*!< compressed page; zip.data
-					(but not the data it points to) is
-					also protected by buf_pool->mutex;
-					state == BUF_BLOCK_ZIP_PAGE and
-					zip.data == NULL means an active
+	page_zip_des_t	zip;		/*!< compressed page; state
+					== BUF_BLOCK_ZIP_PAGE and zip.data
+					== NULL means an active

- I’m not sure creating separate mutex per each n_flush[]/init_flush[]
    element is worth the effort. Those mutexes are basically locked to
    read/update a few memory locations on code paths that don’t look
    critical. I don’t think there’s any measurable effect from split
    mutexes (and it can potentially even have negative impact in cases
    where we need to lock them all). Let’s just have a single
    buf_pool->flush_state_mutex.

review: Needs Fixing

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-11: Posted in a previous version of this proposal

Download full text (7.7 KiB)

Alexey -

> - the code in btr_blob_free() can be simplified

Simplified.

> - wrong comments for buf_LRU_free_page(): s/block
> mutex/buf_page_get_mutex() mutex/

Edited. But there are numerous other places in this patch (and upstream) that would need this editing too, and "block mutex" is already an established shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention pointer to mutex variables named block_mutex.

Do you want me to edit the other places too?

> - the comments for buf_LRU_free_page() say that both LRU_list_mutex
> and block_mutex may be released temporarily if ‘true’ is
> returned. But:
> 1) even if ‘false’ is returned, block_mutex may also be released
> temporarily
> 2) the comments don’t mention that if ‘true’ is returned,
> LRU_list_mutex is always released upon return, but block_mutex is
> always locked. And callers of buf_LRU_free_page() rely on that.

Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:

/******************************************************************//**
Try to free a block. If bpage is a descriptor of a compressed-only
page, the descriptor object will be freed as well.

NOTE: If this function returns true, it will release the LRU list mutex,
and temporarily release and relock the buf_page_get_mutex() mutex.
Furthermore, the page frame will no longer be accessible via bpage. If this
function returns false, the buf_page_get_get_mutex() might be temporarily
released and relocked too.

The caller must hold the LRU list and buf_page_get_mutex() mutexes.

@return true if freed, false otherwise. */

> - the following code in buf_LRU_free_page() is missing a
> buf_page_free_descriptor() call if b != NULL. Which is a potential
> memory leak.

Fixed.

> - the patch removes buf_pool_mutex_enter_all() from
> btr_search_validate_one_table(), but then does a number of dirty
> reads from ‘block’ before it locks block->mutex. Any reasons to not
> lock block->mutex earlier?

I *think* there were actual reasons, but I cannot remember them, due to the number of things going on with this patch. And I don't see why it locking block->mutex earlier is not possible now. I will look further.

> - the following checks for mutex != NULL in buf_buddy_relocate() seem
> to be redundant, since they are made after mutex_enter(mutex), so we
> are guaranteed mutex != NULL if we reach that code:

Fixed. This looks like a missed cleanup after removing 5.5's buf_page_get_mutex_enter().

> - asserting that zip_free_mutex is locked also looks redundant to me,
> because it is locked just a few lines above, and there’s nothing in
> the code path that could release it.

Removed. Added a comment to the function header "The caller must hold zip_free_mutex, and this
function will release and lock it again." instead.

> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
> need that stuff. Their names are misleading as they don’t enforce
> any atomicity.

The ops being load and store, their atom...

Alexey -

>   - the code in btr_blob_free() can be simplified

Simplified.

>   - wrong comments for buf_LRU_free_page(): s/block
>     mutex/buf_page_get_mutex() mutex/

Do you want me to edit the other places too?

>   - the comments for buf_LRU_free_page() say that both LRU_list_mutex
>     and block_mutex may be released temporarily if ‘true’ is
>     returned. But:
>     1) even if ‘false’ is returned, block_mutex may also be released
>        temporarily
>     2) the comments don’t mention that if ‘true’ is returned,
>        LRU_list_mutex is always released upon return, but block_mutex is
>        always locked. And callers of buf_LRU_free_page() rely on that.

Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:

/******************************************************************//**
Try to free a block.  If bpage is a descriptor of a compressed-only
page, the descriptor object will be freed as well.

NOTE: If this function returns true, it will release the LRU list mutex,
and temporarily release and relock the buf_page_get_mutex() mutex.
Furthermore, the page frame will no longer be accessible via bpage.  If this
function returns false, the buf_page_get_get_mutex() might be temporarily
released and relocked too.

The caller must hold the LRU list and buf_page_get_mutex() mutexes.

@return true if freed, false otherwise. */

>   - the following code in buf_LRU_free_page() is missing a
>     buf_page_free_descriptor() call if b != NULL. Which is a potential
>     memory leak.

Fixed.

>   - the patch removes buf_pool_mutex_enter_all() from
>     btr_search_validate_one_table(), but then does a number of dirty
>     reads from ‘block’ before it locks block->mutex. Any reasons to not
>     lock block->mutex earlier?

>   - the following checks for mutex != NULL in buf_buddy_relocate() seem
>     to be redundant, since they are made after mutex_enter(mutex), so we
>     are guaranteed mutex != NULL if we reach that code:

Fixed. This looks like a missed cleanup after removing 5.5's buf_page_get_mutex_enter().

>   - asserting that zip_free_mutex is locked also looks redundant to me,
>     because it is locked just a few lines above, and there’s nothing in
>     the code path that could release it.

Removed. Added a comment to the function header "The caller must hold zip_free_mutex, and this
function will release and lock it again." instead.

>   - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>     need that stuff. Their names are misleading as they don’t enforce
>     any atomicity.

The ops being load and store, their atomicity is enforced by the data type width.

> They should be named os_ordered_load_ulint() /
>     os_ordered_store_ulint(),

That's an option, but I needed atomicity, visibility, and ordering, and chose atomic for function name to match the existing CAS and atomic add operations, which also need all three.

> but... what specific order are you trying
>     to enforce with those constructs?

In buf_read_recv_pages() and buf_read_ibuf_page().

while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
		       > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
			os_thread_sleep(500000);
		}

it is necessary to have a load barrier (compiler only for x86/x86_64, potentially a CPU barrier too elsewhere for visibility and ordering).

In buf_LRU_get_free_block() there is a loop, which can be executed by multiple query threads concurrently, and it both loads and stores buf_pool->try_LRU_scan in it, which requires immediate visibility to any concurrent threads, again necessary to have the load and store barriers.

>   - I don’t see a point in maintaining multiple list nodes in buf_page_t
>     (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
>     page may only be in a single list at any point in time, so splitting
>     the list node is purely cosmetic.
> 
>     On the other hand, we are looking at a non-trivial buf_page_t size
>     increase (112 bytes before the patch, 144 bytes after). Leaving all
>     cache and memory locality questions aside, that’s 64 MB of memory
>     just for list node pointers on a system with a 32 GB buffer pool.
> 
>     What’s the “optimistic use” mentioned in the comment for those nodes
>     in buf_page_t?

These are my concerns too. I don't address them in the current MP as these bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a follow-up that is common with 5.5.

>   - the following hunk simply removes reference to buf_pool->mutex. As I
>     understand, it should just replace buf_pool->mutex with zip_free_mutex?
> 
> -       page_zip_des_t  zip;            /*!< compressed page; zip.data
> -                                       (but not the data it points to) is
> -                                       also protected by buf_pool->mutex;
> -                                       state == BUF_BLOCK_ZIP_PAGE and
> -                                       zip.data == NULL means an active
> +       page_zip_des_t  zip;            /*!< compressed page; state
> +                                       == BUF_BLOCK_ZIP_PAGE and zip.data
> +                                       == NULL means an active

Hm, it looked to me that it's protected not with zip_free_mutex but with zip_mutex in its page mutex capacity. I will check.

>   - I’m not sure creating separate mutex per each n_flush[]/init_flush[]
>     element is worth the effort. Those mutexes are basically locked to
>     read/update a few memory locations on code paths that don’t look
>     critical. I don’t think there’s any measurable effect from split
>     mutexes (and it can potentially even have negative impact in cases
>     where we need to lock them all). Let’s just have a single
>     buf_pool->flush_state_mutex.

First, the effort is very little actually, the per-flush-type split is very natural. The cases of needing to lock them all are 1) buf_pool_validate_instance() to lock them all at once, 2) buf_stats_get_pool_info() locks them sequentially, 3) buf_pool_check_pending_io() sequentially too. 1) is debug-only code, 2) should not be called with such frequency that it becomes a problem, 3) is startup/shutdown code only.

Second, on how I arrived at the current mutexes. The initial patch version did not have the flush_state_mutex array, and it kept the remaining buf_pool->mutex just like the current 5.5 version does. But we observed contention on this remaining mutex. Then I converted all the variables protected by it (stats, init_flush[], n_flush[], try_LRU_scan, maybe something else I forgot) to atomic variables and this is the current state of things on the experimental branches. But while working on the current MP I worked out that it is impossible for init_flush and n_flush to be independent atomic variables, and if one wants to observe a consistent buffer pool state snapshot as in buf_pool_validate_instance(), then some of the buffer page I/O fix, flush state and flush list presence transitions are not independent neither. Hence the mutex, and since the per-flush-type split appeared to be very natural, the mutex array.

Do you want me to repush the branch as I address the comments or wait for a full first review pass first?

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-11: Posted in a previous version of this proposal

Download full text (11.5 KiB)

Hi Laurynas,

On Wed, 11 Sep 2013 09:45:13 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>> - the code in btr_blob_free() can be simplified
>
> Simplified.
>
>> - wrong comments for buf_LRU_free_page(): s/block
>> mutex/buf_page_get_mutex() mutex/
>
> Edited. But there are numerous other places in this patch (and upstream) that would need this editing too, and "block mutex" is already an established shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention pointer to mutex variables named block_mutex.
>

I think editing existing comments is worth the efforts (and potential
extra maintenance cost in the future). I would be OK if this specific
comment was left intact too. It only caught my eye because the comment
was edited and I spent some time verifying it.

> Do you want me to edit the other places too?
>
>> - the comments for buf_LRU_free_page() say that both LRU_list_mutex
>> and block_mutex may be released temporarily if ‘true’ is
>> returned. But:
>> 1) even if ‘false’ is returned, block_mutex may also be released
>> temporarily
>> 2) the comments don’t mention that if ‘true’ is returned,
>> LRU_list_mutex is always released upon return, but block_mutex is
>> always locked. And callers of buf_LRU_free_page() rely on that.
>
> Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:
>
> /******************************************************************//**
> Try to free a block. If bpage is a descriptor of a compressed-only
> page, the descriptor object will be freed as well.
>
> NOTE: If this function returns true, it will release the LRU list mutex,
> and temporarily release and relock the buf_page_get_mutex() mutex.
> Furthermore, the page frame will no longer be accessible via bpage. If this
> function returns false, the buf_page_get_get_mutex() might be temporarily
> released and relocked too.
>
> The caller must hold the LRU list and buf_page_get_mutex() mutexes.
>
> @return true if freed, false otherwise. */
>
>

Looks good.

>> - the patch removes buf_pool_mutex_enter_all() from
>> btr_search_validate_one_table(), but then does a number of dirty
>> reads from ‘block’ before it locks block->mutex. Any reasons to not
>> lock block->mutex earlier?
>
> I *think* there were actual reasons, but I cannot remember them, due to the number of things going on with this patch. And I don't see why it locking block->mutex earlier is not possible now. I will look further.
>

OK.

>> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>> need that stuff. Their names are misleading as they don’t enforce
>> any atomicity.
>
> The ops being load and store, their atomicity is enforced by the data type width.
>

Right, the atomicity is enforced by the data type width on those
architectures that provide it. And even those that do provide it have a
number of prerequisites. Neither of those 2 facts is taken care of in
os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
different with respect to atomi...

Hi Laurynas,

On Wed, 11 Sep 2013 09:45:13 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>>    - the code in btr_blob_free() can be simplified
>
> Simplified.
>
>>    - wrong comments for buf_LRU_free_page(): s/block
>>      mutex/buf_page_get_mutex() mutex/
>
> Edited. But there are numerous other places in this patch (and upstream) that would need this editing too, and "block mutex" is already an established shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention pointer to mutex variables named block_mutex.
>

I think editing existing comments is worth the efforts (and potential 
extra maintenance cost in the future). I would be OK if this specific 
comment was left intact too. It only caught my eye because the comment 
was edited and I spent some time verifying it.

> Do you want me to edit the other places too?
>
>>    - the comments for buf_LRU_free_page() say that both LRU_list_mutex
>>      and block_mutex may be released temporarily if ‘true’ is
>>      returned. But:
>>      1) even if ‘false’ is returned, block_mutex may also be released
>>         temporarily
>>      2) the comments don’t mention that if ‘true’ is returned,
>>         LRU_list_mutex is always released upon return, but block_mutex is
>>         always locked. And callers of buf_LRU_free_page() rely on that.
>
> Indeed callers rely on the current, arguably messy, buf_LRU_free_page() locking. This is how I edited the header comment for this and the previous review comment:
>
> /******************************************************************//**
> Try to free a block.  If bpage is a descriptor of a compressed-only
> page, the descriptor object will be freed as well.
>
> NOTE: If this function returns true, it will release the LRU list mutex,
> and temporarily release and relock the buf_page_get_mutex() mutex.
> Furthermore, the page frame will no longer be accessible via bpage.  If this
> function returns false, the buf_page_get_get_mutex() might be temporarily
> released and relocked too.
>
> The caller must hold the LRU list and buf_page_get_mutex() mutexes.
>
> @return true if freed, false otherwise. */
>
>

Looks good.

>>    - the patch removes buf_pool_mutex_enter_all() from
>>      btr_search_validate_one_table(), but then does a number of dirty
>>      reads from ‘block’ before it locks block->mutex. Any reasons to not
>>      lock block->mutex earlier?
>
> I *think* there were actual reasons, but I cannot remember them, due to the number of things going on with this patch. And I don't see why it locking block->mutex earlier is not possible now. I will look further.
>

OK.

>>    - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>>      need that stuff. Their names are misleading as they don’t enforce
>>      any atomicity.
>
> The ops being load and store, their atomicity is enforced by the data type width.
>

Right, the atomicity is enforced by the data type width on those 
architectures that provide it. And even those that do provide it have a 
number of prerequisites. Neither of those 2 facts is taken care of in 
os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any 
different with respect to atomicity as plain load/store of a ulint and 
thus, have misleading names.

So to justify "atomic" in their names those functions should:

- (if we want to be portable) protect those load/stores with a mutex
- (if we only care about x86/x86_64) make sure that values being 
loaded/stored do not cross cache lines or page boundaries. Which is of 
course impossible to guarantee in a generic function.

>> They should be named os_ordered_load_ulint() /
>>      os_ordered_store_ulint(),
>
> That's an option, but I needed atomicity, visibility, and ordering, and chose atomic for function name to match the existing CAS and atomic add operations, which also need all three.
>

I'm not sure you need all of those 3 in every occurrence of those 
functions, but see below.

>> but... what specific order are you trying
>>      to enforce with those constructs?
>
> In buf_read_recv_pages() and buf_read_ibuf_page().
>
> 		while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
> 		       > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
> 			os_thread_sleep(500000);
> 		}
>
> it is necessary to have a load barrier (compiler only for x86/x86_64, potentially a CPU barrier too elsewhere for visibility and ordering).
>

One needs compiler/CPU barriers when the code, after loading a value 
from one location, expects values from other locations in a certain 
state (or assumes certain cross-location visibility in other words). 
What assumptions about other value states are made in the above code 
after reading buf_pool->n_pend_reads?

It makes even less sense in other occurrences of os_atomic_load_ulint(), 
e.g. in buf_get_total_stat(). I disassembled the code generated for 
buf_get_total_statw() by GCC with compiler barriers both enabled and 
disabled.

With barriers disabled the code corresponding to 
"tot_stat->n_pages_written += 
os_atomic_load_ulint(&buf_stat->n_pages_written)" is:

# Load buf_stat->n_pages_written and add it to tot_stat value in
# a register
add    -0x30(%r13,%rdx,1),%r14
# store the computed value to the tot_stat field
mov    %r14,0x10(%rdi)

With barriers enabled the same code is:

# Load the current tot_stat field to a register
mov    0x10(%rdi),%rax
# Add the value of buf_stat->n_pages_written
add    -0x30(%r8,%rcx,1),%rax
# store the computed value back to the tot_stat field
mov    %rax,0x10(%rdi)

So what has been achieved by a compiler barrier here except the fact 
that the compiler avoids storing buf_stat->n_pages_written to a register 
and thus produces less efficient code?

> In buf_LRU_get_free_block() there is a loop, which can be executed by multiple query threads concurrently, and it both loads and stores buf_pool->try_LRU_scan in it, which requires immediate visibility to any concurrent threads, again necessary to have the load and store barriers.
>

OK, compiler barriers force the compiler to perform _multiple_ 
load/store ops in the desired order. And CPU barriers impose the same 
reqs on CPU. Which means they only matter for cross-location visibility, 
but have nothing to do with a single value visibility.

Aren't you reinventing volatile here?

But taking the atomicity considerations into account, I'm fairly sure 
it's much better to protect those values with a separate mutex and be 
done with it.

>>    - I don’t see a point in maintaining multiple list nodes in buf_page_t
>>      (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
>>      page may only be in a single list at any point in time, so splitting
>>      the list node is purely cosmetic.
>>
>>      On the other hand, we are looking at a non-trivial buf_page_t size
>>      increase (112 bytes before the patch, 144 bytes after). Leaving all
>>      cache and memory locality questions aside, that’s 64 MB of memory
>>      just for list node pointers on a system with a 32 GB buffer pool.
>>
>>      What’s the “optimistic use” mentioned in the comment for those nodes
>>      in buf_page_t?
>
> These are my concerns too. I don't address them in the current MP as these bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a follow-up that is common with 5.5.
>

I'm not sure it should be addressed separately. It fits rather well into 
other bpsplit cleanups you have already done. As in, it's about removing 
unnecessary code, not about adding new code/functionality on top of 
bpsplit in 5.5.

The cleanup might be done for 5.5 as well if we had unlimited resources. 
But as my cleanups in AHI partitioning have shown, implementing, 
merging, reviewing things like this separately is more expensive in 
terms of man-hours.

>>    - the following hunk simply removes reference to buf_pool->mutex. As I
>>      understand, it should just replace buf_pool->mutex with zip_free_mutex?
>>
>> -       page_zip_des_t  zip;            /*!< compressed page; zip.data
>> -                                       (but not the data it points to) is
>> -                                       also protected by buf_pool->mutex;
>> -                                       state == BUF_BLOCK_ZIP_PAGE and
>> -                                       zip.data == NULL means an active
>> +       page_zip_des_t  zip;            /*!< compressed page; state
>> +                                       == BUF_BLOCK_ZIP_PAGE and zip.data
>> +                                       == NULL means an active
>
> Hm, it looked to me that it's protected not with zip_free_mutex but with zip_mutex in its page mutex capacity. I will check.
>

There was a place in the code that asserted zip_free_mutex locked when 
bpage->zip.data is modified. But I'm not sure if that is correct.

>>    - I’m not sure creating separate mutex per each n_flush[]/init_flush[]
>>      element is worth the effort. Those mutexes are basically locked to
>>      read/update a few memory locations on code paths that don’t look
>>      critical. I don’t think there’s any measurable effect from split
>>      mutexes (and it can potentially even have negative impact in cases
>>      where we need to lock them all). Let’s just have a single
>>      buf_pool->flush_state_mutex.
>
> First, the effort is very little actually, the per-flush-type split is very natural. The cases of needing to lock them all are 1) buf_pool_validate_instance() to lock them all at once, 2) buf_stats_get_pool_info() locks them sequentially, 3) buf_pool_check_pending_io() sequentially too. 1) is debug-only code, 2) should not be called with such frequency that it becomes a problem, 3) is startup/shutdown code only.
>

Well, 2 non-debug sites that need to lock all of those mutexes 
sequentially is probably a hint that splitting them is an overkill? Note 
that a mutex acquisition/release may be an expensive operation in 
itself. So, for example, depending on many factors, 3 sequential 
mutex_enter() calls may be much more expensive than a single one. Which 
may be especially visible if the critical section itself is rather short.

> Second, on how I arrived at the current mutexes. The initial patch version did not have the flush_state_mutex array, and it kept the remaining buf_pool->mutex just like the current 5.5 version does. But we observed contention on this remaining mutex. Then I converted all the variables protected by it (stats, init_flush[], n_flush[], try_LRU_scan, maybe something else I forgot) to atomic variables and this is the current state of things on the experimental branches. But while working on the current MP I worked out that it is impossible for init_flush and n_flush to be independent atomic variables, and if one wants to observe a consistent buffer pool state snapshot as in buf_pool_validate_instance(), then some of the buffer page I/O fix, flush state and flush list presence transitions are not independent neither. Hence the mutex, and since the per-flush-type split appeared to be very natural, the mutex array.
>

I see, thanks for clarifications. So the question is: have you 
considered the option to protect just init_flush and n_flush with a 
single mutex? This looks like a good trade-off between 2 extremes: 1) 
protecting them with an overused mutex that is also used to protect a 
bunch of other things; 2) have individual mutexes per each flush type 
(with the implied performance overhead from mutex_enter()/mutex_exit(), 
context switches, and code maintenance overhead).

> Do you want me to repush the branch as I address the comments or wait for a full first review pass first?
>

As you wish. Repushing would probably be slightly better, but I'm OK 
either way.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-11: Posted in a previous version of this proposal

On Wed, 11 Sep 2013 16:06:21 +0400, Alexey Kopytov wrote:
> I think editing existing comments is worth the efforts (and potential
> extra maintenance cost in the future). I would be OK if this specific

Grr, that was supposed to be "NOT worth the efforts" of course.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-11: Posted in a previous version of this proposal

> > - the patch removes buf_pool_mutex_enter_all() from
> > btr_search_validate_one_table(), but then does a number of dirty
> > reads from ‘block’ before it locks block->mutex. Any reasons to not
> > lock block->mutex earlier?
>
> I *think* there were actual reasons, but I cannot remember them, due to the
> number of things going on with this patch. And I don't see why it locking
> block->mutex earlier is not possible now. I will look further.

A test run helped to recover my memories. So the problem is buf_block_hash_get() call, which locks page_hash, which violates locking order. Keeping the initial reads is OK, because 1) block pointer will not be invalidated because we are reading AHI-indexed pages only, which can be only BUF_BLOCK_FILE_PAGE or BUF_BLOCK_REMOVE_HASH (which is a kind of BUF_BLOCK_FILE_PAGE for our purposes), 2) block space and page_id cannot change neither while we hold a corresponding AHI X latch. Thus the initial dirty reads are not actually dirty.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-11: Posted in a previous version of this proposal

Hi Laurynas,

On Wed, 11 Sep 2013 12:26:57 -0000, Laurynas Biveinis wrote:
>>> - the patch removes buf_pool_mutex_enter_all() from
>>> btr_search_validate_one_table(), but then does a number of dirty
>>> reads from ‘block’ before it locks block->mutex. Any reasons to not
>>> lock block->mutex earlier?
>>
>> I *think* there were actual reasons, but I cannot remember them, due to the
>> number of things going on with this patch. And I don't see why it locking
>> block->mutex earlier is not possible now. I will look further.
>
> A test run helped to recover my memories. So the problem is buf_block_hash_get() call, which locks page_hash, which violates locking order. Keeping the initial reads is OK, because 1) block pointer will not be invalidated because we are reading AHI-indexed pages only, which can be only BUF_BLOCK_FILE_PAGE or BUF_BLOCK_REMOVE_HASH (which is a kind of BUF_BLOCK_FILE_PAGE for our purposes), 2) block space and page_id cannot change neither while we hold a corresponding AHI X latch. Thus the initial dirty reads are not actually dirty.
>

Right, makes sense.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-11: Posted in a previous version of this proposal

Download full text (11.7 KiB)

> >> - wrong comments for buf_LRU_free_page(): s/block
> >> mutex/buf_page_get_mutex() mutex/
> >
> > Edited. But there are numerous other places in this patch (and upstream)
> that would need this editing too, and "block mutex" is already an established
> shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention
> pointer to mutex variables named block_mutex.
> >
>
> I think editing existing comments is not worth the efforts (and potential
> extra maintenance cost in the future). I would be OK if this specific
> comment was left intact too. It only caught my eye because the comment
> was edited and I spent some time verifying it.

Fair enough. I applied same edit through the other comments changed by this patch.

> >> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
> >> need that stuff. Their names are misleading as they don’t enforce
> >> any atomicity.
> >
> > The ops being load and store, their atomicity is enforced by the data type
> width.
> >
>
> Right, the atomicity is enforced by the data type width on those
> architectures that provide it.

I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.

> And even those that do provide it have a
> number of prerequisites. Neither of those 2 facts is taken care of in
> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
> different with respect to atomicity as plain load/store of a ulint and
> thus, have misleading names.
>
> So to justify "atomic" in their names those functions should:
>
> - (if we want to be portable) protect those load/stores with a mutex

Why? I guess this question boils down to, what would the mutex implementation code additionally ensure here, let's say, on x86_64? Or is this referring to the 5.6 mutex fallbacks when no atomic ops are implemented for a platform?

> - (if we only care about x86/x86_64) make sure that values being
> loaded/stored do not cross cache lines or page boundaries. Which is of
> course impossible to guarantee in a generic function.

Why? We are talking about ulints here only, and I was not able to find such requirements in the x86_64 memory model descriptions. There is a requirement to be aligned, and misaligned stores/loads might indeed cross cache line or page boundaries, and anything that crosses them is indeed non-atomic. But alignment is possible to guarantee in a generic function (which doesn't even has to be generic: the x86_64 implementation is for x86_64 only, obviously).

Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:

"The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
(...)
• Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (...):
• Reading or writing a quadword aligned on a 64-bit boundary
"

My understanding of the above is that os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, modulo alignment ...

> >>    - wrong comments for buf_LRU_free_page(): s/block
> >>      mutex/buf_page_get_mutex() mutex/
> >
> > Edited. But there are numerous other places in this patch (and upstream)
> that would need this editing too, and "block mutex" is already an established
> shorthand for "really a block mutex or buf_pool->zip_mutex". Not to mention
> pointer to mutex variables named block_mutex.
> >
> 
> I think editing existing comments is not worth the efforts (and potential
> extra maintenance cost in the future). I would be OK if this specific
> comment was left intact too. It only caught my eye because the comment
> was edited and I spent some time verifying it.

Fair enough. I applied same edit through the other comments changed by this patch.

> >>    - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
> >>      need that stuff. Their names are misleading as they don’t enforce
> >>      any atomicity.
> >
> > The ops being load and store, their atomicity is enforced by the data type
> width.
> >
> 
> Right, the atomicity is enforced by the data type width on those
> architectures that provide it.

I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.

> And even those that do provide it have a
> number of prerequisites. Neither of those 2 facts is taken care of in
> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
> different with respect to atomicity as plain load/store of a ulint and
> thus, have misleading names.
> 
> So to justify "atomic" in their names those functions should:
> 
> - (if we want to be portable) protect those load/stores with a mutex

> - (if we only care about x86/x86_64) make sure that values being
> loaded/stored do not cross cache lines or page boundaries. Which is of
> course impossible to guarantee in a generic function.

Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:

The Pentium processor (...):
• Reading or writing a quadword aligned on a 64-bit boundary
"

My understanding of the above is that os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, modulo alignment issues, if any. These are easy to ensure by ut_ad().

> >> They should be named os_ordered_load_ulint() /
> >>      os_ordered_store_ulint(),
> >
> > That's an option, but I needed atomicity, visibility, and ordering, and
> chose atomic for function name to match the existing CAS and atomic add
> operations, which also need all three.
> >
> 
> I'm not sure you need all of those 3 in every occurrence of those
> functions, but see below.

That's right. And ordering probably is not needed anywhere, sorry about that, my understanding of atomics is far from fluent.  But visibility should be needed in every occurrence if this is ever ported to a non-cache-coherent architecture, or if some non-temporal store source code optimization is implemented on x86_64.

> >> but... what specific order are you trying
> >>      to enforce with those constructs?
> >
> > In buf_read_recv_pages() and buf_read_ibuf_page().
> >
> >               while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
> >                      > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
> >                       os_thread_sleep(500000);
> >               }
> >
> > it is necessary to have a load barrier (compiler only for x86/x86_64,
> potentially a CPU barrier too elsewhere for visibility and ordering).
> >
> 
> One needs compiler/CPU barriers when the code, after loading a value
> from one location, expects values from other locations in a certain
> state (or assumes certain cross-location visibility in other words).

That is ordering, yes, but I don't think this is an exhaustive list of when the barriers is needed. Here a compiler barrier is needed to prevent the hoisting of the load above the loop, turning its body to an infinite loop.

> What assumptions about other value states are made in the above code
> after reading buf_pool->n_pend_reads?

Thus this question does not apply.

> It makes even less sense in other occurrences of os_atomic_load_ulint(),
> e.g. in buf_get_total_stat(). I disassembled the code generated for
> buf_get_total_stat() by GCC with compiler barriers both enabled and
> disabled.
> 
> With barriers disabled the code corresponding to
> "tot_stat->n_pages_written +=
> os_atomic_load_ulint(&buf_stat->n_pages_written)" is:
> 
> # Load buf_stat->n_pages_written and add it to tot_stat value in
> # a register
> add    -0x30(%r13,%rdx,1),%r14
> # store the computed value to the tot_stat field
> mov    %r14,0x10(%rdi)
> 
> With barriers enabled the same code is:
> 
> # Load the current tot_stat field to a register
> mov    0x10(%rdi),%rax
> # Add the value of buf_stat->n_pages_written
> add    -0x30(%r8,%rcx,1),%rax
> # store the computed value back to the tot_stat field
> mov    %rax,0x10(%rdi)

I'm mildly curious how would this compare to a volatile n_pages_written too.

> So what has been achieved by a compiler barrier here except the fact
> that the compiler avoids storing buf_stat->n_pages_written to a register
> and thus produces less efficient code?

buf_stat->n_pages_written load from memory is forced to happen next to where the value is used. On non-cache-coherent architectures it would be additionally ensured by a correct os_atomic_load_ulint() implementation that the value loaded is the latest value written by any other thread.

> > In buf_LRU_get_free_block() there is a loop, which can be executed by
> multiple query threads concurrently, and it both loads and stores
> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
> concurrent threads, again necessary to have the load and store barriers.
> >
> 
> OK, compiler barriers force the compiler to perform _multiple_
> load/store ops in the desired order. And CPU barriers impose the same
> reqs on CPU. Which means they only matter for cross-location visibility,
> but have nothing to do with a single value visibility.

By "multiple" above I believe you are again referring to ordering different memory location accesses. And, as above, compiler barriers are required for single location access too, and CPU barriers would be as well.

Without a compiler barrier there is nothing that prevents the compiler to compiling buf_LRU_get_free_block to

> Aren't you reinventing volatile here?

Volatile is not enough: on a non-cache coherent architecture the volatile would force a load/store to be performed, yes, but not necessarily would make a store visible elsewhere nor a load would necessarily be performed in a cache-coherent manner.

> But taking the atomicity considerations into account, I'm fairly sure
> it's much better to protect those values with a separate mutex and be
> done with it.

IMHO the jury is still out.

> >>    - I don’t see a point in maintaining multiple list nodes in buf_page_t
> >>      (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
> >>      page may only be in a single list at any point in time, so splitting
> >>      the list node is purely cosmetic.
...
> > These are my concerns too. I don't address them in the current MP as these
> bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a
> follow-up that is common with 5.5.
> >
> 
> I'm not sure it should be addressed separately. It fits rather well into
> other bpsplit cleanups you have already done. As in, it's about removing
> unnecessary code, not about adding new code/functionality on top of
> bpsplit in 5.5.

OK, I will do this and log a Wishlist bug for 5.5 to backport.

> The cleanup might be done for 5.5 as well if we had unlimited resources.
> But as my cleanups in AHI partitioning have shown, implementing,
> merging, reviewing things like this separately is more expensive in
> terms of man-hours.

The AHI example is a surprise to me.  I see the additional time need for separate testing runs, but I personally always prefer to do common things across the versions commonly, and separate things separately.  This probably depends on a person.

> >>    - the following hunk simply removes reference to buf_pool->mutex. As I
> >>      understand, it should just replace buf_pool->mutex with
> zip_free_mutex?
> >> +       page_zip_des_t  zip;            /*!< compressed page; state
> >> +                                       == BUF_BLOCK_ZIP_PAGE and zip.data
> >> +                                       == NULL means an active
> >
> > Hm, it looked to me that it's protected not with zip_free_mutex but with
> zip_mutex in its page mutex capacity. I will check.
> >
> 
> There was a place in the code that asserted zip_free_mutex locked when
> bpage->zip.data is modified. But I'm not sure if that is correct.

I have checked, I believe it's indeed zip_mutex.  Re. zip_free_mutex, you must be referring to this bit in buf_buddy_relocate():

mutex_enter(&buf_pool->zip_free_mutex);

if (buf_page_can_relocate(bpage)) {
...
		bpage->zip.data = (page_zip_t*) dst;
		mutex_exit(mutex);

buf_buddy_stat_t*	buddy_stat = &buf_pool->buddy_stat[i];
		buddy_stat->relocated++;
		buddy_stat->relocated_usec += ut_time_us(NULL) - usec;
...

Here zip_free_mutex happens to protect buddy_stat and pushing it down to if clause would require an else clause to appear that locks the same mutex.

> >>    - I’m not sure creating separate mutex per each n_flush[]/init_flush[]
> >>      element is worth the effort. Those mutexes are basically locked to
> >>      read/update a few memory locations on code paths that don’t look
> >>      critical. I don’t think there’s any measurable effect from split
> >>      mutexes (and it can potentially even have negative impact in cases
> >>      where we need to lock them all). Let’s just have a single
> >>      buf_pool->flush_state_mutex.
> >
> > First, the effort is very little actually, the per-flush-type split is very
> natural. The cases of needing to lock them all are 1)
> buf_pool_validate_instance() to lock them all at once, 2)
> buf_stats_get_pool_info() locks them sequentially, 3)
> buf_pool_check_pending_io() sequentially too. 1) is debug-only code, 2) should
> not be called with such frequency that it becomes a problem, 3) is
> startup/shutdown code only.
> >
> 
> Well, 2 non-debug sites that need to lock all of those mutexes
> sequentially is probably a hint that splitting them is an overkill? Note
> that a mutex acquisition/release may be an expensive operation in
> itself. So, for example, depending on many factors, 3 sequential
> mutex_enter() calls may be much more expensive than a single one. Which
> may be especially visible if the critical section itself is rather short.

OK. I will convert to a single mutex. For the record I don't see big advantages either way over the other way, but let's have the mild advantage of a single mutex.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-11: Posted in a previous version of this proposal

Download full text (11.6 KiB)

Hi Laurynas,

On Wed, 11 Sep 2013 15:29:33 -0000, Laurynas Biveinis wrote:
>>>> - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>>>> need that stuff. Their names are misleading as they don’t enforce
>>>> any atomicity.
>>>
>>> The ops being load and store, their atomicity is enforced by the data type
>> width.
>>>
>>
>> Right, the atomicity is enforced by the data type width on those
>> architectures that provide it.
>
>
> I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.
>

Yes, but alignment does not guarantee atomicity, see below.

>
>> And even those that do provide it have a
>> number of prerequisites. Neither of those 2 facts is taken care of in
>> os_atomic_load_ulint() / os_atomic_store_ulint(). So they are not any
>> different with respect to atomicity as plain load/store of a ulint and
>> thus, have misleading names.
>>
>> So to justify "atomic" in their names those functions should:
>>
>> - (if we want to be portable) protect those load/stores with a mutex
>
>
> Why? I guess this question boils down to, what would the mutex implementation code additionally ensure here, let's say, on x86_64? Or is this referring to the 5.6 mutex fallbacks when no atomic ops are implemented for a platform?
>

mutex the only portable way to ensure atomicity. You can use atomic
primitives provided by specific architectures, but then you either limit
support for those architectures or yes, provide a mutex fallback.

>
>> - (if we only care about x86/x86_64) make sure that values being
>> loaded/stored do not cross cache lines or page boundaries. Which is of
>> course impossible to guarantee in a generic function.
>
>
> Why? We are talking about ulints here only, and I was not able to find such requirements in the x86_64 memory model descriptions. There is a requirement to be aligned, and misaligned stores/loads might indeed cross cache line or page boundaries, and anything that crosses them is indeed non-atomic. But alignment is possible to guarantee in a generic function (which doesn't even has to be generic: the x86_64 implementation is for x86_64 only, obviously).
>
> Intel® 64 and IA-32 Architectures Software Developer's Manual
> Volume 3A: System Programming Guide, Part 1, section 8.1.1, http://download.intel.com/products/processor/manual/253668.pdf:
>
> "The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
> always be carried out atomically:
> (...)
> • Reading or writing a doubleword aligned on a 32-bit boundary
>
> The Pentium processor (...):
> • Reading or writing a quadword aligned on a 64-bit boundary
> "

Why didn't you quote it further?

"
Accesses to cacheable memory that are split across cache lines and page
boundaries are not guaranteed to be atomic by <all processors>. <all
processors> provide bus control signals that permit external memory
subsystems to make split accesses atomic;
"

Which means even aligned accesses are not guaranteed to be atomic and
it's up to the implementation of "external memory subsystems" (that
probably means chipsets, motherboards, NUMA archi...

Hi Laurynas,

On Wed, 11 Sep 2013 15:29:33 -0000, Laurynas Biveinis wrote:
>>>>     - os_atomic_load_ulint() / os_atomic_store_ulint()... I don’t think we
>>>>       need that stuff. Their names are misleading as they don’t enforce
>>>>       any atomicity.
>>>
>>> The ops being load and store, their atomicity is enforced by the data type
>> width.
>>>
>>
>> Right, the atomicity is enforced by the data type width on those
>> architectures that provide it.
>
>
> I forgot to mention that they also have to be not misaligned so that one access does not translate to two accesses.
>

Yes, but alignment does not guarantee atomicity, see below.

mutex the only portable way to ensure atomicity. You can use atomic 
primitives provided by specific architectures, but then you either limit 
support for those architectures or yes, provide a mutex fallback.

Why didn't you quote it further?

"
Accesses to cacheable memory that are split across cache lines and page 
boundaries are not guaranteed to be atomic by <all processors>. <all 
processors> provide bus control signals that permit external memory 
subsystems to make split accesses atomic;
"

Which means even aligned accesses are not guaranteed to be atomic and 
it's up to the implementation of "external memory subsystems" (that 
probably means chipsets, motherboards, NUMA architectures and the like).

That's the answer to the "why?" question.

>
> My understanding of the above is that os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description, modulo alignment issues, if any. These are easy to ensure by ut_ad().
>

Modulo alignment, cache line boundary and page boundary issues.

I don't see how ut_ad() is going to help here. So a buf_pool_stat_t 
structure happens to be allocated in memory so that n_pages_written 
happens to be misaligned, or cross a cache line or a page boundary. How 
exactly ut_ad() is going to ensure that never happens at runtime?

>
>>>> They should be named os_ordered_load_ulint() /
>>>>       os_ordered_store_ulint(),
>>>
>>> That's an option, but I needed atomicity, visibility, and ordering, and
>> chose atomic for function name to match the existing CAS and atomic add
>> operations, which also need all three.
>>>
>>
>> I'm not sure you need all of those 3 in every occurrence of those
>> functions, but see below.
>
>
> That's right. And ordering probably is not needed anywhere, sorry about that, my understanding of atomics is far from fluent.  But visibility should be needed in every occurrence if this is ever ported to a non-cache-coherent architecture, or if some non-temporal store source code optimization is implemented on x86_64.
>

Grr.

- the "atomic" word is misused, because that code is not really atomic
- right, but I need atomicity + new_entity_1 + new_entity_2
- you don't need new_entity_1 and new_entity_2
- that's right, but new_entity_1 should be needed for new_entity_3 and 
new_entity_4

What do compiler barriers have to do with cache coherence?

>
>>>> but... what specific order are you trying
>>>>       to enforce with those constructs?
>>>
>>> In buf_read_recv_pages() and buf_read_ibuf_page().
>>>
>>>                while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
>>>                       > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
>>>                        os_thread_sleep(500000);
>>>                }
>>>
>>> it is necessary to have a load barrier (compiler only for x86/x86_64,
>> potentially a CPU barrier too elsewhere for visibility and ordering).
>>>

That doesn't answer my question: what specific ordering are you trying 
to enforce with those constructs?

>>
>> One needs compiler/CPU barriers when the code, after loading a value
>> from one location, expects values from other locations in a certain
>> state (or assumes certain cross-location visibility in other words).
>
>
> That is ordering, yes, but I don't think this is an exhaustive list of when the barriers is needed. Here a compiler barrier is needed to prevent the hoisting of the load above the loop, turning its body to an infinite loop.
>

The same problem in mutex_spin_wait(), for example, is addressed by 
ib_mutex_t::lock_word being defined with the volatile keyword.

However, no atomicity, visibility or ordering is enforced when reading 
that value in mutex_spin_wait(), because it doesn't need them.

We need atomicity for counters, so the volatile keyword is not 
sufficient. But it's not provided by any barriers either.

>
>> What assumptions about other value states are made in the above code
>> after reading buf_pool->n_pend_reads?
>
>
> Thus this question does not apply.
>

It does, because the answer is "none" and it shows that "volatile" would 
be a sufficient replacement for os_atomic_load/store if not the 
atomicity issues, which are not addressed by either approach.

>
>> It makes even less sense in other occurrences of os_atomic_load_ulint(),
>> e.g. in buf_get_total_stat(). I disassembled the code generated for
>> buf_get_total_stat() by GCC with compiler barriers both enabled and
>> disabled.
>>
>> With barriers disabled the code corresponding to
>> "tot_stat->n_pages_written +=
>> os_atomic_load_ulint(&buf_stat->n_pages_written)" is:
>>
>> # Load buf_stat->n_pages_written and add it to tot_stat value in
>> # a register
>> add    -0x30(%r13,%rdx,1),%r14
>> # store the computed value to the tot_stat field
>> mov    %r14,0x10(%rdi)
>>
>> With barriers enabled the same code is:
>>
>> # Load the current tot_stat field to a register
>> mov    0x10(%rdi),%rax
>> # Add the value of buf_stat->n_pages_written
>> add    -0x30(%r8,%rcx,1),%rax
>> # store the computed value back to the tot_stat field
>> mov    %rax,0x10(%rdi)
>
>
> I'm mildly curious how would this compare to a volatile n_pages_written too.
>

You are welcome to experiment yourself.
>
>> So what has been achieved by a compiler barrier here except the fact
>> that the compiler avoids storing buf_stat->n_pages_written to a register
>> and thus produces less efficient code?
>
>
> buf_stat->n_pages_written load from memory is forced to happen next to where the value is used. On non-cache-coherent architectures it would be additionally ensured by a correct os_atomic_load_ulint() implementation that the value loaded is the latest value written by any other thread.
>

No, in both cases the value is used to modify a register and then the 
register is stored to memory. So the correct answer to my question is 
"nothing".

>
>>> In buf_LRU_get_free_block() there is a loop, which can be executed by
>> multiple query threads concurrently, and it both loads and stores
>> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
>> concurrent threads, again necessary to have the load and store barriers.
>>>

Er, and how exactly do compiler barriers implement immediate visibility 
to any concurrent threads?

That also needs atomicity, but that's not what barriers guarantee.

>>
>> OK, compiler barriers force the compiler to perform _multiple_
>> load/store ops in the desired order. And CPU barriers impose the same
>> reqs on CPU. Which means they only matter for cross-location visibility,
>> but have nothing to do with a single value visibility.
>
>
> By "multiple" above I believe you are again referring to ordering different memory location accesses. And, as above, compiler barriers are required for single location access too, and CPU barriers would be as well.
>
> Without a compiler barrier there is nothing that prevents the compiler to compiling buf_LRU_get_free_block to
>
> register = try_LRU_scan
> ...loop that sets and accesses try_LRU_scan through the register...
> try_LRU_scan = register
>

That's what the volatile keyword is for.

>
>> Aren't you reinventing volatile here?
>
>
> Volatile is not enough: on a non-cache coherent architecture the volatile would force a load/store to be performed, yes, but not necessarily would make a store visible elsewhere nor a load would necessarily be performed in a cache-coherent manner.
>

*sigh* and compiler barriers solve that problem by ...?

>
>> But taking the atomicity considerations into account, I'm fairly sure
>> it's much better to protect those values with a separate mutex and be
>> done with it.
>
>
> IMHO the jury is still out.
>

*shrug* if you have better ideas, I'd love to hear them.

>
>>>>     - I don’t see a point in maintaining multiple list nodes in buf_page_t
>>>>       (i.e. ‘free’, ‘flush_list’ and ‘zip_list’). As I understand, each
>>>>       page may only be in a single list at any point in time, so splitting
>>>>       the list node is purely cosmetic.
> ...
>>> These are my concerns too. I don't address them in the current MP as these
>> bits date all the way to the XtraDB 5.1. Thus I think it's best addressed in a
>> follow-up that is common with 5.5.
>>>
>>
>> I'm not sure it should be addressed separately. It fits rather well into
>> other bpsplit cleanups you have already done. As in, it's about removing
>> unnecessary code, not about adding new code/functionality on top of
>> bpsplit in 5.5.
>
>
> OK, I will do this and log a Wishlist bug for 5.5 to backport.
>

Thanks.

>
>> The cleanup might be done for 5.5 as well if we had unlimited resources.
>> But as my cleanups in AHI partitioning have shown, implementing,
>> merging, reviewing things like this separately is more expensive in
>> terms of man-hours.
>
>
> The AHI example is a surprise to me.  I see the additional time need for separate testing runs, but I personally always prefer to do common things across the versions commonly, and separate things separately.  This probably depends on a person.
>

The relevant code has diverged between 5.5 and 5.6 considerably. So 
implementing those cleanups in 5.5 and then merging them to 5.6 was far 
from trivial. That's one thing.

Another thing is that implementing those cleanups in 5.5 is less 
important than 5.6, so it could wait longer than 5.6.

Finally, 1 MP/review cycle is less man-hours than 1 + 2 extra MP/review 
cycles. It's that simple.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-12: Posted in a previous version of this proposal

Download full text (14.0 KiB)

Alexey -

> >> - (if we only care about x86/x86_64) make sure that values being
> >> loaded/stored do not cross cache lines or page boundaries. Which is of
> >> course impossible to guarantee in a generic function.
> >
> >
> > Why? We are talking about ulints here only, and I was not able to find such
> requirements in the x86_64 memory model descriptions. There is a requirement
> to be aligned, and misaligned stores/loads might indeed cross cache line or
> page boundaries, and anything that crosses them is indeed non-atomic. But
> alignment is possible to guarantee in a generic function (which doesn't even
> has to be generic: the x86_64 implementation is for x86_64 only, obviously).
> >
> > Intel® 64 and IA-32 Architectures Software Developer's Manual
> > Volume 3A: System Programming Guide, Part 1, section 8.1.1,
> http://download.intel.com/products/processor/manual/253668.pdf:
> >
> > "The Intel486 processor (and newer processors since) guarantees that the
> following basic memory operations will
> > always be carried out atomically:
> > (...)
> > • Reading or writing a doubleword aligned on a 32-bit boundary
> >
> > The Pentium processor (...):
> > • Reading or writing a quadword aligned on a 64-bit boundary
> > "
>
> Why didn't you quote it further?
>
> "
> Accesses to cacheable memory that are split across cache lines and page
> boundaries are not guaranteed to be atomic by <all processors>. <all
> processors> provide bus control signals that permit external memory
> subsystems to make split accesses atomic;
> "
>
> Which means even aligned accesses are not guaranteed to be atomic and
> it's up to the implementation of "external memory subsystems" (that
> probably means chipsets, motherboards, NUMA architectures and the like).

I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here. I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures.

> > My understanding of the above is that
> os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description,
> modulo alignment issues, if any. These are easy to ensure by ut_ad().
> >
>
> Modulo alignment, cache line boundary and page boundary issues.

Alignment only unless my reasoning above is wrong.

> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
> structure happens to be allocated in memory so that n_pages_written
> happens to be misaligned, or cross a cache line or a page boundary. How
> exactly ut_ad() is going to ensure that never happens at runtime?

A debug build would hit this assert and we'd fix the structure layout/allocation. Unless I'm mistaken, to get a misaligned ulint, we'd have to ask for this explicitly, by packing a struct, fetching a pointer to it from a byte array, etc. Thus ut_ad() seems reasonable to me.

> >>>> They should be named os_ordered_load_ulint() /
> >>>> os_ordered_store_ulint(),
> >>>
> >>> That's an option, but I needed atomicity, visibility, and ordering, and
> >> chose atomic for ...

Alexey -

> >> - (if we only care about x86/x86_64) make sure that values being
> >> loaded/stored do not cross cache lines or page boundaries. Which is of
> >> course impossible to guarantee in a generic function.
> >
> >
> > Why? We are talking about ulints here only, and I was not able to find such
> requirements in the x86_64 memory model descriptions. There is a requirement
> to be aligned, and misaligned stores/loads might indeed cross cache line or
> page boundaries, and anything that crosses them is indeed non-atomic. But
> alignment is possible to guarantee in a generic function (which doesn't even
> has to be generic: the x86_64 implementation is for x86_64 only, obviously).
> >
> > Intel® 64 and IA-32 Architectures Software Developer's Manual
> > Volume 3A: System Programming Guide, Part 1, section 8.1.1,
> http://download.intel.com/products/processor/manual/253668.pdf:
> >
> > "The Intel486 processor (and newer processors since) guarantees that the
> following basic memory operations will
> > always be carried out atomically:
> > (...)
> > • Reading or writing a doubleword aligned on a 32-bit boundary
> >
> > The Pentium processor (...):
> > • Reading or writing a quadword aligned on a 64-bit boundary
> > "
> 
> Why didn't you quote it further?
> 
> "
> Accesses to cacheable memory that are split across cache lines and page
> boundaries are not guaranteed to be atomic by <all processors>. <all
> processors> provide bus control signals that permit external memory
> subsystems to make split accesses atomic;
> "
> 
> Which means even aligned accesses are not guaranteed to be atomic and
> it's up to the implementation of "external memory subsystems" (that
> probably means chipsets, motherboards, NUMA architectures and the like).

I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here.  I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures.

> > My understanding of the above is that
> os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description,
> modulo alignment issues, if any. These are easy to ensure by ut_ad().
> >
> 
> Modulo alignment, cache line boundary and page boundary issues.

Alignment only unless my reasoning above is wrong.

A debug build would hit this assert and we'd fix the structure layout/allocation.  Unless I'm mistaken, to get a misaligned ulint, we'd have to ask for this explicitly, by packing a struct, fetching a pointer to it from a byte array, etc.  Thus ut_ad() seems reasonable to me.

> >>>> They should be named os_ordered_load_ulint() /
> >>>>       os_ordered_store_ulint(),
> >>>
> >>> That's an option, but I needed atomicity, visibility, and ordering, and
> >> chose atomic for function name to match the existing CAS and atomic add
> >> operations, which also need all three.
> >>>
> >>
> >> I'm not sure you need all of those 3 in every occurrence of those
> >> functions, but see below.
> >
> >
> > That's right. And ordering probably is not needed anywhere, sorry about
> that, my understanding of atomics is far from fluent.  But visibility should
> be needed in every occurrence if this is ever ported to a non-cache-coherent
> architecture, or if some non-temporal store source code optimization is
> implemented on x86_64.
> >
> 
> Grr.
> 
> - the "atomic" word is misused, because that code is not really atomic
> - right, but I need atomicity + new_entity_1 + new_entity_2
> - you don't need new_entity_1 and new_entity_2
> - that's right, but new_entity_1 should be needed for new_entity_3 and
> new_entity_4

I am getting confused by the term "ordering".  It appears to me, perhaps wrongly, that by ordering you refer to a ordering between different memory location accesses.  If yes, then I don't need ordering, but the primitives provide it, as there is no point in not providing it anyway.  But "ordering" can mean memory access ordering to a single memory location in a single thread.  In this case I need this ordering already.  I need atomicity and visibility without questions.

I don't understand your objection re. new_entity_3 and new_entity_4. My goal is to provide primitives that abstract away the underlying architecture.  That's a sane goal and atomic+visible+ordered load/store is a sane model. Any porting anywhere, or even x86_64 optimization is a matter of implementing the correct primitives, not checking the call sites because we suddenly lost some property.

> What do compiler barriers have to do with cache coherence?

You cannot add CPU barriers to achieve cache coherence without having compiler barriers there too.

> >>>> but... what specific order are you trying
> >>>>       to enforce with those constructs?
> >>>
> >>> In buf_read_recv_pages() and buf_read_ibuf_page().
> >>>
> >>>                while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
> >>>                       > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
> >>>                        os_thread_sleep(500000);
> >>>                }
> >>>
> >>> it is necessary to have a load barrier (compiler only for x86/x86_64,
> >> potentially a CPU barrier too elsewhere for visibility and ordering).
> >>>
> 
> That doesn't answer my question: what specific ordering are you trying
> to enforce with those constructs?

buf_pool->n_pend_reads load must happen after the previous loop iteration buf_pool->n_pend_reads load.

> >> One needs compiler/CPU barriers when the code, after loading a value
> >> from one location, expects values from other locations in a certain
> >> state (or assumes certain cross-location visibility in other words).
> >
> >
> > That is ordering, yes, but I don't think this is an exhaustive list of when
> the barriers is needed. Here a compiler barrier is needed to prevent the
> hoisting of the load above the loop, turning its body to an infinite loop.
> >
> 
> The same problem in mutex_spin_wait(), for example, is addressed by
> ib_mutex_t::lock_word being defined with the volatile keyword.

I don't think that's an example to follow, elaborated below.

> However, no atomicity, visibility or ordering is enforced when reading
> that value in mutex_spin_wait(), because it doesn't need them.

I don't think mutex_spin_wait() wants ever, regarding 1) atomicity: to observe a partial load or perform a store so that another thread observes a partial load; 2) visibility: to not see a mutex lock word write by another thread or make its own write not visible to another thread; 3) ordering: to optimize mutex lock word load out of the loop.  Unless I'm mistaken, all three are needed, and the fact that the current code has volatile instead is slightly troubling me.

In fact, mutex_exit_func() has the following comment:

/* A problem: we assume that mutex_reset_lock word
	is a memory barrier, that is when we read the waiters
	field next, the read must be serialized in memory
	after the reset. A speculative processor might
	perform the read first, which could leave a waiting
	thread hanging indefinitely.

Our current solution call every second
	sync_arr_wake_threads_if_sema_free()
	to wake up possible hanging threads if
	they are missed in mutex_signal_object. */

So, that's concurrent programming with volatile only and without barriers.  What I'm proposing is to acknowledge the memory model and use the correct primitives to operate on them.

> We need atomicity for counters, so the volatile keyword is not
> sufficient. But it's not provided by any barriers either.

Counter atomicity is ensured by atomic inc or atomic CAS operation.  This operation will include all the compiler/CPU barriers to ensure visibility and ordering too.  An atomic load or store is not any less atomic than inc/CAS even if there is no computation with intermediate results.

> >> What assumptions about other value states are made in the above code
> >> after reading buf_pool->n_pend_reads?
> >
> >
> > Thus this question does not apply.
> >
> 
> It does, because the answer is "none" and it shows that "volatile" would
> be a sufficient replacement for os_atomic_load/store if not the
> atomicity issues, which are not addressed by either approach.

It does not, because we don't need to have "other values" to have a reason to consider the memory model applies to this value only and how to access it.

> >> It makes even less sense in other occurrences of os_atomic_load_ulint(),
> >> e.g. in buf_get_total_stat(). I disassembled the code generated for
> >> buf_get_total_stat() by GCC with compiler barriers both enabled and
> >> disabled.
> >>
> >> With barriers disabled the code corresponding to
> >> "tot_stat->n_pages_written +=
> >> os_atomic_load_ulint(&buf_stat->n_pages_written)" is:
> >>
...
> >>
> >> With barriers enabled the same code is:
> >>
...
> >
> >
> > I'm mildly curious how would this compare to a volatile n_pages_written too.
> >
> 
> You are welcome to experiment yourself.

I didn't mean to imply to that you or I or anybody has to do it.

> >> So what has been achieved by a compiler barrier here except the fact
> >> that the compiler avoids storing buf_stat->n_pages_written to a register
> >> and thus produces less efficient code?
> >
> >
> > buf_stat->n_pages_written load from memory is forced to happen next to where
> the value is used. On non-cache-coherent architectures it would be
> additionally ensured by a correct os_atomic_load_ulint() implementation that
> the value loaded is the latest value written by any other thread.
> >
> 
> No, in both cases the value is used to modify a register and then the
> register is stored to memory. So the correct answer to my question is
> "nothing".

Your answer seems to ignore my bit "On non-cache-coherent ... ".  Well, it ignores the first sentence too, x86_64 assembly comparison will provide little for this discussion, because we don't need CPU barriers here.

> >>> In buf_LRU_get_free_block() there is a loop, which can be executed by
> >> multiple query threads concurrently, and it both loads and stores
> >> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
> >> concurrent threads, again necessary to have the load and store barriers.
> >>>
> 
> Er, and how exactly do compiler barriers implement immediate visibility
> to any concurrent threads?

Compiler barrier forces a memory store to happen where it is.  A memory store will be immediately visible on x86_64 to everybody else.  A compiler barrier forces a memory load to happen too.  Which means, that a store by another thread is visible to the current thread.

> That also needs atomicity, but that's not what barriers guarantee.

I don't think I ever said that it is a _compiler barrier_ implements _atomicity_.  Have I?

> >> OK, compiler barriers force the compiler to perform _multiple_
> >> load/store ops in the desired order. And CPU barriers impose the same
> >> reqs on CPU. Which means they only matter for cross-location visibility,
> >> but have nothing to do with a single value visibility.
> >
> >
> > By "multiple" above I believe you are again referring to ordering different
> memory location accesses. And, as above, compiler barriers are required for
> single location access too, and CPU barriers would be as well.
> >
> > Without a compiler barrier there is nothing that prevents the compiler to
> compiling buf_LRU_get_free_block to
> >
> > register = try_LRU_scan
> > ...loop that sets and accesses try_LRU_scan through the register...
> > try_LRU_scan = register
> >
> 
> That's what the volatile keyword is for.

Volatile semantics are not very clear and they have little to do with correct concurrency.  The compilers emit code that happens to, partially or fully, do what we want when volatile is encountered by emitting memory loads and stores and not moving them around, but there is no guarantee that a store is fully retired before the next load, no visibility guarantee.  The correct multithreaded programming must happen with correct primitives that are aware of the memory model.

> >> Aren't you reinventing volatile here?
> >
> >
> > Volatile is not enough: on a non-cache coherent architecture the volatile
> would force a load/store to be performed, yes, but not necessarily would make
> a store visible elsewhere nor a load would necessarily be performed in a
> cache-coherent manner.
> >
> 
> *sigh* and compiler barriers solve that problem by ...?

Where did I say that _compiler_ barriers solve this problem on a _non_-cache coherent architecture?  CPU barriers are needed there, and compiler barriers are a prerequisite for them.

> >> But taking the atomicity considerations into account, I'm fairly sure
> >> it's much better to protect those values with a separate mutex and be
> >> done with it.
> >
> >
> > IMHO the jury is still out.
> >
> 
> *shrug* if you have better ideas, I'd love to hear them.

I still have the same idea but obviously I fail to explain it well.

> >> The cleanup might be done for 5.5 as well if we had unlimited resources.
> >> But as my cleanups in AHI partitioning have shown, implementing,
> >> merging, reviewing things like this separately is more expensive in
> >> terms of man-hours.
> >
> >
> > The AHI example is a surprise to me.  I see the additional time need for
> separate testing runs, but I personally always prefer to do common things
> across the versions commonly, and separate things separately.  This probably
> depends on a person.
> >
> 
> The relevant code has diverged between 5.5 and 5.6 considerably. So
> implementing those cleanups in 5.5 and then merging them to 5.6 was far
> from trivial. That's one thing.
> 
> Another thing is that implementing those cleanups in 5.5 is less
> important than 5.6, so it could wait longer than 5.6.
> 
> Finally, 1 MP/review cycle is less man-hours than 1 + 2 extra MP/review
> cycles. It's that simple.

Thanks for the explanations.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-12: Posted in a previous version of this proposal

buf_LRU_remove_all_pages() was ported incorrectly from 5.5, dropping the space id and I/O-fix re-checks after the block mutex acquisition. This has been caught by Roel as bug 1224282.

review: Needs Fixing

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-12: Posted in a previous version of this proposal

Download full text (17.1 KiB)

Hi Laurynas,

On Thu, 12 Sep 2013 06:16:52 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>> - (if we only care about x86/x86_64) make sure that values being
>>>> loaded/stored do not cross cache lines or page boundaries. Which is of
>>>> course impossible to guarantee in a generic function.
>>>
>>>
>>> Why? We are talking about ulints here only, and I was not able to find such
>> requirements in the x86_64 memory model descriptions. There is a requirement
>> to be aligned, and misaligned stores/loads might indeed cross cache line or
>> page boundaries, and anything that crosses them is indeed non-atomic. But
>> alignment is possible to guarantee in a generic function (which doesn't even
>> has to be generic: the x86_64 implementation is for x86_64 only, obviously).
>>>
>>> Intel® 64 and IA-32 Architectures Software Developer's Manual
>>> Volume 3A: System Programming Guide, Part 1, section 8.1.1,
>> http://download.intel.com/products/processor/manual/253668.pdf:
>>>
>>> "The Intel486 processor (and newer processors since) guarantees that the
>> following basic memory operations will
>>> always be carried out atomically:
>>> (...)
>>> • Reading or writing a doubleword aligned on a 32-bit boundary
>>>
>>> The Pentium processor (...):
>>> • Reading or writing a quadword aligned on a 64-bit boundary
>>> "
>>
>> Why didn't you quote it further?
>>
>> "
>> Accesses to cacheable memory that are split across cache lines and page
>> boundaries are not guaranteed to be atomic by <all processors>. <all
>> processors> provide bus control signals that permit external memory
>> subsystems to make split accesses atomic;
>> "
>>
>> Which means even aligned accesses are not guaranteed to be atomic and
>> it's up to the implementation of "external memory subsystems" (that
>> probably means chipsets, motherboards, NUMA architectures and the like).
>
>
> I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here. I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures.
>

The array of buffer pool descriptors is allocated as follows:

buf_pool_ptr = (buf_pool_t*) mem_zalloc(
n_instances * sizeof *buf_pool_ptr);

so individual buf_pool_t instances are not guaranteed to have any
specific alignment, neither to cache line nor to page boundaries, right?

Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into
buf_pool_t so nothing prevents it from crossing a cache line or a page
boundary?

Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again,
nothing prevents them from crossing a cache line or a page boundary, right?

>
>>> My understanding of the above is that
>> os_atomic_load_ulint()/os_atomic_store_ulint() fit the above description,
>> modulo alignment issues, if any. These are easy to ensure by ut_ad().
>>>
>>
>> Modulo alignment, cache line boundary and page boundary issues.
>
>
> Alignment only unless my reasoning above is wrong.
>

Yes.

>
>> I don't see how ut_ad() is going to help here. So a ...

Hi Laurynas,

On Thu, 12 Sep 2013 06:16:52 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>> - (if we only care about x86/x86_64) make sure that values being
>>>> loaded/stored do not cross cache lines or page boundaries. Which is of
>>>> course impossible to guarantee in a generic function.
>>>
>>>
>>> Why? We are talking about ulints here only, and I was not able to find such
>> requirements in the x86_64 memory model descriptions. There is a requirement
>> to be aligned, and misaligned stores/loads might indeed cross cache line or
>> page boundaries, and anything that crosses them is indeed non-atomic. But
>> alignment is possible to guarantee in a generic function (which doesn't even
>> has to be generic: the x86_64 implementation is for x86_64 only, obviously).
>>>
>>> Intel® 64 and IA-32 Architectures Software Developer's Manual
>>> Volume 3A: System Programming Guide, Part 1, section 8.1.1,
>> http://download.intel.com/products/processor/manual/253668.pdf:
>>>
>>> "The Intel486 processor (and newer processors since) guarantees that the
>> following basic memory operations will
>>> always be carried out atomically:
>>> (...)
>>> • Reading or writing a doubleword aligned on a 32-bit boundary
>>>
>>> The Pentium processor (...):
>>> • Reading or writing a quadword aligned on a 64-bit boundary
>>> "
>>
>> Why didn't you quote it further?
>>
>> "
>> Accesses to cacheable memory that are split across cache lines and page
>> boundaries are not guaranteed to be atomic by <all processors>. <all
>> processors> provide bus control signals that permit external memory
>> subsystems to make split accesses atomic;
>> "
>>
>> Which means even aligned accesses are not guaranteed to be atomic and
>> it's up to the implementation of "external memory subsystems" (that
>> probably means chipsets, motherboards, NUMA architectures and the like).
>
>
> I didn't quote because we both have already acknowledged that cache line- or page boundary-crossing accesses are non-atomic, and because I don't see how it's relevant here.  I don't see how a properly-aligned ulint can possibly cross a cache line boundary, when cache lines are 64-byte wide and 64-byte aligned. Or even 32 for older architectures.
>

The array of buffer pool descriptors is allocated as follows:

buf_pool_ptr = (buf_pool_t*) mem_zalloc(
		n_instances * sizeof *buf_pool_ptr);

so individual buf_pool_t instances are not guaranteed to have any 
specific alignment, neither to cache line nor to page boundaries, right?

Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into 
buf_pool_t so nothing prevents it from crossing a cache line or a page 
boundary?

Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again, 
nothing prevents them from crossing a cache line or a page boundary, right?

Yes.

>
>> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
>> structure happens to be allocated in memory so that n_pages_written
>> happens to be misaligned, or cross a cache line or a page boundary. How
>> exactly ut_ad() is going to ensure that never happens at runtime?
>
>
> A debug build would hit this assert and we'd fix the structure layout/allocation.  Unless I'm mistaken, to get a misaligned ulint, we'd have to ask for this explicitly, by packing a struct, fetching a pointer to it from a byte array, etc.  Thus ut_ad() seems reasonable to me.
>

The only thing you can assume about dynamically allocated objects is 
that their addresses (and thus, the first member of a structure, if an 
object is a structure) is aligned to machine word size. Which is always 
lower than the cache line size. There are no guarantees on alignment of 
other structure members, no matter what compiler hints were used (those 
only matter for statically allocated objects).

>
>>>>>> They should be named os_ordered_load_ulint() /
>>>>>>        os_ordered_store_ulint(),
>>>>>
>>>>> That's an option, but I needed atomicity, visibility, and ordering, and
>>>> chose atomic for function name to match the existing CAS and atomic add
>>>> operations, which also need all three.
>>>>>
>>>>
>>>> I'm not sure you need all of those 3 in every occurrence of those
>>>> functions, but see below.
>>>
>>>
>>> That's right. And ordering probably is not needed anywhere, sorry about
>> that, my understanding of atomics is far from fluent.  But visibility should
>> be needed in every occurrence if this is ever ported to a non-cache-coherent
>> architecture, or if some non-temporal store source code optimization is
>> implemented on x86_64.
>>>
>>
>> Grr.
>>
>> - the "atomic" word is misused, because that code is not really atomic
>> - right, but I need atomicity + new_entity_1 + new_entity_2
>> - you don't need new_entity_1 and new_entity_2
>> - that's right, but new_entity_1 should be needed for new_entity_3 and
>> new_entity_4
>
>
> I am getting confused by the term "ordering".  It appears to me, perhaps wrongly, that by ordering you refer to a ordering between different memory location accesses.  If yes, then I don't need ordering, but the primitives provide it, as there is no point in not providing it anyway.  But "ordering" can mean memory access ordering to a single memory location in a single thread.  In this case I need this ordering already.  I need atomicity and visibility without questions.
>

You have a set of one element: {x}. Please describe what actions must be 
taken to "order" it.

> I don't understand your objection re. new_entity_3 and new_entity_4. My goal is to provide primitives that abstract away the underlying architecture.  That's a sane goal and atomic+visible+ordered load/store is a sane model. Any porting anywhere, or even x86_64 optimization is a matter of implementing the correct primitives, not checking the call sites because we suddenly lost some property.
>

You don't abstract anything in terms of atomicity and visibility. But 
putting "atomic" in the function name. That's the problem.

>
>> What do compiler barriers have to do with cache coherence?
>
>
> You cannot add CPU barriers to achieve cache coherence without having compiler barriers there too.
>

That doesn't answer my question. I cannot add CPU barriers to achieve 
cache coherence, not matter what code is generated by the compiler. 
Cache coherence is a property of the hardware.

>
>>>>>> but... what specific order are you trying
>>>>>>        to enforce with those constructs?
>>>>>
>>>>> In buf_read_recv_pages() and buf_read_ibuf_page().
>>>>>
>>>>>                 while (os_atomic_load_ulint(&buf_pool->n_pend_reads)
>>>>>                        > buf_pool->curr_size / BUF_READ_AHEAD_PEND_LIMIT) {
>>>>>                         os_thread_sleep(500000);
>>>>>                 }
>>>>>
>>>>> it is necessary to have a load barrier (compiler only for x86/x86_64,
>>>> potentially a CPU barrier too elsewhere for visibility and ordering).
>>>>>
>>
>> That doesn't answer my question: what specific ordering are you trying
>> to enforce with those constructs?
>
>
> buf_pool->n_pend_reads load must happen after the previous loop iteration buf_pool->n_pend_reads load.
>

So it might happen that buf_pool->n_pend_reads load is performed 
_before_ the previous loop iteration buf_pool->n_pend_reads load. See, 
"order" is a before/after relation.

What you want to do here is to prevent compiler from optimizing the load 
away (so that no loads happen in loop iterations). That's what 
"volatile" is for.

>
>>>> One needs compiler/CPU barriers when the code, after loading a value
>>>> from one location, expects values from other locations in a certain
>>>> state (or assumes certain cross-location visibility in other words).
>>>
>>>
>>> That is ordering, yes, but I don't think this is an exhaustive list of when
>> the barriers is needed. Here a compiler barrier is needed to prevent the
>> hoisting of the load above the loop, turning its body to an infinite loop.
>>>
>>
>> The same problem in mutex_spin_wait(), for example, is addressed by
>> ib_mutex_t::lock_word being defined with the volatile keyword.
>
>
> I don't think that's an example to follow, elaborated below.
>
>
>> However, no atomicity, visibility or ordering is enforced when reading
>> that value in mutex_spin_wait(), because it doesn't need them.
>
>
> I don't think mutex_spin_wait() wants ever, regarding 1) atomicity: to observe a partial load or perform a store so that another thread observes a partial load; 2) visibility: to not see a mutex lock word write by another thread or make its own write not visible to another thread; 3) ordering: to optimize mutex lock word load out of the loop.  Unless I'm mistaken, all three are needed, and the fact that the current code has volatile instead is slightly troubling me.
>
> In fact, mutex_exit_func() has the following comment:
>
> 	/* A problem: we assume that mutex_reset_lock word
> 	is a memory barrier, that is when we read the waiters
> 	field next, the read must be serialized in memory
> 	after the reset. A speculative processor might
> 	perform the read first, which could leave a waiting
> 	thread hanging indefinitely.
>
> 	Our current solution call every second
> 	sync_arr_wake_threads_if_sema_free()
> 	to wake up possible hanging threads if
> 	they are missed in mutex_signal_object. */
>
> So, that's concurrent programming with volatile only and without barriers.  What I'm proposing is to acknowledge the memory model and use the correct primitives to operate on them.
>

The comment refers to a completely different problem: that code assumes 
'waiters' to be in a certain state after 'lock_word' changes. That's 
cross-location ordering, and that's where barriers would come handy.

I was referring to this specific piece of code in mutex_spin_wait():

while (mutex_get_lock_word(mutex) != 0 && i < SYNC_SPIN_ROUNDS) {
		if (srv_spin_wait_delay) {
			ut_delay(ut_rnd_interval(0, srv_spin_wait_delay));
		}

i++;
	}

Which only works correctly (i.e. is not optimized into an infinite 
loop), because 'lock_word' is volatile. That's the problem we were 
discussing with respect to buf_pool->n_pend_reads in 
buf_read_recv_pages(), right?

>
>> We need atomicity for counters, so the volatile keyword is not
>> sufficient. But it's not provided by any barriers either.
>
>
> Counter atomicity is ensured by atomic inc or atomic CAS operation.  This operation will include all the compiler/CPU barriers to ensure visibility and ordering too.  An atomic load or store is not any less atomic than inc/CAS even if there is no computation with intermediate results.
>

*sigh* that's a different kind of atomicity. atomic inc ensures that a 
value will be loaded, modified and stored back atomically. It doesn't 
enforce any atomicity on other code reading that value. Neither does 
os_atomic_load_ulint().

>
>>>> What assumptions about other value states are made in the above code
>>>> after reading buf_pool->n_pend_reads?
>>>
>>>
>>> Thus this question does not apply.
>>>
>>
>> It does, because the answer is "none" and it shows that "volatile" would
>> be a sufficient replacement for os_atomic_load/store if not the
>> atomicity issues, which are not addressed by either approach.
>
>
> It does not, because we don't need to have "other values" to have a reason to consider the memory model applies to this value only and how to access it.
>

That's just hand-waving.

>
>>>> It makes even less sense in other occurrences of os_atomic_load_ulint(),
>>>> e.g. in buf_get_total_stat(). I disassembled the code generated for
>>>> buf_get_total_stat() by GCC with compiler barriers both enabled and
>>>> disabled.
>>>>
>>>> With barriers disabled the code corresponding to
>>>> "tot_stat->n_pages_written +=
>>>> os_atomic_load_ulint(&buf_stat->n_pages_written)" is:
>>>>
> ...
>>>>
>>>> With barriers enabled the same code is:
>>>>
> ...
>>>
>>>
>>> I'm mildly curious how would this compare to a volatile n_pages_written too.
>>>
>>
>> You are welcome to experiment yourself.
>
>
> I didn't mean to imply to that you or I or anybody has to do it.
>

I know, I was just pointing out that you are welcome to experiment yourself.

>
>>>> So what has been achieved by a compiler barrier here except the fact
>>>> that the compiler avoids storing buf_stat->n_pages_written to a register
>>>> and thus produces less efficient code?
>>>
>>>
>>> buf_stat->n_pages_written load from memory is forced to happen next to where
>> the value is used. On non-cache-coherent architectures it would be
>> additionally ensured by a correct os_atomic_load_ulint() implementation that
>> the value loaded is the latest value written by any other thread.
>>>
>>
>> No, in both cases the value is used to modify a register and then the
>> register is stored to memory. So the correct answer to my question is
>> "nothing".
>
>
> Your answer seems to ignore my bit "On non-cache-coherent ... ".  Well, it ignores the first sentence too, x86_64 assembly comparison will provide little for this discussion, because we don't need CPU barriers here.
>

I'm asking simple questions which require single-word answers. Instead, 
I see abstract hand-waving.

You saw the code with compiler barriers and without them. The question 
is: "what difference is there in terms of program state?". You may 
describe the difference either for cache coherent architectures or 
non-cache-coherent architectures, as you like.

>
>>>>> In buf_LRU_get_free_block() there is a loop, which can be executed by
>>>> multiple query threads concurrently, and it both loads and stores
>>>> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
>>>> concurrent threads, again necessary to have the load and store barriers.
>>>>>
>>
>> Er, and how exactly do compiler barriers implement immediate visibility
>> to any concurrent threads?
>
>
> Compiler barrier forces a memory store to happen where it is.  A memory store will be immediately visible on x86_64 to everybody else.  A compiler barrier forces a memory load to happen too.  Which means, that a store by another thread is visible to the current thread.
>

No, a compiler barrier does not force a memory store to happen where it 
is. It forces the compiler to generate load/store instructions in 
specific order and to avoid caching them in registers. That does not in 
itself guarantee any visibility.

>
>> That also needs atomicity, but that's not what barriers guarantee.
>
>
> I don't think I ever said that it is a _compiler barrier_ implements _atomicity_.  Have I?
>

os_atomic_load_ulint() has "atomic" in its name, but the only thing it 
does is injecting a compiler barrier?

>
>>>> OK, compiler barriers force the compiler to perform _multiple_
>>>> load/store ops in the desired order. And CPU barriers impose the same
>>>> reqs on CPU. Which means they only matter for cross-location visibility,
>>>> but have nothing to do with a single value visibility.
>>>
>>>
>>> By "multiple" above I believe you are again referring to ordering different
>> memory location accesses. And, as above, compiler barriers are required for
>> single location access too, and CPU barriers would be as well.
>>>
>>> Without a compiler barrier there is nothing that prevents the compiler to
>> compiling buf_LRU_get_free_block to
>>>
>>> register = try_LRU_scan
>>> ...loop that sets and accesses try_LRU_scan through the register...
>>> try_LRU_scan = register
>>>
>>
>> That's what the volatile keyword is for.
>
>
> Volatile semantics are not very clear and they have little to do with correct concurrency.  The compilers emit code that happens to, partially or fully, do what we want when volatile is encountered by emitting memory loads and stores and not moving them around, but there is no guarantee that a store is fully retired before the next load, no visibility guarantee.  The correct multithreaded programming must happen with correct primitives that are aware of the memory model.
>

And os_atomic_load_ulint() / os_atomic_store_ulint() address that 
problem by ...?

>
>>>> Aren't you reinventing volatile here?
>>>
>>>
>>> Volatile is not enough: on a non-cache coherent architecture the volatile
>> would force a load/store to be performed, yes, but not necessarily would make
>> a store visible elsewhere nor a load would necessarily be performed in a
>> cache-coherent manner.
>>>
>>
>> *sigh* and compiler barriers solve that problem by ...?
>
>
> Where did I say that _compiler_ barriers solve this problem on a _non_-cache coherent architecture?  CPU barriers are needed there, and compiler barriers are a prerequisite for them.
>

You've been referring to _non_-cache coherent architectures whenever I 
asked you simple questions on compiler barriers. And no, CPU barriers do 
not help with non-cache coherent architectures. Google it.

>
>>>> But taking the atomicity considerations into account, I'm fairly sure
>>>> it's much better to protect those values with a separate mutex and be
>>>> done with it.
>>>
>>>
>>> IMHO the jury is still out.
>>>
>>
>> *shrug* if you have better ideas, I'd love to hear them.
>
>
> I still have the same idea but obviously I fail to explain it well.
>

I think you misunderstand your idea in the first place.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-12: Posted in a previous version of this proposal

Repushed branch with the 1st partial review comments. Not a resubmission due to partial review and ongoing discussions.

    Changes from the 1st MP.
    - Simplified btr_blob_free().
    - Added a note about mutexes to the header comment of
      buf_buddy_relocate().
    - Removed redundant mutex == NULL checks and mutex own assertions
      from buf_buddy_relocate().
    - Fixed locking notes in buf_LRU_free_page() header comment.
    - Removed a memory leak in one of the early exits in
      buf_LRU_free_page().
    - Clarified locking in a comment for buf_page_t::zip.
    - Added debug build checks to os_atomic_load_ulint() and
      os_atomic_store_ulint() x86_64 implementation that the accessed
      variable is properly aligned.
    - Added dropped by mistake buffer page space id and I/O fix 2nd
      checks after the buf_page_get_mutex() has been locked in
      buf_LRU_remove_all_pages().

Please ignore the "Added debug build checks to os_atomic_load_ulint() and os_atomic_store_ulint() x86_64 implementation that the accessed variable is properly aligned." bit for now.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-12: Posted in a previous version of this proposal

A Jenkins run of the latest branch turned up bug 1224432. Logged a separate bug because I am not sure I'll manage to debug it during this MP cycle.

"...
2013-09-12 12:15:38 7ff450082700 InnoDB: Assertion failure in thread 140687291459328 in file buf0buf.cc line 3694
InnoDB: Failing assertion: buf_fix_count > 0
...

Which appears to be a race condition on buf_fix_count on a page that is a sentinel for buffer pool watch. How exactly this can happen is not clear to me currently. All the watch sentinel buf_fix_count changes happen under zip_mutex and a corresponding page_hash X lock. Further, buf_page_init_for_read() asserts buf_fix_count > 0 through buf_pool_watch_is_sentinel() at line 3647, thus this should have changed between 3647 and 3694, but the hash is X-latched throughout, even though the zip mutex is only acquired at 3670."

review: Needs Fixing

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-12: Posted in a previous version of this proposal

Download full text (19.0 KiB)

Alexey -

> I don't see how a properly-aligned ulint can possibly
> cross a cache line boundary, when cache lines are 64-byte wide and 64-byte
> aligned. Or even 32 for older architectures.
> >
>
> The array of buffer pool descriptors is allocated as follows:

You are right, I failed to consider the base addresses returned by dynamic memory allocation. I also failed to notice your hint to that direction in one of the previous mails.

> buf_pool_ptr = (buf_pool_t*) mem_zalloc(
> n_instances * sizeof *buf_pool_ptr);
>
> so individual buf_pool_t instances are not guaranteed to have any
> specific alignment, neither to cache line nor to page boundaries, right?

Right.

> Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into
> buf_pool_t so nothing prevents it from crossing a cache line or a page
> boundary?

Right.

> Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again,
> nothing prevents them from crossing a cache line or a page boundary, right?

Right, nothing prevents an object of buf_pool_stat_t from crossing it. But that's OK. We only need the individual fields not to cross it.

> >> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
> >> structure happens to be allocated in memory so that n_pages_written
> >> happens to be misaligned, or cross a cache line or a page boundary. How
> >> exactly ut_ad() is going to ensure that never happens at runtime?
> >
> >
> > A debug build would hit this assert and we'd fix the structure
> layout/allocation. Unless I'm mistaken, to get a misaligned ulint, we'd have
> to ask for this explicitly, by packing a struct, fetching a pointer to it from
> a byte array, etc. Thus ut_ad() seems reasonable to me.
> >
>
> The only thing you can assume about dynamically allocated objects is
> that their addresses (and thus, the first member of a structure, if an
> object is a structure) is aligned to machine word size. Which is always
> lower than the cache line size. There are no guarantees on alignment of
> other structure members, no matter what compiler hints were used (those
> only matter for statically allocated objects).

Right. So, to conclude... no individual ulint is going to cross a cache line or a page boundary and we are good? We start with a machine-word aligned address returned from heap, and add a multiple of machine-word width to arrive at the address of an individual field, which is machine-word aligned and thus the individual field cannot cross anything?

Alexey -

> I don't see how a properly-aligned ulint can possibly
> cross a cache line boundary, when cache lines are 64-byte wide and 64-byte
> aligned. Or even 32 for older architectures.
> >
> 
> The array of buffer pool descriptors is allocated as follows:

You are right, I failed to consider the base addresses returned by dynamic memory allocation.  I also failed to notice your hint to that direction in one of the previous mails.

>         buf_pool_ptr = (buf_pool_t*) mem_zalloc(
>                 n_instances * sizeof *buf_pool_ptr);
> 
> so individual buf_pool_t instances are not guaranteed to have any
> specific alignment, neither to cache line nor to page boundaries, right?

Right.

> Now, the 'stat' member of buf_pool_t has the offset of 736 bytes into
> buf_pool_t so nothing prevents it from crossing a cache line or a page
> boundary?

Right.

> Now, offsets of the buf_pool_stat_t members vary from 0 to 88. Again,
> nothing prevents them from crossing a cache line or a page boundary, right?

Right, nothing prevents an object of buf_pool_stat_t from crossing it. But that's OK. We only need the individual fields not to cross it.

> >> I don't see how ut_ad() is going to help here. So a buf_pool_stat_t
> >> structure happens to be allocated in memory so that n_pages_written
> >> happens to be misaligned, or cross a cache line or a page boundary. How
> >> exactly ut_ad() is going to ensure that never happens at runtime?
> >
> >
> > A debug build would hit this assert and we'd fix the structure
> layout/allocation.  Unless I'm mistaken, to get a misaligned ulint, we'd have
> to ask for this explicitly, by packing a struct, fetching a pointer to it from
> a byte array, etc.  Thus ut_ad() seems reasonable to me.
> >
> 
> The only thing you can assume about dynamically allocated objects is
> that their addresses (and thus, the first member of a structure, if an
> object is a structure) is aligned to machine word size. Which is always
> lower than the cache line size. There are no guarantees on alignment of
> other structure members, no matter what compiler hints were used (those
> only matter for statically allocated objects).

Right.  So, to conclude... no individual ulint is going to cross a cache line or a page boundary and we are good? We start with a machine-word aligned address returned from heap, and add a multiple of machine-word width to arrive at the address of an individual field, which is machine-word aligned and thus the individual field cannot cross anything?

> >>>>>> They should be named os_ordered_load_ulint() /
> >>>>>>        os_ordered_store_ulint(),
> >>>>>
> >>>>> That's an option, but I needed atomicity, visibility, and ordering, and
> >>>> chose atomic for function name to match the existing CAS and atomic add
> >>>> operations, which also need all three.
> >>>>>
> >>>>
> >>>> I'm not sure you need all of those 3 in every occurrence of those
> >>>> functions, but see below.
> >>>
> >>>
> >>> That's right. And ordering probably is not needed anywhere, sorry about
> >> that, my understanding of atomics is far from fluent.  But visibility
> should
> >> be needed in every occurrence if this is ever ported to a non-cache-
> coherent
> >> architecture, or if some non-temporal store source code optimization is
> >> implemented on x86_64.
> >>>
> >>
> >> Grr.
> >>
> >> - the "atomic" word is misused, because that code is not really atomic
> >> - right, but I need atomicity + new_entity_1 + new_entity_2
> >> - you don't need new_entity_1 and new_entity_2
> >> - that's right, but new_entity_1 should be needed for new_entity_3 and
> >> new_entity_4
> >
> >
> > I am getting confused by the term "ordering".  It appears to me, perhaps
> wrongly, that by ordering you refer to a ordering between different memory
> location accesses.  If yes, then I don't need ordering, but the primitives
> provide it, as there is no point in not providing it anyway.  But "ordering"
> can mean memory access ordering to a single memory location in a single
> thread.  In this case I need this ordering already.  I need atomicity and
> visibility without questions.
> >
> 
> You have a set of one element: {x}. Please describe what actions must be
> taken to "order" it.

It's probably ordered already.  But I cannot understand your question.  Can you please tell me what is your working definition of memory access ordering?  Is "{x}" an theoretical example of a set or do you arrive at it from the "single memory location in a single thread" above?  Because in the latter case it is still plural accesses that have to be ordered.  I'm sure it's my fault that my basic understanding differs from yours, thus the sooner I know yours, the sooner I start making sense and agreeing.

> > I don't understand your objection re. new_entity_3 and new_entity_4. My goal
> is to provide primitives that abstract away the underlying architecture.
> That's a sane goal and atomic+visible+ordered load/store is a sane model. Any
> porting anywhere, or even x86_64 optimization is a matter of implementing the
> correct primitives, not checking the call sites because we suddenly lost some
> property.
> >
> 
> You don't abstract anything in terms of atomicity and visibility. But
> putting "atomic" in the function name. That's the problem.

I can rename.  I also don't understand what's not atomic and not visible about them.

> >> What do compiler barriers have to do with cache coherence?
> >
> >
> > You cannot add CPU barriers to achieve cache coherence without having
> compiler barriers there too.
> >
> 
> That doesn't answer my question. I cannot add CPU barriers to achieve
> cache coherence, not matter what code is generated by the compiler.
> Cache coherence is a property of the hardware.

You're right, it's a hardware property.  The hardware also provides the necessary instructions to work with it.  For example, if you do a non-temporal store in x86_64, visibility is relaxed and you need CPU barriers to guarantee visibility where they are placed.  So a CPU barrier achieves cache coherence.

> > buf_pool->n_pend_reads load must happen after the previous loop iteration
> buf_pool->n_pend_reads load.
> >
> 
> So it might happen that buf_pool->n_pend_reads load is performed
> _before_ the previous loop iteration buf_pool->n_pend_reads load. See,
> "order" is a before/after relation.
> 
> What you want to do here is to prevent compiler from optimizing the load
> away (so that no loads happen in loop iterations).

You're right.  And my way appears to implement it correctly, unless I'm missing something.

> That's what
> "volatile" is for.

Volatile will prevent the load from optimizing away.  Volatile inhibits compiler optimizations.  It has nothing to do with a memory model besides having an effect that inhibited optimizations is what helps us to not to move memory accesses around.  It will not say, that e.g. now these memory accesses are visible or e.g. ordered with non-volatile memory accesses.  Why should we use a less correct mechanism with partial guarantees instead of a more correct one?  Now this comment is of course handwaving, but so is saying that volatile is for this.

> >>>> One needs compiler/CPU barriers when the code, after loading a value
> >>>> from one location, expects values from other locations in a certain
> >>>> state (or assumes certain cross-location visibility in other words).
> >>>
> >>>
> >>> That is ordering, yes, but I don't think this is an exhaustive list of
> when
> >> the barriers is needed. Here a compiler barrier is needed to prevent the
> >> hoisting of the load above the loop, turning its body to an infinite loop.
> >>>
> >>
> >> The same problem in mutex_spin_wait(), for example, is addressed by
> >> ib_mutex_t::lock_word being defined with the volatile keyword.
> >
> >
> > I don't think that's an example to follow, elaborated below.
> >
> >
> >> However, no atomicity, visibility or ordering is enforced when reading
> >> that value in mutex_spin_wait(), because it doesn't need them.
> >
> >
> > I don't think mutex_spin_wait() wants ever, regarding 1) atomicity: to
> observe a partial load or perform a store so that another thread observes a
> partial load; 2) visibility: to not see a mutex lock word write by another
> thread or make its own write not visible to another thread; 3) ordering: to
> optimize mutex lock word load out of the loop.  Unless I'm mistaken, all three
> are needed, and the fact that the current code has volatile instead is
> slightly troubling me.
> >

> > In fact, mutex_exit_func() has the following comment:
...
> The comment refers to a completely different problem...

Yes, so we both agree that a barrier would solve this here.

> I was referring to this specific piece of code in mutex_spin_wait():
> 
>         while (mutex_get_lock_word(mutex) != 0 && i < SYNC_SPIN_ROUNDS) {
>                 if (srv_spin_wait_delay) {
>                         ut_delay(ut_rnd_interval(0, srv_spin_wait_delay));
>                 }
> 
>                 i++;
>         }
>

My comments 1) 2) 3) in a paragraph above are referring to the same piece too.  Can you please explain how this code does not need atomicity, visibility, ordering as you say above?

> Which only works correctly (i.e. is not optimized into an infinite
> loop), because 'lock_word' is volatile.

Same comment as above re. volatile.

> That's the problem we were
> discussing with respect to buf_pool->n_pend_reads in
> buf_read_recv_pages(), right?

The problem we are discussing also depends on 1) 2) 3) above too.

I know that adding volatile will keep the loop, probably.  But we also know exactly how n_pend_reads should be accessed concurrently.  Then, we can perform these accesses in the code using the necessary primitives and be sure about atomicity, visibility, ordering.  As opposed to selectively disabling compiler optimizations.

> >> We need atomicity for counters, so the volatile keyword is not
> >> sufficient. But it's not provided by any barriers either.
> >
> >
> > Counter atomicity is ensured by atomic inc or atomic CAS operation.  This
> operation will include all the compiler/CPU barriers to ensure visibility and
> ordering too.  An atomic load or store is not any less atomic than inc/CAS
> even if there is no computation with intermediate results.
> >
> 
> *sigh* that's a different kind of atomicity.

How many kinds of atomicity are there?  The operation either has completed, either has not, without a possibility of observing any intermediate state.  This applies the same to CAS, INC, as well as to load and store.  Or am I missing something here?

> atomic inc ensures that a
> value will be loaded, modified and stored back atomically. It doesn't
> enforce any atomicity on other code reading that value.

No, why should it?

> Neither does
> os_atomic_load_ulint().

That depends on the alignment issue outcome.  Right now I believe it does.

> One needs compiler/CPU barriers when the code, after loading a value 
> from one location, expects values from other locations in a certain 
> state (or assumes certain cross-location visibility in other words).

> >>>> What assumptions about other value states are made in the above code
> >>>> after reading buf_pool->n_pend_reads?
> >>>
> >>>
> >>> Thus this question does not apply.
> >>>
> >>
> >> It does, because the answer is "none" and it shows that "volatile" would
> >> be a sufficient replacement for os_atomic_load/store if not the
> >> atomicity issues, which are not addressed by either approach.
> >
> >
> > It does not, because we don't need to have "other values" to have a reason
> to consider the memory model applies to this value only and how to access it.
> >
> 
> That's just hand-waving.

I apologize.  I'm trying hard not to, but it's hard and we seem to think from different models.

I added back the context ("One needs ... ") above to the question ("What assumptions").  I hope I not misedited it.  If I'm not mistaken, your question comes from the premise that barriers are only needed for ordering accesses to different memory locations.  Right?  If so, the answer to your question is indeed "none".  Now the problem is that my premise is that barriers are needed not only for that, and that's why my answer to this question "none, the memory barriers are used here not for cross-location ordering".

> >>>> So what has been achieved by a compiler barrier here except the fact
> >>>> that the compiler avoids storing buf_stat->n_pages_written to a register
> >>>> and thus produces less efficient code?
> >>>
> >>>
> >>> buf_stat->n_pages_written load from memory is forced to happen next to
> where
> >> the value is used. On non-cache-coherent architectures it would be
> >> additionally ensured by a correct os_atomic_load_ulint() implementation
> that
> >> the value loaded is the latest value written by any other thread.
> >>>
> >>
> >> No, in both cases the value is used to modify a register and then the
> >> register is stored to memory. So the correct answer to my question is
> >> "nothing".
> >
> >
> > Your answer seems to ignore my bit "On non-cache-coherent ... ".  Well, it
> ignores the first sentence too, x86_64 assembly comparison will provide little
> for this discussion, because we don't need CPU barriers here.
> >
> 
> I'm asking simple questions which require single-word answers. Instead,
> I see abstract hand-waving.
> 
> You saw the code with compiler barriers and without them. The question
> is: "what difference is there in terms of program state?". You may
> describe the difference either for cache coherent architectures or
> non-cache-coherent architectures, as you like.

I apologize again.  The problem is not that I'm trying to avoid a single-word answer which will collapse my reasoning.  I'm not.  The problem is that I'm failing to see how exactly the questions are relevant.

In the two assembly examples above the difference is immaterial.  The atomic op example will load the value right next to where it's needed, atomically, etc.  So the second example does what I want it to do and guarantees it's done.  The first example _happens_ to do the same thing too here, without guarantees.  Hence no significant differences and that's why I don't think this question provides any insights in the issue.

> >>>>> In buf_LRU_get_free_block() there is a loop, which can be executed by
> >>>> multiple query threads concurrently, and it both loads and stores
> >>>> buf_pool->try_LRU_scan in it, which requires immediate visibility to any
> >>>> concurrent threads, again necessary to have the load and store barriers.
> >>>>>
> >>
> >> Er, and how exactly do compiler barriers implement immediate visibility
> >> to any concurrent threads?
> >
> >
> > Compiler barrier forces a memory store to happen where it is.  A memory
> store will be immediately visible on x86_64 to everybody else.  A compiler
> barrier forces a memory load to happen too.  Which means, that a store by
> another thread is visible to the current thread.
> >
> 
> No, a compiler barrier does not force a memory store to happen where it
> is. It forces the compiler to generate load/store instructions in
> specific order and to avoid caching them in registers. That does not in
> itself guarantee any visibility.

Unless I'm misunderstanding you above, you're fully right except for the conclusion, where I believe that it in itself guarantees visibility just fine.  x86_64 memory regular loads and stores are guaranteed visible.

> >> That also needs atomicity, but that's not what barriers guarantee.
> >
> >
> > I don't think I ever said that it is a _compiler barrier_ implements
> _atomicity_.  Have I?
> >
> 
> os_atomic_load_ulint() has "atomic" in its name, but the only thing it
> does is injecting a compiler barrier?

It also has an "ulint" in its name and carries the alignment assumption.  Does that help?

> > Volatile semantics are not very clear and they have little to do with
> correct concurrency.  The compilers emit code that happens to, partially or
> fully, do what we want when volatile is encountered by emitting memory loads
> and stores and not moving them around, but there is no guarantee that a store
> is fully retired before the next load, no visibility guarantee.  The correct
> multithreaded programming must happen with correct primitives that are aware
> of the memory model.
> >
> 
> And os_atomic_load_ulint() / os_atomic_store_ulint() address that
> problem by ...?

By providing the primitives for atomic, visible, ordered access.  Would you agree that having such primitives, assuming any implementation, addresses this problem?  I think yes, since you said previously that mutex-protected access would work?  Or is it too brave an assumption on my part?

> >>>> Aren't you reinventing volatile here?
> >>>
> >>>
> >>> Volatile is not enough: on a non-cache coherent architecture the volatile
> >> would force a load/store to be performed, yes, but not necessarily would
> make
> >> a store visible elsewhere nor a load would necessarily be performed in a
> >> cache-coherent manner.
> >>>
> >>
> >> *sigh* and compiler barriers solve that problem by ...?
> >
> >
> > Where did I say that _compiler_ barriers solve this problem on a _non_-cache
> coherent architecture?  CPU barriers are needed there, and compiler barriers
> are a prerequisite for them.
> >
> 
> You've been referring to _non_-cache coherent architectures whenever I
> asked you simple questions on compiler barriers.

But you question above referred to the _non_ case, thus I must have misunderstood something.

> And no, CPU barriers do
> not help with non-cache coherent architectures. Google it.

I have provided the non-temporal store on x86_64 example above where CPU barrier helps, which I believe is a counterexample.  My googling trying to find a positive example failed.  Can you please point me to one?

> >>>> But taking the atomicity considerations into account, I'm fairly sure
> >>>> it's much better to protect those values with a separate mutex and be
> >>>> done with it.
> >>>
> >>>
> >>> IMHO the jury is still out.
> >>>
> >>
> >> *shrug* if you have better ideas, I'd love to hear them.
> >
> >
> > I still have the same idea but obviously I fail to explain it well.
> >
> 
> I think you misunderstand your idea in the first place.

That's probable.  I will drop the idea the moment I see I'm wrong, I'm not married to it.  But so far I believe I misunderstand your objections more than my idea.

If you allow me a detour.  I could drop it just to save us both time (probably significant by now), but honestly, I don't want to drop what still looks technically sound from my point of view.  For example, I'd have replaced the existing hacks in log0online to store/load LSNs atomically instead of simulating the store with "add zero, atomically, get the old value" op.  Further, there is this new counter framework in 5.6, get_sched_indexer_t, which is a per-CPU based counter array, and which is currently not used.  My hunch is that it's not used because it did not perform, and non-temporal stores is what is needed for it to perform.  And for that the database needs to have both the proper atomic ops framework and the working knowledge of it and memory model in the developers.  Our discussion shows that the latter is sorely missing and I believe it's critical to work this out.

Thanks.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-13: Posted in a previous version of this proposal

Hi Laurynas,

You are right that that os_atomic_{load,store}_ulint() will work as the
name suggests (i.e. be atomic) as long as:

1. it is used to access machine word sized members of structures
2. we are on x86/x86_64

However, the patch implements them as a generic primitives that do
nothing to enforce those restrictions, and that's why their names are
misleading. This is where this discussion has started, and it is a
design flaw. I don't see any arguments in this discussion that would
dispel those concerns. You also acknowledged them in one of the comments.

In contrast, other atomic primitives in existing code keep up to their
promise of being atomic, i.e. do not enforce any implicit requirements.
But they also have mutex-guarded fall back implementations for those
architectures that do not provide atomics.

I also agree that this discussion may be endless and time is precious.
So I think we should implement whatever we both agree does work. That
is: instead of implementing generic atomic primitives that are only
atomic under implicit requirements that are not enforceable at compile
time, it must either use separate mutex(es) to protect them, or use true
atomic primitives provided by the existing code if they are available on
the target architecture and fall back to mutex-guarded access. The
latter is how it is implemented in the rest of InnoDB.

Thanks.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-13: Posted in a previous version of this proposal

Alexey -

> Hi Laurynas,
>
> You are right that that os_atomic_{load,store}_ulint() will work as the
> name suggests (i.e. be atomic) as long as:
>
> 1. it is used to access machine word sized members of structures
> 2. we are on x86/x86_64

Right.

> However, the patch implements them as a generic primitives that do
> nothing to enforce those restrictions, and that's why their names are
> misleading.

1. is enforced through "ulint" in the name and args. ulint is commented in univ.i as "unsigned long integer which should be equal to the word size of the machine".
2. is enforced by platform #ifdefs not providing any other implementation except one for x86/x86_64 with GCC or a GCC-like compiler.

Thus I provide generic primitives, whose current implementations will work as designed. However the 1. above also seems to be missing "properly-aligned" and that's where the design is debatable. On one hand it is possible to implement misaligned access atomically by LOCK MOV, and document that the primitives may be used with args of any alignment. But a better alternative to me seems to accept that misaligned accesses are bugs and document/allow aligned accesses only. Even though that's enforceable in debug builds only, so that's not ideally perfect, but IMHO acceptable.

> This is where this discussion has started, and it is a
> design flaw. I don't see any arguments in this discussion that would
> dispel those concerns. You also acknowledged them in one of the comments.

Addressed above.

> I also agree that this discussion may be endless and time is precious.

But it also cannot end prematurely.

> So I think we should implement whatever we both agree does work.

I suggest that the above either is working already or requires some improvements re. alignment only.

> That
> is: instead of implementing generic atomic primitives that are only
> atomic under implicit requirements that are not enforceable at compile
> time, it must either use separate mutex(es) to protect them, or use true
> atomic primitives provided by the existing code if they are available on
> the target architecture

If you either show that how I address 1. and 2. above is incorrect, either show that the alignment issue is major and unsurmountable, then I'll implement load as inc by zero, return old value, and store as dirty read + CAS in a loop using the existing primitives.

Thanks,
Laurynas

Alexey -

> Hi Laurynas,
> 
> You are right that that os_atomic_{load,store}_ulint() will work as the
> name suggests (i.e. be atomic) as long as:
> 
> 1. it is used to access machine word sized members of structures
> 2. we are on x86/x86_64

Right.

> However, the patch implements them as a generic primitives that do
> nothing to enforce those restrictions, and that's why their names are
> misleading.

1. is enforced through "ulint" in the name and args.  ulint is commented in univ.i as "unsigned long integer which should be equal to the word size of the machine".
2. is enforced by platform #ifdefs not providing any other implementation except one for x86/x86_64 with GCC or a GCC-like compiler.

Thus I provide generic primitives, whose current implementations will work as designed.  However the 1. above also seems to be missing "properly-aligned" and that's where the design is debatable.  On one hand it is possible to implement misaligned access atomically by LOCK MOV, and document that the primitives may be used with args of any alignment.  But a better alternative to me seems to accept that misaligned accesses are bugs and document/allow aligned accesses only.  Even though that's enforceable in debug builds only, so that's not ideally perfect, but IMHO acceptable.

Addressed above.

> I also agree that this discussion may be endless and time is precious.

But it also cannot end prematurely.

> So I think we should implement whatever we both agree does work.

I suggest that the above either is working already or requires some improvements re. alignment only.

Thanks,
Laurynas

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-13: Posted in a previous version of this proposal

Hi Laurynas,

On Fri, 13 Sep 2013 07:40:10 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>> Hi Laurynas,
>>
>> You are right that that os_atomic_{load,store}_ulint() will work as the
>> name suggests (i.e. be atomic) as long as:
>>
>> 1. it is used to access machine word sized members of structures
>> 2. we are on x86/x86_64
>
>
> Right.
>
>
>> However, the patch implements them as a generic primitives that do
>> nothing to enforce those restrictions, and that's why their names are
>> misleading.
>
>
> 1. is enforced through "ulint" in the name and args. ulint is commented in univ.i as "unsigned long integer which should be equal to the word size of the machine".

It is not enforced, because nothing prevents me from passing a
misaligned address to those functions and expect them to be atomic as
the name implies.

For example, os_atomic_inc_ulint() is guaranteed to be atomic for any
arguments on any platform. But os_atomic_load_ulint() is not. That is
the problem.

> 2. is enforced by platform #ifdefs not providing any other implementation except one for x86/x86_64 with GCC or a GCC-like compiler.
>

That's correct. I only mentioned #2 for completeness.

> Thus I provide generic primitives, whose current implementations will work as designed. However the 1. above also seems to be missing "properly-aligned" and that's where the design is debatable. On one hand it is possible to implement misaligned access atomically by LOCK MOV, and document that the primitives may be used with args of any alignment. But a better alternative to me seems to accept that misaligned accesses are bugs and document/allow aligned accesses only. Even though that's enforceable in debug builds only, so that's not ideally perfect, but IMHO acceptable.
>

You don't.

>
> If you either show that how I address 1. and 2. above is incorrect, either show that the alignment issue is major and unsurmountable, then I'll implement load as inc by zero, return old value, and store as dirty read + CAS in a loop using the existing primitives.
>

Yes, please do.

Hi Laurynas,

On Fri, 13 Sep 2013 07:40:10 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>> Hi Laurynas,
>>
>> You are right that that os_atomic_{load,store}_ulint() will work as the
>> name suggests (i.e. be atomic) as long as:
>>
>> 1. it is used to access machine word sized members of structures
>> 2. we are on x86/x86_64
>
>
> Right.
>
>
>> However, the patch implements them as a generic primitives that do
>> nothing to enforce those restrictions, and that's why their names are
>> misleading.
>
>
> 1. is enforced through "ulint" in the name and args.  ulint is commented in univ.i as "unsigned long integer which should be equal to the word size of the machine".

It is not enforced, because nothing prevents me from passing a 
misaligned address to those functions and expect them to be atomic as 
the name implies.

For example, os_atomic_inc_ulint() is guaranteed to be atomic for any 
arguments on any platform. But os_atomic_load_ulint() is not. That is 
the problem.

> 2. is enforced by platform #ifdefs not providing any other implementation except one for x86/x86_64 with GCC or a GCC-like compiler.
>

That's correct. I only mentioned #2 for completeness.

> Thus I provide generic primitives, whose current implementations will work as designed.  However the 1. above also seems to be missing "properly-aligned" and that's where the design is debatable.  On one hand it is possible to implement misaligned access atomically by LOCK MOV, and document that the primitives may be used with args of any alignment.  But a better alternative to me seems to accept that misaligned accesses are bugs and document/allow aligned accesses only.  Even though that's enforceable in debug builds only, so that's not ideally perfect, but IMHO acceptable.
>

You don't.

Yes, please do.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-13: Posted in a previous version of this proposal

Alexey -

> >> You are right that that os_atomic_{load,store}_ulint() will work as the
> >> name suggests (i.e. be atomic) as long as:
> >>
> >> 1. it is used to access machine word sized members of structures
> >> 2. we are on x86/x86_64
> >
> >
> > Right.
> >
> >
> >> However, the patch implements them as a generic primitives that do
> >> nothing to enforce those restrictions, and that's why their names are
> >> misleading.
> >
> >
> > 1. is enforced through "ulint" in the name and args. ulint is commented in
> univ.i as "unsigned long integer which should be equal to the word size of the
> machine".
>
> It is not enforced, because nothing prevents me from passing a
> misaligned address to those functions and expect them to be atomic as
> the name implies.

This is exactly what I discussed below.

> For example, os_atomic_inc_ulint() is guaranteed to be atomic for any
> arguments on any platform. But os_atomic_load_ulint() is not. That is
> the problem.

Right. os_atomic_load_ulint() has additional restrictions on its arg over os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It is a performance bug in any case to perform misaligned atomic ops even with those ops that make it technically possible. I have added ut_ad()s to catch this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too, although that one looks like an overkill to me.

> > 2. is enforced by platform #ifdefs not providing any other implementation
> except one for x86/x86_64 with GCC or a GCC-like compiler.
> >
>
> That's correct. I only mentioned #2 for completeness.

OK, but I am not sure what does the #2 complete then.

> > Thus I provide generic primitives, whose current implementations will work
> as designed. However the 1. above also seems to be missing "properly-aligned"
> and that's where the design is debatable. On one hand it is possible to
> implement misaligned access atomically by LOCK MOV, and document that the
> primitives may be used with args of any alignment. But a better alternative
> to me seems to accept that misaligned accesses are bugs and document/allow
> aligned accesses only. Even though that's enforceable in debug builds only,
> so that's not ideally perfect, but IMHO acceptable.
> >
>
> You don't.

Will you reply to the rest of that paragraph too please? I am acknowledging that alignment is an issue, so let's see how to resolve it.

Alexey -

> >> You are right that that os_atomic_{load,store}_ulint() will work as the
> >> name suggests (i.e. be atomic) as long as:
> >>
> >> 1. it is used to access machine word sized members of structures
> >> 2. we are on x86/x86_64
> >
> >
> > Right.
> >
> >
> >> However, the patch implements them as a generic primitives that do
> >> nothing to enforce those restrictions, and that's why their names are
> >> misleading.
> >
> >
> > 1. is enforced through "ulint" in the name and args.  ulint is commented in
> univ.i as "unsigned long integer which should be equal to the word size of the
> machine".
> 
> It is not enforced, because nothing prevents me from passing a
> misaligned address to those functions and expect them to be atomic as
> the name implies.

This is exactly what I discussed below.

> For example, os_atomic_inc_ulint() is guaranteed to be atomic for any
> arguments on any platform. But os_atomic_load_ulint() is not. That is
> the problem.

Right. os_atomic_load_ulint() has additional restrictions on its arg over os_atomic_inc_ulint().  I believe that these restrictions are reasonable.  It is a performance bug in any case to perform misaligned atomic ops even with those ops that make it technically possible.  I have added ut_ad()s to catch this.  I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too, although that one looks like an overkill to me.

> > 2. is enforced by platform #ifdefs not providing any other implementation
> except one for x86/x86_64 with GCC or a GCC-like compiler.
> >
> 
> That's correct. I only mentioned #2 for completeness.

OK, but I am not sure what does the #2 complete then.

> > Thus I provide generic primitives, whose current implementations will work
> as designed.  However the 1. above also seems to be missing "properly-aligned"
> and that's where the design is debatable.  On one hand it is possible to
> implement misaligned access atomically by LOCK MOV, and document that the
> primitives may be used with args of any alignment.  But a better alternative
> to me seems to accept that misaligned accesses are bugs and document/allow
> aligned accesses only.  Even though that's enforceable in debug builds only,
> so that's not ideally perfect, but IMHO acceptable.
> >
> 
> You don't.

Will you reply to the rest of that paragraph too please?  I am acknowledging that alignment is an issue, so let's see how to resolve it.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-13: Posted in a previous version of this proposal

Hi Laurynas,

On Wed, 11 Sep 2013 15:29:33 -0000, Laurynas Biveinis wrote:
>>>> - the following hunk simply removes reference to buf_pool->mutex. As I
>>>> understand, it should just replace buf_pool->mutex with
>> zip_free_mutex?
>>>> + page_zip_des_t zip; /*!< compressed page; state
>>>> + == BUF_BLOCK_ZIP_PAGE and zip.data
>>>> + == NULL means an active
>>>
>>> Hm, it looked to me that it's protected not with zip_free_mutex but with
>> zip_mutex in its page mutex capacity. I will check.
>>>
>>
>> There was a place in the code that asserted zip_free_mutex locked when
>> bpage->zip.data is modified. But I'm not sure if that is correct.
>
>
> I have checked, I believe it's indeed zip_mutex. Re. zip_free_mutex, you must be referring to this bit in buf_buddy_relocate():
>
> mutex_enter(&buf_pool->zip_free_mutex);
>
> if (buf_page_can_relocate(bpage)) {
> ...
> bpage->zip.data = (page_zip_t*) dst;
> mutex_exit(mutex);
>
> buf_buddy_stat_t* buddy_stat = &buf_pool->buddy_stat[i];
> buddy_stat->relocated++;
> buddy_stat->relocated_usec += ut_time_us(NULL) - usec;
> ...
>
> Here zip_free_mutex happens to protect buddy_stat and pushing it down to if clause would require an else clause to appear that locks the same mutex.
>

No, I was referring to buf_pool_contains_zip(). It traverses the buffer
pool and examines (but not modifies) block->page.zip.data for each
block. However, the patch changes the assertion in
buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
rather than zip_mutex. In fact, in one of the code paths calling
buf_pool_contains_zip() we assert that zip_mutex is NOT locked. Don't we
have a bug here?

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-13: Posted in a previous version of this proposal

On Fri, 13 Sep 2013 11:10:36 -0000, Laurynas Biveinis wrote:
>
>
> Right. os_atomic_load_ulint() has additional restrictions on its arg over os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It is a performance bug in any case to perform misaligned atomic ops even with those ops that make it technically possible. I have added ut_ad()s to catch this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too, although that one looks like an overkill to me.
>

The same restrictions would apply even if os_atomic_load_ulint() didn't
exist, right? I.e. the same restrictions would apply if we simply
accessed those variables without any helper functions?

Let me ask you a few simple questions and this time around I demand
"yes/no" answers.

- Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
not do what they promise to do?

- Do you agree that naming them os_ordered_load_ulint() /
os_ordered_store_ulint() would better reflect what they do?

- Do you agree that naming them that way also makes it obvious that
using them in most places is simply unnecessary (e.g. in
buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(),
buf_get_n_pending_read_ios(), etc.)?

>
>>> 2. is enforced by platform #ifdefs not providing any other implementation
>> except one for x86/x86_64 with GCC or a GCC-like compiler.
>>>
>>
>> That's correct. I only mentioned #2 for completeness.
>
>
> OK, but I am not sure what does the #2 complete then.
>
>
>>> Thus I provide generic primitives, whose current implementations will work
>> as designed. However the 1. above also seems to be missing "properly-aligned"
>> and that's where the design is debatable. On one hand it is possible to
>> implement misaligned access atomically by LOCK MOV, and document that the
>> primitives may be used with args of any alignment. But a better alternative
>> to me seems to accept that misaligned accesses are bugs and document/allow
>> aligned accesses only. Even though that's enforceable in debug builds only,
>> so that's not ideally perfect, but IMHO acceptable.
>>>
>>
>> You don't.
>
>
> Will you reply to the rest of that paragraph too please? I am acknowledging that alignment is an issue, so let's see how to resolve it.
>

I don't think enforcing requirements in debug builds only is acceptable.
It must be a compile-time assertion, not a run-time one.

On Fri, 13 Sep 2013 11:10:36 -0000, Laurynas Biveinis wrote:
>
>
> Right. os_atomic_load_ulint() has additional restrictions on its arg over os_atomic_inc_ulint().  I believe that these restrictions are reasonable.  It is a performance bug in any case to perform misaligned atomic ops even with those ops that make it technically possible.  I have added ut_ad()s to catch this.  I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too, although that one looks like an overkill to me.
>

The same restrictions would apply even if os_atomic_load_ulint() didn't 
exist, right? I.e. the same restrictions would apply if we simply 
accessed those variables without any helper functions?

Let me ask you a few simple questions and this time around I demand 
"yes/no" answers.

- Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do 
not do what they promise to do?

- Do you agree that naming them os_ordered_load_ulint() / 
os_ordered_store_ulint() would better reflect what they do?

- Do you agree that naming them that way also makes it obvious that 
using them in most places is simply unnecessary (e.g. in 
buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(), 
buf_get_n_pending_read_ios(), etc.)?

>
>>> 2. is enforced by platform #ifdefs not providing any other implementation
>> except one for x86/x86_64 with GCC or a GCC-like compiler.
>>>
>>
>> That's correct. I only mentioned #2 for completeness.
>
>
> OK, but I am not sure what does the #2 complete then.
>
>
>>> Thus I provide generic primitives, whose current implementations will work
>> as designed.  However the 1. above also seems to be missing "properly-aligned"
>> and that's where the design is debatable.  On one hand it is possible to
>> implement misaligned access atomically by LOCK MOV, and document that the
>> primitives may be used with args of any alignment.  But a better alternative
>> to me seems to accept that misaligned accesses are bugs and document/allow
>> aligned accesses only.  Even though that's enforceable in debug builds only,
>> so that's not ideally perfect, but IMHO acceptable.
>>>
>>
>> You don't.
>
>
> Will you reply to the rest of that paragraph too please?  I am acknowledging that alignment is an issue, so let's see how to resolve it.
>

I don't think enforcing requirements in debug builds only is acceptable. 
It must be a compile-time assertion, not a run-time one.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-13: Posted in a previous version of this proposal

> > Right. os_atomic_load_ulint() has additional restrictions on its arg over
> os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It
> is a performance bug in any case to perform misaligned atomic ops even with
> those ops that make it technically possible. I have added ut_ad()s to catch
> this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too,
> although that one looks like an overkill to me.
> >
>
> The same restrictions would apply even if os_atomic_load_ulint() didn't
> exist, right? I.e. the same restrictions would apply if we simply
> accessed those variables without any helper functions?

What would be the desired access semantics in that case? If "anything goes", then no restrictions would apply.

> Let me ask you a few simple questions and this time around I demand
> "yes/no" answers.
>
> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> not do what they promise to do?

Yes.

> - Do you agree that naming them os_ordered_load_ulint() /
> os_ordered_store_ulint() would better reflect what they do?

No.

> - Do you agree that naming them that way also makes it obvious that
> using them in most places is simply unnecessary (e.g. in
> buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(),
> buf_get_n_pending_read_ios(), etc.)?

No.

The first answer would be No if not the alignment issue.

> >>> Thus I provide generic primitives, whose current implementations will work
> >> as designed. However the 1. above also seems to be missing "properly-
> aligned"
> >> and that's where the design is debatable. On one hand it is possible to
> >> implement misaligned access atomically by LOCK MOV, and document that the
> >> primitives may be used with args of any alignment. But a better
> alternative
> >> to me seems to accept that misaligned accesses are bugs and document/allow
> >> aligned accesses only. Even though that's enforceable in debug builds
> only,
> >> so that's not ideally perfect, but IMHO acceptable.
> >>>
> >>
> >> You don't.
> >
> >
> > Will you reply to the rest of that paragraph too please? I am acknowledging
> that alignment is an issue, so let's see how to resolve it.
> >
>
> I don't think enforcing requirements in debug builds only is acceptable.
> It must be a compile-time assertion, not a run-time one.

And as we both know, this is not enforceable at compile time. I think that requesting extra protections on the top of already provided ones, when the only way to get a misaligned ulint is to ask for one explicitly, is an overkill. But that's my hand-waving against your hand-waving. Thus let's say that yes, "the alignment issue is major and unsurmountable", and I proceed to do what was offered previously: implement load as atomic add zero and return value, and store as dirty read and CAS until success. The reason I didn't like these implementations is that they are pessimized. But that's OK.

> > Right. os_atomic_load_ulint() has additional restrictions on its arg over
> os_atomic_inc_ulint().  I believe that these restrictions are reasonable.  It
> is a performance bug in any case to perform misaligned atomic ops even with
> those ops that make it technically possible.  I have added ut_ad()s to catch
> this.  I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too,
> although that one looks like an overkill to me.
> >
> 
> The same restrictions would apply even if os_atomic_load_ulint() didn't
> exist, right? I.e. the same restrictions would apply if we simply
> accessed those variables without any helper functions?

What would be the desired access semantics in that case?  If "anything goes", then no restrictions would apply.

> Let me ask you a few simple questions and this time around I demand
> "yes/no" answers.
> 
> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> not do what they promise to do?

Yes.

> - Do you agree that naming them os_ordered_load_ulint() /
> os_ordered_store_ulint() would better reflect what they do?

No.

The first answer would be No if not the alignment issue.

> >>> Thus I provide generic primitives, whose current implementations will work
> >> as designed.  However the 1. above also seems to be missing "properly-
> aligned"
> >> and that's where the design is debatable.  On one hand it is possible to
> >> implement misaligned access atomically by LOCK MOV, and document that the
> >> primitives may be used with args of any alignment.  But a better
> alternative
> >> to me seems to accept that misaligned accesses are bugs and document/allow
> >> aligned accesses only.  Even though that's enforceable in debug builds
> only,
> >> so that's not ideally perfect, but IMHO acceptable.
> >>>
> >>
> >> You don't.
> >
> >
> > Will you reply to the rest of that paragraph too please?  I am acknowledging
> that alignment is an issue, so let's see how to resolve it.
> >
> 
> I don't think enforcing requirements in debug builds only is acceptable.
> It must be a compile-time assertion, not a run-time one.

And as we both know, this is not enforceable at compile time.  I think that requesting extra protections on the top of already provided ones, when the only way to get a misaligned ulint is to ask for one explicitly, is an overkill.  But that's my hand-waving against your hand-waving.  Thus let's say that yes, "the alignment issue is major and unsurmountable", and I proceed to do what was offered previously: implement load as atomic add zero and return value, and store as dirty read and CAS until success.  The reason I didn't like these implementations is that they are pessimized. But that's OK.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-13: Posted in a previous version of this proposal

On Fri, 13 Sep 2013 12:54:44 -0000, Laurynas Biveinis wrote:
>>> Right. os_atomic_load_ulint() has additional restrictions on its arg over
>> os_atomic_inc_ulint(). I believe that these restrictions are reasonable. It
>> is a performance bug in any case to perform misaligned atomic ops even with
>> those ops that make it technically possible. I have added ut_ad()s to catch
>> this. I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too,
>> although that one looks like an overkill to me.
>>>
>>
>> The same restrictions would apply even if os_atomic_load_ulint() didn't
>> exist, right? I.e. the same restrictions would apply if we simply
>> accessed those variables without any helper functions?
>
>
> What would be the desired access semantics in that case? If "anything goes", then no restrictions would apply.
>
>
>> Let me ask you a few simple questions and this time around I demand
>> "yes/no" answers.
>>
>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>> not do what they promise to do?
>
>
> Yes.

OK, so we agree that naming is unfortunate.

>
>
>> - Do you agree that naming them os_ordered_load_ulint() /
>> os_ordered_store_ulint() would better reflect what they do?
>
>
> No.

What would be a better naming then?

>
>
>> - Do you agree that naming them that way also makes it obvious that
>> using them in most places is simply unnecessary (e.g. in
>> buf_get_total_stat(), buf_mark_space_corrupt(), buf_print_instance(),
>> buf_get_n_pending_read_ios(), etc.)?
>
>
> No.

OK, then 2 followup questions:

1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
example? Here's an example of a valid answer: "Because that will result
in incorrect values being used in case ...". And some examples of
invalid answers: "non-cache-coherent architectures, visibility, memory
model, sunspots, crop circles, global warming, ...".

2. Why do only 2 out of 9 values are being loaded with
os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
ones need them?

On Fri, 13 Sep 2013 12:54:44 -0000, Laurynas Biveinis wrote:
>>> Right. os_atomic_load_ulint() has additional restrictions on its arg over
>> os_atomic_inc_ulint().  I believe that these restrictions are reasonable.  It
>> is a performance bug in any case to perform misaligned atomic ops even with
>> those ops that make it technically possible.  I have added ut_ad()s to catch
>> this.  I can rename os_atomic_ prefix to os_atomic_aligned_ prefix too,
>> although that one looks like an overkill to me.
>>>
>>
>> The same restrictions would apply even if os_atomic_load_ulint() didn't
>> exist, right? I.e. the same restrictions would apply if we simply
>> accessed those variables without any helper functions?
>
>
> What would be the desired access semantics in that case?  If "anything goes", then no restrictions would apply.
>
>
>> Let me ask you a few simple questions and this time around I demand
>> "yes/no" answers.
>>
>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>> not do what they promise to do?
>
>
> Yes.

OK, so we agree that naming is unfortunate.

>
>
>> - Do you agree that naming them os_ordered_load_ulint() /
>> os_ordered_store_ulint() would better reflect what they do?
>
>
> No.

What would be a better naming then?

OK, then 2 followup questions:

1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for 
example? Here's an example of a valid answer: "Because that will result 
in incorrect values being used in case ...". And some examples of 
invalid answers: "non-cache-coherent architectures, visibility, memory 
model, sunspots, crop circles, global warming, ...".

2. Why do only 2 out of 9 values are being loaded with 
os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining 
ones need them?

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-13: Posted in a previous version of this proposal

Alexey -

> >> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >> not do what they promise to do?
> >
> >
> > Yes.
>
> OK, so we agree that naming is unfortunate.

Vanishingly slightly so, due to the alignment issue, which I believe is mostly theoretical, but nevertheless am ready to address now.

> >> - Do you agree that naming them os_ordered_load_ulint() /
> >> os_ordered_store_ulint() would better reflect what they do?
> >
> >
> > No.
>
> What would be a better naming then?

os_atomic_load_aligned_ulint().

> OK, then 2 followup questions:
>
> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> example? Here's an example of a valid answer: "Because that will result
> in incorrect values being used in case ...". And some examples of
> invalid answers: "non-cache-coherent architectures, visibility, memory
> model, sunspots, crop circles, global warming, ...".

We have gone through this with a disassembly example already, haven't we? We need os_atomic_load_ulint() because we don't want a dirty read. We may well decide that a dirty read there is fine and then replace it. But that's orthogonal to what is this primitive and why.

Are you also objecting to mutex protection here? If not, why? Note that the three n_flush values here are completely independent.

mutex_enter(&buf_pool->flush_state_mutex);

  pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
  pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
  pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];

mutex_exit(&buf_pool->flush_state_mutex);

And please don't group some of the valid answers with looney stuff.

> 2. Why do only 2 out of 9 values are being loaded with
> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
> ones need them?

Upstream reads all 9 dirtily. I replaced two of them to be clean instead. Maybe I need to replace all 9. Maybe 0. But that's again orthogonal.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-15: Posted in a previous version of this proposal

Alexey -

> No, I was referring to buf_pool_contains_zip(). It traverses the buffer
> pool and examines (but not modifies) block->page.zip.data for each
> block. However, the patch changes the assertion in
> buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
> rather than zip_mutex.

Thanks for the pointer, I have reviewed buf_pool_contains_zip() further. Here, as you say, we iterate through the buffer pool uncompressed pages trying to find the uncompressed page of a given compressed page, in order to assert that no uncompressed page points to it, which can be either just taken from the buddy allocator, either be just removed from the buffer pool. In both cases it's an invalid pointer value and it seems to me that we could iterate through the buffer pool without any locking at all. I have tried to think of possible race conditions in the case of no locking, i.e. can the pointer become valid again somehow, by some uncompressed page starting pointing to this zip page. And the transition of the given zip page pointer unallocated -> allocated is protected by zip_free_mutex. Thus, as long as all its callers hold zip_free_mutex and it's only called with invalid pointer values to assert that FALSE is returned, buf_pool_contains_zip() itself does not need any locking. If my reasoning is correct, I can remove the assert from this function. But maybe this should be documented better somehow?

> In fact, in one of the code paths calling
> buf_pool_contains_zip() we assert that zip_mutex is NOT locked. Don't we
> have a bug here?

buf_buddy_free_low(), right? This function is called for a compressed page that is already removed from the buffer pool, that is, no pointer to it should be live in any other thread (and in the buffer pool, as it's asserted by !buf_pool_contains_zip()). Thus it's ok not hold zip_mutex.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-15: Posted in a previous version of this proposal

Hi Laurynas,

On Fri, 13 Sep 2013 14:21:58 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>> not do what they promise to do?
>>>
>>>
>>> Yes.
>>
>> OK, so we agree that naming is unfortunate.
>
>
> Vanishingly slightly so, due to the alignment issue, which I believe is mostly theoretical, but nevertheless am ready to address now.
>

It doesn't do anything to enforce atomicity, does it? I.e. the following
implementation would be equally "atomic":

ulint
os_atomic_load_ulint(ulint *ptr)
{
printf("Hello, world!\n");
return(*ptr);
}

>
>>>> - Do you agree that naming them os_ordered_load_ulint() /
>>>> os_ordered_store_ulint() would better reflect what they do?
>>>
>>>
>>> No.
>>
>> What would be a better naming then?
>
>
> os_atomic_load_aligned_ulint().
>

No, it doesn't do anything to enforce atomicity. That is a caller's
responsibility.

>
>> OK, then 2 followup questions:
>>
>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
>> example? Here's an example of a valid answer: "Because that will result
>> in incorrect values being used in case ...". And some examples of
>> invalid answers: "non-cache-coherent architectures, visibility, memory
>> model, sunspots, crop circles, global warming, ...".
>
>
> We have gone through this with a disassembly example already, haven't we? We need os_atomic_load_ulint() because we don't want a dirty read. We may well decide that a dirty read there is fine and then replace it. But that's orthogonal to what is this primitive and why.
>

We need an atomic read rather than os_atomic_load_ulint() for the above
reasons. An it will be atomic without using any helper functions. Since
I see no answer in the form "Because that will result in incorrect
values being used in case ...", I assume you don't have an answer to
that question.

> Are you also objecting to mutex protection here? If not, why? Note that the three n_flush values here are completely independent.
>
> mutex_enter(&buf_pool->flush_state_mutex);
>
> pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
> pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
> pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
>
> mutex_exit(&buf_pool->flush_state_mutex);
>

I'm not objecting to mutex protection in that code. Why would I?

> And please don't group some of the valid answers with looney stuff.
>
>
>> 2. Why do only 2 out of 9 values are being loaded with
>> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
>> ones need them?
>
>
> Upstream reads all 9 dirtily. I replaced two of them to be clean instead. Maybe I need to replace all 9. Maybe 0. But that's again orthogonal.
>

All 9 reads are atomic. But 7 of them don't use compiler barriers
because they don't need it. Neither do the remaining 2, but you are
quite creative to avoid accepting this simple fact.

Hi Laurynas,

It doesn't do anything to enforce atomicity, does it? I.e. the following 
implementation would be equally "atomic":

ulint
os_atomic_load_ulint(ulint *ptr)
{
	printf("Hello, world!\n");
	return(*ptr);
}

No, it doesn't do anything to enforce atomicity. That is a caller's 
responsibility.

>
>> OK, then 2 followup questions:
>>
>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
>> example? Here's an example of a valid answer: "Because that will result
>> in incorrect values being used in case ...". And some examples of
>> invalid answers: "non-cache-coherent architectures, visibility, memory
>> model, sunspots, crop circles, global warming, ...".
>
>
> We have gone through this with a disassembly example already, haven't we?  We need os_atomic_load_ulint() because we don't want a dirty read.  We may well decide that a dirty read there is fine and then replace it.  But that's orthogonal to what is this primitive and why.
>

We need an atomic read rather than os_atomic_load_ulint() for the above 
reasons. An it will be atomic without using any helper functions. Since 
I see no answer in the form "Because that will result in incorrect 
values being used in case ...", I assume you don't have an answer to 
that question.

> Are you also objecting to mutex protection here? If not, why? Note that the three n_flush values here are completely independent.
>
> 		mutex_enter(&buf_pool->flush_state_mutex);
>
> 		pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
> 		pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
> 		pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
>
> 		mutex_exit(&buf_pool->flush_state_mutex);
>

I'm not objecting to mutex protection in that code. Why would I?

> And please don't group some of the valid answers with looney stuff.
>
>
>> 2. Why do only 2 out of 9 values are being loaded with
>> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
>> ones need them?
>
>
> Upstream reads all 9 dirtily.  I replaced two of them to be clean instead.  Maybe I need to replace all 9.  Maybe 0.  But that's again orthogonal.
>

All 9 reads are atomic. But 7 of them don't use compiler barriers 
because they don't need it. Neither do the remaining 2, but you are 
quite creative to avoid accepting this simple fact.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-15: Posted in a previous version of this proposal

On Sun, 15 Sep 2013 07:47:59 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>> No, I was referring to buf_pool_contains_zip(). It traverses the buffer
>> pool and examines (but not modifies) block->page.zip.data for each
>> block. However, the patch changes the assertion in
>> buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
>> rather than zip_mutex.
>
>
> Thanks for the pointer, I have reviewed buf_pool_contains_zip() further. Here, as you say, we iterate through the buffer pool uncompressed pages trying to find the uncompressed page of a given compressed page, in order to assert that no uncompressed page points to it, which can be either just taken from the buddy allocator, either be just removed from the buffer pool. In both cases it's an invalid pointer value and it seems to me that we could iterate through the buffer pool without any locking at all. I have tried to think of possible race conditions in the case of no locking, i.e. can the pointer become valid again somehow, by some uncompressed page starting pointing to this zip page. And the transition of the given zip page pointer unallocated -> allocated is protected by zip_free_mutex. Thus, as long as all its callers hold zip_free_mutex and it's only called with invalid pointer values to assert that FALSE is returned, buf_pool_contains_zip() itself does not need
any locki
ng. If my reasoning is correct, I can remove the assert from this function. But maybe this should be documented better somehow?
>

Looks correct to me. Let's remove the zip_free_mutex assertion then?

>
>> In fact, in one of the code paths calling
>> buf_pool_contains_zip() we assert that zip_mutex is NOT locked. Don't we
>> have a bug here?
>
>
> buf_buddy_free_low(), right? This function is called for a compressed page that is already removed from the buffer pool, that is, no pointer to it should be live in any other thread (and in the buffer pool, as it's asserted by !buf_pool_contains_zip()). Thus it's ok not hold zip_mutex.
>

OK, thanks for clarification.

On Sun, 15 Sep 2013 07:47:59 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>> No, I was referring to buf_pool_contains_zip(). It traverses the buffer
>> pool and examines (but not modifies) block->page.zip.data for each
>> block. However, the patch changes the assertion in
>> buf_pool_contains_zip() to make sure that zip_free_mutex is locked,
>> rather than zip_mutex.
>
>
> Thanks for the pointer, I have reviewed buf_pool_contains_zip() further.  Here, as you say, we iterate through the buffer pool uncompressed pages trying to find the uncompressed page of a given compressed page, in order to assert that no uncompressed page points to it, which can be either just taken from the buddy allocator, either be just removed from the buffer pool.  In both cases it's an invalid pointer value and it seems to me that we could iterate through the buffer pool without any locking at all.  I have tried to think of possible race conditions in the case of no locking, i.e. can the pointer become valid again somehow, by some uncompressed page starting pointing to this zip page.  And the transition of the given zip page pointer unallocated -> allocated is protected by zip_free_mutex.  Thus, as long as all its callers hold zip_free_mutex and it's only called with invalid pointer values to assert that FALSE is returned, buf_pool_contains_zip() itself does not need 
 any locki
ng.  If my reasoning is correct, I can remove the assert from this function.  But maybe this should be documented better somehow?
>

Looks correct to me. Let's remove the zip_free_mutex assertion then?

OK, thanks for clarification.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-16: Posted in a previous version of this proposal

Alexey -

> > I have reviewed buf_pool_contains_zip() further.
...
> Looks correct to me. Let's remove the zip_free_mutex assertion then?

Yes.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-16: Posted in a previous version of this proposal

Download full text (4.7 KiB)

Alexey -

> >>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >>>> not do what they promise to do?
> >>>
> >>>
> >>> Yes.
> >>
> >> OK, so we agree that naming is unfortunate.
> >
> >
> > Vanishingly slightly so, due to the alignment issue, which I believe is
> mostly theoretical, but nevertheless am ready to address now.
> >
>
> It doesn't do anything to enforce atomicity, does it? I.e. the following
> implementation would be equally "atomic":
>
> ulint
> os_atomic_load_ulint(ulint *ptr)
> {
> printf("Hello, world!\n");
> return(*ptr);
> }

Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64 as long as pointer is aligned.

> >>>> - Do you agree that naming them os_ordered_load_ulint() /
> >>>> os_ordered_store_ulint() would better reflect what they do?
> >>>
> >>>
> >>> No.
> >>
> >> What would be a better naming then?
> >
> >
> > os_atomic_load_aligned_ulint().
> >
>
> No, it doesn't do anything to enforce atomicity. That is a caller's
> responsibility.

As in, don't pass misaligned values? In that case, yes, it is a caller's responsibility not to pass misaligned values. But where would InnoDB get a misaligned pointer to ulint from that we'd wish to access atomically? Hence enforcing alignment on debug build only seemed like a reasonable compromise, but OK, that's debatable.

> >> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> >> example? Here's an example of a valid answer: "Because that will result
> >> in incorrect values being used in case ...". And some examples of
> >> invalid answers: "non-cache-coherent architectures, visibility, memory
> >> model, sunspots, crop circles, global warming, ...".
> >
> >
> > We have gone through this with a disassembly example already, haven't we?
> We need os_atomic_load_ulint() because we don't want a dirty read. We may
> well decide that a dirty read there is fine and then replace it. But that's
> orthogonal to what is this primitive and why.
> >
>
> We need an atomic read rather than os_atomic_load_ulint() for the above
> reasons. An it will be atomic without using any helper functions.

OK, so is the problem that I wanted to introduce the primitives for such access, that would also document how is the variable accessed, and that they don't have to do much besides a compiler barrier on x86_64?

> Since
> I see no answer in the form "Because that will result in incorrect
> values being used in case ...", I assume you don't have an answer to
> that question.

I know that they are not resulting in incorrect values currently, and that the worst can happen in x86_64 with the most of possible future code changes is that the value loads could be moved earlier, resulting in more out-of-date values used. This and the fact that accessing the variable through the primitive serves as self-documentation seem good enough reasons to me.

Alexey -

> >>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >>>> not do what they promise to do?
> >>>
> >>>
> >>> Yes.
> >>
> >> OK, so we agree that naming is unfortunate.
> >
> >
> > Vanishingly slightly so, due to the alignment issue, which I believe is
> mostly theoretical, but nevertheless am ready to address now.
> >
> 
> It doesn't do anything to enforce atomicity, does it? I.e. the following
> implementation would be equally "atomic":
> 
> ulint
> os_atomic_load_ulint(ulint *ptr)
> {
>         printf("Hello, world!\n");
>         return(*ptr);
> }

Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64 as long as pointer is aligned.

> >>>> - Do you agree that naming them os_ordered_load_ulint() /
> >>>> os_ordered_store_ulint() would better reflect what they do?
> >>>
> >>>
> >>> No.
> >>
> >> What would be a better naming then?
> >
> >
> > os_atomic_load_aligned_ulint().
> >
> 
> No, it doesn't do anything to enforce atomicity. That is a caller's
> responsibility.

As in, don't pass misaligned values?  In that case, yes, it is a caller's responsibility not to pass misaligned values.  But where would InnoDB get a misaligned pointer to ulint from that we'd wish to access atomically?   Hence enforcing alignment on debug build only seemed like a reasonable compromise, but OK, that's debatable.

> >> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> >> example? Here's an example of a valid answer: "Because that will result
> >> in incorrect values being used in case ...". And some examples of
> >> invalid answers: "non-cache-coherent architectures, visibility, memory
> >> model, sunspots, crop circles, global warming, ...".
> >
> >
> > We have gone through this with a disassembly example already, haven't we?
> We need os_atomic_load_ulint() because we don't want a dirty read.  We may
> well decide that a dirty read there is fine and then replace it.  But that's
> orthogonal to what is this primitive and why.
> >
> 
> We need an atomic read rather than os_atomic_load_ulint() for the above
> reasons. An it will be atomic without using any helper functions.

> Since
> I see no answer in the form "Because that will result in incorrect
> values being used in case ...", I assume you don't have an answer to
> that question.

I know that they are not resulting in incorrect values currently, and that the worst can happen in x86_64 with the most of possible future code changes is that the value loads could be moved earlier, resulting in more out-of-date values used.  This and the fact that accessing the variable through the primitive serves as self-documentation seem good enough reasons to me.

> > Are you also objecting to mutex protection here? If not, why? Note that the
> three n_flush values here are completely independent.
> >
> >               mutex_enter(&buf_pool->flush_state_mutex);
> >
> >               pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
> >               pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
> >               pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
> >
> >               mutex_exit(&buf_pool->flush_state_mutex);
> >
> 
> I'm not objecting to mutex protection in that code. Why would I?

Because n_flush[] are independent aligned ulints.  If replacing os_atomic_load(foo) with = *foo is OK, then removing the mutex protection is exactly as OK, isn't it?  If not, what is the difference?

> >> 2. Why do only 2 out of 9 values are being loaded with
> >> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
> >> ones need them?
> >
> >
> > Upstream reads all 9 dirtily.  I replaced two of them to be clean instead.
> Maybe I need to replace all 9.  Maybe 0.  But that's again orthogonal.
> >
> 
> All 9 reads are atomic. But 7 of them don't use compiler barriers
> because they don't need it. Neither do the remaining 2, but you are
> quite creative to avoid accepting this simple fact.

I am not sure where exactly you disagree with my reply here.  Yes, all 9 are atomic on x86_64 since they are aligned ulints.  But it's not only about atomicity.  Using a primitive makes the intention to get the most up-to-date value clear.  Perhaps there is no need for that and a stale value is OK, then the variables might be simply read as-is.  And I think I stated the same in my previous reply.  But that does not mean the primitive does not do what it's supposed to do, it's just we simply don't want its guarantee here and hence using it would be redundant.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-16: Posted in a previous version of this proposal

Download full text (5.6 KiB)

On Mon, 16 Sep 2013 09:05:39 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>>>> not do what they promise to do?
>>>>>
>>>>>
>>>>> Yes.
>>>>
>>>> OK, so we agree that naming is unfortunate.
>>>
>>>
>>> Vanishingly slightly so, due to the alignment issue, which I believe is
>> mostly theoretical, but nevertheless am ready to address now.
>>>
>>
>> It doesn't do anything to enforce atomicity, does it? I.e. the following
>> implementation would be equally "atomic":
>>
>> ulint
>> os_atomic_load_ulint(ulint *ptr)
>> {
>> printf("Hello, world!\n");
>> return(*ptr);
>> }
>
>
> Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64 as long as pointer is aligned.
>

So... we don't need it after all?

>
>>>>>> - Do you agree that naming them os_ordered_load_ulint() /
>>>>>> os_ordered_store_ulint() would better reflect what they do?
>>>>>
>>>>>
>>>>> No.
>>>>
>>>> What would be a better naming then?
>>>
>>>
>>> os_atomic_load_aligned_ulint().
>>>
>>
>> No, it doesn't do anything to enforce atomicity. That is a caller's
>> responsibility.
>
>
> As in, don't pass misaligned values? In that case, yes, it is a caller's responsibility not to pass misaligned values. But where would InnoDB get a misaligned pointer to ulint from that we'd wish to access atomically? Hence enforcing alignment on debug build only seemed like a reasonable compromise, but OK, that's debatable.
>
>
>>>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
>>>> example? Here's an example of a valid answer: "Because that will result
>>>> in incorrect values being used in case ...". And some examples of
>>>> invalid answers: "non-cache-coherent architectures, visibility, memory
>>>> model, sunspots, crop circles, global warming, ...".
>>>
>>>
>>> We have gone through this with a disassembly example already, haven't we?
>> We need os_atomic_load_ulint() because we don't want a dirty read. We may
>> well decide that a dirty read there is fine and then replace it. But that's
>> orthogonal to what is this primitive and why.
>>>
>>
>> We need an atomic read rather than os_atomic_load_ulint() for the above
>> reasons. An it will be atomic without using any helper functions.
>
>
> OK, so is the problem that I wanted to introduce the primitives for such access, that would also document how is the variable accessed, and that they don't have to do much besides a compiler barrier on x86_64?
>

Yes, the problem is that you introduce primitives that basically do
nothing, and then use those primitives unnecessarily and inconsistently.
Which in turn leads to blowing up the patch size, increased maintenance
burden, and opens the door for wrong assumptions made when reading the
existing code and implementing new code.

On Mon, 16 Sep 2013 09:05:39 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>>>> not do what they promise to do?
>>>>>
>>>>>
>>>>> Yes.
>>>>
>>>> OK, so we agree that naming is unfortunate.
>>>
>>>
>>> Vanishingly slightly so, due to the alignment issue, which I believe is
>> mostly theoretical, but nevertheless am ready to address now.
>>>
>>
>> It doesn't do anything to enforce atomicity, does it? I.e. the following
>> implementation would be equally "atomic":
>>
>> ulint
>> os_atomic_load_ulint(ulint *ptr)
>> {
>>          printf("Hello, world!\n");
>>          return(*ptr);
>> }
>
>
> Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64 as long as pointer is aligned.
>

So... we don't need it after all?

>
>>>>>> - Do you agree that naming them os_ordered_load_ulint() /
>>>>>> os_ordered_store_ulint() would better reflect what they do?
>>>>>
>>>>>
>>>>> No.
>>>>
>>>> What would be a better naming then?
>>>
>>>
>>> os_atomic_load_aligned_ulint().
>>>
>>
>> No, it doesn't do anything to enforce atomicity. That is a caller's
>> responsibility.
>
>
> As in, don't pass misaligned values?  In that case, yes, it is a caller's responsibility not to pass misaligned values.  But where would InnoDB get a misaligned pointer to ulint from that we'd wish to access atomically?   Hence enforcing alignment on debug build only seemed like a reasonable compromise, but OK, that's debatable.
>
>
>>>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
>>>> example? Here's an example of a valid answer: "Because that will result
>>>> in incorrect values being used in case ...". And some examples of
>>>> invalid answers: "non-cache-coherent architectures, visibility, memory
>>>> model, sunspots, crop circles, global warming, ...".
>>>
>>>
>>> We have gone through this with a disassembly example already, haven't we?
>> We need os_atomic_load_ulint() because we don't want a dirty read.  We may
>> well decide that a dirty read there is fine and then replace it.  But that's
>> orthogonal to what is this primitive and why.
>>>
>>
>> We need an atomic read rather than os_atomic_load_ulint() for the above
>> reasons. An it will be atomic without using any helper functions.
>
>
> OK, so is the problem that I wanted to introduce the primitives for such access, that would also document how is the variable accessed, and that they don't have to do much besides a compiler barrier on x86_64?
>

Yes, the problem is that you introduce primitives that basically do 
nothing, and then use those primitives unnecessarily and inconsistently. 
Which in turn leads to blowing up the patch size, increased maintenance 
burden, and opens the door for wrong assumptions made when reading the 
existing code and implementing new code.

>
>> Since
>> I see no answer in the form "Because that will result in incorrect
>> values being used in case ...", I assume you don't have an answer to
>> that question.
>
>
> I know that they are not resulting in incorrect values currently, and that the worst can happen in x86_64 with the most of possible future code changes is that the value loads could be moved earlier, resulting in more out-of-date values used.  This and the fact that accessing the variable through the primitive serves as self-documentation seem good enough reasons to me.
>

Whether they are more "out-of-date" or less "out-of-date" depends on the 
definition of "date". By defining "date" as the "point in time when the 
function is called", "more out-of-date" can be easily converted to "more 
up-to-date".

>
>
>>> Are you also objecting to mutex protection here? If not, why? Note that the
>> three n_flush values here are completely independent.
>>>
>>>                mutex_enter(&buf_pool->flush_state_mutex);
>>>
>>>                pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
>>>                pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
>>>                pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
>>>
>>>                mutex_exit(&buf_pool->flush_state_mutex);
>>>
>>
>> I'm not objecting to mutex protection in that code. Why would I?
>
>
> Because n_flush[] are independent aligned ulints.  If replacing os_atomic_load(foo) with = *foo is OK, then removing the mutex protection is exactly as OK, isn't it?  If not, what is the difference?
>

But that code is correct and does what it wants to do. That's the 
difference, and that's why I have no reasons to object.

>
>>>> 2. Why do only 2 out of 9 values are being loaded with
>>>> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
>>>> ones need them?
>>>
>>>
>>> Upstream reads all 9 dirtily.  I replaced two of them to be clean instead.
>> Maybe I need to replace all 9.  Maybe 0.  But that's again orthogonal.
>>>
>>
>> All 9 reads are atomic. But 7 of them don't use compiler barriers
>> because they don't need it. Neither do the remaining 2, but you are
>> quite creative to avoid accepting this simple fact.
>
>
> I am not sure where exactly you disagree with my reply here.  Yes, all 9 are atomic on x86_64 since they are aligned ulints.  But it's not only about atomicity.  Using a primitive makes the intention to get the most up-to-date value clear.  Perhaps there is no need for that and a stale value is OK, then the variables might be simply read as-is.  And I think I stated the same in my previous reply.  But that does not mean the primitive does not do what it's supposed to do, it's just we simply don't want its guarantee here and hence using it would be redundant.
>

Finally, "perhaps there is no need for that"! Yes, there's no need for 
that, and that's what I've been trying to explain for a few days so far.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-16: Posted in a previous version of this proposal

Download full text (5.1 KiB)

More comments on the patch:

- typo (double “get”) in the updated buf_LRU_free_page() comments:
“function returns false, the buf_page_get_get_mutex() might be temporarily”

  - what’s the reason for changing buf_page_t::in_LRU_list to be present
    in release builds? Unless I’m missing something in the current code,
    it is only assigned, but never read in release builds?

  - spurious blank line changes in buf_buddy_relocate(),
    buf_buddy_free_low(), buf_pool_watch_set(), buf_page_get_gen(),
    buf_page_init_for_read(), buf_pool_validate_instance(),
    buf_flush_check_neighbor(), buf_flush_LRU_list_batch(), buf_LRU_drop_page_hash_for_tablespace()
    and innodb_buffer_pool_evict_uncompressed().

- the following change in buf_block_try_discard_uncompressed() does
not release block_mutex if buf_LRU_free_page() returns false.

bpage = buf_page_hash_get(buf_pool, space, offset);

if (bpage) {
- buf_LRU_free_page(bpage, false);
+
+ ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
+
+ mutex_enter(block_mutex);
+
+ if (buf_LRU_free_page(bpage, false)) {
+
+ mutex_exit(block_mutex);
+ return;
+ }
}

- do you really need this change?

@@ -2114,8 +2098,8 @@ buf_page_get_zip(
   break;
  case BUF_BLOCK_ZIP_PAGE:
  case BUF_BLOCK_ZIP_DIRTY:
+ buf_enter_zip_mutex_for_page(bpage);
   block_mutex = &buf_pool->zip_mutex;
- mutex_enter(block_mutex);
   bpage->buf_fix_count++;
   goto got_block;
  case BUF_BLOCK_FILE_PAGE:

- in the following change:

@@ -2721,13 +2707,14 @@ buf_page_get_gen(
}

bpage = &block->page;
+ ut_ad(buf_own_zip_mutex_for_page(bpage));

   if (bpage->buf_fix_count
       || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
    /* This condition often occurs when the buffer
    is not buffer-fixed, but I/O-fixed by
    buf_page_init_for_read(). */
- mutex_exit(block_mutex);
+ mutex_exit(&buf_pool->zip_mutex);
wait_until_unfixed:
    /* The block is buffer-fixed or I/O-fixed.
    Try again later. */

     is there a reason for replacing block_mutex with zip_mutex? you
     could just assert that block_mutex is zip_mutex next to, or even
     instead of the buf_own_zip_mutex_for_page() call.

- same comments for this change:

@@ -2737,11 +2724,11 @@ buf_page_get_gen(
}

   /* Allocate an uncompressed page. */
- mutex_exit(block_mutex);
+ mutex_exit(&buf_pool->zip_mutex);
   block = buf_LRU_get_free_block(buf_pool);
   ut_a(block);

- buf_pool_mutex_enter(buf_pool);
+ mutex_enter(&buf_pool->LRU_list_mutex);

/* As we have released the page_hash lock and the
block_mutex to allocate an uncompressed page it is

  - in buf_mark_space_corrupt() I see LRU_list_mutex, hash_lock and
    buf_page_get_mutex() being acquired, but only LRU_list_mutex and
    buf_page_get_mutex() being released, i.e. it returns with hash_lock
    acquired?

  - it looks like with the changes in buf_page_io_complete() for io_type
    == BUF_IO_WRITE we don’t set io_fix to BUF_IO_NONE, though we did
    before the changes?

- more spurious changes:

@@ -475,8 +473,8 @@ buf_flush_insert_sorted_into_flush_list(
if (prev_b == NULL) {
UT_LIST_ADD_FIRST(list, buf_pool->flush_list, &block->page);
...

More comments on the patch:

- typo (double “get”) in the updated buf_LRU_free_page() comments:
   “function returns false, the buf_page_get_get_mutex() might be temporarily”

- the following change in buf_block_try_discard_uncompressed() does
    not release block_mutex if buf_LRU_free_page() returns false.

bpage = buf_page_hash_get(buf_pool, space, offset);
 
 	if (bpage) {
-		buf_LRU_free_page(bpage, false);
+
+		ib_mutex_t* block_mutex = buf_page_get_mutex(bpage);
+
+		mutex_enter(block_mutex);
+
+		if (buf_LRU_free_page(bpage, false)) {
+
+			mutex_exit(block_mutex);
+			return;
+		}
 	}
 
  - do you really need this change?

@@ -2114,8 +2098,8 @@ buf_page_get_zip(
 		break;
 	case BUF_BLOCK_ZIP_PAGE:
 	case BUF_BLOCK_ZIP_DIRTY:
+		buf_enter_zip_mutex_for_page(bpage);
 		block_mutex = &buf_pool->zip_mutex;
-		mutex_enter(block_mutex);
 		bpage->buf_fix_count++;
 		goto got_block;
 	case BUF_BLOCK_FILE_PAGE:

- in the following change:

@@ -2721,13 +2707,14 @@ buf_page_get_gen(
 		}
 
 		bpage = &block->page;
+		ut_ad(buf_own_zip_mutex_for_page(bpage));
 
 		if (bpage->buf_fix_count
 		    || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
 			/* This condition often occurs when the buffer
 			is not buffer-fixed, but I/O-fixed by
 			buf_page_init_for_read(). */
-			mutex_exit(block_mutex);
+			mutex_exit(&buf_pool->zip_mutex);
 wait_until_unfixed:
 			/* The block is buffer-fixed or I/O-fixed.
 			Try again later. */

is there a reason for replacing block_mutex with zip_mutex? you
     could just assert that block_mutex is zip_mutex next to, or even
     instead of the buf_own_zip_mutex_for_page() call.

- same comments for this change:

@@ -2737,11 +2724,11 @@ buf_page_get_gen(
 		}
 
 		/* Allocate an uncompressed page. */
-		mutex_exit(block_mutex);
+		mutex_exit(&buf_pool->zip_mutex);
 		block = buf_LRU_get_free_block(buf_pool);
 		ut_a(block);
 
-		buf_pool_mutex_enter(buf_pool);
+		mutex_enter(&buf_pool->LRU_list_mutex);
 
 		/* As we have released the page_hash lock and the
 		block_mutex to allocate an uncompressed page it is

- it looks like with the changes in buf_page_io_complete() for io_type
    == BUF_IO_WRITE we don’t set io_fix to BUF_IO_NONE, though we did
    before the changes?

- more spurious changes:

@@ -475,8 +473,8 @@ buf_flush_insert_sorted_into_flush_list(
 	if (prev_b == NULL) {
 		UT_LIST_ADD_FIRST(list, buf_pool->flush_list, &block->page);
 	} else {
-		UT_LIST_INSERT_AFTER(list, buf_pool->flush_list,
-				     prev_b, &block->page);
+		UT_LIST_INSERT_AFTER(list, buf_pool->flush_list, prev_b,
+				     &block->page);
 	}
 
 	incr_flush_list_size_in_bytes(block, buf_pool);

@@ -573,7 +567,7 @@ buf_flush_ready_for_flush(
 }

and
 
 /********************************************************************//**
-Remove a block from the flush list of modified blocks. */
+Remove a block from the flush list of modified blocks.  */
 UNIV_INTERN
 void
 buf_flush_remove(

and

@@ -1693,7 +1729,7 @@ buf_flush_batch(
 					(if their number does not exceed
 					min_n), otherwise ignored */
 {
-	ulint		count	= 0;
+	ulint	count	= 0;
 
 	ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_LIST);
 #ifdef UNIV_SYNC_DEBUG

and

@@ -1836,7 +1870,7 @@ buf_flush_wait_batch_end(
 		}
 	} else {
 		thd_wait_begin(NULL, THD_WAIT_DISKIO);
-	os_event_wait(buf_pool->no_flush[type]);
+		os_event_wait(buf_pool->no_flush[type]);
 		thd_wait_end(NULL);
 	}
 }

and

if (buf_pool->n_flush[BUF_FLUSH_LRU] > 0
-		   || buf_pool->init_flush[BUF_FLUSH_LRU]) {
+		    || buf_pool->init_flush[BUF_FLUSH_LRU]) {
 
and

@@ -958,10 +1018,10 @@ buf_LRU_free_from_unzip_LRU_list(
 					srv_LRU_scan_depth / 2 blocks. */
 {
 	buf_block_t*	block;
-	ibool 		freed;
+	ibool		freed;
 	ulint		scanned;

- in the following hunk there’s no really need for the “else” block,
    and a call to mutex_exit() can be moved out of the “if” statement:

@@ -1316,14 +1317,14 @@ buf_flush_try_neighbors(
 
 				buf_flush_page(buf_pool, bpage, flush_type, false);
 				ut_ad(!mutex_own(block_mutex));
-				ut_ad(!buf_pool_mutex_own(buf_pool));
 				count++;
 				continue;
-			} else {
-				mutex_exit(block_mutex);
 			}
+			mutex_exit(block_mutex);
+		} else {
+
+			mutex_exit(block_mutex);
 		}
-		buf_pool_mutex_exit(buf_pool);
 	}
 
 	if (count > 0) {

- typo in the comments:

+	/* The following call will release the buf_pgae_get_mutex() mutex. */
...
+	As we are not holding LRU list or buf_pgae_get_mutex() mutex therefore

review: Needs Fixing

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-16: Posted in a previous version of this proposal

Alexey -

Replying to this bit separately as it may need further discussion while I am addressing the rest of comments.

> - what’s the reason for changing buf_page_t::in_LRU_list to be present
> in release builds? Unless I’m missing something in the current code,
> it is only assigned, but never read in release builds?

This is another "automerge" from 5.5 and indeed serves no release
build purpose in the current 5.6 code. But it points out a
non-trivial thing. In 5.5 it is used as follows:
-) for checking whether a given page is still on the LRU list if both
   block and LRU mutexes were temporarily released:
   buf_page_get_zip(), buf_LRU_free_block() (both are different code
   from 5.6).
-) for iterating through the LRU list without holding the LRU list
   mutex at all: buf_LRU_free_from_common_LRU_list(),
   buf_LRU_free_block(),
   buf_flush_LRU_recommendation()/buf_flush_ready_for_replace(). I
   think this is unsafe and a bug in 5.5 due to page relocations
   potentially resulting in wild pointers, even if it does wonders for
   the LRU list contention. 5.6 holds the LRU list mutex in the corresponding code.
-) redundant checks, ie. on LRU list iteration where the mutex is not
   released: buf_LRU_insert_zip_clean(),
   buf_LRU_free_from_unzip_LRU_list().

Thus I think 1) in_LRU_list changes should be reverted now. 2) 5.5
might need fixing. 3) The LRU list mutex is hot in 5.6. If there is
a safe way not to hold it in 5.6 (for example, for
BUF_BLOCK_FILE_PAGE, but hard to tell the page type without
dereferencing page pointer - maybe by comparing the page address
against buffer pool chunk address range?), then it's worth looking
into it.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-17: Posted in a previous version of this proposal

Download full text (5.3 KiB)

Alexey -

> - typo (double “get”) in the updated buf_LRU_free_page() comments:
> “function returns false, the buf_page_get_get_mutex() might be temporarily”

Fixed.

in_LRU_list changes have been reverted with the exception of an extra
assert in buf_page_set_sticky(). I also found that in_unzip_LRU_list
was converted to a release build flag but no uses were converted.
Reverted that too.

> - spurious blank line changes in buf_buddy_relocate(),
> buf_buddy_free_low(), buf_pool_watch_set(),

Fixed.

> buf_page_get_gen(),

I didn't find this one. Will re-check before the final push.

> buf_page_init_for_read(), buf_pool_validate_instance(),
> buf_flush_check_neighbor(), buf_flush_LRU_list_batch(),
> buf_LRU_drop_page_hash_for_tablespace()
> and innodb_buffer_pool_evict_uncompressed().

Fixed. Also removed a diagnostic printf from
buf_pool_validate_instance().

> - the following change in buf_block_try_discard_uncompressed() does
> not release block_mutex if buf_LRU_free_page() returns false.

Fixed.

> - do you really need this change?
>
> @@ -2114,8 +2098,8 @@ buf_page_get_zip(
> break;
> case BUF_BLOCK_ZIP_PAGE:
> case BUF_BLOCK_ZIP_DIRTY:
> + buf_enter_zip_mutex_for_page(bpage);
> block_mutex = &buf_pool->zip_mutex;
> - mutex_enter(block_mutex);
> bpage->buf_fix_count++;
> goto got_block;
> case BUF_BLOCK_FILE_PAGE:

No, I don't. A debugging leftover, reverted.

> - in the following change:
>
> @@ -2721,13 +2707,14 @@ buf_page_get_gen(
> }
>
> bpage = &block->page;
> + ut_ad(buf_own_zip_mutex_for_page(bpage));
>
> if (bpage->buf_fix_count
> || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
> /* This condition often occurs when the buffer
> is not buffer-fixed, but I/O-fixed by
> buf_page_init_for_read(). */
> - mutex_exit(block_mutex);
> + mutex_exit(&buf_pool->zip_mutex);
> wait_until_unfixed:
> /* The block is buffer-fixed or I/O-fixed.
> Try again later. */
>
> is there a reason for replacing block_mutex with zip_mutex? you
> could just assert that block_mutex is zip_mutex next to, or even
> instead of the buf_own_zip_mutex_for_page() call.

Yes. Replaced buf_own_zip_mutex_for_page() with ut_ad(block_mutex ==
&buf_pool->zip_mutex), which also happens to be symmetric with "!="
assert above for BUF_BLOCK_FILE_PAGE. Reverted the mutex_exit change
too.

> - same comments for this change:
>
> @@ -2737,11 +2724,11 @@ buf_page_get_gen(
> }
>
> /* Allocate an uncompressed page. */
> - mutex_exit(block_mutex);
> + mutex_exit(&buf_pool->zip_mutex);...

Alexey -

>   - typo (double “get”) in the updated buf_LRU_free_page() comments:
>    “function returns false, the buf_page_get_get_mutex() might be temporarily”

Fixed.

>   - what’s the reason for changing buf_page_t::in_LRU_list to be present
>     in release builds? Unless I’m missing something in the current code,
>     it is only assigned, but never read in release builds?

in_LRU_list changes have been reverted with the exception of an extra
assert in buf_page_set_sticky().  I also found that in_unzip_LRU_list
was converted to a release build flag but no uses were converted.
Reverted that too.

>   - spurious blank line changes in buf_buddy_relocate(),
>     buf_buddy_free_low(), buf_pool_watch_set(),

Fixed.

> buf_page_get_gen(),

I didn't find this one.  Will re-check before the final push.

>     buf_page_init_for_read(), buf_pool_validate_instance(),
>     buf_flush_check_neighbor(), buf_flush_LRU_list_batch(),
> buf_LRU_drop_page_hash_for_tablespace()
>     and innodb_buffer_pool_evict_uncompressed().

Fixed.  Also removed a diagnostic printf from
buf_pool_validate_instance().

>   - the following change in buf_block_try_discard_uncompressed() does
>     not release block_mutex if buf_LRU_free_page() returns false.

Fixed.

>   - do you really need this change?
> 
> @@ -2114,8 +2098,8 @@ buf_page_get_zip(
>                 break;
>         case BUF_BLOCK_ZIP_PAGE:
>         case BUF_BLOCK_ZIP_DIRTY:
> +               buf_enter_zip_mutex_for_page(bpage);
>                 block_mutex = &buf_pool->zip_mutex;
> -               mutex_enter(block_mutex);
>                 bpage->buf_fix_count++;
>                 goto got_block;
>         case BUF_BLOCK_FILE_PAGE:

No, I don't.  A debugging leftover, reverted.

>   - in the following change:
> 
> @@ -2721,13 +2707,14 @@ buf_page_get_gen(
>                 }
> 
>                 bpage = &block->page;
> +               ut_ad(buf_own_zip_mutex_for_page(bpage));
> 
>                 if (bpage->buf_fix_count
>                     || buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
>                         /* This condition often occurs when the buffer
>                         is not buffer-fixed, but I/O-fixed by
>                         buf_page_init_for_read(). */
> -                       mutex_exit(block_mutex);
> +                       mutex_exit(&buf_pool->zip_mutex);
>  wait_until_unfixed:
>                         /* The block is buffer-fixed or I/O-fixed.
>                         Try again later. */
> 
>      is there a reason for replacing block_mutex with zip_mutex? you
>      could just assert that block_mutex is zip_mutex next to, or even
>      instead of the buf_own_zip_mutex_for_page() call.

Yes.  Replaced buf_own_zip_mutex_for_page() with ut_ad(block_mutex ==
&buf_pool->zip_mutex), which also happens to be symmetric with "!="
assert above for BUF_BLOCK_FILE_PAGE.  Reverted the mutex_exit change
too.

>   - same comments for this change:
> 
> @@ -2737,11 +2724,11 @@ buf_page_get_gen(
>                 }
> 
>                 /* Allocate an uncompressed page. */
> -               mutex_exit(block_mutex);
> +               mutex_exit(&buf_pool->zip_mutex);
>                 block = buf_LRU_get_free_block(buf_pool);
>                 ut_a(block);

Reverted too.

>   - in buf_mark_space_corrupt() I see LRU_list_mutex, hash_lock and
>     buf_page_get_mutex() being acquired, but only LRU_list_mutex and
>     buf_page_get_mutex() being released, i.e. it returns with hash_lock
>     acquired?

I don't see buf_page_get_mutex() being released directly in
buf_mark_space_corrupt()?   buf_LRU_free_one_page() will release hash
lock and buf_page_get_mutex(), thus no bug.  Should
buf_LRU_free_one_page() locking be documented better in its header
comment?

>   - it looks like with the changes in buf_page_io_complete() for io_type
>     == BUF_IO_WRITE we don’t set io_fix to BUF_IO_NONE, though we did
>     before the changes?

It's pushed down to buf_flush_write_complete().  The reason is that
it must happen under flush_state_mutex in order to observe consistent
buffer pool in buf_pool_validate_instance().

>   - more spurious changes:

> -               UT_LIST_INSERT_AFTER(list, buf_pool->flush_list,
> -                                    prev_b, &block->page);
> +               UT_LIST_INSERT_AFTER(list, buf_pool->flush_list, prev_b,
> +                                    &block->page);
> -Remove a block from the flush list of modified blocks. */
> +Remove a block from the flush list of modified blocks.  */
> -       ulint           count   = 0;
> +       ulint   count   = 0;
> -       os_event_wait(buf_pool->no_flush[type]);
> +               os_event_wait(buf_pool->no_flush[type]);
> -                  || buf_pool->init_flush[BUF_FLUSH_LRU]) {
> +                   || buf_pool->init_flush[BUF_FLUSH_LRU]) {
> -       ibool           freed;
> +       ibool           freed;

All reverted.

>   - in the following hunk there’s no really need for the “else” block,
>     and a call to mutex_exit() can be moved out of the “if” statement:
> 
> @@ -1316,14 +1317,14 @@ buf_flush_try_neighbors(

Yes, fixed.

>   - typo in the comments:
> 
> +       /* The following call will release the buf_pgae_get_mutex() mutex. */
> ...
> +       As we are not holding LRU list or buf_pgae_get_mutex() mutex therefore

Oh. Fixed.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-17: Posted in a previous version of this proposal

Download full text (4.4 KiB)

buf_enter_zip_mutex_for_page(buf_page_t *) is a bad idea, unless a buffer pool pointer is passed too, which then makes it an overkill, because it dereferences an unprotected bpage pointer, resulting in the below. Will remove it and replace back with mutex_enter(&buf_pool->zip_mutex).

http://jenkins.percona.com/job/percona-server-5.6-param/273/BUILD_TYPE=debug,Host=debian-wheezy-x64/testReport/junit/%28root%29/innodb/innodb_wl5522_zip/

buf_enter_zip_mutex_for_page(buf_page_t *) is a bad idea, unless a buffer pool pointer is passed too, which then makes it an overkill, because it dereferences an unprotected bpage pointer, resulting in the below.  Will remove it and replace back with mutex_enter(&buf_pool->zip_mutex).

http://jenkins.percona.com/job/percona-server-5.6-param/273/BUILD_TYPE=debug,Host=debian-wheezy-x64/testReport/junit/%28root%29/innodb/innodb_wl5522_zip/

Thread 1 (Thread 0x7f3acebfe700 (LWP 321)):
#0  __pthread_kill (threadid=<optimized out>, signo=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:63
#1  0x0000000000ae2f45 in my_write_core (sig=6) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/mysys/stacktrace.c:422
#2  0x000000000075eaa9 in handle_fatal_signal (sig=6) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/sql/signal_handler.cc:251
#3  <signal handler called>
#4  0x00007f3ae835c475 in *__GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007f3ae835f6f0 in *__GI_abort () at abort.c:92
#6  0x0000000000d9b44c in buf_pool_from_bpage (bpage=0x33912c0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/buf0buf.ic:88
#7  0x0000000000d9cb3c in buf_enter_zip_mutex_for_page (bpage=0x33912c0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/buf0buf.ic:1412
#8  0x0000000000da2e5b in buf_page_get_gen (space=431, zip_size=8192, offset=9, rw_latch=1, guess=0x0, mode=10, file=0x1109668 "/mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/btr/btr0pcur.cc", line=438, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/buf/buf0buf.cc:2743
#9  0x0000000000d8c45a in btr_block_get_func (space=431, zip_size=8192, page_no=9, mode=1, file=0x1109668 "/mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/btr/btr0pcur.cc", line=438, index=0x3393d18, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/btr0btr.ic:60
#10 0x0000000000d8df82 in btr_pcur_move_to_next_page (cursor=0x7f3acebfd420, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/btr/btr0pcur.cc:438
#11 0x0000000000df3a47 in btr_pcur_move_to_next_user_rec (cursor=0x7f3acebfd420, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/include/btr0pcur.ic:323
#12 0x0000000000df5a6e in dict_stats_analyze_index_level (index=0x3393d18, level=0, n_diff=0x3496418, total_recs=0x7f3acebfdac0, total_pages=0x7f3acebfdab8, n_diff_boundaries=0x0, mtr=0x7f3acebfd5e0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0stats.cc:1024
#13 0x0000000000df6acd in dict_stats_analyze_index (index=0x3393d18) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0stats.cc:1821
#14 0x0000000000df72d3 in dict_stats_update_persistent (table=0x32505e8) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0stats.cc:2058
#15 0x0000000000df8dea in dict_stats_update (table=0x32505e8, stats_upd_option=DICT_STATS_RECALC_PERSISTENT) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0stats.cc:2947
#16 0x0000000000dfba8f in dict_stats_process_entry_from_recalc_pool () at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0stats_bg.cc:313
#17 0x0000000000dfbb6c in dict_stats_thread (arg=0x0) at /mnt/workspace/percona-server-5.6-param/BUILD_TYPE/debug/Host/debian-wheezy-x64/Percona-Server/storage/innobase/dict/dict0stats_bg.cc:355
#18 0x00007f3ae969eb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#19 0x00007f3ae8404a7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#20 0x0000000000000000 in ?? ()

review: Needs Fixing

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-17: Posted in a previous version of this proposal

Download full text (6.9 KiB)

Alexey -

> >>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >>>>>> not do what they promise to do?
> >>>>>
> >>>>>
> >>>>> Yes.
> >>>>
> >>>> OK, so we agree that naming is unfortunate.
> >>>
> >>>
> >>> Vanishingly slightly so, due to the alignment issue, which I believe is
> >> mostly theoretical, but nevertheless am ready to address now.
> >>>
> >>
> >> It doesn't do anything to enforce atomicity, does it? I.e. the following
> >> implementation would be equally "atomic":
> >>
> >> ulint
> >> os_atomic_load_ulint(ulint *ptr)
> >> {
> >> printf("Hello, world!\n");
> >> return(*ptr);
> >> }
> >
> >
> > Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64
> as long as pointer is aligned.
> >
>
> So... we don't need it after all?

But we don't want to ignore visibility and ordering.

> >>>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> >>>> example? Here's an example of a valid answer: "Because that will result
> >>>> in incorrect values being used in case ...". And some examples of
> >>>> invalid answers: "non-cache-coherent architectures, visibility, memory
> >>>> model, sunspots, crop circles, global warming, ...".
> >>>
> >>>
> >>> We have gone through this with a disassembly example already, haven't we?
> >> We need os_atomic_load_ulint() because we don't want a dirty read. We may
> >> well decide that a dirty read there is fine and then replace it. But
> that's
> >> orthogonal to what is this primitive and why.
> >>>
> >>
> >> We need an atomic read rather than os_atomic_load_ulint() for the above
> >> reasons. An it will be atomic without using any helper functions.
> >
> >
> > OK, so is the problem that I wanted to introduce the primitives for such
> access, that would also document how is the variable accessed, and that they
> don't have to do much besides a compiler barrier on x86_64?
> >
>
> Yes, the problem is that you introduce primitives that basically do
> nothing, and then use those primitives unnecessarily and inconsistently.
> Which in turn leads to blowing up the patch size, increased maintenance
> burden, and opens the door for wrong assumptions made when reading the
> existing code and implementing new code.

OK, now I see your concerns, and understand a big part of them but not all of them. What wrong assumptions are encouraged by the primitives?

> >> Since
> >> I see no answer in the form "Because that will result in incorrect
> >> values being used in case ...", I assume you don't have an answer to
> >> that question.
> >
> >
> > I know that they are not resulting in incorrect values currently, and that
> the worst can happen in x86_64 with the most of possible future code changes
> is that the value loads could be moved earlier, resulting in more out-of-date
> values used. This and the fact that accessing the variable through the
> primitive serves as self-documentation seem good enough reasons to me.
> >
>
> Whether they are more "out-of-date" or less "out-of-date" depends on the
> definition of "date". By defining "date" as the "point in time when the
> function is called", "more out-of-date" can be easily...

Alexey -

> >>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
> >>>>>> not do what they promise to do?
> >>>>>
> >>>>>
> >>>>> Yes.
> >>>>
> >>>> OK, so we agree that naming is unfortunate.
> >>>
> >>>
> >>> Vanishingly slightly so, due to the alignment issue, which I believe is
> >> mostly theoretical, but nevertheless am ready to address now.
> >>>
> >>
> >> It doesn't do anything to enforce atomicity, does it? I.e. the following
> >> implementation would be equally "atomic":
> >>
> >> ulint
> >> os_atomic_load_ulint(ulint *ptr)
> >> {
> >>          printf("Hello, world!\n");
> >>          return(*ptr);
> >> }
> >
> >
> > Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64
> as long as pointer is aligned.
> >
> 
> So... we don't need it after all?

But we don't want to ignore visibility and ordering.

> >>>> 1. Why do we need os_atomic_load_ulint() in buf_get_total_stat(), for
> >>>> example? Here's an example of a valid answer: "Because that will result
> >>>> in incorrect values being used in case ...". And some examples of
> >>>> invalid answers: "non-cache-coherent architectures, visibility, memory
> >>>> model, sunspots, crop circles, global warming, ...".
> >>>
> >>>
> >>> We have gone through this with a disassembly example already, haven't we?
> >> We need os_atomic_load_ulint() because we don't want a dirty read.  We may
> >> well decide that a dirty read there is fine and then replace it.  But
> that's
> >> orthogonal to what is this primitive and why.
> >>>
> >>
> >> We need an atomic read rather than os_atomic_load_ulint() for the above
> >> reasons. An it will be atomic without using any helper functions.
> >
> >
> > OK, so is the problem that I wanted to introduce the primitives for such
> access, that would also document how is the variable accessed, and that they
> don't have to do much besides a compiler barrier on x86_64?
> >
> 
> Yes, the problem is that you introduce primitives that basically do
> nothing, and then use those primitives unnecessarily and inconsistently.
> Which in turn leads to blowing up the patch size, increased maintenance
> burden, and opens the door for wrong assumptions made when reading the
> existing code and implementing new code.

OK, now I see your concerns, and understand a big part of them but not all of them.  What wrong assumptions are encouraged by the primitives?

> >> Since
> >> I see no answer in the form "Because that will result in incorrect
> >> values being used in case ...", I assume you don't have an answer to
> >> that question.
> >
> >
> > I know that they are not resulting in incorrect values currently, and that
> the worst can happen in x86_64 with the most of possible future code changes
> is that the value loads could be moved earlier, resulting in more out-of-date
> values used.  This and the fact that accessing the variable through the
> primitive serves as self-documentation seem good enough reasons to me.
> >
> 
> Whether they are more "out-of-date" or less "out-of-date" depends on the
> definition of "date". By defining "date" as the "point in time when the
> function is called", "more out-of-date" can be easily converted to "more
> up-to-date".

I am trying to understand this but with little success.  Does this refer to the case where we use values for heuristics and stats, where a value from one point in time is as good as value from some other not too removed point in time?  I couldn't come up with another case where "date" as "point in time when the function is called" would be used.

> >>> Are you also objecting to mutex protection here? If not, why? Note that
> the
> >> three n_flush values here are completely independent.
> >>>
> >>>                mutex_enter(&buf_pool->flush_state_mutex);
> >>>
> >>>                pending_io += buf_pool->n_flush[BUF_FLUSH_LRU];
> >>>                pending_io += buf_pool->n_flush[BUF_FLUSH_SINGLE_PAGE];
> >>>                pending_io += buf_pool->n_flush[BUF_FLUSH_LIST];
> >>>
> >>>                mutex_exit(&buf_pool->flush_state_mutex);
> >>>
> >>
> >> I'm not objecting to mutex protection in that code. Why would I?
> >
> >
> > Because n_flush[] are independent aligned ulints.  If replacing
> os_atomic_load(foo) with = *foo is OK, then removing the mutex protection is
> exactly as OK, isn't it?  If not, what is the difference?
> >
> 
> But that code is correct and does what it wants to do. That's the
> difference, and that's why I have no reasons to object.

Now this I don't understand.  The variables are atomic, we don't care about "up-to-dateness" here (or do we?).  Thus

mutex_enter(...)
pending_io += ..
mutex_exit(...)

and

pending_io += os_atomic_load_ulint(...);
...

and

pending_io += buf_pool->n_flush[...];
...

are the same?  If yes, why do we go through the mutex trouble in order to call the code correct?

> >>>> 2. Why do only 2 out of 9 values are being loaded with
> >>>> os_atomic_load_ulint() in buf_get_total_stat()? Why don't the remaining
> >>>> ones need them?
> >>>
> >>>
> >>> Upstream reads all 9 dirtily.  I replaced two of them to be clean instead.
> >> Maybe I need to replace all 9.  Maybe 0.  But that's again orthogonal.
> >>>
> >>
> >> All 9 reads are atomic. But 7 of them don't use compiler barriers
> >> because they don't need it. Neither do the remaining 2, but you are
> >> quite creative to avoid accepting this simple fact.
> >
> >
> > I am not sure where exactly you disagree with my reply here.  Yes, all 9 are
> atomic on x86_64 since they are aligned ulints.  But it's not only about
> atomicity.  Using a primitive makes the intention to get the most up-to-date
> value clear.  Perhaps there is no need for that and a stale value is OK, then
> the variables might be simply read as-is.  And I think I stated the same in my
> previous reply.  But that does not mean the primitive does not do what it's
> supposed to do, it's just we simply don't want its guarantee here and hence
> using it would be redundant.
> >
> 
> Finally, "perhaps there is no need for that"! Yes, there's no need for
> that, and that's what I've been trying to explain for a few days so far.

There is a big difference between "there is no need for that because we don't need such guarantees at this location" and "there is no need for that because there is no use for this primitive at all".

I am removing os_atomic_load_ulint() uses from buf_get_total_stat(), buf_print_instance(), buf_stat_get_pool_info(), buf_LRU_evict_from_unzip_LRU(), the last instance in buf_read_recv_pages(), srv_mon_process_existing_counter() (where it always has been a bug, as called on a private variable).

This would leave them in buf_mark_space_corrupt(), buf_page_io_complete(), buf_get_n_pending_read_ios(), buf_pool_check_no_pending_io(), buf_LRU_get_free_block(), buf_read_page_handle_error(), buf_read_ahead_random(), buf_read_ahead_linear(), buf_read_ibuf_merge_pages(), all but the last instance in buf_read_recv_pages().  Would you object?

Thanks

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-09-18: Posted in a previous version of this proposal

Hi Laurynas,

On Tue, 17 Sep 2013 12:41:18 -0000, Laurynas Biveinis wrote:
> Alexey -
>
>
>>>>>>>> - Do you agree that os_atomic_load_ulint() / os_atomic_store_ulint() do
>>>>>>>> not do what they promise to do?
>>>>>>>
>>>>>>>
>>>>>>> Yes.
>>>>>>
>>>>>> OK, so we agree that naming is unfortunate.
>>>>>
>>>>>
>>>>> Vanishingly slightly so, due to the alignment issue, which I believe is
>>>> mostly theoretical, but nevertheless am ready to address now.
>>>>>
>>>>
>>>> It doesn't do anything to enforce atomicity, does it? I.e. the following
>>>> implementation would be equally "atomic":
>>>>
>>>> ulint
>>>> os_atomic_load_ulint(ulint *ptr)
>>>> {
>>>> printf("Hello, world!\n");
>>>> return(*ptr);
>>>> }
>>>
>>>
>>> Yes, it would be equally atomic (ignoring visibility and ordering) on x86_64
>> as long as pointer is aligned.
>>>
>>
>> So... we don't need it after all?
>
>
> But we don't want to ignore visibility and ordering.
>

Yeah, you forgot non-cache-coherent architectures.

This discussion has been running in circles for almost a week now, and I
have a feeling you deliberately keep it this way. Since I have not been
presented any technical arguments for keeping that code, and have better
things to do, I'm going to wrap up the democrazy and stop it forcefully.

I will not approve this MP with "atomic" primitives present in the code.
The discussion is over.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-18: Posted in a previous version of this proposal

Alexey -

> This discussion has been running in circles for almost a week now, and I
> have a feeling you deliberately keep it this way. Since I have not been
> presented any technical arguments for keeping that code, and have better
> things to do, I'm going to wrap up the democrazy and stop it forcefully.
>
> I will not approve this MP with "atomic" primitives present in the code.
> The discussion is over.

I did not do anything to deserve this kind of treatment. Your feeling on me keeping this deliberately is plain wrong (What do I have to gain? Minus one week of my copious time? Smug feeling of being right?). I have addressed every single your comment without hesitation and coming from an assumption that you are right in this MP and tens of MPs before.

I have tried my best to explain why the code is correct. You seem to disagree but I have trouble understanding why. I am well within in my rights to ask you to explain further and the burden is on you to show why the code is wrong. Hence your refusal to continue the review is stalling the review right now. Please continue the technical discussion.

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2013-09-18: Posted in a previous version of this proposal

Repushed. Changes from the 2nd push. Not a resubmission yet.

    - Removed ut_ad(mutex_own(&buf_pool->zip_free_mutex)) from
      buf_pool_contains_zip().
    - Fixed comment typos.
    - Reverted in_LRU_list and in_unzip_LRU_list changes, with the
      exception of an extra assert in buf_page_set_sticky().
    - Reverted spurious whitespace changes.
    - Removed spurious diagnostic printf from
      buf_pool_validate_instance().
    - Fixed locking in buf_block_try_discard_uncompressed().
    - Reverted redundant locking changes in buf_page_get_zip() and
      buf_page_get_gen().
    - Removed buf_enter_zip_mutex_for_page().
    - Removed os_atomic_load_ulint() uses from from
      buf_get_total_stat(), buf_print_instance(),
      buf_stat_get_pool_info(), buf_LRU_evict_from_unzip_LRU(), the
      last instance in buf_read_recv_pages(), and.
      srv_mon_process_existing_counter().