Merge lp:~laurynas-biveinis/percona-server/bug1086680-5.1 into lp:percona-server/5.1

Proposed by Laurynas Biveinis
Status: Rejected
Rejected by: Alexey Kopytov
Proposed branch: lp:~laurynas-biveinis/percona-server/bug1086680-5.1
Merge into: lp:percona-server/5.1
Diff against target: 237 lines (+63/-6)
5 files modified
Percona-Server/storage/innodb_plugin/buf/buf0buf.c (+29/-3)
Percona-Server/storage/innodb_plugin/buf/buf0flu.c (+29/-3)
Percona-Server/storage/innodb_plugin/include/buf0buf.h (+3/-0)
Percona-Server/storage/innodb_plugin/include/sync0sync.h (+1/-0)
Percona-Server/storage/innodb_plugin/sync/sync0sync.c (+1/-0)
To merge this branch: bzr merge lp:~laurynas-biveinis/percona-server/bug1086680-5.1
Reviewer Review Type Date Requested Status
Alexey Kopytov (community) Needs Resubmitting
Laurynas Biveinis Pending
Review via email: mp+144470@code.launchpad.net

This proposal supersedes a proposal from 2013-01-18.

Description of the change

Fix bug 1086680 for 5.1
http://jenkins.percona.com/job/percona-server-5.1-param/508/
Additionally tested locally with UNIV_SYNC_DEBUG, MTR --suite=innodb_plugin.
No BT or ST, but blocks BT 16274 QA.

To post a comment you must log in.
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote : Posted in a previous version of this proposal

Left a race condition in that block that is about to be relocated state might change from ZIP_DIRTY to ZIP_CLEAN or back while the mutexes are released in buf_page_get_gen. It's an issue because the decision to take or not the zip dirty flush mutex is made earlier.

Easy to fix by rechecking the block state after the final mutex reacquisition.

review: Needs Fixing
Revision history for this message
Alexey Kopytov (akopytov) wrote :

The patch itself looks correct in the sense that it does solve the
problem under the described conditions. Though it is not clear if there
are other conditions resulting in a dirty bpage read in
buf_flush_batch(), e.g. not related to compressed page relocations, and
if those will be solved by the patch as well.

However, I don't like that we:

1) introduce another mutex. it's another potential source of regressions
and another thing to care about on merges.

2) lock a global buffer pool mutex for the entire flushing loop in
buf_flush_batch() when we in fact only want to protect individual pages
on every loop iteration. Which means a buf_page_get_gen() call in
another thread trying to relocate a completely unrelated (e.g. not even
touched by the buf_flush_batch() loop) page will have to wait for
buf_flush_batch() to complete flushing all the pages it wants to flush.

Are there any reasons to not follow the same logic in buf_flush_batch()
wrt. flush_list_mutex as we do wrt. LRU_list_mutex? After all, the root
cause of the problem is that we release flush_list_mutex too early and
then read from the block while it's not protected either by
flush_list_mutex or the block mutex. So let's just not release
flush_list_mutex until we lock the block mutex a few lines down in the
code, and that will also prevent relocations? flush_list_mutex can then
be released temporarily while buf_flush_try_neighbors() is in
progress. It will likely result in a much smaller & easier to maintain
patch, and even increase locking granularity as opposed to a separate
global mutex approach?

review: Needs Fixing
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

> Though it is not clear if there
> are other conditions resulting in a dirty bpage read in
> buf_flush_batch(), e.g. not related to compressed page relocations, and
> if those will be solved by the patch as well.

IIRC I reviewed the code for other such issues.

> However, I don't like that we:
>
> 1) introduce another mutex. it's another potential source of regressions
> and another thing to care about on merges.

I don't like it neither, and I'd be glad to reimplement it some other way. But that was the least bad alternative I could come up with.

> 2) lock a global buffer pool mutex for the entire flushing loop in
> buf_flush_batch() when we in fact only want to protect individual pages
> on every loop iteration. Which means a buf_page_get_gen() call in
> another thread trying to relocate a completely unrelated (e.g. not even
> touched by the buf_flush_batch() loop) page will have to wait for
> buf_flush_batch() to complete flushing all the pages it wants to flush.

But in this patch the new mutex is temporarily released once the the block-to-be-flushed is fixed right before the flushing itself, where we don't have any block pointers acquired from the flush list. Thus at least the blocking is not very long. Thus what this mutex serializes are 1) single page relocations; 2) single page flushes.

> Are there any reasons to not follow the same logic in buf_flush_batch()
> wrt. flush_list_mutex as we do wrt. LRU_list_mutex? After all, the root
> cause of the problem is that we release flush_list_mutex too early and
> then read from the block while it's not protected either by
> flush_list_mutex or the block mutex. So let's just not release
> flush_list_mutex until we lock the block mutex a few lines down in the
> code, and that will also prevent relocations?

Unfortunately, the block and flush list mutex priority is wrong for this, obviously the easiest if it were possible, solution. Block mutexes have to locked before the flush list mutex. The LRU list mutex is different, it has to be locked before the block mutexes.

> flush_list_mutex can then
> be released temporarily while buf_flush_try_neighbors() is in
> progress. It will likely result in a much smaller & easier to maintain
> patch, and even increase locking granularity as opposed to a separate
> global mutex approach?

Yes, it would if not the mutex priority.

The alternatives I see:
- Rework buffer pool mutexes so that flush list mutex comes before the block mutex. Not something I'd want to maintain for the upstream merges and haven't looked into what kind of issues would such implementation raise.
- Overload (rename?) the LRU list mutex to take on the job of the new mutex too since it has the right priority for the job. This comes close to unsplitting the buffer pool mutex split, and has most of the disadvantages of the current solution.

Maybe I'm missing some other alternative or my reasoning above is wrong?

Thanks,

Revision history for this message
Alexey Kopytov (akopytov) wrote :
Download full text (3.1 KiB)

Hi Laurynas,

On Fri, 15 Mar 2013 13:14:20 -0000, Laurynas Biveinis wrote:
>> 2) lock a global buffer pool mutex for the entire flushing loop in
>> buf_flush_batch() when we in fact only want to protect individual pages
>> on every loop iteration. Which means a buf_page_get_gen() call in
>> another thread trying to relocate a completely unrelated (e.g. not even
>> touched by the buf_flush_batch() loop) page will have to wait for
>> buf_flush_batch() to complete flushing all the pages it wants to flush.
>
> But in this patch the new mutex is temporarily released once the the block-to-be-flushed is fixed right before the flushing itself, where we don't have any block pointers acquired from the flush list. Thus at least the blocking is not very long. Thus what this mutex serializes are 1) single page relocations; 2) single page flushes.
>

Right, but temporarily unlocking zip_dirty_flush_mutex only occurs after
locking another 2 global locks: page_hash_latch and the buffer pool
mutex. Which is still bad, because whether it is "not very long" or not
depends on many other factors.

>> Are there any reasons to not follow the same logic in buf_flush_batch()
>> wrt. flush_list_mutex as we do wrt. LRU_list_mutex? After all, the root
>> cause of the problem is that we release flush_list_mutex too early and
>> then read from the block while it's not protected either by
>> flush_list_mutex or the block mutex. So let's just not release
>> flush_list_mutex until we lock the block mutex a few lines down in the
>> code, and that will also prevent relocations?
>
> Unfortunately, the block and flush list mutex priority is wrong for this, obviously the easiest if it were possible, solution. Block mutexes have to locked before the flush list mutex. The LRU list mutex is different, it has to be locked before the block mutexes.
>

I don't really see reasons for them to be different. The order relation
between LRU_list_mutex, flush_list_mutex and block_mutex priorities were
defined by our patches. And something tells me they have either been
defined arbitrarily, or to hide some implementation "drawback" :)

Seriously, what's the reason for them be different with respect to the
block mutex priority?

>> flush_list_mutex can then
>> be released temporarily while buf_flush_try_neighbors() is in
>> progress. It will likely result in a much smaller & easier to maintain
>> patch, and even increase locking granularity as opposed to a separate
>> global mutex approach?
>
> Yes, it would if not the mutex priority.
>
> The alternatives I see:
> - Rework buffer pool mutexes so that flush list mutex comes before the block mutex. Not something I'd want to maintain for the upstream merges and haven't looked into what kind of issues would such implementation raise.

It doesn't apply to upstream merges. I'd first try to answer the "why is
it different?" question and then see how feasible it is to change that.

> - Overload (rename?) the LRU list mutex to take on the job of the new mutex too since it has the right priority for the job. This comes close to unsplitting the buffer pool mutex split, and has most of the disadvantages of the current solution.
>
Yep, this doesn't ...

Read more...

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Hi Alexey -

> The order relation
> between LRU_list_mutex, flush_list_mutex and block_mutex priorities were
> defined by our patches.

flush_list_mutex and block_mutex are upstream mutexes, with their order defined in the upstream:

#define SYNC_BUF_BLOCK 146 /* Block mutex */
#define SYNC_BUF_FLUSH_LIST 145 /* Buffer flush list mutex */

LRU_list_mutex is of course ours.

> And something tells me they have either been
> defined arbitrarily, or to hide some implementation "drawback" :)
>
> Seriously, what's the reason for them be different with respect to the
> block mutex priority?

Arguably it is the LRU_list_mutex that has the right (with some thought I guess the only possible) priority WRT block mutex even though it differs from the flush list mutex one. It's more of an upstream "drawback" IMHO :)

> > - Rework buffer pool mutexes so that flush list mutex comes before the block
> mutex. Not something I'd want to maintain for the upstream merges and haven't
> looked into what kind of issues would such implementation raise.
>
> It doesn't apply to upstream merges. I'd first try to answer the "why is
> it different?" question and then see how feasible it is to change that.

Since these two are upstream mutexes, I believe it does apply to the merges. I will spend some time looking into the priority reversal and its implications. Perhaps it's not very bad code-wise, but the merges add an additional risk here.

Thanks,
Laurynas

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Hi Laurynas,

On Fri, 15 Mar 2013 15:08:36 -0000, Laurynas Biveinis wrote:
> Hi Alexey -
>
>> The order relation
>> between LRU_list_mutex, flush_list_mutex and block_mutex priorities were
>> defined by our patches.
>
> flush_list_mutex and block_mutex are upstream mutexes, with their order defined in the upstream:
>
> #define SYNC_BUF_BLOCK 146 /* Block mutex */
> #define SYNC_BUF_FLUSH_LIST 145 /* Buffer flush list mutex */
>

That's from upstream 5.5. This is a 5.1 MP, and upstream 5.1 doesn't
have flush_list_mutex.

> LRU_list_mutex is of course ours.
>
>> And something tells me they have either been
>> defined arbitrarily, or to hide some implementation "drawback" :)
>>
>> Seriously, what's the reason for them be different with respect to the
>> block mutex priority?
>
> Arguably it is the LRU_list_mutex that has the right (with some thought I guess the only possible) priority WRT block mutex even though it differs from the flush list mutex one. It's more of an upstream "drawback" IMHO :)
>

In fact it doesn't matter. If flush_list_mutex has a higher priority
than block_mutex, that just means it can be used _instead_ of the
block_mutex. So we only want to acquire it for BUF_FLUSH_LRU, but
acquiring flush_list_mutex and holding it until all necessary bpage
fields are accessed looks possible.

The 5.5 implementation will be a bit more tricky as you need a bit
different implementation of buf_flush_page_and_try_neighbors() in
buf_flush_flush_list_batch() (or just don't use it and call
buf_flush_try_neighbors() directly from that function).

Revision history for this message
Alexey Kopytov (akopytov) :
review: Needs Resubmitting

Unmerged revisions

519. By Laurynas Biveinis

Fix bug 1086680 (Valgrind: free in buf_page_get_gen (Invalid read in
buf_flush_batch / buf_flush_list) + free in buf_page_get_gen (Invalid
read in buf_flush_page_and_try_neighbors)) in XtraDB.

The Valgrind errors and crashes happen because of a race condition
involving a dirty compressed page block for which there is
uncompressed page image in the buffer pool.

First, a master thread (or possible another query thread) does a flush
list flush and for that acquires the flush list mutex, gets a pointer
to a page, releases the flush list mutex.

At this point another thread starts reading the same page into the
buffer. Since it is a dirty compressed page, it allocates an
uncompressed page, relocates the page on the flush list, frees the
compressed page descriptor.

At this point the flushing thread proceeds to use the pointer which is
now dangling, causing the issue.

To fix this, a new buffer pool mutex is introduced that protects flush
list relocations and flush list page reads with no flush list mutex
taken. The mutex priority is between the page latch and the LRU list
mutex priorities.

Acquire this mutex in buf_page_get_gen() when we are about to relocate
the page on the flush list. Also acquire it in buf_flush_batch()
around the flushing. Temporarily release this mutex to obey latching
order in buf_flush_page() after an I/O fix is set on the page we are
about to flush, as this also prevents the flush list relocation.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'Percona-Server/storage/innodb_plugin/buf/buf0buf.c'
2--- Percona-Server/storage/innodb_plugin/buf/buf0buf.c 2013-01-18 03:33:06 +0000
3+++ Percona-Server/storage/innodb_plugin/buf/buf0buf.c 2013-01-23 10:49:24 +0000
4@@ -294,6 +294,9 @@
5 UNIV_INTERN mutex_t free_list_mutex;
6 UNIV_INTERN mutex_t zip_free_mutex;
7 UNIV_INTERN mutex_t zip_hash_mutex;
8+/** mutex protecting the compressed page relocation and flush list page
9+accesses after the flush list mutex release */
10+UNIV_INTERN mutex_t zip_dirty_flush_mutex;
11 /** mutex protecting the control blocks of compressed-only pages
12 (of type buf_page_t, not buf_block_t) */
13 UNIV_INTERN mutex_t buf_pool_zip_mutex;
14@@ -980,6 +983,7 @@
15 mutex_create(&free_list_mutex, SYNC_BUF_FREE_LIST);
16 mutex_create(&zip_free_mutex, SYNC_BUF_ZIP_FREE);
17 mutex_create(&zip_hash_mutex, SYNC_BUF_ZIP_HASH);
18+ mutex_create(&zip_dirty_flush_mutex, SYNC_BUF_ZIP_DIRTY_FLUSH);
19
20 mutex_create(&buf_pool_zip_mutex, SYNC_BUF_BLOCK);
21
22@@ -1931,6 +1935,7 @@
23 switch (buf_block_get_state(block)) {
24 buf_page_t* bpage;
25 ibool success;
26+ ibool is_zip_dirty;
27
28 case BUF_BLOCK_FILE_PAGE:
29 if (block_mutex == &buf_pool_zip_mutex) {
30@@ -1943,6 +1948,8 @@
31 case BUF_BLOCK_ZIP_PAGE:
32 case BUF_BLOCK_ZIP_DIRTY:
33 ut_ad(block_mutex == &buf_pool_zip_mutex);
34+ is_zip_dirty
35+ = (buf_block_get_state(block) == BUF_BLOCK_ZIP_DIRTY);
36 bpage = &block->page;
37 /* Protect bpage->buf_fix_count. */
38 /* Already proteced here. */
39@@ -1974,6 +1981,13 @@
40 block_mutex = &block->mutex;
41
42 //buf_pool_mutex_enter();
43+ if (is_zip_dirty) {
44+ /* We have a dirty compressed block for this page, but
45+ no uncompressed block. As we will replace the block on
46+ the flush list, block any flushes from acquiring this
47+ block. */
48+ mutex_enter(&zip_dirty_flush_mutex);
49+ }
50 mutex_enter(&LRU_list_mutex);
51 rw_lock_x_lock(&page_hash_latch);
52 mutex_enter(block_mutex);
53@@ -1997,6 +2011,9 @@
54 }
55 rw_lock_x_unlock(&page_hash_latch);
56 mutex_exit(&LRU_list_mutex);
57+ if (is_zip_dirty) {
58+ mutex_exit(&zip_dirty_flush_mutex);
59+ }
60 goto loop2;
61 }
62 }
63@@ -2005,11 +2022,14 @@
64
65 if (UNIV_UNLIKELY
66 (bpage->buf_fix_count
67- || buf_page_get_io_fix(bpage) != BUF_IO_NONE)) {
68+ || buf_page_get_io_fix(bpage) != BUF_IO_NONE
69+ || is_zip_dirty != (buf_page_get_state(bpage)
70+ == BUF_BLOCK_ZIP_DIRTY))) {
71
72 mutex_exit(&buf_pool_zip_mutex);
73- /* The block was buffer-fixed or I/O-fixed
74- while buf_pool_mutex was not held by this thread.
75+ /* The block was buffer-fixed or I/O-fixed or changed
76+ from zip clean to zip dirty or back while the mutexes
77+ were not held by this thread.
78 Free the block that was allocated and try again.
79 This should be extremely unlikely. */
80
81@@ -2018,6 +2038,9 @@
82
83 rw_lock_x_unlock(&page_hash_latch);
84 mutex_exit(&LRU_list_mutex);
85+ if (is_zip_dirty) {
86+ mutex_exit(&zip_dirty_flush_mutex);
87+ }
88 goto wait_until_unfixed;
89 }
90
91@@ -2060,6 +2083,9 @@
92 buf_unzip_LRU_add_block(block, FALSE);
93
94 mutex_exit(&LRU_list_mutex);
95+ if (is_zip_dirty) {
96+ mutex_exit(&zip_dirty_flush_mutex);
97+ }
98
99 block->page.buf_fix_count = 1;
100 buf_block_set_io_fix(block, BUF_IO_READ);
101
102=== modified file 'Percona-Server/storage/innodb_plugin/buf/buf0flu.c'
103--- Percona-Server/storage/innodb_plugin/buf/buf0flu.c 2011-11-24 02:00:47 +0000
104+++ Percona-Server/storage/innodb_plugin/buf/buf0flu.c 2013-01-23 10:49:24 +0000
105@@ -400,7 +400,9 @@
106 //ut_a(buf_page_in_file(bpage));
107 //ut_ad(buf_pool_mutex_own()); /*optimistic...*/
108 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
109- ut_ad(flush_type == BUF_FLUSH_LRU || BUF_FLUSH_LIST);
110+ ut_ad(flush_type == BUF_FLUSH_LRU
111+ || (flush_type == BUF_FLUSH_LIST
112+ && mutex_own(&zip_dirty_flush_mutex)));
113
114 if (buf_page_in_file(bpage) && bpage->oldest_modification != 0
115 && buf_page_get_io_fix(bpage) == BUF_IO_NONE) {
116@@ -499,6 +501,8 @@
117 buf_page_t* prev_b = NULL;
118
119 //ut_ad(buf_pool_mutex_own());
120+ ut_ad((buf_page_get_state(bpage) != BUF_BLOCK_ZIP_DIRTY)
121+ || mutex_own(&zip_dirty_flush_mutex));
122 ut_ad(mutex_own(&flush_list_mutex));
123
124 ut_ad(mutex_own(buf_page_get_mutex(bpage)));
125@@ -1170,7 +1174,9 @@
126 mutex_t* block_mutex;
127 ibool is_uncompressed;
128
129- ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_LIST);
130+ ut_ad(flush_type == BUF_FLUSH_LRU
131+ || (flush_type == BUF_FLUSH_LIST
132+ && mutex_own(&zip_dirty_flush_mutex)));
133 //ut_ad(buf_pool_mutex_own());
134 #ifdef UNIV_SYNC_DEBUG
135 ut_ad(rw_lock_own(&page_hash_latch, RW_LOCK_EX)
136@@ -1188,6 +1194,12 @@
137
138 buf_page_set_io_fix(bpage, BUF_IO_WRITE);
139
140+ /* I/O fix prevents the block relocation, thus OK to release
141+ temporarily */
142+ if (flush_type == BUF_FLUSH_LIST) {
143+ mutex_exit(&zip_dirty_flush_mutex);
144+ }
145+
146 buf_page_set_flush_type(bpage, flush_type);
147
148 if (buf_pool->n_flush[flush_type] == 0) {
149@@ -1273,6 +1285,10 @@
150 }
151 #endif /* UNIV_DEBUG */
152 buf_flush_write_block_low(bpage);
153+
154+ if (flush_type == BUF_FLUSH_LIST) {
155+ mutex_enter(&zip_dirty_flush_mutex);
156+ }
157 }
158
159 /***********************************************************//**
160@@ -1293,7 +1309,9 @@
161 ulint count = 0;
162 ulint i;
163
164- ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_LIST);
165+ ut_ad(flush_type == BUF_FLUSH_LRU
166+ || (flush_type == BUF_FLUSH_LIST
167+ && mutex_own(&zip_dirty_flush_mutex)));
168
169 if (UT_LIST_GET_LEN(buf_pool->LRU) < BUF_LRU_OLD_MIN_LEN || !flush_neighbors) {
170 /* If there is little space, it is better not to flush any
171@@ -1426,6 +1444,9 @@
172
173 if (flush_type == BUF_FLUSH_LRU) {
174 mutex_enter(&LRU_list_mutex);
175+ } else {
176+ ut_ad(flush_type == BUF_FLUSH_LIST);
177+ mutex_enter(&zip_dirty_flush_mutex);
178 }
179
180 for (;;) {
181@@ -1451,6 +1472,9 @@
182 prev_bpage = UT_LIST_GET_PREV(flush_list, bpage);
183 }
184 mutex_exit(&flush_list_mutex);
185+ /* After the flush list mutex is released, the block is
186+ protected from flush list reallocation by
187+ zip_dirty_flush_mutex. */
188 if (!bpage
189 || bpage->oldest_modification >= lsn_limit) {
190 /* We have flushed enough */
191@@ -1528,6 +1552,8 @@
192
193 if (flush_type == BUF_FLUSH_LRU) {
194 mutex_exit(&LRU_list_mutex);
195+ } else {
196+ mutex_exit(&zip_dirty_flush_mutex);
197 }
198
199 mutex_enter(&buf_pool_mutex);
200
201=== modified file 'Percona-Server/storage/innodb_plugin/include/buf0buf.h'
202--- Percona-Server/storage/innodb_plugin/include/buf0buf.h 2012-10-16 04:49:59 +0000
203+++ Percona-Server/storage/innodb_plugin/include/buf0buf.h 2013-01-23 10:49:24 +0000
204@@ -1606,6 +1606,9 @@
205 extern mutex_t free_list_mutex;
206 extern mutex_t zip_free_mutex;
207 extern mutex_t zip_hash_mutex;
208+/** mutex protecting the compressed page relocation and flush list page
209+accesses after the flush list mutex release */
210+extern mutex_t zip_dirty_flush_mutex;
211 /** mutex protecting the control blocks of compressed-only pages
212 (of type buf_page_t, not buf_block_t) */
213 extern mutex_t buf_pool_zip_mutex;
214
215=== modified file 'Percona-Server/storage/innodb_plugin/include/sync0sync.h'
216--- Percona-Server/storage/innodb_plugin/include/sync0sync.h 2011-11-24 16:33:30 +0000
217+++ Percona-Server/storage/innodb_plugin/include/sync0sync.h 2013-01-23 10:49:24 +0000
218@@ -487,6 +487,7 @@
219 SYNC_SEARCH_SYS, as memory allocation
220 can call routines there! Otherwise
221 the level is SYNC_MEM_HASH. */
222+#define SYNC_BUF_ZIP_DIRTY_FLUSH 158
223 #define SYNC_BUF_LRU_LIST 157
224 #define SYNC_BUF_PAGE_HASH 156
225 #define SYNC_BUF_BLOCK 155
226
227=== modified file 'Percona-Server/storage/innodb_plugin/sync/sync0sync.c'
228--- Percona-Server/storage/innodb_plugin/sync/sync0sync.c 2011-11-24 16:33:30 +0000
229+++ Percona-Server/storage/innodb_plugin/sync/sync0sync.c 2013-01-23 10:49:24 +0000
230@@ -1170,6 +1170,7 @@
231 case SYNC_TRX_SYS_HEADER:
232 case SYNC_FILE_FORMAT_TAG:
233 case SYNC_DOUBLEWRITE:
234+ case SYNC_BUF_ZIP_DIRTY_FLUSH:
235 case SYNC_BUF_LRU_LIST:
236 case SYNC_BUF_FLUSH_LIST:
237 case SYNC_BUF_PAGE_HASH:

Subscribers

People subscribed via source and target branches