Merge lp:~sergei.glushchenko/percona-xtrabackup/2.1-ST43302-xb-bug1340717 into lp:percona-xtrabackup/2.1

Proposed by Sergei Glushchenko
Status: Merged
Approved by: Alexey Kopytov
Approved revision: no longer in the source branch.
Merged at revision: 760
Proposed branch: lp:~sergei.glushchenko/percona-xtrabackup/2.1-ST43302-xb-bug1340717
Merge into: lp:percona-xtrabackup/2.1
Diff against target: 425 lines (+81/-43)
2 files modified
patches/innodb56.patch (+46/-43)
test/t/bug1340717.sh (+35/-0)
To merge this branch: bzr merge lp:~sergei.glushchenko/percona-xtrabackup/2.1-ST43302-xb-bug1340717
Reviewer Review Type Date Requested Status
Alexey Kopytov (community) Approve
Review via email: mp+227022@code.launchpad.net

Description of the change

    Bug 1340717: Crash on prepare partial backup with 5.6-based
    xtrabackup

    The root cause of the crash is that purge_sys is not initialized
    at the moment when xtrabackup trying to delete missing tablespace.

    Fix for upstream Bug#16593427 "ROLLBACK OF RECOVERED TRANSACTION
    CORRUPTS NON-ONLINE ADD INDEX" introduced function
    trx_resurrect_table_locks which being executed before purge_sys
    initialized. It loads some tablepsaces in data dictionary. Crash
    happens when tablespace is missing from partial backup.

    Fix is not to remove missing tablespaces from data dictionary
    until purge_sys is initialized. They will be eventually removed
    during subsequent operation.

To post a comment you must log in.
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

http://jenkins.percona.com/view/PXB%202.1/job/percona-xtrabackup-2.1-param/584/

lots of Jenkins failures are due to migration. hopefully Alexey will fix it and I re-run it.

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Sergei,

Some questions:

1. " Fix is not to remove missing tablespaces from data dictionary
    until purge_sys is initialized. They will be eventually removed
    during subsequent operation."

What are those "subsequent operations"? If it is supposed to be removed when a redo log record is applied, what happens when there are no corresponding redo log records to replay?

2. What about a test case? Why isn't this case covered by current tests? I thought we have some partial prepared cases in the suite.

review: Needs Information
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Hello Alexey,

Let me explain the bug more verbosely. New code has been added to
MySQL 5.6 to resurrect table locks. This code opens some tables. It
being done before purge_sys initialized and here is where the problem
is.

trx_resurrect_table_locks parses undo log and resurrects locks for
tables in it. It is done while purge_sys is still down. In order to
catch this crash we need to take a backup at the moment when there are
active transactions. Since our test suite communicate with MySQL via
mysql cli tool we cannot produce uncommitted transactions. Only option
for us is to run mysql in parallel and implement some hacks for
synchronization.

Subsequent operations are dict_check_tablespaces_and_store_max_id
invoked from innobase_start_or_create_for_mysql.

Hope this makes more sense.

Revision history for this message
Alexey Kopytov (akopytov) wrote :

It does make sense, thanks.

I still think a test case is possible. Have a look at start_uncommitted_transaction() in undo_tablespaces.sh. Which again makes me wonder why that specific test case passes. Any ideas?

review: Needs Information
Revision history for this message
Alexey Kopytov (akopytov) wrote :

ping?

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

I'll try to reproduce this crash. Will likely have results by Monday.

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

I constructed the test case based on start_uncommitted_transactio(), but I added additional sleep to make sure that transaction really started. It crashes the xtrabackup on prepare as expected. Patch fixes the crash. Will post link to Jenkins run once JEN-248 is fixed.

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Are you saying that undo_tablespaces.sh is broken because it doesn't wait for the transaction to start, and that's why the test suite didn't trigger this bug before? If so, can you also fix the broken test?

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

This bug needs partial backup to crash. Recovery crashes when tablespace is not found and there is uncommitted transaction for this tablespace. Both conditions are essential. undo_tablespaces.sh does not do partial backup. Maybe it also doesn't wait for server to write something into undo tablespace. It takes backup immediately in parallel with delete statement. It is possible that delete work some time using only redo log and without flushing, thus not writing to undo tablespace.

Revision history for this message
Alexey Kopytov (akopytov) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'patches/innodb56.patch'
2--- patches/innodb56.patch 2014-05-01 06:29:32 +0000
3+++ patches/innodb56.patch 2014-08-17 09:44:59 +0000
4@@ -112,16 +112,17 @@
5 bpage->offset, buf_page_get_zip_size(bpage),
6 --- a/storage/innobase/fil/fil0fil.cc
7 +++ b/storage/innobase/fil/fil0fil.cc
8-@@ -43,6 +43,8 @@
9+@@ -43,6 +43,9 @@
10 #include "dict0dict.h"
11 #include "page0page.h"
12 #include "page0zip.h"
13 +#include "pars0pars.h"
14 +#include "que0que.h"
15++#include "trx0purge.h"
16 #include "trx0sys.h"
17 #include "row0mysql.h"
18 #ifndef UNIV_HOTBACKUP
19-@@ -313,17 +315,15 @@
20+@@ -313,17 +316,15 @@
21
22 /** The tablespace memory cache. This variable is NULL before the module is
23 initialized. */
24@@ -145,7 +146,7 @@
25 #else /* __WIN__ */
26 # define fil_buffering_disabled(s) (0)
27 #endif /* __WIN__ */
28-@@ -791,6 +791,7 @@
29+@@ -791,6 +792,7 @@
30 "InnoDB: Error: tablespace id is %lu"
31 " in the data dictionary\n"
32 "InnoDB: but in file %s it is %lu!\n",
33@@ -153,7 +154,7 @@
34 space->id, node->name, space_id);
35
36 ut_error;
37-@@ -830,8 +831,9 @@
38+@@ -830,8 +832,9 @@
39 }
40
41 if (size_bytes >= 1024 * 1024) {
42@@ -165,7 +166,7 @@
43 }
44
45 if (!fsp_flags_is_compressed(flags)) {
46-@@ -1302,12 +1304,6 @@
47+@@ -1302,12 +1305,6 @@
48
49 if (!fil_system->space_id_reuse_warned) {
50 fil_system->space_id_reuse_warned = TRUE;
51@@ -178,7 +179,7 @@
52 }
53
54 fil_system->max_assigned_id = id;
55-@@ -2031,12 +2027,6 @@
56+@@ -2031,12 +2028,6 @@
57 contain sensible data */
58 ulint* flags, /*!< out: tablespace flags */
59 ulint* space_id, /*!< out: tablespace ID */
60@@ -191,7 +192,7 @@
61 lsn_t* min_flushed_lsn, /*!< out: min of flushed
62 lsn values in data files */
63 lsn_t* max_flushed_lsn) /*!< out: max of flushed
64-@@ -2061,7 +2051,7 @@
65+@@ -2061,7 +2052,7 @@
66
67 flushed_lsn = mach_read_from_8(page + FIL_PAGE_FILE_FLUSH_LSN);
68
69@@ -200,7 +201,7 @@
70 check_msg = fil_check_first_page(page);
71 }
72
73-@@ -2074,10 +2064,6 @@
74+@@ -2074,10 +2065,6 @@
75 if (!one_read_already) {
76 *min_flushed_lsn = flushed_lsn;
77 *max_flushed_lsn = flushed_lsn;
78@@ -211,7 +212,7 @@
79 return(NULL);
80 }
81
82-@@ -2087,14 +2073,6 @@
83+@@ -2087,14 +2074,6 @@
84 if (*max_flushed_lsn < flushed_lsn) {
85 *max_flushed_lsn = flushed_lsn;
86 }
87@@ -226,7 +227,7 @@
88
89 return(NULL);
90 }
91-@@ -2195,7 +2173,7 @@
92+@@ -2195,7 +2174,7 @@
93 mem_free(path);
94 }
95
96@@ -235,7 +236,7 @@
97 /********************************************************//**
98 Writes a log record about an .ibd file create/rename/delete. */
99 static
100-@@ -2423,7 +2401,7 @@
101+@@ -2423,7 +2402,7 @@
102 space_id, name, path, flags,
103 DICT_TF2_USE_TABLESPACE,
104 FIL_IBD_FILE_INITIAL_SIZE) != DB_SUCCESS) {
105@@ -244,7 +245,7 @@
106 }
107 }
108
109-@@ -2782,7 +2760,7 @@
110+@@ -2782,7 +2761,7 @@
111 }
112
113 if (err == DB_SUCCESS) {
114@@ -253,7 +254,7 @@
115 /* Write a log record about the deletion of the .ibd
116 file, so that ibbackup can replay it in the
117 --apply-log phase. We use a dummy mtr and the familiar
118-@@ -3137,7 +3115,7 @@
119+@@ -3137,7 +3116,7 @@
120
121 mutex_exit(&fil_system->mutex);
122
123@@ -262,7 +263,7 @@
124 if (success && !recv_recovery_on) {
125 mtr_t mtr;
126
127-@@ -3368,7 +3346,21 @@
128+@@ -3368,7 +3347,21 @@
129 ibool success;
130 /* TRUE if a table is created with CREATE TEMPORARY TABLE */
131 bool is_temp = !!(flags2 & DICT_TF2_TEMPORARY);
132@@ -284,7 +285,7 @@
133
134 ut_a(space_id > 0);
135 ut_ad(!srv_read_only_mode);
136-@@ -3521,7 +3513,7 @@
137+@@ -3521,7 +3514,7 @@
138 goto error_exit_1;
139 }
140
141@@ -293,7 +294,7 @@
142 {
143 mtr_t mtr;
144 ulint mlog_file_flag = 0;
145-@@ -3599,6 +3591,97 @@
146+@@ -3599,6 +3592,97 @@
147 (ulong) expected_id, (ulong) expected_flags);
148 }
149
150@@ -391,7 +392,7 @@
151 /********************************************************************//**
152 Tries to open a single-table tablespace and optionally checks that the
153 space id in it is correct. If this does not succeed, print an error message
154-@@ -3664,6 +3747,9 @@
155+@@ -3664,6 +3748,9 @@
156 in the default location. If it is remote, it should not be here. */
157 def.filepath = fil_make_ibd_name(tablename, false);
158
159@@ -401,7 +402,7 @@
160 /* The path_in was read from SYS_DATAFILES. */
161 if (path_in) {
162 if (strcmp(def.filepath, path_in)) {
163-@@ -3711,6 +3797,7 @@
164+@@ -3711,6 +3798,7 @@
165 tablespaces_found++;
166 }
167 }
168@@ -409,7 +410,7 @@
169
170 /* Always look for a file at the default location. */
171 ut_a(def.filepath);
172-@@ -3732,9 +3819,6 @@
173+@@ -3732,9 +3820,6 @@
174 if (def.success) {
175 def.check_msg = fil_read_first_page(
176 def.file, FALSE, &def.flags, &def.id,
177@@ -419,7 +420,7 @@
178 &def.lsn, &def.lsn);
179 def.valid = !def.check_msg;
180
181-@@ -3757,9 +3841,6 @@
182+@@ -3757,9 +3842,6 @@
183 if (remote.success) {
184 remote.check_msg = fil_read_first_page(
185 remote.file, FALSE, &remote.flags, &remote.id,
186@@ -429,7 +430,7 @@
187 &remote.lsn, &remote.lsn);
188 remote.valid = !remote.check_msg;
189
190-@@ -3783,9 +3864,6 @@
191+@@ -3783,9 +3865,6 @@
192 if (dict.success) {
193 dict.check_msg = fil_read_first_page(
194 dict.file, FALSE, &dict.flags, &dict.id,
195@@ -439,7 +440,7 @@
196 &dict.lsn, &dict.lsn);
197 dict.valid = !dict.check_msg;
198
199-@@ -3810,11 +3888,15 @@
200+@@ -3810,11 +3889,17 @@
201 /* The following call prints an error message */
202 os_file_get_last_error(true);
203
204@@ -452,11 +453,13 @@
205 + ib_logf(IB_LOG_LEVEL_WARN,
206 + "It will be removed from the data dictionary.");
207 +
208-+ fil_remove_invalid_table_from_data_dict(tablename);
209++ if (purge_sys) {
210++ fil_remove_invalid_table_from_data_dict(tablename);
211++ }
212
213 err = DB_CORRUPTION;
214
215-@@ -4187,9 +4269,6 @@
216+@@ -4187,9 +4272,6 @@
217 fsp->success = TRUE;
218 if (const char* check_msg = fil_read_first_page(
219 fsp->file, FALSE, &fsp->flags, &fsp->id,
220@@ -466,7 +469,7 @@
221 &fsp->lsn, &fsp->lsn)) {
222 ib_logf(IB_LOG_LEVEL_ERROR,
223 "%s in tablespace %s (table %s)",
224-@@ -4266,6 +4345,9 @@
225+@@ -4266,6 +4348,9 @@
226 #ifdef UNIV_HOTBACKUP
227 fil_space_t* space;
228 #endif
229@@ -476,7 +479,7 @@
230
231 memset(&def, 0, sizeof(def));
232 memset(&remote, 0, sizeof(remote));
233-@@ -4310,6 +4392,9 @@
234+@@ -4310,6 +4395,9 @@
235 # endif /* !UNIV_HOTBACKUP */
236 #endif
237
238@@ -486,7 +489,7 @@
239 /* Check for a link file which locates a remote tablespace. */
240 remote.success = fil_open_linked_file(
241 tablename, &remote.filepath, &remote.file);
242-@@ -4320,8 +4405,20 @@
243+@@ -4320,8 +4408,20 @@
244 if (!remote.success) {
245 os_file_close(remote.file);
246 mem_free(remote.filepath);
247@@ -507,7 +510,7 @@
248
249
250 /* Try to open the tablespace in the datadir. */
251-@@ -4334,6 +4431,17 @@
252+@@ -4334,6 +4434,17 @@
253 fil_validate_single_table_tablespace(tablename, &def);
254 if (!def.success) {
255 os_file_close(def.file);
256@@ -525,7 +528,7 @@
257 }
258 }
259
260-@@ -4418,7 +4526,7 @@
261+@@ -4418,7 +4529,7 @@
262 /* At this point, only one tablespace is open */
263 ut_a(def.success == !remote.success);
264
265@@ -534,7 +537,7 @@
266
267 /* Get and test the file size. */
268 size = os_file_get_size(fsp->file);
269-@@ -4437,9 +4545,9 @@
270+@@ -4437,9 +4548,9 @@
271
272 /* Every .ibd file is created >= 4 pages in size. Smaller files
273 cannot be ok. */
274@@ -546,7 +549,7 @@
275 ib_logf(IB_LOG_LEVEL_ERROR,
276 "The size of single-table tablespace file %s "
277 "is only " UINT64PF ", should be at least %lu!",
278-@@ -4520,7 +4628,7 @@
279+@@ -4520,7 +4631,7 @@
280 }
281 mutex_exit(&fil_system->mutex);
282 #endif /* UNIV_HOTBACKUP */
283@@ -555,7 +558,7 @@
284 tablename, fsp->id, fsp->flags, FIL_TABLESPACE);
285
286 if (!file_space_create_success) {
287-@@ -4547,13 +4655,55 @@
288+@@ -4547,13 +4658,55 @@
289 }
290
291 func_exit:
292@@ -615,7 +618,7 @@
293 mem_free(tablename);
294 if (remote.success) {
295 mem_free(remote.filepath);
296-@@ -4567,7 +4717,7 @@
297+@@ -4567,7 +4720,7 @@
298 idea is to read as much good data as we can and jump over bad data.
299 @return 0 if ok, -1 if error even after the retries, 1 if at the end
300 of the directory */
301@@ -624,7 +627,7 @@
302 int
303 fil_file_readdir_next_file(
304 /*=======================*/
305-@@ -4607,7 +4757,7 @@
306+@@ -4607,7 +4760,7 @@
307 @return DB_SUCCESS or error number */
308 UNIV_INTERN
309 dberr_t
310@@ -633,7 +636,7 @@
311 /*===================================*/
312 {
313 int ret;
314-@@ -4663,7 +4813,9 @@
315+@@ -4663,7 +4816,9 @@
316 "%s/%s", fil_path_to_mysql_datadir, dbinfo.name);
317 srv_normalize_path_for_win(dbpath);
318
319@@ -644,7 +647,7 @@
320
321 if (dbdir != NULL) {
322
323-@@ -4684,9 +4836,15 @@
324+@@ -4684,9 +4839,15 @@
325 && (0 == strcmp(fileinfo.name
326 + strlen(fileinfo.name) - 4,
327 ".ibd")
328@@ -662,7 +665,7 @@
329 /* The name ends in .ibd or .isl;
330 try opening the file */
331 fil_load_single_table_tablespace(
332-@@ -4842,6 +5000,7 @@
333+@@ -4842,6 +5003,7 @@
334 {
335 fil_space_t* fnamespace;
336 fil_space_t* space;
337@@ -670,7 +673,7 @@
338
339 ut_ad(fil_system);
340
341-@@ -4919,6 +5078,10 @@
342+@@ -4919,6 +5081,10 @@
343 if (fnamespace == NULL) {
344 if (print_error_if_does_not_exist) {
345 fil_report_missing_tablespace(name, id);
346@@ -681,18 +684,18 @@
347 }
348 } else {
349 ut_print_timestamp(stderr);
350-@@ -4942,6 +5105,10 @@
351+@@ -4942,6 +5108,10 @@
352
353 mutex_exit(&fil_system->mutex);
354
355-+ if (remove_from_data_dict) {
356++ if (remove_from_data_dict && purge_sys) {
357 + fil_remove_invalid_table_from_data_dict(name);
358 + }
359 +
360 return(FALSE);
361 }
362
363-@@ -5575,6 +5742,16 @@
364+@@ -5575,6 +5745,16 @@
365
366 ut_ad(mode != OS_AIO_IBUF || space->purpose == FIL_TABLESPACE);
367
368@@ -709,7 +712,7 @@
369 node = UT_LIST_GET_FIRST(space->chain);
370
371 for (;;) {
372-@@ -6502,6 +6679,7 @@
373+@@ -6502,6 +6682,7 @@
374 return(err);
375 }
376
377@@ -717,7 +720,7 @@
378 /****************************************************************//**
379 Generate redo logs for swapping two .ibd files */
380 UNIV_INTERN
381-@@ -6528,3 +6706,4 @@
382+@@ -6528,3 +6709,4 @@
383 0, 0, new_name, old_name, mtr);
384 }
385 }
386
387=== added file 'test/t/bug1340717.sh'
388--- test/t/bug1340717.sh 1970-01-01 00:00:00 +0000
389+++ test/t/bug1340717.sh 2014-08-17 09:44:59 +0000
390@@ -0,0 +1,35 @@
391+########################################################################
392+# Test resurrect table locks
393+########################################################################
394+
395+. inc/common.sh
396+
397+require_server_version_higher_than 5.6.0
398+
399+################################################################################
400+# Start an uncommitted transaction pause "indefinitely" to keep the connection
401+# open
402+################################################################################
403+function start_uncomitted_transaction()
404+{
405+ run_cmd $MYSQL $MYSQL_ARGS sakila <<EOF
406+START TRANSACTION;
407+DELETE FROM payment;
408+SELECT SLEEP(10000);
409+EOF
410+}
411+
412+start_server
413+load_sakila
414+
415+start_uncomitted_transaction &
416+job_master=$!
417+
418+sleep 5;
419+
420+innobackupex --no-timestamp --include="sakila.actor" $topdir/backup
421+
422+kill -SIGKILL $job_master
423+stop_server
424+
425+innobackupex --apply-log $topdir/backup

Subscribers

People subscribed via source and target branches