Merge lp:~vlad-lesin/percona-server/5.6-logical-readahead into lp:percona-server/5.6

Proposed by Vlad Lesin on 2014-04-23
Status: Rejected
Rejected by: Laurynas Biveinis on 2015-02-25
Proposed branch: lp:~vlad-lesin/percona-server/5.6-logical-readahead
Merge into: lp:percona-server/5.6
Diff against target: 2278 lines (+1520/-71)
40 files modified
client/client_priv.h (+4/-1)
client/mysqldump.c (+38/-0)
mysql-test/include/have_native_aio.inc (+6/-0)
mysql-test/suite/innodb/r/innodb_logical_read_ahead.result (+47/-0)
mysql-test/suite/innodb/r/innodb_logical_read_ahead_correctness.result (+88/-0)
mysql-test/suite/innodb/r/innodb_merge_read.result (+30/-0)
mysql-test/suite/innodb/t/innodb_logical_read_ahead-master.opt (+2/-0)
mysql-test/suite/innodb/t/innodb_logical_read_ahead.test (+55/-0)
mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness-master.opt (+2/-0)
mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness.test (+107/-0)
mysql-test/suite/innodb/t/innodb_merge_read-master.opt (+1/-0)
mysql-test/suite/innodb/t/innodb_merge_read.test (+42/-0)
mysql-test/suite/sys_vars/r/innodb_lra_n_node_recs_before_sleep_basic.result (+28/-0)
mysql-test/suite/sys_vars/r/innodb_lra_size_basic.result (+28/-0)
mysql-test/suite/sys_vars/r/innodb_lra_sleep_basic.result (+30/-0)
mysql-test/suite/sys_vars/r/innodb_lra_test_basic.result (+8/-0)
mysql-test/suite/sys_vars/t/innodb_lra_n_node_recs_before_sleep_basic.test (+14/-0)
mysql-test/suite/sys_vars/t/innodb_lra_size_basic-master.opt (+1/-0)
mysql-test/suite/sys_vars/t/innodb_lra_size_basic.test (+14/-0)
mysql-test/suite/sys_vars/t/innodb_lra_sleep_basic.test (+14/-0)
mysql-test/suite/sys_vars/t/innodb_lra_test_basic-master.opt (+1/-0)
mysql-test/suite/sys_vars/t/innodb_lra_test_basic.test (+8/-0)
storage/innobase/btr/btr0cur.cc (+1/-0)
storage/innobase/btr/btr0pcur.cc (+26/-18)
storage/innobase/buf/buf0rea.cc (+33/-10)
storage/innobase/fil/fil0fil.cc (+8/-3)
storage/innobase/handler/ha_innodb.cc (+61/-0)
storage/innobase/include/btr0pcur.h (+3/-2)
storage/innobase/include/btr0pcur.ic (+51/-17)
storage/innobase/include/buf0rea.h (+37/-0)
storage/innobase/include/fil0fil.h (+5/-2)
storage/innobase/include/os0file.h (+24/-6)
storage/innobase/include/os0file.ic (+6/-2)
storage/innobase/include/srv0srv.h (+42/-0)
storage/innobase/include/trx0trx.h (+96/-1)
storage/innobase/os/os0file.cc (+99/-9)
storage/innobase/row/row0purge.cc (+9/-0)
storage/innobase/row/row0sel.cc (+319/-0)
storage/innobase/srv/srv0srv.cc (+9/-0)
storage/innobase/trx/trx0trx.cc (+123/-0)
To merge this branch: bzr merge lp:~vlad-lesin/percona-server/5.6-logical-readahead
Reviewer Review Type Date Requested Status
Laurynas Biveinis (community) 2014-04-23 Resubmit on 2015-02-25
Review via email: mp+216857@code.launchpad.net

Description of the change

Porting logical readahead feature from Facebook branch. The original patches are here:

https://github.com/facebook/mysql-5.6/commit/f9d1a5332eb2c82c028638d3b93b5a3592a69ffa
https://github.com/facebook/mysql-5.6/commit/f8e361952612d00979f7cf744f487e48b15cb5a6
https://github.com/facebook/mysql-5.6/commit/f69a4ea522bce24e4cdcc7696d5fad29587cf87a

The main difference is multiple io's commit is enabled only for logical read-ahead in this branch in comparison with the original implementation where it is enabled by default for all operations. See explanation in commit comments.

Jenkins testing:
http://jenkins.percona.com/view/PS 5.6/job/percona-server-5.6-param/589

To post a comment you must log in.

This is not a full review, but addressing this bit can happen in parallel with the rest of the review: the MP needs a blueprint, or even several blueprints, corresponding to each commit (the commits IMHO are well-split). The blueprints must be self-contained.

review: Needs Information

Review of commit 576. I think it will be easier if it will have its own MP (probably the other two commits too).

    Code

    - s/ibool/bool/g (in all three commits if applies)
    - I'd add a defensive asserts at os_aio_free that for all arrays,
      count[x] == 0. This would catch any unsubmitted buffered read
      request, and any request buffered on the non-read array.
    - s/ut_malloc+memset(0)/calloc in os_aio_array_create
    - Why does buf_read_recv_pages call
      os_aio_linux_dispatch_read_array_submit? It does not appear to
      submit any buffered requests.
    - fil_extend_space_to_desired_size calling os_aio with
      should_buffer == TRUE is a (benign) typo?
    - The abstraction level for buffered request submitting seems to
      be off. I'd rename os_aio_linux_dispatch_read_array_submit to
      os_aio_submit_buffered_requests and push #ifdef LINUX_NATIVE_AIO
      down to it.
    - buf_read_page_low header comment @return tag: edit to "1 if
      read request is issued or buffered" to clarify that the function
      returns the same for both buffered and immediatelly issued read
      requests.
    - s/read/ready in the buf_read_page_low should_buffer
      comment. (https://github.com/facebook/mysql-5.6/commit/b1b4c7977136d57170f8bf500aaedba740b1c333)
    - Make sure the patch does not break the build with performance
      schema configured out
    - os_aio_linux_dispatch_read_array_submit would need a /*===*/
      comment, os_aio_func should_buffer arg declaration is misaligned
      </pedantic>

    Testcase

    - --disable_warnings/DROP TABLE IF EXISTS/--enable_warnings idiom
      is obsolete and should be removed. (in all three commits if
      applies)
    - innodb_merge_read.test needs --source
      include/have_innodb_16k.inc, as the number of readahead requests
      is likely to differ for other page sizes.
    - innodb_merge_read-master.opt is redundant, as
      --innodb-use-native-aio=1 is the default. The source
      include/have_native_aio.inc check in the testcase itself is
      enough.
    - newline at the end of have_native_aio.inc
    - I'd extend the innodb_merge_read testcase to check that linear
      read ahead read buffering works for compressed tablespaces too
      (there is code if (zip_size) then read(... should_buffer) else
      read(... should_buffer)). That would cause move of the testcase
      to the innodb_zip suite as well.
    - (wishlist) Consider submitting
      https://github.com/webscalesql/webscalesql-5.6/commit/32e49b4d66eaa392d9f06198596db4b16e8b8d04
      for Percona Server so that we exercise AIO with MTR --mem.

review: Needs Fixing

Work on this MP must continue on github.

review: Resubmit

Unmerged revisions

578. By Vlad Lesin on 2014-04-22

Add mysqldump support for logical read ahead

Summary:
Adds options to mysqldump:
 --lra-size=X
 --lra-sleep=X
 --lra-n-node-recs-before-sleep=X

These just inject SET statements to set these session variables.

The original implementation is here:
https://github.com/facebook/mysql-5.6/commit/f69a4ea522bce24e4cdcc7696d5fad29587cf87a

577. By Vlad Lesin on 2014-04-22

When the session variable innodb_lra_size is set to N, we issue async
read requests for the next M logical pages where the total size of the M
pages on disk is N megabytes. The max allowed value of innodb_lra_size
is is 16384 which corresponds to prefetching 16GB of data. We may choose
to use smaller values in production.

The original implementation can be found here:
https://github.com/facebook/mysql-5.6/commit/f8e361952612d00979f7cf744f487e48b15cb5a6

This implementation does not contain code for flashcahe.

576. By Vlad Lesin on 2014-04-21

Merge aio page read requests

Summary:
Tries to submit multiple aio page read requests together to improve read
performance.

The original code and description can be found here:
https://github.com/facebook/mysql-5.6/commit/f9d1a5332eb2c82c028638d3b93b5a3592a69ffa

The difference between this and the original implementation is that fil_io()
macros invokes _fil_io() function with enabled io's buffering by default in
the original implementation, it can cause the errors connected with waiting
io finishing just after fil_io() invocation.

For example log_archive_do() waits io's finishing on log_sys->archive_lock
mutex, but the mutex is not being unlocked as io's were buffered and
uncommited and io_handler_thread() does not process io's completion in
fil_aio_wait(). Potentially there can be the same errors so io's buffering
is disabled by default and will be enabled only for logical readahead code.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'client/client_priv.h'
2--- client/client_priv.h 2014-02-25 17:05:01 +0000
3+++ client/client_priv.h 2014-04-23 10:58:56 +0000
4@@ -106,7 +106,10 @@
5 OPT_INNODB_OPTIMIZE_KEYS,
6 OPT_REWRITE_DB,
7 OPT_LOCK_FOR_BACKUP,
8- OPT_MAX_CLIENT_OPTION
9+ OPT_MAX_CLIENT_OPTION,
10+ OPT_LRA_SIZE,
11+ OPT_LRA_SLEEP,
12+ OPT_LRA_N_NODE_RECS_BEFORE_SLEEP
13 };
14
15 /**
16
17=== modified file 'client/mysqldump.c'
18--- client/mysqldump.c 2014-03-03 17:51:33 +0000
19+++ client/mysqldump.c 2014-04-23 10:58:56 +0000
20@@ -133,6 +133,9 @@
21 /* Server supports character_set_results session variable? */
22 static my_bool server_supports_switching_charsets= TRUE;
23 static ulong opt_compatible_mode= 0;
24+static ulong opt_lra_size = 0;
25+static ulong opt_lra_sleep = 0;
26+static ulong opt_lra_n_node_recs_before_sleep = 0;
27 #define MYSQL_OPT_MASTER_DATA_EFFECTIVE_SQL 1
28 #define MYSQL_OPT_MASTER_DATA_COMMENTED_SQL 2
29 #define MYSQL_OPT_SLAVE_DATA_EFFECTIVE_SQL 1
30@@ -567,6 +570,18 @@
31 "Default authentication client-side plugin to use.",
32 &opt_default_auth, &opt_default_auth, 0,
33 GET_STR, REQUIRED_ARG, 0, 0, 0, 0, 0, 0},
34+ {"lra_size", OPT_LRA_SIZE,
35+ "Set innodb_lra_size for the session of this dump.",
36+ &opt_lra_size, &opt_lra_size, 0,
37+ GET_ULONG, REQUIRED_ARG, 0, 0, 16384, 0, 0, 0},
38+ {"lra_sleep", OPT_LRA_SLEEP,
39+ "Set innodb_lra_sleep for the session of this dump.",
40+ &opt_lra_sleep, &opt_lra_sleep, 0,
41+ GET_ULONG, REQUIRED_ARG, 0, 0, 1000, 0, 0, 0},
42+ {"lra_n_node_recs_before_sleep", OPT_LRA_N_NODE_RECS_BEFORE_SLEEP,
43+ "Set innodb_lra_n_node_recs_before_sleep for the session of this dump.",
44+ &opt_lra_n_node_recs_before_sleep, &opt_lra_n_node_recs_before_sleep, 0,
45+ GET_ULONG, REQUIRED_ARG, 1024, 128, ULONG_MAX, 0, 0, 0},
46 {0, 0, 0, 0, 0, 0, GET_NO_ARG, NO_ARG, 0, 0, 0, 0, 0, 0}
47 };
48
49@@ -1611,6 +1626,29 @@
50 if (mysql_query_with_error_report(mysql, 0, buff))
51 DBUG_RETURN(1);
52 }
53+
54+ if (opt_lra_size)
55+ {
56+ my_snprintf(buff, sizeof(buff), "SET innodb_lra_size=%lu", opt_lra_size);
57+ if (mysql_query_with_error_report(mysql, 0, buff))
58+ DBUG_RETURN(1);
59+ if (opt_lra_sleep)
60+ {
61+ my_snprintf(buff, sizeof(buff), "SET innodb_lra_sleep=%lu",
62+ opt_lra_sleep);
63+ if (mysql_query_with_error_report(mysql, 0, buff))
64+ DBUG_RETURN(1);
65+ }
66+ if (opt_lra_n_node_recs_before_sleep)
67+ {
68+ my_snprintf(buff, sizeof(buff),
69+ "SET innodb_lra_n_node_recs_before_sleep=%lu",
70+ opt_lra_n_node_recs_before_sleep);
71+ if (mysql_query_with_error_report(mysql, 0, buff))
72+ DBUG_RETURN(1);
73+ }
74+ }
75+
76 DBUG_RETURN(0);
77 } /* connect_to_db */
78
79
80=== added file 'mysql-test/include/have_native_aio.inc'
81--- mysql-test/include/have_native_aio.inc 1970-01-01 00:00:00 +0000
82+++ mysql-test/include/have_native_aio.inc 2014-04-23 10:58:56 +0000
83@@ -0,0 +1,6 @@
84+--disable_query_log
85+if (`select @@global.innodb_use_native_aio != 1`)
86+{
87+ --skip native AIO is not in use
88+}
89+--enable_query_log
90\ No newline at end of file
91
92=== added file 'mysql-test/suite/innodb/r/innodb_logical_read_ahead.result'
93--- mysql-test/suite/innodb/r/innodb_logical_read_ahead.result 1970-01-01 00:00:00 +0000
94+++ mysql-test/suite/innodb/r/innodb_logical_read_ahead.result 2014-04-23 10:58:56 +0000
95@@ -0,0 +1,47 @@
96+DROP TABLE if exists t1;
97+CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
98+INSERT INTO t1 VALUES (0, REPEAT('a',256));
99+INSERT INTO t1 SELECT 0, b FROM t1;
100+INSERT INTO t1 SELECT 0, b FROM t1;
101+INSERT INTO t1 SELECT 0, b FROM t1;
102+INSERT INTO t1 SELECT 0, b FROM t1;
103+INSERT INTO t1 SELECT 0, b FROM t1;
104+INSERT INTO t1 SELECT 0, b FROM t1;
105+INSERT INTO t1 SELECT 0, b FROM t1;
106+INSERT INTO t1 SELECT 0, b FROM t1;
107+INSERT INTO t1 SELECT 0, b FROM t1;
108+INSERT INTO t1 SELECT 0, b FROM t1;
109+INSERT INTO t1 SELECT 0, b FROM t1;
110+INSERT INTO t1 SELECT 0, b FROM t1;
111+INSERT INTO t1 SELECT 0, b FROM t1;
112+INSERT INTO t1 SELECT 0, b FROM t1;
113+INSERT INTO t1 SELECT 0, b FROM t1;
114+INSERT INTO t1 SELECT 0, b FROM t1;
115+show global status like "innodb_buffered_aio_submitted";
116+Variable_name Value
117+Innodb_buffered_aio_submitted 0
118+show global status like "innodb_logical_read_ahead_misses";
119+Variable_name Value
120+Innodb_logical_read_ahead_misses 0
121+show global status like "innodb_logical_read_ahead_prefetched";
122+Variable_name Value
123+Innodb_logical_read_ahead_prefetched 0
124+show global status like "innodb_logical_read_ahead_in_buf_pool";
125+Variable_name Value
126+Innodb_logical_read_ahead_in_buf_pool 0
127+SET SESSION innodb_lra_size=1024;
128+SET SESSION innodb_lra_n_node_recs_before_sleep=128;
129+SET SESSION innodb_lra_sleep=100;
130+checksum table t1;
131+Table Checksum
132+test.t1 2920207201
133+show global status like "innodb_logical_read_ahead_misses";
134+Variable_name Value
135+Innodb_logical_read_ahead_misses 0
136+select variable_value > 1000 from information_schema.global_status where variable_name="innodb_logical_read_ahead_prefetched";
137+variable_value > 1000
138+1
139+select variable_value < 100 from information_schema.global_status where variable_name="innodb_logical_read_ahead_in_buf_pool";
140+variable_value < 100
141+1
142+DROP TABLE t1;
143
144=== added file 'mysql-test/suite/innodb/r/innodb_logical_read_ahead_correctness.result'
145--- mysql-test/suite/innodb/r/innodb_logical_read_ahead_correctness.result 1970-01-01 00:00:00 +0000
146+++ mysql-test/suite/innodb/r/innodb_logical_read_ahead_correctness.result 2014-04-23 10:58:56 +0000
147@@ -0,0 +1,88 @@
148+DROP TABLE IF EXISTS t1_small;
149+DROP TABLE IF EXISTS t1;
150+DROP TABLE IF EXISTS t1_lra;
151+DROP TABLE IF EXISTS t2_small;
152+DROP TABLE IF EXISTS t3_small;
153+CREATE TABLE t1_small(a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
154+SET SESSION innodb_lra_size=1;
155+SELECT * FROM t1_small;
156+a b
157+SET SESSION innodb_lra_size=0;
158+INSERT INTO t1_small(b) VALUES(REPEAT('a',256));
159+SET SESSION innodb_lra_size=1;
160+SELECT a, LENGTH(b) FROM t1_small;
161+a LENGTH(b)
162+1 256
163+SET SESSION innodb_lra_size=0;
164+DROP TABLE t1_small;
165+CREATE TABLE `t2_small` (
166+`id1` bigint(20) unsigned NOT NULL DEFAULT '0',
167+`time` bigint(20) unsigned NOT NULL DEFAULT '0',
168+`id2` bigint(20) unsigned NOT NULL DEFAULT '0',
169+`id2_type` int(10) unsigned DEFAULT NULL,
170+`data` text,
171+`status` tinyint(3) unsigned DEFAULT NULL,
172+PRIMARY KEY (`id1`,`time`,`id2`)
173+) ENGINE=InnoDB DEFAULT CHARSET=latin1;
174+SET SESSION innodb_lra_size=1;
175+SELECT * FROM t2_small;
176+id1 time id2 id2_type data status
177+DROP TABLE t2_small;
178+CREATE TABLE `t3_small` (
179+`id` bigint(20) NOT NULL,
180+`a` text,
181+`b` text,
182+`c` text,
183+`d` text,
184+`e` text,
185+`f` text,
186+`g` text,
187+PRIMARY KEY (`id`)
188+) ENGINE=InnoDB DEFAULT CHARSET=latin1;
189+SET SESSION innodb_lra_size=1;
190+SELECT * FROM t3_small;
191+id a b c d e f g
192+DROP TABLE t3_small;
193+CREATE TABLE t1(a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
194+CREATE TABLE t1_lra(a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
195+INSERT INTO t1 VALUES (0, REPEAT('a',256));
196+INSERT INTO t1(b) SELECT b FROM t1;
197+INSERT INTO t1(b) SELECT b FROM t1;
198+INSERT INTO t1(b) SELECT b FROM t1;
199+INSERT INTO t1(b) SELECT b FROM t1;
200+INSERT INTO t1(b) SELECT b FROM t1;
201+INSERT INTO t1(b) SELECT b FROM t1;
202+INSERT INTO t1(b) SELECT b FROM t1;
203+INSERT INTO t1(b) SELECT b FROM t1;
204+INSERT INTO t1(b) SELECT b FROM t1;
205+INSERT INTO t1(b) SELECT b FROM t1;
206+INSERT INTO t1(b) SELECT b FROM t1;
207+INSERT INTO t1(b) SELECT b FROM t1;
208+INSERT INTO t1(b) SELECT b FROM t1;
209+INSERT INTO t1(b) SELECT b FROM t1;
210+INSERT INTO t1_lra SELECT * FROM t1;
211+CHECKSUM TABLE t1;
212+Table Checksum
213+test.t1 2793042655
214+SET SESSION innodb_lra_size=1;
215+SET SESSION innodb_lra_n_node_recs_before_sleep=128;
216+SET SESSION innodb_lra_sleep=100;
217+CHECKSUM TABLE t1_lra;
218+Table Checksum
219+test.t1_lra 2793042655
220+DELETE FROM t1 WHERE a >= 5480 AND a < 5520;
221+DELETE FROM t1 WHERE a >= 5520 AND a < 5550;
222+CHECKSUM TABLE t1;
223+Table Checksum
224+test.t1 1005864202
225+SET GLOBAL innodb_lra_test=1;
226+DELETE FROM t1_lra WHERE a >= 5480 AND a < 5520;
227+DELETE FROM t1_lra WHERE a >= 5520 AND a < 5550;
228+SET SESSION innodb_lra_size=1;
229+SET SESSION innodb_lra_n_node_recs_before_sleep=128;
230+SET SESSION innodb_lra_sleep=100;
231+CHECKSUM TABLE t1_lra;
232+Table Checksum
233+test.t1_lra 1005864202
234+DROP TABLE t1;
235+DROP TABLE t1_lra;
236
237=== added file 'mysql-test/suite/innodb/r/innodb_merge_read.result'
238--- mysql-test/suite/innodb/r/innodb_merge_read.result 1970-01-01 00:00:00 +0000
239+++ mysql-test/suite/innodb/r/innodb_merge_read.result 2014-04-23 10:58:56 +0000
240@@ -0,0 +1,30 @@
241+DROP TABLE if exists t1;
242+CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
243+INSERT INTO t1 VALUES (0, REPEAT('a',256));
244+INSERT INTO t1 SELECT 0, b FROM t1;
245+INSERT INTO t1 SELECT 0, b FROM t1;
246+INSERT INTO t1 SELECT 0, b FROM t1;
247+INSERT INTO t1 SELECT 0, b FROM t1;
248+INSERT INTO t1 SELECT 0, b FROM t1;
249+INSERT INTO t1 SELECT 0, b FROM t1;
250+INSERT INTO t1 SELECT 0, b FROM t1;
251+INSERT INTO t1 SELECT 0, b FROM t1;
252+INSERT INTO t1 SELECT 0, b FROM t1;
253+INSERT INTO t1 SELECT 0, b FROM t1;
254+INSERT INTO t1 SELECT 0, b FROM t1;
255+INSERT INTO t1 SELECT 0, b FROM t1;
256+INSERT INTO t1 SELECT 0, b FROM t1;
257+INSERT INTO t1 SELECT 0, b FROM t1;
258+INSERT INTO t1 SELECT 0, b FROM t1;
259+INSERT INTO t1 SELECT 0, b FROM t1;
260+show global status like "innodb_buffered_aio_submitted";
261+Variable_name Value
262+Innodb_buffered_aio_submitted 0
263+select * from t1;
264+select count(*) from t1;
265+count(*)
266+65536
267+show global status like "innodb_buffered_aio_submitted";
268+Variable_name Value
269+Innodb_buffered_aio_submitted 2397
270+DROP TABLE t1;
271
272=== added file 'mysql-test/suite/innodb/t/innodb_logical_read_ahead-master.opt'
273--- mysql-test/suite/innodb/t/innodb_logical_read_ahead-master.opt 1970-01-01 00:00:00 +0000
274+++ mysql-test/suite/innodb/t/innodb_logical_read_ahead-master.opt 2014-04-23 10:58:56 +0000
275@@ -0,0 +1,2 @@
276+--innodb_use_native_aio=1
277+--force-restart
278
279=== added file 'mysql-test/suite/innodb/t/innodb_logical_read_ahead.test'
280--- mysql-test/suite/innodb/t/innodb_logical_read_ahead.test 1970-01-01 00:00:00 +0000
281+++ mysql-test/suite/innodb/t/innodb_logical_read_ahead.test 2014-04-23 10:58:56 +0000
282@@ -0,0 +1,55 @@
283+--source include/have_innodb.inc
284+--source include/have_native_aio.inc
285+
286+--disable_warnings
287+DROP TABLE if exists t1;
288+--enable_warnings
289+
290+# Create table.
291+CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
292+
293+# Populate table.
294+INSERT INTO t1 VALUES (0, REPEAT('a',256));
295+INSERT INTO t1 SELECT 0, b FROM t1;
296+INSERT INTO t1 SELECT 0, b FROM t1;
297+INSERT INTO t1 SELECT 0, b FROM t1;
298+INSERT INTO t1 SELECT 0, b FROM t1;
299+INSERT INTO t1 SELECT 0, b FROM t1;
300+INSERT INTO t1 SELECT 0, b FROM t1;
301+INSERT INTO t1 SELECT 0, b FROM t1;
302+INSERT INTO t1 SELECT 0, b FROM t1;
303+INSERT INTO t1 SELECT 0, b FROM t1;
304+INSERT INTO t1 SELECT 0, b FROM t1;
305+INSERT INTO t1 SELECT 0, b FROM t1;
306+INSERT INTO t1 SELECT 0, b FROM t1;
307+INSERT INTO t1 SELECT 0, b FROM t1;
308+INSERT INTO t1 SELECT 0, b FROM t1;
309+INSERT INTO t1 SELECT 0, b FROM t1;
310+INSERT INTO t1 SELECT 0, b FROM t1;
311+
312+--source include/restart_mysqld.inc
313+
314+show global status like "innodb_buffered_aio_submitted";
315+show global status like "innodb_logical_read_ahead_misses";
316+show global status like "innodb_logical_read_ahead_prefetched";
317+show global status like "innodb_logical_read_ahead_in_buf_pool";
318+
319+# set the logical read ahead large enough to prefetch
320+# the entire table.
321+SET SESSION innodb_lra_size=1024;
322+SET SESSION innodb_lra_n_node_recs_before_sleep=128;
323+SET SESSION innodb_lra_sleep=100;
324+checksum table t1;
325+
326+# there should be no misses, all pages must have been
327+# prefetched by the logical read ahead.
328+show global status like "innodb_logical_read_ahead_misses";
329+# the total number of pages prefetched must be close to the number
330+# of leaf pages of the table.
331+select variable_value > 1000 from information_schema.global_status where variable_name="innodb_logical_read_ahead_prefetched";
332+# innodb_logical_read_ahead_in_buf_pool is the number of pages
333+# of the table that were already in the buffer pool while doing the scan.
334+# This should be small.
335+select variable_value < 100 from information_schema.global_status where variable_name="innodb_logical_read_ahead_in_buf_pool";
336+
337+DROP TABLE t1;
338
339=== added file 'mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness-master.opt'
340--- mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness-master.opt 1970-01-01 00:00:00 +0000
341+++ mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness-master.opt 2014-04-23 10:58:56 +0000
342@@ -0,0 +1,2 @@
343+--innodb_use_native_aio=1
344+--force-restart
345
346=== added file 'mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness.test'
347--- mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness.test 1970-01-01 00:00:00 +0000
348+++ mysql-test/suite/innodb/t/innodb_logical_read_ahead_correctness.test 2014-04-23 10:58:56 +0000
349@@ -0,0 +1,107 @@
350+--source include/have_debug.inc
351+--source include/have_innodb.inc
352+--source include/have_native_aio.inc
353+
354+--disable_warnings
355+DROP TABLE IF EXISTS t1_small;
356+DROP TABLE IF EXISTS t1;
357+DROP TABLE IF EXISTS t1_lra;
358+DROP TABLE IF EXISTS t2_small;
359+DROP TABLE IF EXISTS t3_small;
360+--enable_warnings
361+
362+# The small table is for checking against a bug where the table's only page is the
363+# root page. In such a case the function called for getting the parent page caused
364+# the server to crash.
365+CREATE TABLE t1_small(a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
366+
367+SET SESSION innodb_lra_size=1;
368+SELECT * FROM t1_small;
369+
370+SET SESSION innodb_lra_size=0;
371+INSERT INTO t1_small(b) VALUES(REPEAT('a',256));
372+SET SESSION innodb_lra_size=1;
373+SELECT a, LENGTH(b) FROM t1_small;
374+SET SESSION innodb_lra_size=0;
375+
376+DROP TABLE t1_small;
377+
378+CREATE TABLE `t2_small` (
379+ `id1` bigint(20) unsigned NOT NULL DEFAULT '0',
380+ `time` bigint(20) unsigned NOT NULL DEFAULT '0',
381+ `id2` bigint(20) unsigned NOT NULL DEFAULT '0',
382+ `id2_type` int(10) unsigned DEFAULT NULL,
383+ `data` text,
384+ `status` tinyint(3) unsigned DEFAULT NULL,
385+ PRIMARY KEY (`id1`,`time`,`id2`)
386+) ENGINE=InnoDB DEFAULT CHARSET=latin1;
387+
388+SET SESSION innodb_lra_size=1;
389+SELECT * FROM t2_small;
390+DROP TABLE t2_small;
391+
392+CREATE TABLE `t3_small` (
393+ `id` bigint(20) NOT NULL,
394+ `a` text,
395+ `b` text,
396+ `c` text,
397+ `d` text,
398+ `e` text,
399+ `f` text,
400+ `g` text,
401+ PRIMARY KEY (`id`)
402+) ENGINE=InnoDB DEFAULT CHARSET=latin1;
403+
404+SET SESSION innodb_lra_size=1;
405+SELECT * FROM t3_small;
406+DROP TABLE t3_small;
407+
408+CREATE TABLE t1(a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
409+CREATE TABLE t1_lra(a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
410+
411+# Populate tables.
412+INSERT INTO t1 VALUES (0, REPEAT('a',256));
413+INSERT INTO t1(b) SELECT b FROM t1;
414+INSERT INTO t1(b) SELECT b FROM t1;
415+INSERT INTO t1(b) SELECT b FROM t1;
416+INSERT INTO t1(b) SELECT b FROM t1;
417+INSERT INTO t1(b) SELECT b FROM t1;
418+INSERT INTO t1(b) SELECT b FROM t1;
419+INSERT INTO t1(b) SELECT b FROM t1;
420+INSERT INTO t1(b) SELECT b FROM t1;
421+INSERT INTO t1(b) SELECT b FROM t1;
422+INSERT INTO t1(b) SELECT b FROM t1;
423+INSERT INTO t1(b) SELECT b FROM t1;
424+INSERT INTO t1(b) SELECT b FROM t1;
425+INSERT INTO t1(b) SELECT b FROM t1;
426+INSERT INTO t1(b) SELECT b FROM t1;
427+
428+INSERT INTO t1_lra SELECT * FROM t1;
429+
430+--source include/restart_mysqld.inc
431+
432+CHECKSUM TABLE t1;
433+
434+SET SESSION innodb_lra_size=1;
435+SET SESSION innodb_lra_n_node_recs_before_sleep=128;
436+SET SESSION innodb_lra_sleep=100;
437+CHECKSUM TABLE t1_lra;
438+
439+--source include/restart_mysqld.inc
440+
441+DELETE FROM t1 WHERE a >= 5480 AND a < 5520;
442+DELETE FROM t1 WHERE a >= 5520 AND a < 5550;
443+
444+CHECKSUM TABLE t1;
445+
446+SET GLOBAL innodb_lra_test=1;
447+DELETE FROM t1_lra WHERE a >= 5480 AND a < 5520;
448+DELETE FROM t1_lra WHERE a >= 5520 AND a < 5550;
449+
450+SET SESSION innodb_lra_size=1;
451+SET SESSION innodb_lra_n_node_recs_before_sleep=128;
452+SET SESSION innodb_lra_sleep=100;
453+CHECKSUM TABLE t1_lra;
454+
455+DROP TABLE t1;
456+DROP TABLE t1_lra;
457
458=== added file 'mysql-test/suite/innodb/t/innodb_merge_read-master.opt'
459--- mysql-test/suite/innodb/t/innodb_merge_read-master.opt 1970-01-01 00:00:00 +0000
460+++ mysql-test/suite/innodb/t/innodb_merge_read-master.opt 2014-04-23 10:58:56 +0000
461@@ -0,0 +1,1 @@
462+--innodb-use-native-aio=1
463
464=== added file 'mysql-test/suite/innodb/t/innodb_merge_read.test'
465--- mysql-test/suite/innodb/t/innodb_merge_read.test 1970-01-01 00:00:00 +0000
466+++ mysql-test/suite/innodb/t/innodb_merge_read.test 2014-04-23 10:58:56 +0000
467@@ -0,0 +1,42 @@
468+--source include/have_innodb.inc
469+--source include/have_native_aio.inc
470+
471+--disable_warnings
472+DROP TABLE if exists t1;
473+--enable_warnings
474+
475+# Create table.
476+CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(256)) ENGINE=INNODB;
477+
478+# Populate table.
479+INSERT INTO t1 VALUES (0, REPEAT('a',256));
480+INSERT INTO t1 SELECT 0, b FROM t1;
481+INSERT INTO t1 SELECT 0, b FROM t1;
482+INSERT INTO t1 SELECT 0, b FROM t1;
483+INSERT INTO t1 SELECT 0, b FROM t1;
484+INSERT INTO t1 SELECT 0, b FROM t1;
485+INSERT INTO t1 SELECT 0, b FROM t1;
486+INSERT INTO t1 SELECT 0, b FROM t1;
487+INSERT INTO t1 SELECT 0, b FROM t1;
488+INSERT INTO t1 SELECT 0, b FROM t1;
489+INSERT INTO t1 SELECT 0, b FROM t1;
490+INSERT INTO t1 SELECT 0, b FROM t1;
491+INSERT INTO t1 SELECT 0, b FROM t1;
492+INSERT INTO t1 SELECT 0, b FROM t1;
493+INSERT INTO t1 SELECT 0, b FROM t1;
494+INSERT INTO t1 SELECT 0, b FROM t1;
495+INSERT INTO t1 SELECT 0, b FROM t1;
496+
497+--source include/restart_mysqld.inc
498+
499+show global status like "innodb_buffered_aio_submitted";
500+
501+--disable_result_log
502+select * from t1;
503+--enable_result_log
504+
505+select count(*) from t1;
506+
507+show global status like "innodb_buffered_aio_submitted";
508+
509+DROP TABLE t1;
510
511=== added file 'mysql-test/suite/sys_vars/r/innodb_lra_n_node_recs_before_sleep_basic.result'
512--- mysql-test/suite/sys_vars/r/innodb_lra_n_node_recs_before_sleep_basic.result 1970-01-01 00:00:00 +0000
513+++ mysql-test/suite/sys_vars/r/innodb_lra_n_node_recs_before_sleep_basic.result 2014-04-23 10:58:56 +0000
514@@ -0,0 +1,28 @@
515+SET GLOBAL innodb_lra_n_node_recs_before_sleep = 128;
516+SELECT @@GLOBAL.innodb_lra_n_node_recs_before_sleep;
517+@@GLOBAL.innodb_lra_n_node_recs_before_sleep
518+128
519+SET SESSION innodb_lra_n_node_recs_before_sleep=1000000;
520+SELECT @@SESSION.innodb_lra_n_node_recs_before_sleep;
521+@@SESSION.innodb_lra_n_node_recs_before_sleep
522+1000000
523+SET SESSION innodb_lra_n_node_recs_before_sleep=0;
524+Warnings:
525+Warning 1292 Truncated incorrect innodb_lra_n_node_recs_before_sl value: '0'
526+SELECT @@SESSION.innodb_lra_n_node_recs_before_sleep;
527+@@SESSION.innodb_lra_n_node_recs_before_sleep
528+128
529+SET SESSION innodb_lra_n_node_recs_before_sleep=16384;
530+SELECT @@SESSION.innodb_lra_n_node_recs_before_sleep;
531+@@SESSION.innodb_lra_n_node_recs_before_sleep
532+16384
533+SET GLOBAL innodb_lra_n_node_recs_before_sleep=-1;
534+Warnings:
535+Warning 1292 Truncated incorrect innodb_lra_n_node_recs_before_sl value: '-1'
536+SELECT @@GLOBAL.innodb_lra_n_node_recs_before_sleep;
537+@@GLOBAL.innodb_lra_n_node_recs_before_sleep
538+128
539+SET GLOBAL innodb_lra_n_node_recs_before_sleep = default;
540+SELECT @@GLOBAL.innodb_lra_n_node_recs_before_sleep;
541+@@GLOBAL.innodb_lra_n_node_recs_before_sleep
542+1024
543
544=== added file 'mysql-test/suite/sys_vars/r/innodb_lra_size_basic.result'
545--- mysql-test/suite/sys_vars/r/innodb_lra_size_basic.result 1970-01-01 00:00:00 +0000
546+++ mysql-test/suite/sys_vars/r/innodb_lra_size_basic.result 2014-04-23 10:58:56 +0000
547@@ -0,0 +1,28 @@
548+SET GLOBAL innodb_lra_size = 128;
549+SELECT @@GLOBAL.innodb_lra_size;
550+@@GLOBAL.innodb_lra_size
551+128
552+SET SESSION innodb_lra_size=1000000;
553+Warnings:
554+Warning 1292 Truncated incorrect innodb_lra_size value: '1000000'
555+SELECT @@SESSION.innodb_lra_size;
556+@@SESSION.innodb_lra_size
557+16384
558+SET SESSION innodb_lra_size=0;
559+SELECT @@SESSION.innodb_lra_size;
560+@@SESSION.innodb_lra_size
561+0
562+SET SESSION innodb_lra_size=16384;
563+SELECT @@SESSION.innodb_lra_size;
564+@@SESSION.innodb_lra_size
565+16384
566+SET GLOBAL innodb_lra_size=-1;
567+Warnings:
568+Warning 1292 Truncated incorrect innodb_lra_size value: '-1'
569+SELECT @@GLOBAL.innodb_lra_size;
570+@@GLOBAL.innodb_lra_size
571+0
572+SET GLOBAL innodb_lra_size = default;
573+SELECT @@GLOBAL.innodb_lra_size;
574+@@GLOBAL.innodb_lra_size
575+0
576
577=== added file 'mysql-test/suite/sys_vars/r/innodb_lra_sleep_basic.result'
578--- mysql-test/suite/sys_vars/r/innodb_lra_sleep_basic.result 1970-01-01 00:00:00 +0000
579+++ mysql-test/suite/sys_vars/r/innodb_lra_sleep_basic.result 2014-04-23 10:58:56 +0000
580@@ -0,0 +1,30 @@
581+SET GLOBAL innodb_lra_sleep = 128;
582+SELECT @@GLOBAL.innodb_lra_sleep;
583+@@GLOBAL.innodb_lra_sleep
584+128
585+SET SESSION innodb_lra_sleep=1000000;
586+Warnings:
587+Warning 1292 Truncated incorrect innodb_lra_sleep value: '1000000'
588+SELECT @@SESSION.innodb_lra_sleep;
589+@@SESSION.innodb_lra_sleep
590+1000
591+SET SESSION innodb_lra_sleep=0;
592+SELECT @@SESSION.innodb_lra_sleep;
593+@@SESSION.innodb_lra_sleep
594+0
595+SET SESSION innodb_lra_sleep=16384;
596+Warnings:
597+Warning 1292 Truncated incorrect innodb_lra_sleep value: '16384'
598+SELECT @@SESSION.innodb_lra_sleep;
599+@@SESSION.innodb_lra_sleep
600+1000
601+SET GLOBAL innodb_lra_sleep=-1;
602+Warnings:
603+Warning 1292 Truncated incorrect innodb_lra_sleep value: '-1'
604+SELECT @@GLOBAL.innodb_lra_sleep;
605+@@GLOBAL.innodb_lra_sleep
606+0
607+SET GLOBAL innodb_lra_sleep = default;
608+SELECT @@GLOBAL.innodb_lra_sleep;
609+@@GLOBAL.innodb_lra_sleep
610+50
611
612=== added file 'mysql-test/suite/sys_vars/r/innodb_lra_test_basic.result'
613--- mysql-test/suite/sys_vars/r/innodb_lra_test_basic.result 1970-01-01 00:00:00 +0000
614+++ mysql-test/suite/sys_vars/r/innodb_lra_test_basic.result 2014-04-23 10:58:56 +0000
615@@ -0,0 +1,8 @@
616+set global innodb_lra_test=1;
617+select @@global.innodb_lra_test;
618+@@global.innodb_lra_test
619+1
620+set global innodb_lra_test=default;
621+select @@global.innodb_lra_test;
622+@@global.innodb_lra_test
623+0
624
625=== added file 'mysql-test/suite/sys_vars/t/innodb_lra_n_node_recs_before_sleep_basic.test'
626--- mysql-test/suite/sys_vars/t/innodb_lra_n_node_recs_before_sleep_basic.test 1970-01-01 00:00:00 +0000
627+++ mysql-test/suite/sys_vars/t/innodb_lra_n_node_recs_before_sleep_basic.test 2014-04-23 10:58:56 +0000
628@@ -0,0 +1,14 @@
629+--source include/have_innodb.inc
630+
631+SET GLOBAL innodb_lra_n_node_recs_before_sleep = 128;
632+SELECT @@GLOBAL.innodb_lra_n_node_recs_before_sleep;
633+SET SESSION innodb_lra_n_node_recs_before_sleep=1000000;
634+SELECT @@SESSION.innodb_lra_n_node_recs_before_sleep;
635+SET SESSION innodb_lra_n_node_recs_before_sleep=0;
636+SELECT @@SESSION.innodb_lra_n_node_recs_before_sleep;
637+SET SESSION innodb_lra_n_node_recs_before_sleep=16384;
638+SELECT @@SESSION.innodb_lra_n_node_recs_before_sleep;
639+SET GLOBAL innodb_lra_n_node_recs_before_sleep=-1;
640+SELECT @@GLOBAL.innodb_lra_n_node_recs_before_sleep;
641+SET GLOBAL innodb_lra_n_node_recs_before_sleep = default;
642+SELECT @@GLOBAL.innodb_lra_n_node_recs_before_sleep;
643
644=== added file 'mysql-test/suite/sys_vars/t/innodb_lra_size_basic-master.opt'
645--- mysql-test/suite/sys_vars/t/innodb_lra_size_basic-master.opt 1970-01-01 00:00:00 +0000
646+++ mysql-test/suite/sys_vars/t/innodb_lra_size_basic-master.opt 2014-04-23 10:58:56 +0000
647@@ -0,0 +1,1 @@
648+--innodb-use-native-aio=1
649
650=== added file 'mysql-test/suite/sys_vars/t/innodb_lra_size_basic.test'
651--- mysql-test/suite/sys_vars/t/innodb_lra_size_basic.test 1970-01-01 00:00:00 +0000
652+++ mysql-test/suite/sys_vars/t/innodb_lra_size_basic.test 2014-04-23 10:58:56 +0000
653@@ -0,0 +1,14 @@
654+--source include/have_innodb.inc
655+
656+SET GLOBAL innodb_lra_size = 128;
657+SELECT @@GLOBAL.innodb_lra_size;
658+SET SESSION innodb_lra_size=1000000;
659+SELECT @@SESSION.innodb_lra_size;
660+SET SESSION innodb_lra_size=0;
661+SELECT @@SESSION.innodb_lra_size;
662+SET SESSION innodb_lra_size=16384;
663+SELECT @@SESSION.innodb_lra_size;
664+SET GLOBAL innodb_lra_size=-1;
665+SELECT @@GLOBAL.innodb_lra_size;
666+SET GLOBAL innodb_lra_size = default;
667+SELECT @@GLOBAL.innodb_lra_size;
668
669=== added file 'mysql-test/suite/sys_vars/t/innodb_lra_sleep_basic.test'
670--- mysql-test/suite/sys_vars/t/innodb_lra_sleep_basic.test 1970-01-01 00:00:00 +0000
671+++ mysql-test/suite/sys_vars/t/innodb_lra_sleep_basic.test 2014-04-23 10:58:56 +0000
672@@ -0,0 +1,14 @@
673+--source include/have_innodb.inc
674+
675+SET GLOBAL innodb_lra_sleep = 128;
676+SELECT @@GLOBAL.innodb_lra_sleep;
677+SET SESSION innodb_lra_sleep=1000000;
678+SELECT @@SESSION.innodb_lra_sleep;
679+SET SESSION innodb_lra_sleep=0;
680+SELECT @@SESSION.innodb_lra_sleep;
681+SET SESSION innodb_lra_sleep=16384;
682+SELECT @@SESSION.innodb_lra_sleep;
683+SET GLOBAL innodb_lra_sleep=-1;
684+SELECT @@GLOBAL.innodb_lra_sleep;
685+SET GLOBAL innodb_lra_sleep = default;
686+SELECT @@GLOBAL.innodb_lra_sleep;
687
688=== added file 'mysql-test/suite/sys_vars/t/innodb_lra_test_basic-master.opt'
689--- mysql-test/suite/sys_vars/t/innodb_lra_test_basic-master.opt 1970-01-01 00:00:00 +0000
690+++ mysql-test/suite/sys_vars/t/innodb_lra_test_basic-master.opt 2014-04-23 10:58:56 +0000
691@@ -0,0 +1,1 @@
692+--innodb-use-native-aio=1
693
694=== added file 'mysql-test/suite/sys_vars/t/innodb_lra_test_basic.test'
695--- mysql-test/suite/sys_vars/t/innodb_lra_test_basic.test 1970-01-01 00:00:00 +0000
696+++ mysql-test/suite/sys_vars/t/innodb_lra_test_basic.test 2014-04-23 10:58:56 +0000
697@@ -0,0 +1,8 @@
698+--source include/have_debug.inc
699+--source include/have_innodb.inc
700+--source include/have_native_aio.inc
701+
702+set global innodb_lra_test=1;
703+select @@global.innodb_lra_test;
704+set global innodb_lra_test=default;
705+select @@global.innodb_lra_test;
706\ No newline at end of file
707
708=== modified file 'storage/innobase/btr/btr0cur.cc'
709--- storage/innobase/btr/btr0cur.cc 2014-03-03 17:51:33 +0000
710+++ storage/innobase/btr/btr0cur.cc 2014-04-23 10:58:56 +0000
711@@ -548,6 +548,7 @@
712 btr_search_enabled below, and btr_search_guess_on_hash()
713 will have to check it again. */
714 && UNIV_LIKELY(btr_search_enabled)
715+ && !level
716 && btr_search_guess_on_hash(index, info, tuple, mode,
717 latch_mode, cursor,
718 has_search_latch, mtr)) {
719
720=== modified file 'storage/innobase/btr/btr0pcur.cc'
721--- storage/innobase/btr/btr0pcur.cc 2014-03-03 17:51:33 +0000
722+++ storage/innobase/btr/btr0pcur.cc 2014-04-23 10:58:56 +0000
723@@ -227,6 +227,7 @@
724 /*===========================*/
725 ulint latch_mode, /*!< in: BTR_SEARCH_LEAF, ... */
726 btr_pcur_t* cursor, /*!< in: detached persistent cursor */
727+ ulint level,
728 const char* file, /*!< in: file name */
729 ulint line, /*!< in: line where called */
730 mtr_t* mtr) /*!< in: mtr */
731@@ -255,7 +256,7 @@
732 btr_cur_open_at_index_side(
733 cursor->rel_pos == BTR_PCUR_BEFORE_FIRST_IN_TREE,
734 index, latch_mode,
735- btr_pcur_get_btr_cur(cursor), 0, mtr);
736+ btr_pcur_get_btr_cur(cursor), level, mtr);
737
738 cursor->latch_mode = latch_mode;
739 cursor->pos_state = BTR_PCUR_IS_POSITIONED;
740@@ -267,8 +268,12 @@
741 ut_a(cursor->old_rec);
742 ut_a(cursor->old_n_fields);
743
744- if (UNIV_LIKELY(latch_mode == BTR_SEARCH_LEAF)
745- || UNIV_LIKELY(latch_mode == BTR_MODIFY_LEAF)) {
746+ if (true
747+#ifdef UNIV_DEBUG
748+ && !level
749+#endif
750+ && (UNIV_LIKELY(latch_mode == BTR_SEARCH_LEAF)
751+ || UNIV_LIKELY(latch_mode == BTR_MODIFY_LEAF))) {
752 /* Try optimistic restoration. */
753
754 if (buf_page_optimistic_get(latch_mode,
755@@ -325,24 +330,27 @@
756
757 /* Save the old search mode of the cursor */
758 old_mode = cursor->search_mode;
759-
760- switch (cursor->rel_pos) {
761- case BTR_PCUR_ON:
762+ if (level > 0) {
763 mode = PAGE_CUR_LE;
764- break;
765- case BTR_PCUR_AFTER:
766- mode = PAGE_CUR_G;
767- break;
768- case BTR_PCUR_BEFORE:
769- mode = PAGE_CUR_L;
770- break;
771- default:
772- ut_error;
773- mode = 0;
774+ } else {
775+ switch (cursor->rel_pos) {
776+ case BTR_PCUR_ON:
777+ mode = PAGE_CUR_LE;
778+ break;
779+ case BTR_PCUR_AFTER:
780+ mode = PAGE_CUR_G;
781+ break;
782+ case BTR_PCUR_BEFORE:
783+ mode = PAGE_CUR_L;
784+ break;
785+ default:
786+ ut_error;
787+ mode = 0;
788+ }
789 }
790
791- btr_pcur_open_with_no_init_func(index, tuple, mode, latch_mode,
792- cursor, 0, file, line, mtr);
793+ btr_pcur_open_with_no_init_func_low(index, tuple, mode, latch_mode,
794+ cursor, level, 0, file, line, mtr);
795
796 /* Restore the old search mode */
797 cursor->search_mode = old_mode;
798
799=== modified file 'storage/innobase/buf/buf0rea.cc'
800--- storage/innobase/buf/buf0rea.cc 2013-10-23 08:48:28 +0000
801+++ storage/innobase/buf/buf0rea.cc 2014-04-23 10:58:56 +0000
802@@ -123,7 +123,12 @@
803 use to stop dangling page reads from a tablespace
804 which we have DISCARDed + IMPORTed back */
805 ulint offset, /*!< in: page number */
806- trx_t* trx)
807+ trx_t* trx, /*!< in: transaction object */
808+ ibool should_buffer) /*!< in: whether to buffer an aio request.
809+ AIO read ahead uses this. If you plan to
810+ use this parameter, make sure you remember
811+ to call os_aio_linux_dispatch_read_array_submit
812+ when you are read to commit all your requests.*/
813 {
814 buf_page_t* bpage;
815 ulint wake_later;
816@@ -229,14 +234,16 @@
817 *err = _fil_io(OS_FILE_READ | wake_later
818 | ignore_nonexistent_pages,
819 sync, space, zip_size, offset, 0, zip_size,
820- bpage->zip.data, bpage, trx);
821+ bpage->zip.data, bpage, trx,
822+ should_buffer);
823 } else {
824 ut_a(buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
825
826 *err = _fil_io(OS_FILE_READ | wake_later
827 | ignore_nonexistent_pages,
828 sync, space, 0, offset, 0, UNIV_PAGE_SIZE,
829- ((buf_block_t*) bpage)->frame, bpage, trx);
830+ ((buf_block_t*) bpage)->frame, bpage, trx,
831+ should_buffer);
832 }
833
834 if (sync) {
835@@ -395,7 +402,7 @@
836 &err, false,
837 ibuf_mode | OS_AIO_SIMULATED_WAKE_LATER,
838 space, zip_size, FALSE,
839- tablespace_version, i, trx);
840+ tablespace_version, i, trx, FALSE);
841 if (err == DB_TABLESPACE_DELETED) {
842 ut_print_timestamp(stderr);
843 fprintf(stderr,
844@@ -459,7 +466,7 @@
845
846 count = buf_read_page_low(&err, true, BUF_READ_ANY_PAGE, space,
847 zip_size, FALSE,
848- tablespace_version, offset, trx);
849+ tablespace_version, offset, trx, FALSE);
850 srv_stats.buf_pool_reads.add(count);
851 if (err == DB_TABLESPACE_DELETED) {
852 ut_print_timestamp(stderr);
853@@ -507,7 +514,7 @@
854 | OS_AIO_SIMULATED_WAKE_LATER
855 | BUF_READ_IGNORE_NONEXISTENT_PAGES,
856 space, zip_size, FALSE,
857- tablespace_version, offset, NULL);
858+ tablespace_version, offset, NULL, FALSE);
859 srv_stats.buf_pool_reads.add(count);
860
861 /* We do not increment number of I/O operations used for LRU policy
862@@ -584,6 +591,12 @@
863 return(0);
864 }
865
866+ /* linear read ahead is disabled if user requested logical read ahead.
867+ */
868+ if (trx && trx->lra_size) {
869+ return(0);
870+ }
871+
872 low = (offset / buf_read_ahead_linear_area)
873 * buf_read_ahead_linear_area;
874 high = (offset / buf_read_ahead_linear_area + 1)
875@@ -773,7 +786,8 @@
876 count += buf_read_page_low(
877 &err, false,
878 ibuf_mode,
879- space, zip_size, FALSE, tablespace_version, i, trx);
880+ space, zip_size, FALSE, tablespace_version, i, trx,
881+ TRUE);
882 if (err == DB_TABLESPACE_DELETED) {
883 ut_print_timestamp(stderr);
884 fprintf(stderr,
885@@ -786,6 +800,10 @@
886 }
887 }
888 }
889+#if defined(LINUX_NATIVE_AIO)
890+ /* Tell aio to submit all buffered requests. */
891+ ut_a(os_aio_linux_dispatch_read_array_submit());
892+#endif
893
894 /* In simulated aio we wake the aio handler threads only after
895 queuing all aio requests, in native aio the following call does
896@@ -863,7 +881,7 @@
897 buf_read_page_low(&err, sync && (i + 1 == n_stored),
898 BUF_READ_ANY_PAGE, space_ids[i],
899 zip_size, TRUE, space_versions[i],
900- page_nos[i], NULL);
901+ page_nos[i], NULL, FALSE);
902
903 if (UNIV_UNLIKELY(err == DB_TABLESPACE_DELETED)) {
904 tablespace_deleted:
905@@ -1003,15 +1021,20 @@
906 if ((i + 1 == n_stored) && sync) {
907 buf_read_page_low(&err, true, BUF_READ_ANY_PAGE, space,
908 zip_size, TRUE, tablespace_version,
909- page_nos[i], NULL);
910+ page_nos[i], NULL, FALSE);
911 } else {
912 buf_read_page_low(&err, false, BUF_READ_ANY_PAGE
913 | OS_AIO_SIMULATED_WAKE_LATER,
914 space, zip_size, TRUE,
915- tablespace_version, page_nos[i], NULL);
916+ tablespace_version, page_nos[i], NULL,
917+ FALSE);
918 }
919 }
920
921+#ifdef LINUX_NATIVE_AIO
922+ ut_a(os_aio_linux_dispatch_read_array_submit());
923+#endif
924+
925 os_aio_simulated_wake_handler_threads();
926
927 #ifdef UNIV_DEBUG
928
929=== modified file 'storage/innobase/fil/fil0fil.cc'
930--- storage/innobase/fil/fil0fil.cc 2014-03-05 11:54:14 +0000
931+++ storage/innobase/fil/fil0fil.cc 2014-04-23 10:58:56 +0000
932@@ -5168,7 +5168,7 @@
933 success = os_aio(OS_FILE_WRITE, OS_AIO_SYNC,
934 node->name, node->handle, buf,
935 offset, page_size * n_pages,
936- NULL, NULL, space_id, NULL);
937+ NULL, NULL, space_id, NULL, TRUE);
938 #endif /* UNIV_HOTBACKUP */
939 if (success) {
940 os_has_said_disk_full = FALSE;
941@@ -5545,7 +5545,12 @@
942 appropriately aligned */
943 void* message, /*!< in: message for aio handler if non-sync
944 aio used, else ignored */
945- trx_t* trx)
946+ trx_t* trx,
947+ ibool should_buffer) /*!< in: whether to buffer an aio request.
948+ AIO read ahead uses this. If you plan to
949+ use this parameter, make sure you remember
950+ to call os_aio_linux_dispatch_read_array_submit
951+ when you are read to commit all your requests.*/
952 {
953 ulint mode;
954 fil_space_t* space;
955@@ -5762,7 +5767,7 @@
956
957 /* Queue the aio request */
958 ret = os_aio(type, mode | wake_later, node->name, node->handle, buf,
959- offset, len, node, message, space_id, trx);
960+ offset, len, node, message, space_id, trx, should_buffer);
961
962 #else
963 /* In ibbackup do normal i/o, not aio */
964
965=== modified file 'storage/innobase/handler/ha_innodb.cc'
966--- storage/innobase/handler/ha_innodb.cc 2014-03-03 17:51:33 +0000
967+++ storage/innobase/handler/ha_innodb.cc 2014-04-23 10:58:56 +0000
968@@ -106,6 +106,11 @@
969 #include "i_s.h"
970 #include "xtradb_i_s.h"
971
972+#ifdef TARGET_OS_LINUX
973+#include <sys/syscall.h>
974+#include <sys/ioctl.h>
975+#endif /* TARGET_OS_LINUX */
976+
977 # ifndef MYSQL_PLUGIN_IMPORT
978 # define MYSQL_PLUGIN_IMPORT /* nothing */
979 # endif /* MYSQL_PLUGIN_IMPORT */
980@@ -634,6 +639,30 @@
981 "Timeout in seconds an InnoDB transaction may wait for a lock before being rolled back. Values above 100000000 disable the timeout.",
982 NULL, NULL, 50, 1, 1024 * 1024 * 1024, 0);
983
984+static MYSQL_THDVAR_ULONG(lra_size, PLUGIN_VAR_OPCMDARG,
985+ "The size (in MBs) of the total size of the pages that innodb will prefetch "
986+ "while scanning a table during this session. This is meant to be used only "
987+ "for table scans. The upper limit of this variable is 16384 which "
988+ "corresponds to prefetching 16GB of data. When set to max, this algorithm "
989+ "may use 100M memory.", NULL, NULL, 0, 0, 16384, 0);
990+
991+static MYSQL_THDVAR_ULONG(lra_n_node_recs_before_sleep, PLUGIN_VAR_OPCMDARG,
992+ "innodb_lra_n_node_recs_before_sleep is the number of node pointer records "
993+ "traversed while holding the index lock before releasing the index lock "
994+ "and sleeping for a short period of time so that the other threads get a "
995+ "chance to x-latch the index lock. innodb_lra_sleep is the sleep time in "
996+ "milliseconds.",
997+ NULL, NULL, 1024, 128, ULINT_MAX, 0);
998+
999+static MYSQL_THDVAR_ULONG(lra_sleep, PLUGIN_VAR_OPCMDARG,
1000+ "innodb_lra_n_node_recs_before_sleep is the number of node pointer records "
1001+ "traversed while holding the index lock before releasing the index lock "
1002+ "and sleeping for a short period of time so that the other threads get a "
1003+ "chance to x-latch the index lock. innodb_lra_sleep is the sleep time in "
1004+ "milliseconds.",
1005+ NULL, NULL, 50, 0, 1000, 0);
1006+
1007+
1008 static MYSQL_THDVAR_STR(ft_user_stopword_table,
1009 PLUGIN_VAR_OPCMDARG|PLUGIN_VAR_MEMALLOC,
1010 "User supplied stopword table name, effective in the session level.",
1011@@ -851,6 +880,14 @@
1012 (char*) &export_vars.innodb_x_lock_spin_rounds, SHOW_LONGLONG},
1013 {"x_lock_spin_waits",
1014 (char*) &export_vars.innodb_x_lock_spin_waits, SHOW_LONGLONG},
1015+ {"buffered_aio_submitted",
1016+ (char*) &export_vars.innodb_buffered_aio_submitted, SHOW_LONG},
1017+ {"logical_read_ahead_misses",
1018+ (char*) &export_vars.innodb_logical_read_ahead_misses, SHOW_LONG},
1019+ {"logical_read_ahead_prefetched",
1020+ (char*) &export_vars.innodb_logical_read_ahead_prefetched, SHOW_LONG},
1021+ {"logical_read_ahead_in_buf_pool",
1022+ (char*) &export_vars.innodb_logical_read_ahead_in_buf_pool, SHOW_LONG},
1023 {NullS, NullS, SHOW_LONG}
1024 };
1025
1026@@ -2294,6 +2331,10 @@
1027 thd, OPTION_RELAXED_UNIQUE_CHECKS);
1028
1029 trx->fake_changes = THDVAR(thd, fake_changes);
1030+ trx_lra_reset(trx,
1031+ THDVAR(thd, lra_size),
1032+ THDVAR(thd, lra_n_node_recs_before_sleep),
1033+ THDVAR(thd, lra_sleep));
1034
1035 #ifdef EXTENDED_SLOWLOG
1036 if (thd_log_slow_verbosity(thd) & (1ULL << SLOG_V_INNODB)) {
1037@@ -2326,6 +2367,10 @@
1038 trx = trx_allocate_for_mysql();
1039
1040 trx->mysql_thd = thd;
1041+ trx_lra_reset(trx,
1042+ THDVAR(thd, lra_size),
1043+ THDVAR(thd, lra_n_node_recs_before_sleep),
1044+ THDVAR(thd, lra_sleep));
1045
1046 innobase_trx_init(thd, trx);
1047
1048@@ -3860,6 +3905,7 @@
1049 /*================*/
1050 trx_t* trx) /*!< in: transaction handle */
1051 {
1052+ trx_lra_reset(trx, 0, 0, 0);
1053 if (trx_is_started(trx)) {
1054
1055 trx_commit_for_mysql(trx);
1056@@ -17573,6 +17619,17 @@
1057 "It is to create artificially the situation the purge view have been updated "
1058 "but the each purges were not done yet.",
1059 NULL, NULL, FALSE);
1060+
1061+#ifdef UNIV_DEBUG
1062+extern my_bool row_lra_test;
1063+#endif
1064+
1065+static MYSQL_SYSVAR_BOOL(lra_test, row_lra_test,
1066+ PLUGIN_VAR_NOCMDARG,
1067+ "When set to true, the purge thread stops until the logical read ahead "
1068+ "sets this variable to TRUE. Used for testing edge cases regarding the "
1069+ "purge thread and logical read ahead.",
1070+ NULL, NULL, FALSE);
1071 #endif /* UNIV_DEBUG */
1072
1073 const char *corrupt_table_action_names[]=
1074@@ -17789,10 +17846,14 @@
1075 MYSQL_SYSVAR(trx_rseg_n_slots_debug),
1076 MYSQL_SYSVAR(limit_optimistic_insert_debug),
1077 MYSQL_SYSVAR(trx_purge_view_update_only_debug),
1078+ MYSQL_SYSVAR(lra_test),
1079 #endif /* UNIV_DEBUG */
1080 MYSQL_SYSVAR(corrupt_table_action),
1081 MYSQL_SYSVAR(fake_changes),
1082 MYSQL_SYSVAR(locking_fake_changes),
1083+ MYSQL_SYSVAR(lra_size),
1084+ MYSQL_SYSVAR(lra_n_node_recs_before_sleep),
1085+ MYSQL_SYSVAR(lra_sleep),
1086 NULL
1087 };
1088
1089
1090=== modified file 'storage/innobase/include/btr0pcur.h'
1091--- storage/innobase/include/btr0pcur.h 2014-02-17 11:12:40 +0000
1092+++ storage/innobase/include/btr0pcur.h 2014-04-23 10:58:56 +0000
1093@@ -262,11 +262,12 @@
1094 /*===========================*/
1095 ulint latch_mode, /*!< in: BTR_SEARCH_LEAF, ... */
1096 btr_pcur_t* cursor, /*!< in: detached persistent cursor */
1097+ ulint level,
1098 const char* file, /*!< in: file name */
1099 ulint line, /*!< in: line where called */
1100 mtr_t* mtr); /*!< in: mtr */
1101-#define btr_pcur_restore_position(l,cur,mtr) \
1102- btr_pcur_restore_position_func(l,cur,__FILE__,__LINE__,mtr)
1103+#define btr_pcur_restore_position(l, cur, mtr) \
1104+ btr_pcur_restore_position_func(l, cur, 0, __FILE__, __LINE__, mtr)
1105 /*********************************************************//**
1106 Gets the rel_pos field for a cursor whose position has been stored.
1107 @return BTR_PCUR_ON, ... */
1108
1109=== modified file 'storage/innobase/include/btr0pcur.ic'
1110--- storage/innobase/include/btr0pcur.ic 2014-02-17 11:12:40 +0000
1111+++ storage/innobase/include/btr0pcur.ic 2014-04-23 10:58:56 +0000
1112@@ -448,6 +448,54 @@
1113 cursor. */
1114 UNIV_INLINE
1115 void
1116+btr_pcur_open_with_no_init_func_low(
1117+/*============================*/
1118+ dict_index_t* index, /*!< in: index */
1119+ const dtuple_t* tuple, /*!< in: tuple on which search done */
1120+ ulint mode, /*!< in: PAGE_CUR_L, ...;
1121+ NOTE that if the search is made using a unique
1122+ prefix of a record, mode should be
1123+ PAGE_CUR_LE, not PAGE_CUR_GE, as the latter
1124+ may end up on the previous page of the
1125+ record! */
1126+ ulint latch_mode,/*!< in: BTR_SEARCH_LEAF, ...;
1127+ NOTE that if has_search_latch != 0 then
1128+ we maybe do not acquire a latch on the cursor
1129+ page, but assume that the caller uses his
1130+ btr search latch to protect the record! */
1131+ btr_pcur_t* cursor, /*!< in: memory buffer for persistent cursor */
1132+ ulint level,
1133+ ulint has_search_latch,/*!< in: latch mode the caller
1134+ currently has on btr_search_latch:
1135+ RW_S_LATCH, or 0 */
1136+ const char* file, /*!< in: file name */
1137+ ulint line, /*!< in: line where called */
1138+ mtr_t* mtr) /*!< in: mtr */
1139+{
1140+ btr_cur_t* btr_cursor;
1141+
1142+ cursor->latch_mode = latch_mode;
1143+ cursor->search_mode = mode;
1144+
1145+ /* Search with the tree cursor */
1146+
1147+ btr_cursor = btr_pcur_get_btr_cur(cursor);
1148+
1149+ btr_cur_search_to_nth_level(index, level, tuple, mode, latch_mode,
1150+ btr_cursor, has_search_latch,
1151+ file, line, mtr);
1152+ cursor->pos_state = BTR_PCUR_IS_POSITIONED;
1153+
1154+ cursor->old_stored = BTR_PCUR_OLD_NOT_STORED;
1155+
1156+ cursor->trx_if_known = NULL;
1157+}
1158+
1159+/**************************************************************//**
1160+Opens an persistent cursor to an index tree without initializing the
1161+cursor. */
1162+UNIV_INLINE
1163+void
1164 btr_pcur_open_with_no_init_func(
1165 /*============================*/
1166 dict_index_t* index, /*!< in: index */
1167@@ -471,23 +519,9 @@
1168 ulint line, /*!< in: line where called */
1169 mtr_t* mtr) /*!< in: mtr */
1170 {
1171- btr_cur_t* btr_cursor;
1172-
1173- cursor->latch_mode = latch_mode;
1174- cursor->search_mode = mode;
1175-
1176- /* Search with the tree cursor */
1177-
1178- btr_cursor = btr_pcur_get_btr_cur(cursor);
1179-
1180- btr_cur_search_to_nth_level(index, 0, tuple, mode, latch_mode,
1181- btr_cursor, has_search_latch,
1182- file, line, mtr);
1183- cursor->pos_state = BTR_PCUR_IS_POSITIONED;
1184-
1185- cursor->old_stored = BTR_PCUR_OLD_NOT_STORED;
1186-
1187- cursor->trx_if_known = NULL;
1188+ return btr_pcur_open_with_no_init_func_low(
1189+ index, tuple, mode, latch_mode, cursor,
1190+ 0, has_search_latch, file, line, mtr);
1191 }
1192
1193 /*****************************************************************//**
1194
1195=== modified file 'storage/innobase/include/buf0rea.h'
1196--- storage/innobase/include/buf0rea.h 2013-10-23 08:48:28 +0000
1197+++ storage/innobase/include/buf0rea.h 2014-04-23 10:58:56 +0000
1198@@ -30,6 +30,43 @@
1199 #include "buf0types.h"
1200
1201 /********************************************************************//**
1202+Low-level function which reads a page asynchronously from a file to the
1203+buffer buf_pool if it is not already there, in which case does nothing.
1204+Sets the io_fix flag and sets an exclusive lock on the buffer frame. The
1205+flag is cleared and the x-lock released by an i/o-handler thread.
1206+@return 1 if a read request was queued, 0 if the page already resided
1207+in buf_pool, or if the page is in the doublewrite buffer blocks in
1208+which case it is never read into the pool, or if the tablespace does
1209+not exist or is being dropped
1210+@return 1 if read request is issued. 0 if it is not */
1211+UNIV_INTERN
1212+ulint
1213+buf_read_page_low(
1214+/*==============*/
1215+ dberr_t* err, /*!< out: DB_SUCCESS or DB_TABLESPACE_DELETED
1216+ if we are trying to read from a non-existent
1217+ tablespace, or a tablespace which is just now being
1218+ dropped */
1219+ bool sync, /*!< in: TRUE if synchronous aio is desired */
1220+ ulint mode, /*!< in: BUF_READ_IBUF_PAGES_ONLY, ...,
1221+ ORed to OS_AIO_SIMULATED_WAKE_LATER (see below
1222+ at read-ahead functions) */
1223+ ulint space, /*!< in: space id */
1224+ ulint zip_size,/*!< in: compressed page size, or 0 */
1225+ ibool unzip, /*!< in: TRUE=request uncompressed page */
1226+ ib_int64_t tablespace_version, /*!< in: if the space memory object has
1227+ this timestamp different from what we are giving here,
1228+ treat the tablespace as dropped; this is a timestamp
1229+ we use to stop dangling page reads from a tablespace
1230+ which we have DISCARDed + IMPORTed back */
1231+ ulint offset, /*!< in: page number */
1232+ trx_t* trx, /*!< in: transaction object */
1233+ ibool should_buffer); /*!< in: whether to buffer an aio request.
1234+ AIO read ahead uses this. If you plan to
1235+ use this parameter, make sure you remember
1236+ to call os_aio_linux_dispatch_read_array_submit
1237+ when you are read to commit all your requests.*/
1238+/********************************************************************//**
1239 High-level function which reads a page asynchronously from a file to the
1240 buffer buf_pool if it is not already there. Sets the io_fix flag and sets
1241 an exclusive lock on the buffer frame. The flag is cleared and the x-lock
1242
1243=== modified file 'storage/innobase/include/fil0fil.h'
1244--- storage/innobase/include/fil0fil.h 2014-02-17 11:12:40 +0000
1245+++ storage/innobase/include/fil0fil.h 2014-04-23 10:58:56 +0000
1246@@ -724,7 +724,7 @@
1247 @return DB_SUCCESS, or DB_TABLESPACE_DELETED if we are trying to do
1248 i/o on a tablespace which does not exist */
1249 #define fil_io(type, sync, space_id, zip_size, block_offset, byte_offset, len, buf, message) \
1250- _fil_io(type, sync, space_id, zip_size, block_offset, byte_offset, len, buf, message, NULL)
1251+ _fil_io(type, sync, space_id, zip_size, block_offset, byte_offset, len, buf, message, NULL, FALSE)
1252
1253 UNIV_INTERN
1254 dberr_t
1255@@ -755,7 +755,10 @@
1256 appropriately aligned */
1257 void* message, /*!< in: message for aio handler if non-sync
1258 aio used, else ignored */
1259- trx_t* trx)
1260+ trx_t* trx,
1261+ ibool should_buffer /*!< in: whether to buffer an aio request.
1262+ Only used by aio read ahead*/
1263+)
1264 __attribute__((nonnull(8)));
1265 /**********************************************************************//**
1266 Waits for an aio operation to complete. This function is used to write the
1267
1268=== modified file 'storage/innobase/include/os0file.h'
1269--- storage/innobase/include/os0file.h 2014-02-17 11:12:40 +0000
1270+++ storage/innobase/include/os0file.h 2014-04-23 10:58:56 +0000
1271@@ -321,10 +321,11 @@
1272 pfs_os_file_close_func(file, __FILE__, __LINE__)
1273
1274 # define os_aio(type, mode, name, file, buf, offset, \
1275- n, message1, message2, space_id, trx) \
1276+ n, message1, message2, space_id, trx, \
1277+ should_buffer) \
1278 pfs_os_aio_func(type, mode, name, file, buf, offset, \
1279 n, message1, message2, space_id, trx, \
1280- __FILE__, __LINE__)
1281+ __FILE__, __LINE__, should_buffer)
1282
1283 # define os_file_read(file, buf, offset, n) \
1284 pfs_os_file_read_func(file, buf, offset, n, NULL, \
1285@@ -371,9 +372,9 @@
1286 # define os_file_close(file) os_file_close_func(file)
1287
1288 # define os_aio(type, mode, name, file, buf, offset, n, message1, \
1289- message2, space_id, trx) \
1290+ message2, space_id, trx, should_buffer) \
1291 os_aio_func(type, mode, name, file, buf, offset, n, \
1292- message1, message2, space_id, trx)
1293+ message1, message2, space_id, trx, should_buffer)
1294
1295 # define os_file_read(file, buf, offset, n) \
1296 os_file_read_func(file, buf, offset, n, NULL)
1297@@ -777,7 +778,13 @@
1298 ulint space_id,
1299 trx_t* trx,
1300 const char* src_file,/*!< in: file name where func invoked */
1301- ulint src_line);/*!< in: line where the func invoked */
1302+ ulint src_line,/*!< in: line where the func invoked */
1303+ ibool should_buffer);
1304+ /*!< in: Whether to buffer an aio request.
1305+ AIO read ahead uses this. If you plan to
1306+ use this parameter, make sure you remember
1307+ to call os_aio_linux_dispatch_read_array_submit
1308+ when you are read to commit all your requests.*/
1309 /*******************************************************************//**
1310 NOTE! Please use the corresponding macro os_file_write(), not directly
1311 this function!
1312@@ -1148,7 +1155,12 @@
1313 aio operation); ignored if mode is
1314 OS_AIO_SYNC */
1315 ulint space_id,
1316- trx_t* trx);
1317+ trx_t* trx,
1318+ ibool should_buffer); /*!< in: Whether to buffer an aio request.
1319+ AIO read ahead uses this. If you plan to
1320+ use this parameter, make sure you remember
1321+ to call os_aio_linux_dispatch_read_array_submit
1322+ when you are read to commit all your requests.*/
1323 /************************************************************************//**
1324 Wakes up all async i/o threads so that they know to exit themselves in
1325 shutdown. */
1326@@ -1315,6 +1327,12 @@
1327 restart the operation. */
1328 ulint* type, /*!< out: OS_FILE_WRITE or ..._READ */
1329 ulint* space_id);
1330+/*******************************************************************//**
1331+Submit buffered AIO requests on the given segment to the kernel.
1332+@return TRUE on success. */
1333+UNIV_INTERN
1334+ibool
1335+os_aio_linux_dispatch_read_array_submit();
1336 #endif /* LINUX_NATIVE_AIO */
1337
1338 #ifndef UNIV_NONINL
1339
1340=== modified file 'storage/innobase/include/os0file.ic'
1341--- storage/innobase/include/os0file.ic 2013-10-23 08:48:28 +0000
1342+++ storage/innobase/include/os0file.ic 2014-04-23 10:58:56 +0000
1343@@ -213,7 +213,10 @@
1344 ulint space_id,
1345 trx_t* trx,
1346 const char* src_file,/*!< in: file name where func invoked */
1347- ulint src_line)/*!< in: line where the func invoked */
1348+ ulint src_line,/*!< in: line where the func invoked */
1349+ ibool should_buffer)
1350+ /*!< in: whether to buffer an aio request.
1351+ Only used by aio read ahead*/
1352 {
1353 ibool result;
1354 struct PSI_file_locker* locker = NULL;
1355@@ -227,7 +230,8 @@
1356 src_file, src_line);
1357
1358 result = os_aio_func(type, mode, name, file, buf, offset,
1359- n, message1, message2, space_id, trx);
1360+ n, message1, message2, space_id, trx,
1361+ should_buffer);
1362
1363 register_pfs_file_io_end(locker, n);
1364
1365
1366=== modified file 'storage/innobase/include/srv0srv.h'
1367--- storage/innobase/include/srv0srv.h 2014-02-17 11:12:40 +0000
1368+++ storage/innobase/include/srv0srv.h 2014-04-23 10:58:56 +0000
1369@@ -129,6 +129,23 @@
1370 ulint_ctr_1_t lock_deadlock_count;
1371
1372 ulint_ctr_1_t n_lock_max_wait_time;
1373+
1374+ /** Number of buffered aio requests submitted */
1375+ ulint_ctr_64_t n_aio_submitted;
1376+
1377+ /** total number of pages that logical-read-ahead missed while doing
1378+ a table scan. The number is the total for all transactions that used a
1379+ non-zero innodb_lra_size. */
1380+ ulint_ctr_64_t n_logical_read_ahead_misses;
1381+ /** total number of pages that logical-read-ahead prefetched. The
1382+ number is the total for all transactions that used a non-zero
1383+ innodb_lra_size. */
1384+ ulint_ctr_64_t n_logical_read_ahead_prefetched;
1385+ /** total number of pages that logical-read-ahead did not need to
1386+ prefetch because these pages were already in the buffer pool. The
1387+ number is the total for all transactions that used a non-zero
1388+ innodb_lra_size. */
1389+ ulint_ctr_64_t n_logical_read_ahead_in_buf_pool;
1390 };
1391
1392 extern const char* srv_main_thread_op_info;
1393@@ -1060,6 +1077,31 @@
1394 ulint innodb_purge_view_trx_id_age; /*!< rw_max_trx_id
1395 - purged view's min trx_id */
1396 #endif /* UNIV_DEBUG */
1397+ ulint innodb_buffered_aio_submitted;
1398+ ulint innodb_logical_read_ahead_misses; /*!< total number of pages that
1399+ logical-read-ahead missed
1400+ during a table scan.
1401+ The number is the total for all
1402+ the transactions that used a
1403+ non-zero
1404+ innodb_lra_size.
1405+ */
1406+ ulint innodb_logical_read_ahead_prefetched; /*!< total number of pages
1407+ that logical-read-ahead
1408+ prefetched. The number is the
1409+ total for all the transactions
1410+ that used a non-zero
1411+ innodb_lra_size.
1412+ */
1413+ ulint innodb_logical_read_ahead_in_buf_pool; /*!< total number of pages
1414+ that logical-read-ahead did not
1415+ need to prefetch because these
1416+ pages were already in the
1417+ buffer pool. The number is the
1418+ total for all transactions that
1419+ used a non-zero
1420+ innodb_lra_size.
1421+ */
1422 };
1423
1424 /** Thread slot in the thread table. */
1425
1426=== modified file 'storage/innobase/include/trx0trx.h'
1427--- storage/innobase/include/trx0trx.h 2014-02-17 11:12:40 +0000
1428+++ storage/innobase/include/trx0trx.h 2014-04-23 10:58:56 +0000
1429@@ -39,6 +39,13 @@
1430 #include "trx0xa.h"
1431 #include "ut0vec.h"
1432 #include "fts0fts.h"
1433+#include "btr0types.h"
1434+
1435+#ifdef TARGET_OS_LINUX
1436+#include <sys/syscall.h>
1437+#include <sys/ioctl.h>
1438+#endif /* TARGET_OS_LINUX */
1439+
1440
1441 /** Dummy session used currently in MySQL interface */
1442 extern sess_t* trx_dummy_sess;
1443@@ -135,7 +142,27 @@
1444 #define trx_start_if_not_started_xa(t) \
1445 trx_start_if_not_started_xa_low((t))
1446 #endif /* UNIV_DEBUG */
1447-
1448+/*************************************************************//**
1449+Creates or frees data structures related to logical-read-ahead.
1450+based on the value of lra_size. */
1451+UNIV_INTERN
1452+void
1453+trx_lra_reset(
1454+ trx_t* trx, /*!< in: transaction */
1455+ ulint lra_size, /*!< in: lra_size in MB.
1456+ If 0, the fields that are releated
1457+ to logical-read-ahead will be free'd
1458+ if they were initialized. */
1459+ ulint lra_n_node_recs_before_sleep,
1460+ /*!< in: lra_n_node_recs_before_sleep
1461+ is the number of node pointer records
1462+ traversed while holding the index lock
1463+ before releasing the index lock and
1464+ sleeping for a short period of time so
1465+ that the other threads get a chance to
1466+ x-latch the index lock. */
1467+ ulint lra_sleep); /* lra_sleep is the sleep time in
1468+ milliseconds. */
1469 /*************************************************************//**
1470 Starts the transaction if it is not yet started. */
1471 UNIV_INTERN
1472@@ -650,6 +677,15 @@
1473
1474 #define TRX_MAGIC_N 91118598
1475
1476+/*******************************************************************//**
1477+Helper data structure to store page numbers in an internally-linked hash
1478+table. */
1479+typedef struct page_no_holder_struct page_no_holder_t;
1480+struct page_no_holder_struct {
1481+ ulint page_no;
1482+ page_no_holder_t* hash;
1483+};
1484+
1485 /** The transaction handle
1486
1487 Normally, there is a 1:1 relationship between a transaction handle
1488@@ -804,6 +840,65 @@
1489 150 bytes in the undo log size as then
1490 we skip XA steps */
1491 ulint fake_changes;
1492+ ulint lra_size; /* Total size (in MBs) of the
1493+ pages that will be prefetched by
1494+ logical read ahead. */
1495+ ulint lra_n_pages; /* Number of pages that lra prefetches
1496+ every time. This is computed using
1497+ lra_size and the currently scanned
1498+ table's block size */
1499+ ulint lra_space_id; /* The last space id that the scanning
1500+ transaction accessed. If the scanning
1501+ trx accesses multiple tables, we need
1502+ to reset the data structures that lra
1503+ uses. */
1504+ ulint lra_page_no; /* The last page that was visited
1505+ by the trx. Used by the
1506+ logical-read-ahead algorithm to
1507+ determine if a new prefetch should be
1508+ performed. */
1509+ hash_table_t* lra_ht1;
1510+ hash_table_t* lra_ht2; /* Hash tables store the leaf page
1511+ numbers for the already prefetched
1512+ pages. Each hash table will typically
1513+ have lra_n_pages pages and when the
1514+ scanning trx visits all lra_n_pages
1515+ pages in one of them, we will empty
1516+ that one and prefetch another batch of
1517+ lra_n_pages pages. */
1518+ hash_table_t* lra_ht; /* lra_ht points to lra_ht1 and lra_ht2
1519+ alternatingly. */
1520+ ulint lra_n_pages_since;/* number of leaf pages visited since
1521+ the last prefetch operation. We require
1522+ that no prefetch be done until the
1523+ scanning trx scans lra_n_pages pages.
1524+ */
1525+ ulint* lra_sort_arr; /* Array used for sorting the page
1526+ numbers before issuing the read
1527+ requests */
1528+ page_no_holder_t* lra_arr1; /* Pre-allocated array of
1529+ page_no_holder objects which are used
1530+ by the logical-read-ahead algorithm for
1531+ lra_ht1. */
1532+ page_no_holder_t* lra_arr2; /* Pre-allocated array of
1533+ page_no_holder objects which are used
1534+ by the logical-read-ahead algorithm for
1535+ lra_ht2. */
1536+ btr_pcur_t* lra_cur; /* The persistent cursor that points
1537+ to the first node pointer record for
1538+ which the associated leaf page is not
1539+ prefetched by LRA. */
1540+ ulint lra_n_node_recs_before_sleep;
1541+ /* lra_n_node_recs_before_sleep
1542+ is the number of node pointer records
1543+ traversed while holding the index lock
1544+ before releasing the index lock and
1545+ sleeping for a short period of time so
1546+ that the other threads get a chance to
1547+ x-latch the index lock. */
1548+ ulint lra_sleep; /* lra_sleep is the sleep time in
1549+ milliseconds. */
1550+ ulint lra_tree_height;
1551 ulint flush_log_later;/* In 2PC, we hold the
1552 prepare_commit mutex across
1553 both phases. In that case, we
1554
1555=== modified file 'storage/innobase/os/os0file.cc'
1556--- storage/innobase/os/os0file.cc 2014-03-03 17:51:33 +0000
1557+++ storage/innobase/os/os0file.cc 2014-04-23 10:58:56 +0000
1558@@ -245,6 +245,16 @@
1559 There is one such event for each
1560 possible pending IO. The size of the
1561 array is equal to n_slots. */
1562+ struct iocb** pending;
1563+ /* Array to buffer the not-submitted aio
1564+ requests. The array length is n_slots.
1565+ It is divided into n_segments segments.
1566+ pending requests on each segment are buffered
1567+ separately.*/
1568+ ulint* count;
1569+ /* Array of length n_segments. Each element
1570+ counts the number of not-submitted aio request
1571+ on that segment.*/
1572 #endif /* LINUX_NATIV_AIO */
1573 };
1574
1575@@ -3926,6 +3936,13 @@
1576 memset(io_event, 0x0, sizeof(*io_event) * n);
1577 array->aio_events = io_event;
1578
1579+ array->pending = static_cast<struct iocb**>(
1580+ ut_malloc(n * sizeof(struct iocb*)));
1581+ memset(array->pending, 0x0, sizeof(struct iocb*) * n);
1582+ array->count = static_cast<ulint*>(
1583+ ut_malloc(n_segments * sizeof(ulint)));
1584+ memset(array->count, 0x0, sizeof(ulint) * n_segments);
1585+
1586 skip_native_aio:
1587 #endif /* LINUX_NATIVE_AIO */
1588 for (ulint i = 0; i < n; i++) {
1589@@ -3982,6 +3999,8 @@
1590 if (srv_use_native_aio) {
1591 ut_free(array->aio_events);
1592 ut_free(array->aio_ctx);
1593+ ut_free(array->pending);
1594+ ut_free(array->count);
1595 }
1596 #endif /* LINUX_NATIVE_AIO */
1597
1598@@ -4605,6 +4624,49 @@
1599
1600 #if defined(LINUX_NATIVE_AIO)
1601 /*******************************************************************//**
1602+Submit buffered AIO requests on the given segment to the kernel.
1603+@return TRUE on success. */
1604+UNIV_INTERN
1605+ibool
1606+os_aio_linux_dispatch_read_array_submit()
1607+{
1608+ os_aio_array_t* array = os_aio_read_array;
1609+ ulint total_submitted = 0;
1610+ ulint total_count = 0;
1611+ if (!srv_use_native_aio) {
1612+ return TRUE;
1613+ }
1614+ os_mutex_enter(array->mutex);
1615+ /* Submit aio requests buffered on all segments. */
1616+ for (ulint i = 0; i < array->n_segments; i++) {
1617+ ulint count = array->count[i];
1618+ if (count > 0) {
1619+ ulint iocb_index = i * array->n_slots
1620+ / array->n_segments;
1621+ total_count += count;
1622+ total_submitted += io_submit(array->aio_ctx[i], count,
1623+ &(array->pending[iocb_index]));
1624+ }
1625+ }
1626+ /* Reset the aio request buffer. */
1627+ memset(array->pending, 0x0,
1628+ sizeof(struct iocb*) * array->n_slots);
1629+ memset(array->count, 0x0, sizeof(ulint) * array->n_segments);
1630+ os_mutex_exit(array->mutex);
1631+
1632+ srv_stats.n_aio_submitted.add(total_count);
1633+
1634+ /* io_submit returns number of successfully
1635+ queued requests or -errno. */
1636+ if (UNIV_UNLIKELY(total_count != total_submitted)) {
1637+ errno = -total_submitted;
1638+ return(FALSE);
1639+ }
1640+
1641+ return(TRUE);
1642+}
1643+
1644+/*******************************************************************//**
1645 Dispatch an AIO request to the kernel.
1646 @return TRUE on success. */
1647 static
1648@@ -4612,24 +4674,46 @@
1649 os_aio_linux_dispatch(
1650 /*==================*/
1651 os_aio_array_t* array, /*!< in: io request array. */
1652- os_aio_slot_t* slot) /*!< in: an already reserved slot. */
1653+ os_aio_slot_t* slot, /*!< in: an already reserved slot. */
1654+ ibool should_buffer) /*!< in: should buffer the request
1655+ rather than submit. */
1656 {
1657 int ret;
1658- ulint io_ctx_index;
1659+ ulint io_ctx_index = 0;
1660 struct iocb* iocb;
1661+ ulint slots_per_segment;
1662
1663- ut_ad(slot != NULL);
1664+ ut_ad(slot);
1665 ut_ad(array);
1666-
1667 ut_a(slot->reserved);
1668
1669 /* Find out what we are going to work with.
1670 The iocb struct is directly in the slot.
1671 The io_context is one per segment. */
1672
1673+ slots_per_segment = array->n_slots / array->n_segments;
1674 iocb = &slot->control;
1675- io_ctx_index = (slot->pos * array->n_segments) / array->n_slots;
1676-
1677+ io_ctx_index = slot->pos / slots_per_segment;
1678+ if (should_buffer) {
1679+ ulint n;
1680+ os_mutex_enter(array->mutex);
1681+ /* There are array->n_slots elements in array->pending,
1682+ which is divided into array->n_segments area of equal size.
1683+ The iocb of each segment are buffered in its corresponding area
1684+ in the pending array consecutively as they come.
1685+ array->count[i] records the number of buffered aio requests
1686+ in the ith segment.*/
1687+ n = io_ctx_index * slots_per_segment
1688+ + array->count[io_ctx_index];
1689+ array->pending[n] = iocb;
1690+ array->count[io_ctx_index] ++;
1691+ os_mutex_exit(array->mutex);
1692+ if (array->count[io_ctx_index] == slots_per_segment) {
1693+ return os_aio_linux_dispatch_read_array_submit();
1694+ }
1695+ return(TRUE);
1696+ }
1697+ /* Submit the given request. */
1698 ret = io_submit(array->aio_ctx[io_ctx_index], 1, &iocb);
1699
1700 #if defined(UNIV_AIO_DEBUG)
1701@@ -4689,7 +4773,12 @@
1702 aio operation); ignored if mode is
1703 OS_AIO_SYNC */
1704 ulint space_id,
1705- trx_t* trx)
1706+ trx_t* trx,
1707+ ibool should_buffer) /*!< in: Whether to buffer an aio request.
1708+ AIO read ahead uses this. If you plan to
1709+ use this parameter, make sure you remember
1710+ to call os_aio_linux_dispatch_read_array_submit
1711+ when you are read to commit all your requests.*/
1712 {
1713 os_aio_array_t* array;
1714 os_aio_slot_t* slot;
1715@@ -4802,7 +4891,8 @@
1716 &(slot->control));
1717
1718 #elif defined(LINUX_NATIVE_AIO)
1719- if (!os_aio_linux_dispatch(array, slot)) {
1720+ if (!os_aio_linux_dispatch(array, slot,
1721+ should_buffer)) {
1722 goto err_exit;
1723 }
1724 #endif /* WIN_ASYNC_IO */
1725@@ -4822,7 +4912,7 @@
1726 &(slot->control));
1727
1728 #elif defined(LINUX_NATIVE_AIO)
1729- if (!os_aio_linux_dispatch(array, slot)) {
1730+ if (!os_aio_linux_dispatch(array, slot, FALSE)) {
1731 goto err_exit;
1732 }
1733 #endif /* WIN_ASYNC_IO */
1734
1735=== modified file 'storage/innobase/row/row0purge.cc'
1736--- storage/innobase/row/row0purge.cc 2013-06-20 15:16:00 +0000
1737+++ storage/innobase/row/row0purge.cc 2014-04-23 10:58:56 +0000
1738@@ -187,6 +187,10 @@
1739 return(success);
1740 }
1741
1742+#ifdef UNIV_DEBUG
1743+extern my_bool row_lra_test;
1744+#endif
1745+
1746 /***********************************************************//**
1747 Removes a clustered index record if it has not been modified after the delete
1748 marking.
1749@@ -203,6 +207,11 @@
1750 return(true);
1751 }
1752
1753+#ifdef UNIV_DEBUG
1754+ while (row_lra_test) {
1755+ os_thread_sleep(300000);
1756+ }
1757+#endif
1758 for (ulint n_tries = 0;
1759 n_tries < BTR_CUR_RETRY_DELETE_N_TIMES;
1760 n_tries++) {
1761
1762=== modified file 'storage/innobase/row/row0sel.cc'
1763--- storage/innobase/row/row0sel.cc 2014-03-03 17:51:33 +0000
1764+++ storage/innobase/row/row0sel.cc 2014-04-23 10:58:56 +0000
1765@@ -60,6 +60,8 @@
1766 #include "srv0start.h"
1767 #include "m_string.h" /* for my_sys.h */
1768 #include "my_sys.h" /* DEBUG_SYNC_C */
1769+#include "ut0sort.h"
1770+#include <algorithm>
1771
1772 #include "my_compare.h" /* enum icp_result */
1773
1774@@ -3632,6 +3634,318 @@
1775 return(result);
1776 }
1777
1778+/**********************************************************************//**
1779+Determines the page numbers for the next batch of pages that will be
1780+prefetched for logical read ahead and stores them in the hash_table and
1781+page_no_array. Does not issue read requests. */
1782+static
1783+void
1784+row_read_ahead_logical_low(
1785+ hash_table_t* hash_table, /* in/out: This hash table is emptied and
1786+ then filled with the next batch of page
1787+ numbers that should be prefetched. */
1788+ ulint* n_prefetched_ptr, /* in/out: the number that's pointed by
1789+ this pointer is incremented by the number
1790+ of pages that are added to the hash_table */
1791+ ulint* page_no_array, /* out: the page numbers that will be
1792+ prefetched are stored in this array */
1793+ dict_index_t* index, /* in: index object for the table */
1794+ mtr_t* mtr, /* in: mini transaction object used for
1795+ acquiring and releasing the necessary locks */
1796+ ulint* offsets, /* in/out: temporary storage for offsets */
1797+ mem_heap_t* heap, /* in: temporary memory heap */
1798+ trx_t *trx)
1799+{
1800+ page_no_holder_t* page_no_holder;
1801+ page_no_holder_t* lra_arr;
1802+ ulint page_no;
1803+ ulint n_prefetched = 0;
1804+ rec_t* rec;
1805+ /* empty the hash table because we don't want it to grow to hold
1806+ all leaf page numbers of the table. The concern is not memory, but
1807+ the lookup time. */
1808+ hash_table_clear(hash_table);
1809+ if (hash_table == trx->lra_ht1) {
1810+ lra_arr = trx->lra_arr1;
1811+ } else {
1812+ lra_arr = trx->lra_arr2;
1813+ }
1814+ while (!btr_pcur_is_after_last_in_tree(trx->lra_cur, mtr)
1815+ && n_prefetched < trx->lra_n_pages) {
1816+ if (UNIV_UNLIKELY(trx_is_interrupted(trx))) {
1817+ return;
1818+ }
1819+ rec = btr_pcur_get_rec(trx->lra_cur);
1820+ if (page_rec_is_supremum(rec) || page_rec_is_infimum(rec)) {
1821+ btr_pcur_move_to_next(trx->lra_cur, mtr);
1822+ continue;
1823+ }
1824+ offsets = rec_get_offsets(rec, index, offsets,
1825+ ULINT_UNDEFINED, &heap);
1826+ page_no = btr_node_ptr_get_child_page_no(rec, offsets);
1827+ page_no_holder = &lra_arr[n_prefetched];
1828+ page_no_holder->page_no = page_no;
1829+ page_no_holder->hash = NULL;
1830+ HASH_INSERT(page_no_holder_t, hash, hash_table,
1831+ page_no, page_no_holder);
1832+ btr_pcur_move_to_next(trx->lra_cur, mtr);
1833+ page_no_array[n_prefetched] = page_no;
1834+ ++n_prefetched;
1835+ if (trx->lra_n_node_recs_before_sleep
1836+ && trx->lra_sleep
1837+ &&
1838+ ((n_prefetched % trx->lra_n_node_recs_before_sleep) == 0))
1839+ {
1840+ btr_pcur_store_position(trx->lra_cur, mtr);
1841+ mtr_commit(mtr);
1842+ os_thread_sleep(trx->lra_sleep * 1000);
1843+ mtr_start(mtr);
1844+ btr_pcur_restore_position_func(
1845+ BTR_SEARCH_LEAF, trx->lra_cur, 1,
1846+ __FILE__, __LINE__, mtr);
1847+ }
1848+ }
1849+ *n_prefetched_ptr += n_prefetched;
1850+}
1851+
1852+/*********************************************************************//**
1853+Returns TRUE if the page specified by page_no was prefetched.
1854+@return: TRUE if the page was prefetched before. */
1855+UNIV_INLINE
1856+ibool
1857+row_lra_is_prefetched(
1858+ const trx_t* trx, /* in: trx->lra_ht1 and trx->lra_ht2 are
1859+ probed to see if page page_no was prefetched */
1860+ ulint page_no) /* in: page no for the page that is being checked */
1861+{
1862+ page_no_holder_t* page_no_holder = NULL;
1863+ hash_table_t* other_table;
1864+ HASH_SEARCH(hash, trx->lra_ht, page_no, page_no_holder_t*,
1865+ page_no_holder, ut_a(1),
1866+ page_no_holder->page_no == page_no);
1867+ if (page_no_holder) {
1868+ return TRUE;
1869+ }
1870+ other_table = trx->lra_ht == trx->lra_ht1 ? trx->lra_ht2
1871+ : trx->lra_ht1;
1872+ HASH_SEARCH(hash, other_table, page_no, page_no_holder_t*,
1873+ page_no_holder, ut_a(1),
1874+ page_no_holder->page_no == page_no);
1875+ if (page_no_holder) {
1876+ return TRUE;
1877+ }
1878+ return FALSE;
1879+}
1880+
1881+#ifdef UNIV_DEBUG
1882+my_bool row_lra_test = FALSE;
1883+#endif
1884+
1885+/*********************************************************************//**
1886+This function submits io requests for pages that are logical successors to
1887+the page that is pointed by pcur. It is meant to be called during sequential
1888+scan to prefetch pages and speed-up the scan. The number of pages that are
1889+prefetched is determined by the session variable innodb_logical_readahead_size
1890+divided by the block size of the table. It is ok to call this function
1891+successively even if pcur did not move to the next page because this function
1892+keeps track of the page numbers it prefetched and won't duplicate io requests.
1893+This function may temporarily release the block latches held by pcur and
1894+re-acquire them.
1895+@return: TRUE if the function released the block latch and re-acquired it and
1896+now the cursor pcur points to a new record that must be processed by the
1897+caller. */
1898+static
1899+ibool
1900+row_read_ahead_logical(
1901+ btr_pcur_t* pcur, /* in/out: Cursor from which the current page
1902+ number is obtained. Cursor's position may change
1903+ and this is indicated in the return value. */
1904+ dict_index_t* index, /* in: index object for the table */
1905+ mtr_t* mtr, /* in: mini-transaction. May be committed and
1906+ restarted */
1907+ ulint* offsets, /* in: temporary storage for offsets */
1908+ mem_heap_t** heap_ptr, /* in/out: If *heap_ptr is not NULL then this
1909+ heap is used for memory allocations, otherwise
1910+ a new heap is created and stored in *heap_ptr.
1911+ The caller is responsible for freeing the heap */
1912+ trx_t *trx)
1913+{
1914+ buf_block_t* block = btr_cur_get_block(&pcur->btr_cur);
1915+ ibool same_user_rec;
1916+ mem_heap_t* heap;
1917+ ib_int64_t tablespace_version;
1918+ ulint page_no = buf_block_get_page_no(block);
1919+ ulint space = buf_block_get_space(block);
1920+ ulint zip_size = buf_block_get_zip_size(block);
1921+ rec_t* rec;
1922+ dtuple_t* tuple;
1923+ dberr_t err;
1924+ ulint num_prefetched = 0;
1925+ ulint num_read_requests = 0;
1926+ ulint i;
1927+ ulint root_page_no;
1928+ buf_block_t* root_block;
1929+
1930+ if (!trx->lra_size) {
1931+ return FALSE;
1932+ }
1933+ if (trx->lra_space_id == space && trx->lra_tree_height <= 1) {
1934+ return FALSE;
1935+ }
1936+ if (trx->lra_space_id == space && trx->lra_page_no == page_no) {
1937+ /* the cursor is on the same page as the last time this
1938+ function was called. */
1939+ return FALSE;
1940+ }
1941+
1942+ /* Set the last page number to page_no only if we are scanning the
1943+ same table. */
1944+ if (trx->lra_space_id == space) {
1945+ trx->lra_page_no = page_no;
1946+ }
1947+
1948+ /* In order not to prefetch extraneously, we do not issue prefetches
1949+ until the scan processes lra_n_pages pages. This may cause misses if
1950+ there are too many splits/merges going on but in such a case it
1951+ would be hard to guess which pages to prefetch anyway
1952+ because it is equivalent to guessing which pages would split
1953+ (or merge). */
1954+ if (trx->lra_space_id == space
1955+ && ++trx->lra_n_pages_since <= trx->lra_n_pages) {
1956+ if (!row_lra_is_prefetched(trx, page_no)) {
1957+ srv_stats.n_logical_read_ahead_misses.add(1);
1958+ }
1959+ return FALSE;
1960+ }
1961+ rec = page_rec_get_next(page_get_infimum_rec(
1962+ buf_block_get_frame(block)));
1963+ if (!rec || page_rec_is_supremum(rec)) {
1964+ /* Do not start prefetching because we can not get the
1965+ parent page of an empty page */
1966+ trx->lra_page_no = 0;
1967+ return FALSE;
1968+ }
1969+ tablespace_version = fil_space_get_version(space);
1970+
1971+ if (!*heap_ptr) {
1972+ *heap_ptr = mem_heap_create(100);
1973+ }
1974+ heap = *heap_ptr;
1975+ tuple = dict_index_build_node_ptr(index, rec, 0, heap, 1);
1976+ trx->lra_n_pages_since = 0;
1977+
1978+ btr_pcur_store_position(pcur, mtr);
1979+ mtr_commit(mtr);
1980+#ifdef UNIV_DEBUG
1981+ if (row_lra_test && trx->lra_space_id == space) {
1982+ row_lra_test = FALSE;
1983+ os_thread_sleep(1000000);
1984+ }
1985+#endif
1986+ mtr_start(mtr);
1987+
1988+ if (UNIV_LIKELY(trx->lra_space_id == space)) {
1989+#ifdef UNIV_DEBUG
1990+ memset(trx->lra_sort_arr, 0,
1991+ 2 * trx->lra_n_pages * sizeof(ulint));
1992+#endif
1993+ btr_pcur_restore_position_func(
1994+ BTR_SEARCH_LEAF, trx->lra_cur, 1,
1995+ __FILE__, __LINE__, mtr);
1996+ row_read_ahead_logical_low(
1997+ trx->lra_ht, &num_prefetched,
1998+ trx->lra_sort_arr, index, mtr,
1999+ offsets, heap, trx);
2000+ if (trx->lra_ht == trx->lra_ht1) {
2001+ trx->lra_ht = trx->lra_ht2;
2002+ } else {
2003+ trx->lra_ht = trx->lra_ht1;
2004+ }
2005+ } else {
2006+ /* The transaction started to scan a new table, set
2007+ the values for lra_space_id and lra_n_pages based on the
2008+ new table */
2009+ trx_lra_reset(trx,
2010+ trx->lra_size,
2011+ trx->lra_n_node_recs_before_sleep,
2012+ trx->lra_sleep);
2013+ trx->lra_space_id = space;
2014+ trx->lra_n_pages = (trx->lra_size << 20L)
2015+ / (zip_size ? zip_size : UNIV_PAGE_SIZE);
2016+ trx->lra_page_no = page_no;
2017+ mtr_s_lock(dict_index_get_lock(index), mtr);
2018+ /* Get root page to get the B-tree depth */
2019+ root_page_no = dict_index_get_page(index);
2020+ root_block = buf_page_get_gen(space, zip_size, root_page_no,
2021+ RW_NO_LATCH, NULL, BUF_GET,
2022+ __FILE__, __LINE__, mtr);
2023+ trx->lra_tree_height = btr_page_get_level(
2024+ buf_block_get_frame(root_block),
2025+ mtr) + 1;
2026+ if (trx->lra_tree_height > 1) {
2027+#ifdef UNIV_DEBUG
2028+ memset(trx->lra_sort_arr, 0,
2029+ 2 * trx->lra_n_pages * sizeof(ulint));
2030+#endif
2031+ mtr_commit(mtr);
2032+ mtr_start(mtr);
2033+ btr_pcur_open_low(index, 1, tuple, PAGE_CUR_LE,
2034+ BTR_SEARCH_LEAF, trx->lra_cur,
2035+ __FILE__, __LINE__, mtr);
2036+ row_read_ahead_logical_low(
2037+ trx->lra_ht1, &num_prefetched,
2038+ trx->lra_sort_arr, index, mtr,
2039+ offsets, heap, trx);
2040+ row_read_ahead_logical_low(
2041+ trx->lra_ht2, &num_prefetched,
2042+ &trx->lra_sort_arr[num_prefetched], index, mtr,
2043+ offsets, heap, trx);
2044+ }
2045+ trx->lra_ht = trx->lra_ht1;
2046+ }
2047+ if (trx->lra_tree_height > 1)
2048+ btr_pcur_store_position(trx->lra_cur, mtr);
2049+ mtr_commit(mtr);
2050+ if (num_prefetched) {
2051+ /* We sort the page numbers before issuing read requests for two
2052+ reasons:
2053+ 1- The block layer in linux kernel currently sorts the read
2054+ requests and merges them but there is a possibility that
2055+ this algorithm does not detect the sequential read and
2056+ coalesce the iops.
2057+ 2- Even if the block layer algorithm is perfect, the
2058+ asynchrounous read array size may be small in which case we
2059+ the read requests will have a lower chance of being
2060+ coalesced by the block layer.
2061+
2062+ Sorting is cheap in comparison to the iops that are about to
2063+ be done so we always sort. */
2064+ std::sort(trx->lra_sort_arr, trx->lra_sort_arr + num_prefetched);
2065+ /* TODO(nizamordulu): Here we call buf_read_page_low() which
2066+ acquires the related buffer pool shard lock and checks if the
2067+ page is in that shard for each page. This could be made more
2068+ efficient if we checked for all pages at once or batched the
2069+ check for multiple pages after acquiring the related latch. */
2070+ for (i = 0; i < num_prefetched; ++i) {
2071+ num_read_requests += buf_read_page_low(
2072+ &err, FALSE,
2073+ BUF_READ_ANY_PAGE | OS_AIO_SIMULATED_WAKE_LATER,
2074+ space, zip_size, FALSE, tablespace_version,
2075+ trx->lra_sort_arr[i], trx, TRUE);
2076+ }
2077+#ifdef LINUX_NATIVE_AIO
2078+ os_aio_linux_dispatch_read_array_submit();
2079+#endif
2080+ srv_stats.n_logical_read_ahead_prefetched.add(
2081+ num_read_requests);
2082+ srv_stats.n_logical_read_ahead_in_buf_pool.add(
2083+ num_prefetched - num_read_requests);
2084+ }
2085+ mtr_start(mtr);
2086+ return sel_restore_position_for_mysql(&same_user_rec, BTR_SEARCH_LEAF,
2087+ pcur, TRUE, mtr);
2088+}
2089+
2090 /********************************************************************//**
2091 Searches for rows in the database. This is used in the interface to
2092 MySQL. This function opens a cursor, and also implements fetch next
2093@@ -5014,6 +5328,11 @@
2094 }
2095
2096 if (moves_up) {
2097+ if (trx
2098+ && row_read_ahead_logical(
2099+ pcur, index, &mtr, offsets, &heap, trx)) {
2100+ goto rec_loop;
2101+ }
2102 if (UNIV_UNLIKELY(!btr_pcur_move_to_next(pcur, &mtr))) {
2103 not_moved:
2104 btr_pcur_store_position(pcur, &mtr);
2105
2106=== modified file 'storage/innobase/srv/srv0srv.cc'
2107--- storage/innobase/srv/srv0srv.cc 2014-03-03 17:51:33 +0000
2108+++ storage/innobase/srv/srv0srv.cc 2014-04-23 10:58:56 +0000
2109@@ -1863,6 +1863,15 @@
2110 }
2111 #endif /* UNIV_DEBUG */
2112
2113+ export_vars.innodb_buffered_aio_submitted =
2114+ srv_stats.n_aio_submitted;
2115+ export_vars.innodb_logical_read_ahead_misses =
2116+ srv_stats.n_logical_read_ahead_misses;
2117+ export_vars.innodb_logical_read_ahead_prefetched =
2118+ srv_stats.n_logical_read_ahead_prefetched;
2119+ export_vars.innodb_logical_read_ahead_in_buf_pool =
2120+ srv_stats.n_logical_read_ahead_in_buf_pool;
2121+
2122 mutex_exit(&srv_innodb_monitor_mutex);
2123 }
2124
2125
2126=== modified file 'storage/innobase/trx/trx0trx.cc'
2127--- storage/innobase/trx/trx0trx.cc 2014-02-17 11:12:40 +0000
2128+++ storage/innobase/trx/trx0trx.cc 2014-04-23 10:58:56 +0000
2129@@ -48,6 +48,7 @@
2130 #include "ha_prototypes.h"
2131 #include "srv0mon.h"
2132 #include "ut0vec.h"
2133+#include "btr0pcur.h"
2134
2135 #include<set>
2136
2137@@ -212,6 +213,121 @@
2138 trx_sys->descr_n_used--;
2139 }
2140
2141+/*************************************************************//**
2142+Creates or frees data structures related to logical-read-ahead.
2143+based on the value of lra_size. */
2144+UNIV_INTERN
2145+void
2146+trx_lra_reset(
2147+ trx_t* trx, /*!< in: transaction */
2148+ ulint lra_size, /*!< in: lra_size in MB.
2149+ If 0, the fields that are releated
2150+ to logical-read-ahead will be free'd
2151+ if they were initialized. */
2152+ ulint lra_n_node_recs_before_sleep,
2153+ /*!< in: lra_n_node_recs_before_sleep
2154+ is the number of node pointer records
2155+ traversed while holding the index lock
2156+ before releasing the index lock and
2157+ sleeping for a short period of time so
2158+ that the other threads get a chance to
2159+ x-latch the index lock. */
2160+ ulint lra_sleep) /* lra_sleep is the sleep time in
2161+ milliseconds. */
2162+{
2163+#ifndef TARGET_OS_LINUX
2164+ if (lra_size) {
2165+ ib_logf(IB_LOG_LEVEL_WARN,
2166+ "Logical read ahead is supported only on linux.");
2167+ lra_size = 0;
2168+ }
2169+#else /* TARGET_OS_LINUX */
2170+ if (!srv_use_native_aio && lra_size) {
2171+ ib_logf(IB_LOG_LEVEL_WARN,
2172+ "In order to use logical read ahead please enable "
2173+ "native aio by setting innodb_use_native_aio=1 in "
2174+ "my.cnf and restarting the server.");
2175+ lra_size = 0;
2176+ }
2177+#endif /* TARGET_OS_LINUX */
2178+ trx->lra_size = lra_size;
2179+ trx->lra_space_id = 0;
2180+ trx->lra_n_pages = 0;
2181+ trx->lra_n_pages_since = 0;
2182+ trx->lra_page_no = 0;
2183+ trx->lra_n_node_recs_before_sleep = lra_n_node_recs_before_sleep;
2184+ trx->lra_sleep = lra_sleep;
2185+ trx->lra_tree_height = 0;
2186+ if (lra_size) {
2187+ ulint n_pages_max =
2188+ (lra_size << 20L) / UNIV_ZIP_SIZE_MIN;
2189+ ulint mem = n_pages_max * (2 * sizeof(ulint)
2190+ + 2 * sizeof(page_no_holder_t))
2191+ + sizeof(btr_pcur_t);
2192+ if (trx->lra_ht) {
2193+ ut_a(trx->lra_ht1);
2194+ ut_a(trx->lra_ht2);
2195+ ut_a(trx->lra_sort_arr);
2196+ ut_a(trx->lra_cur);
2197+ hash_table_clear(trx->lra_ht1);
2198+ hash_table_clear(trx->lra_ht2);
2199+ trx->lra_ht = trx->lra_ht1;
2200+#ifdef UNIV_DEBUG
2201+ /* following resets lra_sort_arr,
2202+ * lra_arr1, lra_arr2, and lra_cursor.
2203+ */
2204+ memset(trx->lra_sort_arr, 0, mem);
2205+#endif
2206+ btr_pcur_init(trx->lra_cur);
2207+ } else {
2208+ byte* alloc;
2209+ ut_a(!trx->lra_ht1);
2210+ ut_a(!trx->lra_ht2);
2211+ ut_a(!trx->lra_sort_arr);
2212+ trx->lra_ht1 = hash_create(16384);
2213+ trx->lra_ht2 = hash_create(16384);
2214+ trx->lra_ht = trx->lra_ht1;
2215+ alloc = (byte*)ut_malloc(mem);
2216+#ifdef UNIV_DEBUG
2217+ memset(alloc, 0, mem);
2218+#endif
2219+ trx->lra_sort_arr = (ulint*)alloc;
2220+ alloc += 2 * sizeof(ulint) * n_pages_max;
2221+ trx->lra_arr1 = (page_no_holder_t*) alloc;
2222+ alloc += sizeof(page_no_holder_t) * n_pages_max;
2223+ trx->lra_arr2 = (page_no_holder_t*) alloc;
2224+ alloc += sizeof(page_no_holder_t) * n_pages_max;
2225+ trx->lra_cur = (btr_pcur_t*) alloc;
2226+ btr_pcur_init(trx->lra_cur);
2227+
2228+ }
2229+ } else {
2230+ if (trx->lra_ht) {
2231+ ut_a(trx->lra_ht1);
2232+ ut_a(trx->lra_ht2);
2233+ ut_a(trx->lra_sort_arr);
2234+ hash_table_free(trx->lra_ht1);
2235+ hash_table_free(trx->lra_ht2);
2236+ btr_pcur_close(trx->lra_cur);
2237+ ut_free(trx->lra_sort_arr);
2238+ trx->lra_sort_arr = NULL;
2239+ trx->lra_ht = NULL;
2240+ trx->lra_ht1 = NULL;
2241+ trx->lra_ht2 = NULL;
2242+ trx->lra_arr1 = NULL;
2243+ trx->lra_arr2 = NULL;
2244+ trx->lra_cur = NULL;
2245+ } else {
2246+ ut_a(!trx->lra_ht1);
2247+ ut_a(!trx->lra_ht2);
2248+ ut_a(!trx->lra_sort_arr);
2249+ ut_a(!trx->lra_cur);
2250+ ut_a(!trx->lra_arr1);
2251+ ut_a(!trx->lra_arr2);
2252+ }
2253+ }
2254+}
2255+
2256 /****************************************************************//**
2257 Creates and initializes a transaction object. It must be explicitly
2258 started with trx_start_if_not_started() before using it. The default
2259@@ -294,6 +410,12 @@
2260 trx->lock.table_locks = ib_vector_create(
2261 heap_alloc, sizeof(void**), 32);
2262
2263+ trx->lra_ht = NULL;
2264+ trx->lra_cur = NULL;
2265+ trx->lra_ht1 = NULL;
2266+ trx->lra_ht2 = NULL;
2267+ trx_lra_reset(trx, 0, 0, 0);
2268+
2269 return(trx);
2270 }
2271
2272@@ -388,6 +510,7 @@
2273 }
2274
2275 mutex_free(&trx->mutex);
2276+ trx_lra_reset(trx, 0, 0, 0);
2277
2278 read_view_free(trx->prebuilt_view);
2279

Subscribers

People subscribed via source and target branches