~thopiekar/linux/+git/linux-stable:linux-4.7.y

Last commit made on 2016-10-22
Get this branch:
git clone -b linux-4.7.y https://git.launchpad.net/~thopiekar/linux/+git/linux-stable

Branch merges

Branch information

Name:
linux-4.7.y
Repository:
lp:~thopiekar/linux/+git/linux-stable

Recent commits

b3afc45... by Greg Kroah-Hartman <email address hidden>

Linux 4.7.10

9442f9a... by Glauber Costa <email address hidden>

cfq: fix starvation of asynchronous writes

commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

  8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240
  8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1]
  8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1]
  8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1]
  <==== 'I' means Inserted (into the IO scheduler) ===================================>
  8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]
  <==== Only 15s later the CFQ scheduler dispatches the request ======================>

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

  8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
  8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
  8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
  8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
  8,0 1 0 81.117914755 0 m N cfq767A / resid=40
  8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
  8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
 value |-------------------------------------------------- count
     0 | 11
     1 |@@@@ 161
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
     4 |@ 54
     8 | 36
    16 | 7
    32 | 0
    64 | 0
       ~
  1024 | 0
  2048 | 0
  4096 | 1
  8192 | 1
 16384 | 2
 32768 | 0
 65536 | 0
131072 | 1
262144 | 0
524288 | 0

Signed-off-by: Glauber Costa <email address hidden>
CC: Jens Axboe <email address hidden>
CC: <email address hidden>
CC: <email address hidden>
Signed-off-by: Glauber Costa <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

6a9e153... by David Howells

cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]

commit a818101d7b92e76db2f9a597e4830734767473b9 upstream.

An NULL-pointer dereference happens in cachefiles_mark_object_inactive()
when it tries to read i_blocks so that it can tell the cachefilesd daemon
how much space it's making available.

The problem is that cachefiles_drop_object() calls
cachefiles_mark_object_inactive() after calling cachefiles_delete_object()
because the object being marked active staves off attempts to (re-)use the
file at that filename until after it has been deleted. This means that
d_inode is NULL by the time we come to try to access it.

To fix the problem, have the caller of cachefiles_mark_object_inactive()
supply the number of blocks freed up.

Without this, the following oops may occur:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
...
CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G I ------------ 3.10.0-470.el7.x86_64 #1
Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000
RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
RSP: 0018:ffff8800b77c3d70 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034
RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8
RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000
R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600
R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498
FS: 0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0
 ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658
 ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600
Call Trace:
 [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles]
 [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache]
 [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache]
 [<ffffffff810a605b>] process_one_work+0x17b/0x470
 [<ffffffff810a6e96>] worker_thread+0x126/0x410
 [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460
 [<ffffffff810ae64f>] kthread+0xcf/0xe0
 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff81695418>] ret_from_fork+0x58/0x90
 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140

The oopsing code shows:

 callq 0xffffffff810af6a0 <wake_up_bit>
 mov 0xf8(%r12),%rax
 mov 0x30(%rax),%rax
 mov 0x98(%rax),%rax <---- oops here
 lock add %rax,0x130(%rbx)

where this is:

 d_backing_inode(object->dentry)->i_blocks

Fixes: a5b3a80b899bda0f456f1246c4c5a1191ea01519 (CacheFiles: Provide read-and-reset release counters for cachefilesd)
Reported-by: Jianhong Yin <email address hidden>
Signed-off-by: David Howells <email address hidden>
Reviewed-by: Jeff Layton <email address hidden>
Reviewed-by: Steve Dickson <email address hidden>
Signed-off-by: Al Viro <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

3f38ae1... by Miklos Szeredi <email address hidden>

vfs: move permission checking into notify_change() for utimes(NULL)

commit f2b20f6ee842313a0d681dbbf7f87b70291a6a3b upstream.

This fixes a bug where the permission was not properly checked in
overlayfs. The testcase is ltp/utimensat01.

It is also cleaner and safer to do the permission checking in the vfs
helper instead of the caller.

This patch introduces an additional ia_valid flag ATTR_TOUCH (since
touch(1) is the most obvious user of utimes(NULL)) that is passed into
notify_change whenever the conditions for this special permission checking
mode are met.

Reported-by: Aihua Zhang <email address hidden>
Signed-off-by: Miklos Szeredi <email address hidden>
Tested-by: Aihua Zhang <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

4530bf8... by Marcelo Ricardo Leitner <email address hidden>

dlm: free workqueues after the connections

commit 3a8db79889ce16930aff19b818f5b09651bb7644 upstream.

After backporting commit ee44b4bc054a ("dlm: use sctp 1-to-1 API")
series to a kernel with an older workqueue which didn't use RCU yet, it
was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
too early as free_conn() will try to access that memory for canceling
the queued works if any.

This issue was introduced by commit 0d737a8cfd83 as before it such
attempt to cancel the queued works wasn't performed, so the issue was
not present.

This patch fixes it by simply inverting the free order.

Fixes: 0d737a8cfd83 ("dlm: fix race while closing connections")
Signed-off-by: Marcelo Ricardo Leitner <email address hidden>
Signed-off-by: David Teigland <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

c2a55c3... by Marcelo Cerri

crypto: vmx - Fix memory corruption caused by p8_ghash

commit 80da44c29d997e28c4442825f35f4ac339813877 upstream.

This patch changes the p8_ghash driver to use ghash-generic as a fixed
fallback implementation. This allows the correct value of descsize to be
defined directly in its shash_alg structure and avoids problems with
incorrect buffer sizes when its state is exported or imported.

Reported-by: Jan Stancek <email address hidden>
Fixes: cc333cd68dfa ("crypto: vmx - Adding GHASH routines for VMX module")
Signed-off-by: Marcelo Cerri <email address hidden>
Signed-off-by: Herbert Xu <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

826ed40... by Marcelo Cerri

crypto: ghash-generic - move common definitions to a new header file

commit a397ba829d7f8aff4c90af3704573a28ccd61a59 upstream.

Move common values and types used by ghash-generic to a new header file
so drivers can directly use ghash-generic as a fallback implementation.

Fixes: cc333cd68dfa ("crypto: vmx - Adding GHASH routines for VMX module")
Signed-off-by: Marcelo Cerri <email address hidden>
Signed-off-by: Herbert Xu <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

824bd2f... by Jan Kara <email address hidden>

ext4: unmap metadata when zeroing blocks

commit 9b623df614576680cadeaa4d7e0b5884de8f7c17 upstream.

When zeroing blocks for DAX allocations, we also have to unmap aliases
in the block device mappings. Otherwise writeback can overwrite zeros
with stale data from block device page cache.

Signed-off-by: Jan Kara <email address hidden>
Signed-off-by: Theodore Ts'o <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

9de4a46... by gmail <email address hidden>

ext4: release bh in make_indexed_dir

commit e81d44778d1d57bbaef9e24c4eac7c8a7a401d40 upstream.

The commit 6050d47adcad: "ext4: bail out from make_indexed_dir() on
first error" could end up leaking bh2 in the error path.

[ Also avoid renaming bh2 to bh, which just confuses things --tytso ]

Signed-off-by: yangsheng <email address hidden>
Signed-off-by: Theodore Ts'o <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>

e54ffbb... by Ross Zwisler <email address hidden>

ext4: allow DAX writeback for hole punch

commit cca32b7eeb4ea24fa6596650e06279ad9130af98 upstream.

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <email address hidden>
Reviewed-by: Jan Kara <email address hidden>
Signed-off-by: Theodore Ts'o <email address hidden>
Signed-off-by: Greg Kroah-Hartman <email address hidden>