~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial:hibernate-master

Last commit made on 2019-05-24
Get this branch:
git clone -b hibernate-master https://git.launchpad.net/~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial
Only Kamal Mostafa can upload to this branch. If you are Kamal Mostafa please log in for upload directions.

Branch merges

Branch information

Name:
hibernate-master
Repository:
lp:~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial

Recent commits

f9b66a4... by Kamal Mostafa

TEST KERNEL 4.4.0-1084.94+hibernate20190524

2b4e1f4... by Andrea Righi

UBUNTU SAUCE [aws]: mm: aggressive swapoff

Improve swapoff performance at the expense of the entire system
performance by avoiding to sleep on lock_page() in try_to_unuse().

This allows to trigger a read_swap_cache_async() on all the swapped out
pages and strongly increase swapoff performance (at the risk of
completely killing interactive performance).

Test case: swapoff called on a swap file containing about 32G of data in
a VM with 8 cpus, 64G RAM.

Result:

 - stock kernel:

 # time swapoff /swap-hibinit

 real 40m13.072s
 user 0m0.000s
 sys 17m18.971s

 - with this patch applied:

 # time swapoff /swap-hibinit

 real 1m59.496s
 user 0m0.000s
 sys 0m21.370s

Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

09ad3a5... by Hugh Dickins <email address hidden>

mm: swapoff: shmem_unuse() stop eviction without igrab()

The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().

Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).

That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.

[<email address hidden>: remove incorrect list_del()]
  Link: http://<email address hidden>
Link: http://<email address hidden>
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <email address hidden>
Cc: "Alex Xu (Hello71)" <email address hidden>
Cc: Huang Ying <email address hidden>
Cc: Kelley Nielsen <email address hidden>
Cc: Konstantin Khlebnikov <email address hidden>
Cc: Rik van Riel <email address hidden>
Cc: Vineeth Pillai <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>
(backported from commit af53d3e9e04024885de5b4fda51e5fa362ae2bd8)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

f3b016d... by Peter Zijlstra <email address hidden>

sched/wait: Introduce wait_var_event()

As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.

The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.

It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.

The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.

Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.

The following are (mostly) identical replacements:

 wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
 wake_up_atomic_t(&my_atomic);

 wait_var_event(&my_atomic, !atomic_read(&my_atomic));
 wake_up_var(&my_atomic);

The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.

Tested-by: Dan Williams <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: David Howells <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(backported from commit 6b2bb7265f0b62605e8caee3613449ed0db270b9)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

bef6cfc... by Ingo Molnar <email address hidden>

sched/wait: Standardize wait_bit_queue naming

So wait-bit-queue head variables are often named:

 struct wait_bit_queue *q

... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!

They are misnomers in two ways:

 - the 'wait_bit_queue' leaves open the question of whether
   it's an entry or a head

 - the 'q' parameter and local variable naming falsely implies
   that it's a 'queue' - while it's an entry.

This resulted in sometimes confusing cases such as:

 finish_wait(wq, &q->wait);

where the 'q' is not a wait-queue head, but a wait-bit-queue entry.

So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:

 struct wait_bit_queue => struct wait_bit_queue_entry
 q => wbq_entry

Which makes it all a much clearer:

 struct wait_bit_queue_entry *wbq_entry

... and turns the former confusing piece of code into:

 finish_wait(wq_head, &wbq_entry->wq_entry;

which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.

I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.

Cc: Linus Torvalds <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(backported from commit 76c85ddc4695bb7b8209bfeff11f5156088f9197)
Signed-off-by: Kamal Mostafa <email address hidden>

5a2f668... by Oleg Nesterov <email address hidden>

sched/wait: Introduce init_wait_entry()

The partial initialization of wait_queue_t in prepare_to_wait_event() looks
ugly. This was done to shrink .text, but we can simply add the new helper
which does the full initialization and shrink the compiled code a bit more.

And. This way prepare_to_wait_event() can have more users. In particular we
are ready to remove the signal_pending_state() checks from wait_bit_action_f
helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 0176beaffbe9ed627b6a4dfa61d640f1a848086f)
Signed-off-by: Kamal Mostafa <email address hidden>

2b0b134... by Oleg Nesterov <email address hidden>

sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()

__wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
now it can't use prepare_to_wait_event() (see the next change), but
it can do the additional finish_wait() if action() fails.

abort_exclusive_wait() no longer has callers, remove it.

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit eaf9ef52241b545fe63621266bfc6fd8b06559ff)
Signed-off-by: Kamal Mostafa <email address hidden>

f6696c9... by Oleg Nesterov <email address hidden>

sched/wait: Avoid abort_exclusive_wait() in ___wait_event()

___wait_event() doesn't really need abort_exclusive_wait(), we can simply
change prepare_to_wait_event() to remove the waiter from q->task_list if
it was interrupted.

This simplifies the code/logic, and this way prepare_to_wait_event() can
have more users, see the next change.

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
--
 include/linux/wait.h | 7 +------
 kernel/sched/wait.c | 35 +++++++++++++++++++++++++----------
 2 files changed, 26 insertions(+), 16 deletions(-)

(cherry picked from commit b1ea06a90f528e516929a4da1d9b8838752bceb9)
Signed-off-by: Kamal Mostafa <email address hidden>

dd6eda6... by Oleg Nesterov <email address hidden>

sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up()

Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.

If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.

The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.

Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 38a3e1fc1dac480f3672ab22fc97e1f995c80ed7)
Signed-off-by: Kamal Mostafa <email address hidden>

6be3994... by Vineeth Remanan Pillai <email address hidden>

mm: rid swapoff of quadratic complexity

This patch was initially posted by Kelley Nielsen. Reposting the patch
with all review comments addressed and with minor modifications and
optimizations. Also, folding in the fixes offered by Hugh Dickins and
Huang Ying. Tests were rerun and commit message updated with new
results.

try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about 9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.

Link: http://<email address hidden>
Signed-off-by: Vineeth Remanan Pillai <email address hidden>
Signed-off-by: Kelley Nielsen <email address hidden>
Signed-off-by: Huang Ying <email address hidden>
Acked-by: Hugh Dickins <email address hidden>
Cc: Rik van Riel <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

This patch is based on the original prototype patch (that doesn't depend
on xarray) and it includes the following fixes:

 64165b1affc5 mm: swapoff: take notice of completion sooner
 dd862deb151a mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
 870395465444 mm: swapoff: shmem_find_swap_entries() filter out other types

Along with other changes to make it more similar to the mainline patch.

Link: https://patchwork.kernel.org/patch/6048271/
(backported from commit b56a2d8af9147a4efe4011b60d93779c0461ca97)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>