~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial:hibernate-lp1831940

Last commit made on 2019-06-18
Get this branch:
git clone -b hibernate-lp1831940 https://git.launchpad.net/~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial
Only Kamal Mostafa can upload to this branch. If you are Kamal Mostafa please log in for upload directions.

Branch merges

Branch information

Name:
hibernate-lp1831940
Repository:
lp:~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial

Recent commits

d4c9e1b... by Andrea Righi

UBUNTU SAUCE [aws]: PM / hibernate: make sure pm_async is always disabled

BugLink: https://bugs.launchpad.net/bugs/1831940

We have experienced deadlock conditions on hibernate under memory
pressure with pm_async enabled.

To prevent such deadlocks make sure that pm_async is never enabled.

Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

a1f7191... by Michal Hocko <email address hidden>

mm, vmscan: get rid of throttle_vm_writeout

BugLink: https://bugs.launchpad.net/bugs/1831940

throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
excessive pageout activity during the reclaim. Too many pages could be
put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages when
initiating the buffered IO and later during the memory pressure, the
direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi is
congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Finally should_reclaim_retry does congestion_wait if the reclaim cannot
make any progress and there are too many dirty/writeback pages.

Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a
lot of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies on
the mempool allocator. mempool_alloc(), however, prefers to use the
underlying (usually page) allocator before it grabs objects from the
pool. Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout. If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

  kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
  Workqueue: kcryptd kcryptd_crypt [dm_crypt]
  Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Let's just drop throttle_vm_writeout altogether. It is not very much
helpful anymore.

I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1]. Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data. As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background to
make the memory pressure even stronger as well as to disrupt the steady
state for the IO. The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout the
test. Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.

[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Link: http://<email address hidden>
Signed-off-by: Michal Hocko <email address hidden>
Reported-by: Mikulas Patocka <email address hidden>
Cc: Marcelo Tosatti <email address hidden>
Cc: NeilBrown <email address hidden>
Cc: Ondrej Kozina <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>
(cherry picked from commit bf48438354a79df50fadd2e1c0b81baa2619a8b6)
Signed-off-by: Kamal Mostafa <email address hidden>

aad8e95... by Andrea Righi

UBUNTU SAUCE [aws]: mm: aggressive swapoff

BugLink: https://bugs.launchpad.net/bugs/1831940

Improve swapoff performance at the expense of the entire system
performance by avoiding to sleep on lock_page() in try_to_unuse().

This allows to trigger a read_swap_cache_async() on all the swapped out
pages and strongly increase swapoff performance (at the risk of
completely killing interactive performance).

Test case: swapoff called on a swap file containing about 32G of data in
a VM with 8 cpus, 64G RAM.

Result:

 - stock kernel:

 # time swapoff /swap-hibinit

 real 40m13.072s
 user 0m0.000s
 sys 17m18.971s

 - with this patch applied:

 # time swapoff /swap-hibinit

 real 1m59.496s
 user 0m0.000s
 sys 0m21.370s

Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

e3b7779... by Hugh Dickins <email address hidden>

mm: swapoff: shmem_unuse() stop eviction without igrab()

BugLink: https://bugs.launchpad.net/bugs/1831940

The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().

Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).

That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.

[<email address hidden>: remove incorrect list_del()]
  Link: http://<email address hidden>
Link: http://<email address hidden>
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <email address hidden>
Cc: "Alex Xu (Hello71)" <email address hidden>
Cc: Huang Ying <email address hidden>
Cc: Kelley Nielsen <email address hidden>
Cc: Konstantin Khlebnikov <email address hidden>
Cc: Rik van Riel <email address hidden>
Cc: Vineeth Pillai <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>
(backported from commit af53d3e9e04024885de5b4fda51e5fa362ae2bd8)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

475581c... by Peter Zijlstra <email address hidden>

sched/wait: Introduce wait_var_event()

BugLink: https://bugs.launchpad.net/bugs/1831940

As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.

The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.

It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.

The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.

Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.

The following are (mostly) identical replacements:

 wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
 wake_up_atomic_t(&my_atomic);

 wait_var_event(&my_atomic, !atomic_read(&my_atomic));
 wake_up_var(&my_atomic);

The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.

Tested-by: Dan Williams <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: David Howells <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(backported from commit 6b2bb7265f0b62605e8caee3613449ed0db270b9)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

6cb1108... by Ingo Molnar <email address hidden>

sched/wait: Standardize wait_bit_queue naming

BugLink: https://bugs.launchpad.net/bugs/1831940

So wait-bit-queue head variables are often named:

 struct wait_bit_queue *q

... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!

They are misnomers in two ways:

 - the 'wait_bit_queue' leaves open the question of whether
   it's an entry or a head

 - the 'q' parameter and local variable naming falsely implies
   that it's a 'queue' - while it's an entry.

This resulted in sometimes confusing cases such as:

 finish_wait(wq, &q->wait);

where the 'q' is not a wait-queue head, but a wait-bit-queue entry.

So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:

 struct wait_bit_queue => struct wait_bit_queue_entry
 q => wbq_entry

Which makes it all a much clearer:

 struct wait_bit_queue_entry *wbq_entry

... and turns the former confusing piece of code into:

 finish_wait(wq_head, &wbq_entry->wq_entry;

which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.

I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.

Cc: Linus Torvalds <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(backported from commit 76c85ddc4695bb7b8209bfeff11f5156088f9197)
Signed-off-by: Kamal Mostafa <email address hidden>

47eea2b... by Oleg Nesterov <email address hidden>

sched/wait: Introduce init_wait_entry()

BugLink: https://bugs.launchpad.net/bugs/1831940

The partial initialization of wait_queue_t in prepare_to_wait_event() looks
ugly. This was done to shrink .text, but we can simply add the new helper
which does the full initialization and shrink the compiled code a bit more.

And. This way prepare_to_wait_event() can have more users. In particular we
are ready to remove the signal_pending_state() checks from wait_bit_action_f
helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 0176beaffbe9ed627b6a4dfa61d640f1a848086f)
Signed-off-by: Kamal Mostafa <email address hidden>

f8c56f8... by Oleg Nesterov <email address hidden>

sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()

BugLink: https://bugs.launchpad.net/bugs/1831940

__wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
now it can't use prepare_to_wait_event() (see the next change), but
it can do the additional finish_wait() if action() fails.

abort_exclusive_wait() no longer has callers, remove it.

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit eaf9ef52241b545fe63621266bfc6fd8b06559ff)
Signed-off-by: Kamal Mostafa <email address hidden>

070487f... by Oleg Nesterov <email address hidden>

sched/wait: Avoid abort_exclusive_wait() in ___wait_event()

BugLink: https://bugs.launchpad.net/bugs/1831940

___wait_event() doesn't really need abort_exclusive_wait(), we can simply
change prepare_to_wait_event() to remove the waiter from q->task_list if
it was interrupted.

This simplifies the code/logic, and this way prepare_to_wait_event() can
have more users, see the next change.

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
--
 include/linux/wait.h | 7 +------
 kernel/sched/wait.c | 35 +++++++++++++++++++++++++----------
 2 files changed, 26 insertions(+), 16 deletions(-)

(cherry picked from commit b1ea06a90f528e516929a4da1d9b8838752bceb9)
Signed-off-by: Kamal Mostafa <email address hidden>

0d00ac8... by Oleg Nesterov <email address hidden>

sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up()

BugLink: https://bugs.launchpad.net/bugs/1831940

Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.

If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.

The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.

Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.

Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 38a3e1fc1dac480f3672ab22fc97e1f995c80ed7)
Signed-off-by: Kamal Mostafa <email address hidden>