~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial:hibernate-lp1831940b

Last commit made on 2019-06-25
Get this branch:
git clone -b hibernate-lp1831940b https://git.launchpad.net/~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial
Only Kamal Mostafa can upload to this branch. If you are Kamal Mostafa please log in for upload directions.

Branch merges

Branch information

Name:
hibernate-lp1831940b
Repository:
lp:~kamalmostafa/ubuntu/+source/linux-aws/+git/xenial

Recent commits

a11b38b... by Andrea Righi

UBUNTU SAUCE [aws] PM / hibernate: reduce memory pressure during image writing

BugLink: https://bugs.launchpad.net/bugs/1831940

Get rid of the reqd_free_pages logic and make sure I/O requests are
completed every time a swap page is written.

This allows to reduce the risk of running out of memory during hibernate
if the system is under memory pressure.

Signed-off-by: Andrea Righi <email address hidden>

38bdd73... by Andrea Righi

UBUNTU SAUCE [aws] PM / hibernate: set image_size to total RAM size by default

BugLink: https://bugs.launchpad.net/bugs/1831940

If the size of the image created when hibernating is bigger than
image_size the kernel will try to reduce the size of the image. This can
slow down hibernation consistenly, so setting image_size to the total
amount of RAM by default ensures that the kernel doesn't waste time to
reduce the image, making hibernation faster.

Signed-off-by: Andrea Righi <email address hidden>

a6abfd2... by Andrea Righi

UBUNTU: [Config] aws: disable CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS

BugLink: https://bugs.launchpad.net/bugs/1831940

When swapping out THP (Transparent Huge Pages), sometimes we have to
fallback to split the THP into normal pages before swapping, because no
free swap clusters are available. This can make hibernation really slow
under memory pressure, so disable transparent huge pages by default.

Signed-off-by: Andrea Righi <email address hidden>

4f58b94... by "Rafael J. Wysocki" <email address hidden>

PM / hibernate: Simplify mark_unsafe_pages()

BugLink: https://bugs.launchpad.net/bugs/1831940

Rework mark_unsafe_pages() to use a simpler method of clearing
all bits in free_pages_map and to set the bits for the "unsafe"
pages (ie. pages that were used by the image kernel before
hibernation) with the help of duplicate_memory_bitmap().

For this purpose, move the pfn_valid() check from mark_unsafe_pages()
to unpack_orig_pfns() where the "unsafe" pages are discovered.

Signed-off-by: Rafael J. Wysocki <email address hidden>
(cherry picked from commit 6dbecfd345a617888da370b13d5b190c9ff3df53)
Signed-off-by: Andrea Righi <email address hidden>

3caef48... by "Rafael J. Wysocki" <email address hidden>

PM / hibernate: Recycle safe pages after image restoration

BugLink: https://bugs.launchpad.net/bugs/1831940

One of the memory bitmaps used by the hibernation image restoration
code is freed after the image has been loaded.

That is not quite efficient, though, because the memory pages used
for building that bitmap are known to be safe (ie. they were not
used by the image kernel before hibernation) and the arch-specific
code finalizing the image restoration may need them. In that case
it needs to allocate those pages again via the memory management
subsystem, check if they are really safe again by consulting the
other bitmaps and so on.

To avoid that, recycle those pages by putting them into the global
list of known safe pages so that they can be given to the arch code
right away when necessary.

Signed-off-by: Rafael J. Wysocki <email address hidden>
(cherry picked from commit 307c5971c972ef2bfd541d2850b36a692c6354c9)
Signed-off-by: Andrea Righi <email address hidden>

006e5f7... by "Rafael J. Wysocki" <email address hidden>

PM / hibernate: Do not free preallocated safe pages during image restore

BugLink: https://bugs.launchpad.net/bugs/1831940

The core image restoration code preallocates some safe pages
(ie. pages that weren't used by the image kernel before hibernation)
for future use before allocating the bulk of memory for loading the
image data. Those safe pages are then freed so they can be allocated
again (with the memory management subsystem's help). That's done to
ensure that there will be enough safe pages for temporary data
structures needed during image restoration.

However, it is not really necessary to free those pages after they
have been allocated. They can be added to the (global) list of
safe pages right away and then picked up from there when needed
without freeing.

That reduces the overhead related to using safe pages, especially
in the arch-specific code, so modify the code accordingly.

Signed-off-by: Rafael J. Wysocki <email address hidden>
(cherry picked from commit 9c744481c003697de453e8fc039468143ba604aa)
Signed-off-by: Andrea Righi <email address hidden>

b0c5a78... by Keith Busch

NVMe: Allow request merges

BugLink: https://bugs.launchpad.net/bugs/1831940

It is generally more efficient to submit larger IO.

Signed-off-by: Keith Busch <email address hidden>
Reviewed-by: Johannes Thumshirn <email address hidden>
Reviewed-by: Sagi Grimberg <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit ef2d4615c59efb312e531a5e949970f37ca1c841)
Signed-off-by: Andrea Righi <email address hidden>

d4c9e1b... by Andrea Righi

UBUNTU SAUCE [aws]: PM / hibernate: make sure pm_async is always disabled

BugLink: https://bugs.launchpad.net/bugs/1831940

We have experienced deadlock conditions on hibernate under memory
pressure with pm_async enabled.

To prevent such deadlocks make sure that pm_async is never enabled.

Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

a1f7191... by Michal Hocko <email address hidden>

mm, vmscan: get rid of throttle_vm_writeout

BugLink: https://bugs.launchpad.net/bugs/1831940

throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
excessive pageout activity during the reclaim. Too many pages could be
put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages when
initiating the buffered IO and later during the memory pressure, the
direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi is
congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Finally should_reclaim_retry does congestion_wait if the reclaim cannot
make any progress and there are too many dirty/writeback pages.

Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a
lot of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies on
the mempool allocator. mempool_alloc(), however, prefers to use the
underlying (usually page) allocator before it grabs objects from the
pool. Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout. If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

  kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
  Workqueue: kcryptd kcryptd_crypt [dm_crypt]
  Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Let's just drop throttle_vm_writeout altogether. It is not very much
helpful anymore.

I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1]. Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data. As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background to
make the memory pressure even stronger as well as to disrupt the steady
state for the IO. The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout the
test. Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.

[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Link: http://<email address hidden>
Signed-off-by: Michal Hocko <email address hidden>
Reported-by: Mikulas Patocka <email address hidden>
Cc: Marcelo Tosatti <email address hidden>
Cc: NeilBrown <email address hidden>
Cc: Ondrej Kozina <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>
(cherry picked from commit bf48438354a79df50fadd2e1c0b81baa2619a8b6)
Signed-off-by: Kamal Mostafa <email address hidden>

aad8e95... by Andrea Righi

UBUNTU SAUCE [aws]: mm: aggressive swapoff

BugLink: https://bugs.launchpad.net/bugs/1831940

Improve swapoff performance at the expense of the entire system
performance by avoiding to sleep on lock_page() in try_to_unuse().

This allows to trigger a read_swap_cache_async() on all the swapped out
pages and strongly increase swapoff performance (at the risk of
completely killing interactive performance).

Test case: swapoff called on a swap file containing about 32G of data in
a VM with 8 cpus, 64G RAM.

Result:

 - stock kernel:

 # time swapoff /swap-hibinit

 real 40m13.072s
 user 0m0.000s
 sys 17m18.971s

 - with this patch applied:

 # time swapoff /swap-hibinit

 real 1m59.496s
 user 0m0.000s
 sys 0m21.370s

Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>