~kamalmostafa/ubuntu/+source/linux-aws/+git/bionic:hibernate-master

Last commit made on 2019-05-24
Get this branch:
git clone -b hibernate-master https://git.launchpad.net/~kamalmostafa/ubuntu/+source/linux-aws/+git/bionic
Only Kamal Mostafa can upload to this branch. If you are Kamal Mostafa please log in for upload directions.

Branch merges

Branch information

Name:
hibernate-master
Repository:
lp:~kamalmostafa/ubuntu/+source/linux-aws/+git/bionic

Recent commits

c5ea37a... by Kamal Mostafa

TEST KERNEL 4.15.0-1040.42+hibernate20150524

0614c59... by Andrea Righi

UBUNTU SAUCE [aws]: mm: aggressive swapoff

Improve swapoff performance at the expense of the entire system
performance by avoiding to sleep on lock_page() in try_to_unuse().

This allows to trigger a read_swap_cache_async() on all the swapped out
pages and strongly increase swapoff performance (at the risk of
completely killing interactive performance).

Test case: swapoff called on a swap file containing about 32G of data in
a VM with 8 cpus, 64G RAM.

Result:

 - stock kernel:

 # time swapoff /swap-hibinit

 real 40m13.072s
 user 0m0.000s
 sys 17m18.971s

 - with this patch applied:

 # time swapoff /swap-hibinit

 real 1m59.496s
 user 0m0.000s
 sys 0m21.370s

Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

3f3d89e... by Hugh Dickins <email address hidden>

mm: swapoff: shmem_unuse() stop eviction without igrab()

The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().

Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).

That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.

[<email address hidden>: remove incorrect list_del()]
  Link: http://<email address hidden>
Link: http://<email address hidden>
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <email address hidden>
Cc: "Alex Xu (Hello71)" <email address hidden>
Cc: Huang Ying <email address hidden>
Cc: Kelley Nielsen <email address hidden>
Cc: Konstantin Khlebnikov <email address hidden>
Cc: Rik van Riel <email address hidden>
Cc: Vineeth Pillai <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>
(cherry picked from commit af53d3e9e04024885de5b4fda51e5fa362ae2bd8)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

fcfed59... by Peter Zijlstra <email address hidden>

sched/wait: Introduce wait_var_event()

As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.

The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.

It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.

The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.

Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.

The following are (mostly) identical replacements:

 wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
 wake_up_atomic_t(&my_atomic);

 wait_var_event(&my_atomic, !atomic_read(&my_atomic));
 wake_up_var(&my_atomic);

The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.

Tested-by: Dan Williams <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: David Howells <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 6b2bb7265f0b62605e8caee3613449ed0db270b9)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

c40c008... by Vineeth Remanan Pillai <email address hidden>

mm: rid swapoff of quadratic complexity

This patch was initially posted by Kelley Nielsen. Reposting the patch
with all review comments addressed and with minor modifications and
optimizations. Also, folding in the fixes offered by Hugh Dickins and
Huang Ying. Tests were rerun and commit message updated with new
results.

try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about 9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.

Link: http://<email address hidden>
Signed-off-by: Vineeth Remanan Pillai <email address hidden>
Signed-off-by: Kelley Nielsen <email address hidden>
Signed-off-by: Huang Ying <email address hidden>
Acked-by: Hugh Dickins <email address hidden>
Cc: Rik van Riel <email address hidden>
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

This patch is based on the original prototype patch (that doesn't depend
on xarray) and it includes the following fixes:

 64165b1affc5 mm: swapoff: take notice of completion sooner
 dd862deb151a mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
 870395465444 mm: swapoff: shmem_find_swap_entries() filter out other types

Along with other changes to make it more similar to the mainline patch.

Link: https://patchwork.kernel.org/patch/6048271/
(backported from commit b56a2d8af9147a4efe4011b60d93779c0461ca97)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

267bd52... by Frank van der Linden <email address hidden>

UBUNTU SAUCE [aws]: xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.

Restoring all PIRQs, which is the right thing to do, was causing problems
on larger instances. This is a horrible workaround until this issue is fully
understood.

Signed-off-by: Frank van der Linden <email address hidden>
Reviewed-by: Alakesh Haloi <email address hidden>
Reviewed-by: Anchal Agarwal <email address hidden>
Reviewed-by: Qian Lu <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

eff560c... by Frank van der Linden <email address hidden>

UBUNTU SAUCE [aws]: xen: restore pirqs on resume from hibernation.

The hibernation code unlinks event channels from these (legacy) IRQs, so they
must be reinitialized on wakeup, much like in the Xen suspend/resume case.

Signed-off-by: Frank van der Linden <email address hidden>
Reviewed-by: Cristian Gafton <email address hidden>
Reviewed-by: Anchal Agarwal <email address hidden>
Reviewed-by: Alakesh Haloi <email address hidden>

CR: https://code.amazon.com/reviews/CR-3702953/
Signed-off-by: Kamal Mostafa <email address hidden>

87acebb... by Anchal Agarwal <email address hidden>

UBUNTU SAUCE [aws]: ACPICA: Enable sleep button on ACPI legacy wake

Currently we do not see sleep_enable bit set after guest resumes from
hibernation. Hibernation is triggered in guest on receiving a sleep
trigger from the hypervisor(S4 state). We see that power button is enabled
on wake up however sleep button isn't. This causes 2nd sleep trigger to fail
since PMEN register does not have bit 9 set in its register on resume which
is expected by hypervisor to be set before it can send an SCI interrupt to
the guest.
Expected PMEN=0x320 on fresh boot however, after resume PMEN=0x120

Signed-off-by: Anchal Agarwal <email address hidden>
Reviewed-by: Balbir Singh <email address hidden>
Reviewed-by: Frank van der Linden <email address hidden>

CR: https://code.amazon.com/reviews/CR-3704872
Signed-off-by: Kamal Mostafa <email address hidden>

af137b7... by Eduardo Valentin <email address hidden>

UBUNTU SAUCE [aws]: block: xen-blkfront: consider new dom0 features on restore

On regular start, the instance will perform a regular boot, in which rootfs
is mounted accordingly to the xen-blkback features (in particular
feature-barrier and feature-flush-cache). That will setup the journal
accordingly to the provided features on SB.
On a start from hibernation, the instance boots, detects that a hibernation
image is present, push the image to memory and jumps back where it was. There
is no regular mount of the rootfs, it uses the data structures already in
the previous saved memory image.
Now, When the instance hibernates, it may move from its original dom0 to a new dom0
when it is restarted.
So, given the above, if the xen-blkback features change then the guest
can be in trouble. And I see the original assumption was that the
dom0 environment would be preserved. I did a couple of experiments,
and I confirm that these particular features change quite a lot across
hibernation attempts:
[ 2343.157903] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
[ 2444.712339] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
[ 2537.105884] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 2636.641298] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
[ 2729.868349] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 2827.118979] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 2924.812599] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3018.063399] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3116.685040] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3209.164475] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3317.981362] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3415.939725] blkfront: xvda: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3514.202478] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
[ 3619.355791] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;

Now, considering the above, this patch fixes the following scenario:
a. Instance boots and sets up bio queue on a dom0 A with softbarrier supported.
b. hibernates
c. When asked to restore, the instance is back on dom0 B with unsupported
softbarrier.
d. Restoration goes well until next journal commit is issued. Remember that
it is still using the previous image rootfs data structures, therefore
is gonna request a softbarrier.
e. The bio will error out and throw a "operation not supported" message
and cause the journal to fail, and it will decide to remount
the rootfs as RO.
[ 1138.909290] print_req_error: operation not supported error, dev xvda, sector 4470400, flags 6008
[ 1139.025685] Aborting journal on device xvda1-8.
[ 1139.029758] print_req_error: operation not supported error, dev xvda, sector 4460544, flags 26008
[ 1139.326119] Buffer I/O error on dev xvda1, logical block 0, lost sync page write
[ 1139.331398] EXT4-fs error (device xvda1): ext4_journal_check_start:61: Detected aborted journal
[ 1139.337296] EXT4-fs (xvda1): Remounting filesystem read-only
[ 1139.341006] EXT4-fs (xvda1): previous I/O error to superblock detected
[ 1139.345704] print_req_error: operation not supported error, dev xvda, sector 4096, flags 26008

The fix is essentially to read xenbus to query the new xen
blkback capabilities and update them into the request queue.

Reviewed-by: Balbir Singh <email address hidden>
Reviewed-by: Vallish Vaidyeshwara <email address hidden>
Signed-off-by: Eduardo Valentin <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>

07976b6... by Connor Kuehl

UBUNTU: Ubuntu-aws-4.15.0-1040.42

Signed-off-by: Connor Kuehl <email address hidden>