Improve swapoff performance at the expense of the entire system
performance by avoiding to sleep on lock_page() in try_to_unuse().
This allows to trigger a read_swap_cache_async() on all the swapped out
pages and strongly increase swapoff performance (at the risk of
completely killing interactive performance).
Test case: swapoff called on a swap file containing about 32G of data in
a VM with 8 cpus, 64G RAM.
Result:
- stock kernel:
# time swapoff /swap-hibinit
real 40m13.072s
user 0m0.000s
sys 17m18.971s
- with this patch applied:
# time swapoff /swap-hibinit
real 1m59.496s
user 0m0.000s
sys 0m21.370s
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>
mm: swapoff: shmem_unuse() stop eviction without igrab()
The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.
Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().
Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).
That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.
f3b016d...
by
Peter Zijlstra <email address hidden>
sched/wait: Introduce wait_var_event()
As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.
The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.
It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.
The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.
Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.
The following are (mostly) identical replacements:
The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.
Tested-by: Dan Williams <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: David Howells <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(backported from commit 6b2bb7265f0b62605e8caee3613449ed0db270b9)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>
... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!
They are misnomers in two ways:
- the 'wait_bit_queue' leaves open the question of whether
it's an entry or a head
- the 'q' parameter and local variable naming falsely implies
that it's a 'queue' - while it's an entry.
This resulted in sometimes confusing cases such as:
finish_wait(wq, &q->wait);
where the 'q' is not a wait-queue head, but a wait-bit-queue entry.
So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:
... and turns the former confusing piece of code into:
finish_wait(wq_head, &wbq_entry->wq_entry;
which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.
I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.
Cc: Linus Torvalds <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Cc: <email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(backported from commit 76c85ddc4695bb7b8209bfeff11f5156088f9197)
Signed-off-by: Kamal Mostafa <email address hidden>
5a2f668...
by
Oleg Nesterov <email address hidden>
sched/wait: Introduce init_wait_entry()
The partial initialization of wait_queue_t in prepare_to_wait_event() looks
ugly. This was done to shrink .text, but we can simply add the new helper
which does the full initialization and shrink the compiled code a bit more.
And. This way prepare_to_wait_event() can have more users. In particular we
are ready to remove the signal_pending_state() checks from wait_bit_action_f
helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().
Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 0176beaffbe9ed627b6a4dfa61d640f1a848086f)
Signed-off-by: Kamal Mostafa <email address hidden>
2b0b134...
by
Oleg Nesterov <email address hidden>
sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
__wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
now it can't use prepare_to_wait_event() (see the next change), but
it can do the additional finish_wait() if action() fails.
abort_exclusive_wait() no longer has callers, remove it.
Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit eaf9ef52241b545fe63621266bfc6fd8b06559ff)
Signed-off-by: Kamal Mostafa <email address hidden>
f6696c9...
by
Oleg Nesterov <email address hidden>
sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
___wait_event() doesn't really need abort_exclusive_wait(), we can simply
change prepare_to_wait_event() to remove the waiter from q->task_list if
it was interrupted.
This simplifies the code/logic, and this way prepare_to_wait_event() can
have more users, see the next change.
Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
--
include/linux/wait.h | 7 +------
kernel/sched/wait.c | 35 +++++++++++++++++++++++++----------
2 files changed, 26 insertions(+), 16 deletions(-)
(cherry picked from commit b1ea06a90f528e516929a4da1d9b8838752bceb9)
Signed-off-by: Kamal Mostafa <email address hidden>
dd6eda6...
by
Oleg Nesterov <email address hidden>
sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up()
Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.
If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.
The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.
Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.
Signed-off-by: Oleg Nesterov <email address hidden>
Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
Cc: Al Viro <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Johannes Weiner <email address hidden>
Cc: Linus Torvalds <email address hidden>
Cc: Mike Galbraith <email address hidden>
Cc: Neil Brown <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Thomas Gleixner <email address hidden>
Link: http://<email address hidden>
Signed-off-by: Ingo Molnar <email address hidden>
(cherry picked from commit 38a3e1fc1dac480f3672ab22fc97e1f995c80ed7)
Signed-off-by: Kamal Mostafa <email address hidden>
6be3994...
by
Vineeth Remanan Pillai <email address hidden>
mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley Nielsen. Reposting the patch
with all review comments addressed and with minor modifications and
optimizations. Also, folding in the fixes offered by Hugh Dickins and
Huang Ying. Tests were rerun and commit message updated with new
results.
try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.
This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.
Improvement
swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.
Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about 9.0K calls(3min, avg 5% cpu util).
Details
In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.
In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.
Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().
Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.
This patch is based on the original prototype patch (that doesn't depend
on xarray) and it includes the following fixes:
64165b1affc5 mm: swapoff: take notice of completion sooner
dd862deb151a mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
870395465444 mm: swapoff: shmem_find_swap_entries() filter out other types
Along with other changes to make it more similar to the mainline patch.
Link: https://patchwork.kernel.org/patch/6048271/
(backported from commit b56a2d8af9147a4efe4011b60d93779c0461ca97)
Signed-off-by: Andrea Righi <email address hidden>
Signed-off-by: Kamal Mostafa <email address hidden>