~sunforce/ubuntu/+source/linux/+git/mainline-crack:mmu_gather-race-fix

Last commit made on 2022-10-31
Get this branch:
git clone -b mmu_gather-race-fix https://git.launchpad.net/~sunforce/ubuntu/+source/linux/+git/mainline-crack
Only Richard Korbini can upload to this branch. If you are Richard Korbini please log in for upload directions.

Branch merges

Branch information

Name:
mmu_gather-race-fix
Repository:
lp:~sunforce/ubuntu/+source/linux/+git/mainline-crack

Recent commits

6200eb6... by Linus Torvalds <email address hidden>

mm: delay rmap removal until after TLB flush

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the followign situation:

 (a) one CPU could come in and clean it, never seeing our mapping of
     the page

 (b) another CPU could continue to use the stale and dirty TLB entry
     and continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This commit uses the same old TLB gather array that we use to delay the
freeing of the page to also say 'remove from rmap after flush', so that
we can keep the rmap entries alive until all TLB entries have been
flushed.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even when user space goes off the rails.

Reported-by: Nadav Amit <email address hidden>
Link: Link: https://<email address hidden>/
Cc: Will Deacon <email address hidden>
Cc: Aneesh Kumar <email address hidden>
Cc: Andrew Morton <email address hidden>
Cc: Nick Piggin <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: Heiko Carstens <email address hidden>
Cc: Vasily Gorbik <email address hidden>
Cc: Alexander Gordeev <email address hidden>
Cc: Christian Borntraeger <email address hidden>
Cc: Sven Schnelle <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

d12cec6... by Linus Torvalds <email address hidden>

mm: re-unify the simplified page_zap_*_rmap() function

Now that we've simplified both the anonymous and file-backed opage zap
functions, they end up being identical except for which page statistic
they update, and we can re-unify the implementation of that much
simplified code.

To make it very clear that this is onlt for the final pte zapping (since
a lot of the simplifications depended on that), name the unified
function 'page_zap_pte_rmap()'.

Link: https://<email address hidden>/
Cc: Nadav Amit <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: John Hubbard <email address hidden>
Cc: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

4305e9c... by Linus Torvalds <email address hidden>

mm: inline simpler case of page_remove_file_rmap()

Now that we have a simplified special case of 'page_remove_rmap()' that
doesn't deal with the 'compound' case and always gets a file-mapped (ie
not anonymous) page, it ended up doing just

 lock_page_memcg(page);
 page_remove_file_rmap(page, false);
 unlock_page_memcg(page);

but 'page_remove_file_rmap()' is actually trivial when 'compound' is false.

So just inline that non-compound case in the caller, and - like we did
in the previous commit for the anon pages - only do the memcg locking for
the parts that actually matter: the page statistics.

Also, as the previous commit did for anonymous pages, knowing we only
get called for the last-level page table entries allows for a further
simplification: we can get rid of the 'PageHuge(page)' case too.

You can't map a huge-page in a pte without splitting it (and the full
code in the generic page_remove_file_rmap() function has a comment to
that effect: "hugetlb pages are always mapped with pmds").

That means that the page_zap_file_rmap() case of that whole function is
really small and trivial.

Link: https://<email address hidden>/
Cc: Nadav Amit <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: John Hubbard <email address hidden>
Cc: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

e2dd770... by Linus Torvalds <email address hidden>

mm: introduce simplified versions of 'page_remove_rmap()'

The rmap handling is proving a bit problematic, and part of it comes
from the complexities of all the different cases of our implementation
of 'page_remove_rmap()'.

And a large part of that complexity comes from the fact that while we
have multiple different versions of _adding_ an rmap, this 'remove rmap'
function tries to deal with all possible cases.

So we have these specific versions for page_add_anon_rmap(),
page_add_new_anon_rmap() and page_add_file_rmap() which all do slightly
different things, but then 'page_remove_rmap()' has to handle all the
cases.

That's particularly annoying for 'zap_pte_range()', which already knows
which special case it's dealing with. It already checked for its own
reasons whether it's an anonymous page, and it already knows it's not
the compound page case and passed in an unconditional 'false' argument.

So this introduces the specialized versions of 'page_remove_rmap()' for
the cases that zap_pte_range() wants. We also make it the job of the
caller to do the munlock_vma_page(), which is really unrelated and is
the only thing that cares aboiut the 'vma'.

This just means that we end up with several simplifications:

 - there's no 'vma' argument any more, because it's not used

 - there's no 'compound' argument any more, because it was always false

 - we can get rid of the tests for 'compound' and 'PageAnon()' since we
   know what they are

and so instead of having that fairly complicated page_remove_rmap()
function, we end up with a couple of specialized functions that are
_much_ simpler.

There is supposed to be no semantic difference from this change,
although this does end up simplifying the code further by moving the
atomic_add_negative() on the PageAnon mapcount to outside the memcg
locking.

That locking protects other data structures (the page state statistics),
and this avoids not only an ugly 'goto', but means that we don't need to
take and release the lock when we're not actually doing anything with
the state statistics.

We also remove the test for PageTransCompound(), since this is only
called for the final pte level from zap_pte_range().

Cc: Nadav Amit <email address hidden>
Cc: Peter Zijlstra <email address hidden>
Cc: John Hubbard <email address hidden>
Cc: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

30a0b95... by Linus Torvalds <email address hidden>

Linux 6.1-rc3

b72018a... by Linus Torvalds <email address hidden>

Merge tag 'fbdev-for-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes from Helge Deller:
 "A use-after-free bugfix in the smscufx driver and various minor error
  path fixes, smaller build fixes, sysfs fixes and typos in comments in
  the stifb, sisfb, da8xxfb, xilinxfb, sm501fb, gbefb and cyber2000fb
  drivers"

* tag 'fbdev-for-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbdev: cyber2000fb: fix missing pci_disable_device()
  fbdev: sisfb: use explicitly signed char
  fbdev: smscufx: Fix several use-after-free bugs
  fbdev: xilinxfb: Make xilinxfb_release() return void
  fbdev: sisfb: fix repeated word in comment
  fbdev: gbefb: Convert sysfs snprintf to sysfs_emit
  fbdev: sm501fb: Convert sysfs snprintf to sysfs_emit
  fbdev: stifb: Fall back to cfb_fillrect() on 32-bit HCRX cards
  fbdev: da8xx-fb: Fix error handling in .remove()
  fbdev: MIPS supports iomem addresses

9f12754... by Linus Torvalds <email address hidden>

Merge tag 'char-misc-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc fixes from Greg KH:
 "Some small driver fixes for 6.1-rc3. They include:

   - iio driver bugfixes

   - counter driver bugfixes

   - coresight bugfixes, including a revert and then a second fix to get
     it right.

  All of these have been in linux-next with no reported problems"

* tag 'char-misc-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
  misc: sgi-gru: use explicitly signed char
  coresight: cti: Fix hang in cti_disable_hw()
  Revert "coresight: cti: Fix hang in cti_disable_hw()"
  counter: 104-quad-8: Fix race getting function mode and direction
  counter: microchip-tcb-capture: Handle Signal1 read and Synapse
  coresight: cti: Fix hang in cti_disable_hw()
  coresight: Fix possible deadlock with lock dependency
  counter: ti-ecap-capture: fix IS_ERR() vs NULL check
  counter: Reduce DEFINE_COUNTER_ARRAY_POLARITY() to defining counter_array
  iio: bmc150-accel-core: Fix unsafe buffer attributes
  iio: adxl367: Fix unsafe buffer attributes
  iio: adxl372: Fix unsafe buffer attributes
  iio: at91-sama5d2_adc: Fix unsafe buffer attributes
  iio: temperature: ltc2983: allocate iio channels once
  tools: iio: iio_utils: fix digit calculation
  iio: adc: stm32-adc: fix channel sampling time init
  iio: adc: mcp3911: mask out device ID in debug prints
  iio: adc: mcp3911: use correct id bits
  iio: adc: mcp3911: return proper error code on failure to allocate trigger
  iio: adc: mcp3911: fix sizeof() vs ARRAY_SIZE() bug
  ...

c4d25ce... by Linus Torvalds <email address hidden>

Merge tag 'usb-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "A few small USB fixes for 6.1-rc3. Include in here are:

   - MAINTAINERS update, including a big one for the USB gadget
     subsystem. Many thanks to Felipe for all of the years of hard work
     he has done on this codebase, it was greatly appreciated.

   - dwc3 driver fixes for reported problems.

   - xhci driver fixes for reported problems.

   - typec driver fixes for minor issues

   - uvc gadget driver change, and then revert as it wasn't relevant for
     6.1-final, as it is a new feature and people are still reviewing
     and modifying it.

  All of these have been in the linux-next tree with no reported issues"

* tag 'usb-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: dwc3: gadget: Don't set IMI for no_interrupt
  usb: dwc3: gadget: Stop processing more requests on IMI
  Revert "usb: gadget: uvc: limit isoc_sg to super speed gadgets"
  xhci: Remove device endpoints from bandwidth list when freeing the device
  xhci-pci: Set runtime PM as default policy on all xHC 1.2 or later devices
  xhci: Add quirk to reset host back to default state at shutdown
  usb: xhci: add XHCI_SPURIOUS_SUCCESS to ASM1042 despite being a V0.96 controller
  usb: dwc3: st: Rely on child's compatible instead of name
  usb: gadget: uvc: limit isoc_sg to super speed gadgets
  usb: bdc: change state when port disconnected
  usb: typec: ucsi: acpi: Implement resume callback
  usb: typec: ucsi: Check the connection on resume
  usb: gadget: aspeed: Fix probe regression
  usb: gadget: uvc: fix sg handling during video encode
  usb: gadget: uvc: fix sg handling in error case
  usb: gadget: uvc: fix dropped frame after missed isoc
  usb: dwc3: gadget: Don't delay End Transfer on delayed_status
  usb: dwc3: Don't switch OTG -> peripheral if extcon is present
  MAINTAINERS: Update maintainers for broadcom USB
  MAINTAINERS: move USB gadget and phy entries under the main USB entry

ef3c094... by Linus Torvalds <email address hidden>

Merge tag 'gpio-fixes-for-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

 - convert gpio-tegra to using an immutable irqchip

 - MAINTAINERS update

* tag 'gpio-fixes-for-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  MAINTAINERS: Change myself to a maintainer
  gpio: tegra: Convert to immutable irq chip

4347660... by Linus Torvalds <email address hidden>

Merge tag 'perf_urgent_for_v6.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Rename a perf memory level event define to denote it is of CXL type

 - Add Alder and Raptor Lakes support to RAPL

 - Make sure raw sample data is output with tracepoints

* tag 'perf_urgent_for_v6.1_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/mem: Rename PERF_MEM_LVLNUM_EXTN_MEM to PERF_MEM_LVLNUM_CXL
  perf/x86/rapl: Add support for Intel Raptor Lake
  perf/x86/rapl: Add support for Intel AlderLake-N
  perf: Fix missing raw data on tracepoint events