~cascardo/ubuntu/+source/linux/+git/mantic:io_uring_backport

Last commit made on 2023-11-16
Get this branch:
git clone -b io_uring_backport https://git.launchpad.net/~cascardo/ubuntu/+source/linux/+git/mantic
Only Thadeu Lima de Souza Cascardo can upload to this branch. If you are Thadeu Lima de Souza Cascardo please log in for upload directions.

Branch merges

Branch information

Name:
io_uring_backport
Repository:
lp:~cascardo/ubuntu/+source/linux/+git/mantic

Recent commits

208fd9c... by Al Viro <email address hidden>

io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed

BugLink: https://bugs.launchpad.net/bugs/2043730

->ki_pos value is unreliable in such cases. For an obvious example,
consider O_DSYNC write - we feed the data to page cache and start IO,
then we make sure it's completed. Update of ->ki_pos is dealt with
by the first part; failure in the second ends up with negative value
returned _and_ ->ki_pos left advanced as if sync had been successful.
In the same situation write(2) does not advance the file position
at all.

Reviewed-by: Christian Brauner <email address hidden>
Reviewed-by: Jens Axboe <email address hidden>
Signed-off-by: Al Viro <email address hidden>
(cherry picked from commit 1939316bf988f3e49a07d9c4dd6f660bf4daa53d)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

22f3e69... by axboe

io_uring/rw: disable IOCB_DIO_CALLER_COMP

BugLink: https://bugs.launchpad.net/bugs/2043730

If an application does O_DIRECT writes with io_uring and the file system
supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is
done from the task_work that will post the completion event for said
write as well.

Whenever a dio write is done against a file, the inode i_dio_count is
elevated. This enables other callers to use inode_dio_wait() to wait for
previous writes to complete. If we defer the full dio completion to
task_work, we are dependent on that task_work being run before the
inode i_dio_count can be decremented.

If the same task that issues io_uring dio writes with
IOCB_DIO_CALLER_COMP performs a synchronous system call that calls
inode_dio_wait(), then we can deadlock as we're blocked sleeping on
the event to become true, but not processing the completions that will
result in the inode i_dio_count being decremented.

Until we can guarantee that this is the case, then disable the deferred
caller completions.

Fixes: 099ada2c8726 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP")
Reported-by: Andres Freund <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 838b35bb6a89c36da07ca39520ec071d9250334d)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

ccbdc85... by axboe

io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid

BugLink: https://bugs.launchpad.net/bugs/2043730

We could race with SQ thread exit, and if we do, we'll hit a NULL pointer
dereference when the thread is cleared. Grab the SQPOLL data lock before
attempting to get the task cpu and pid for fdinfo, this ensures we have a
stable view of it.

Cc: <email address hidden>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032
Reviewed-by: Gabriel Krisman Bertazi <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 7644b1a1c9a7ae8ab99175989bfc8676055edb46)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

65bbf13... by axboe

io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring address

BugLink: https://bugs.launchpad.net/bugs/2043730

If we specify a valid CQ ring address but an invalid SQ ring address,
we'll correctly spot this and free the allocated pages and clear them
to NULL. However, we don't clear the ring page count, and hence will
attempt to free the pages again. We've already cleared the address of
the page array when freeing them, but we don't check for that. This
causes the following crash:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Oops [#1]
Modules linked in:
CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56
Hardware name: ucbbar,riscvemu-bare (DT)
Workqueue: events_unbound io_ring_exit_work
epc : io_pages_free+0x2a/0x58
 ra : io_rings_free+0x3a/0x50
 epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0
 status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d
 [<ffffffff808811a2>] io_pages_free+0x2a/0x58
 [<ffffffff80881406>] io_rings_free+0x3a/0x50
 [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424
 [<ffffffff80027234>] process_one_work+0x10c/0x1f4
 [<ffffffff8002756e>] worker_thread+0x252/0x31c
 [<ffffffff8002f5e4>] kthread+0xc4/0xe0
 [<ffffffff8000332a>] ret_from_fork+0xa/0x1c

Check for a NULL array in io_pages_free(), but also clear the page counts
when we free them to be on the safer side.

Reported-by: <email address hidden>
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Cc: <email address hidden>
Reviewed-by: Jeff Moyer <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 8b51a3956d44ea6ade962874ade14de9a7d16556)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

6540575... by Jeff Moyer <email address hidden>

io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()

BugLink: https://bugs.launchpad.net/bugs/2043730

I received a bug report with the following signature:

[ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8
[ 1759.944564] #PF: supervisor read access in kernel mode
[ 1759.949732] #PF: error_code(0x0000) - not-present page
[ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0
[ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI
[ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1
[ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018
[ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0
[ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48
[ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286
[ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000
[ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400
[ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000
[ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50
[ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548
[ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000
[ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0
[ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1760.086325] PKRU: 55555554
[ 1760.089044] Call Trace:
[ 1760.091501] <TASK>
[ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df
[ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df
[ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0
[ 1760.106584] ? __die_body.cold+0x8/0xd
[ 1760.110356] ? page_fault_oops+0x134/0x170
[ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110
[ 1760.119298] ? exc_page_fault+0xa8/0x150
[ 1760.123247] ? asm_exc_page_fault+0x22/0x30
[ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10
[ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10
[ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0
[ 1760.142527] __io_wq_cpu_online+0x54/0xb0
[ 1760.146558] cpuhp_invoke_callback+0x109/0x460
[ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10
[ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 1760.160320] cpuhp_thread_fun+0x8d/0x140
[ 1760.164266] smpboot_thread_fn+0xd3/0x1a0
[ 1760.168297] kthread+0xdd/0x100
[ 1760.171457] ? __pfx_kthread+0x10/0x10
[ 1760.175225] ret_from_fork+0x29/0x50
[ 1760.178826] </TASK>
[ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1760.248623] CR2: ffffffffffffffe8

A cpu hotplug callback was issued before wq->all_list was initialized.
This results in a null pointer dereference. The fix is to fully setup
the io_wq before calling cpuhp_state_add_instance_nocalls().

Signed-off-by: Jeff Moyer <email address hidden>
Link: https://<email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 0f8baa3c9802fbfe313c901e1598397b61b91ada)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

2fc01ca... by axboe

io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages

BugLink: https://bugs.launchpad.net/bugs/2043730

On at least arm32, but presumably any arch with highmem, if the
application passes in memory that resides in highmem for the rings,
then we should fail that ring creation. We fail it with -EINVAL, which
is what kernels that don't support IORING_SETUP_NO_MMAP will do as well.

Cc: <email address hidden>
Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 223ef474316466e9f61f6e0064f3a6fe4923a2c5)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

55fa957... by axboe

io_uring: ensure io_lockdep_assert_cq_locked() handles disabled rings

BugLink: https://bugs.launchpad.net/bugs/2043730

io_lockdep_assert_cq_locked() checks that locking is correctly done when
a CQE is posted. If the ring is setup in a disabled state with
IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until
the ring is later enabled. We generally don't post CQEs in this state,
as no SQEs can be submitted. However it is possible to generate a CQE
if tagged resources are being updated. If this happens and PROVE_LOCKING
is enabled, then the locking check helper will dereference
ctx->submitter_task, which hasn't been set yet.

Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While
at it, convert it to a static inline as well, so that generated line
offsets will actually reflect which condition failed, rather than just
the line offset for io_lockdep_assert_cq_locked() itself.

Reported-and-tested-by: <email address hidden>
Fixes: f26cc9593581 ("io_uring: lockdep annotate CQ locking")
Cc: <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 1658633c04653578429ff5dfc62fdc159203a8f2)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

d8ee0f6... by axboe

io_uring/kbuf: don't allow registered buffer rings on highmem pages

BugLink: https://bugs.launchpad.net/bugs/2043730

syzbot reports that registering a mapped buffer ring on arm32 can
trigger an OOPS. Registered buffer rings have two modes, one of them
is the application passing in the memory that the buffer ring should
reside in. Once those pages are mapped, we use page_address() to get
a virtual address. This will obviously fail on highmem pages, which
aren't mapped.

Add a check if we have any highmem pages after mapping, and fail the
attempt to register a provided buffer ring if we do. This will return
the same error as kernels that don't support provided buffer rings to
begin with.

Link: https://<email address hidden>/
Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring")
Cc: <email address hidden>
Reported-by: <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit f8024f1f36a30a082b0457d5779c8847cea57f57)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

9e5e85a... by axboe

io_uring/fs: remove sqe->rw_flags checking from LINKAT

BugLink: https://bugs.launchpad.net/bugs/2043730

This is unionized with the actual link flags, so they can of course be
set and they will be evaluated further down. If not we fail any LINKAT
that has to set option flags.

Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT")
Cc: <email address hidden>
Reported-by: Thomas Leonard <email address hidden>
Link: https://github.com/axboe/liburing/issues/955
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit a52d4f657568d6458e873f74a9602e022afe666f)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>

cde57da... by Pavel Begunkov <email address hidden>

io_uring/net: fix iter retargeting for selected buf

BugLink: https://bugs.launchpad.net/bugs/2043730

When using selected buffer feature, io_uring delays data iter setup
until later. If io_setup_async_msg() is called before that it might see
not correctly setup iterator. Pre-init nr_segs and judge from its state
whether we repointing.

Cc: <email address hidden>
Reported-by: <email address hidden>
Fixes: 0455d4ccec548 ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg")
Signed-off-by: Pavel Begunkov <email address hidden>
Link: https://<email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit c21a8027ad8a68c340d0d58bf1cc61dcb0bc4d2f)
Signed-off-by: Thadeu Lima de Souza Cascardo <email address hidden>