~mhcerri/ubuntu/+source/linux/+git/bionic:lp1848739

Last commit made on 2019-12-13
Get this branch:
git clone -b lp1848739 https://git.launchpad.net/~mhcerri/ubuntu/+source/linux/+git/bionic
Only Marcelo Cerri can upload to this branch. If you are Marcelo Cerri please log in for upload directions.

Branch merges

Branch information

Name:
lp1848739
Repository:
lp:~mhcerri/ubuntu/+source/linux/+git/bionic

Recent commits

f249c87... by axboe

blk-mq: punt failed direct issue to dispatch list

BugLink: https://bugs.launchpad.net/bugs/1848739

After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: <email address hidden>
Suggested-by: Ming Lei <email address hidden>
Reported-by: Bart Van Assche <email address hidden>
Tested-by: Ming Lei <email address hidden>
Acked-by: Mike Snitzer <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit c616cbee97aed4bc6178f148a7240206dcdb85a6)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

3344dc9... by Ming Lei <email address hidden>

blk-mq: fail the request in case issue failure

BugLink: https://bugs.launchpad.net/bugs/1848739

Inside blk_mq_try_issue_list_directly(), if the request is issued as
failed, we shouldn't try to do it again, otherwise the warning in
blk_mq_start_request() will be triggered. This change is aligned to
behaviour of other ways of request issue & dispatch.

Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: Kashyap Desai <email address hidden>
Cc: Laurence Oberman <email address hidden>
Cc: Omar Sandoval <email address hidden>
Cc: Christoph Hellwig <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Hannes Reinecke <email address hidden>
Cc: Kashyap Desai <email address hidden>
Cc: kernel test robot <email address hidden>
Cc: LKP <lkp@01.org>
Reported-by: kernel test robot <email address hidden>
Signed-off-by: Ming Lei <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 8824f62246bef288173a6624a363352f0d4d3b09)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

a303abc... by axboe

blk-mq: fix corruption with direct issue

BugLink: https://bugs.launchpad.net/bugs/1848739

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685

Tested-by: Guenter Roeck <email address hidden>
Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit ffe81d45322cc3cb140f0db080a4727ea284661e)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

d5f72c6... by Ming Lei <email address hidden>

blk-mq: issue directly if hw queue isn't busy in case of 'none'

BugLink: https://bugs.launchpad.net/bugs/1848739

In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.

There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.

This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.

Cc: Kashyap Desai <email address hidden>
Cc: Laurence Oberman <email address hidden>
Cc: Omar Sandoval <email address hidden>
Cc: Christoph Hellwig <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Hannes Reinecke <email address hidden>
Reported-by: Kashyap Desai <email address hidden>
Tested-by: Kashyap Desai <email address hidden>
Signed-off-by: Ming Lei <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 6ce3dd6eec114930cf2035a8bcb1e80477ed79a8)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

7c5e7cd... by Ming Lei <email address hidden>

blk-mq: dequeue request one by one from sw queue if hctx is busy

BugLink: https://bugs.launchpad.net/bugs/1848739

It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <email address hidden>
Cc: Laurence Oberman <email address hidden>
Cc: Omar Sandoval <email address hidden>
Cc: Christoph Hellwig <email address hidden>
Cc: Bart Van Assche <email address hidden>
Cc: Hannes Reinecke <email address hidden>
Reported-by: Kashyap Desai <email address hidden>
Tested-by: Kashyap Desai <email address hidden>
Signed-off-by: Ming Lei <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 6e768717304bdbe8d2897ca8298f6b58863fdc41)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

c48a87a... by axboe

blk-mq: don't queue more if we get a busy return

BugLink: https://bugs.launchpad.net/bugs/1848739

Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.

Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.

Reviewed-by: Omar Sandoval <email address hidden>
Reviewed-by: Ming Lei <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 1f57f8d442f8017587eeebd8617913bfc3661d3d)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

b286aa2... by Bart Van Assche <email address hidden>

blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()

BugLink: https://bugs.launchpad.net/bugs/1848739

Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
However, the function name blk_mq_request_direct_issue is an exception.
Hence rename this function. This patch does not change any functionality.

Reviewed-by: Mike Snitzer <email address hidden>
Signed-off-by: Bart Van Assche <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit c77ff7fd03ddca8face268c4cf093c0edf4bcf1f)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

0765aad... by Ming Lei <email address hidden>

blk-mq: introduce BLK_STS_DEV_RESOURCE

BugLink: https://bugs.launchpad.net/bugs/1848739

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe <email address hidden>
Tested-by: Laurence Oberman <email address hidden>
Signed-off-by: Ming Lei <email address hidden>
Signed-off-by: Mike Snitzer <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 86ff7c2a80cd357f6156a53b354f6a0b357dc0c9)
[<email address hidden>: Fixed context in
 include/linux/blk_types.h, the missing context is from commit
 9111e5686c8c ("block: Provide blk_status_t decoding for path errors")
 which is not necessary]
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

26105c0... by Ming Lei <email address hidden>

blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

BugLink: https://bugs.launchpad.net/bugs/1848739

If we run into blk_mq_request_direct_issue(), when queue is busy, we
don't want to dispatch this request into hctx->dispatch_list, and
what we need to do is to return the queue busy info to caller, so
that caller can deal with it well.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Reported-by: Laurence Oberman <email address hidden>
Reviewed-by: Mike Snitzer <email address hidden>
Signed-off-by: Ming Lei <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 23d4ee19e789ae3dce3e04bd24e3d1537965475f)
Signed-off-by: Marcelo Henrique Cerri <email address hidden>

9f9c97e... by Mike Snitzer <email address hidden>

blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request

BugLink: https://bugs.launchpad.net/bugs/1848739

After commit:

923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq")

we no longer use the 'can_block' argument in
blk_mq_sched_insert_request(). Kill it.

Signed-off-by: Mike Snitzer <email address hidden>

Added actual commit message as to why it's being removed.

Signed-off-by: Jens Axboe <email address hidden>
(cherry picked from commit 9e97d2951a7e6ee6e204f87f6bda4ff754a8cede)
[<email address hidden>: fixed conflict in blk_mq_requeue_work()
 because the commit aef1897cd36d ("blk-mq: insert rq with DONTPREP to
 hctx dispatch list when requeue") was already applied]
Signed-off-by: Marcelo Henrique Cerri <email address hidden>