Comment 27 for bug 1371591

Revision history for this message
Arvind Kumar (arvindkumar) wrote : Re: file not initialized to 0s under some conditions on VMWare

Hi Chris,

This is Arvind Kumar from VMware. Recently the issue discussed in this bug was brought into VMware's notice. We looked at the patch (https://lkml.org/lkml/2014/9/23/509) which was done to address the issue. Since the patch is done in mptsas driver, it addresses the issue only on lsilogic controller, if user uses some other controller e.g. pvscsi or buslogic then the issue remains. Moreover the patch disables the WRITE SAME completely on the lsilogic which indicates that VMware will never be able to support WRITE SAME on lsilogic. As I understand from the bug, it is concluded that the WRITE SAME is not properly implemented by VMware. Actually we don't support WRITE SAME at all.

We internally investigated the issue and as per our understanding the issue is not VMware specific and rather seems to be with the kernel, which could very well happen on real hardware too in case the disk doesn't support WRITE SAME command. Below are the details of the investigation by Petr Vandrovec.

--

In blk-lib.c on line 294 it checks whether bdev supports write_same. With LVM, bdev here is dm-0. It says yes, it is supported, and so write_same is invoked (note that check is racy in case device loses write_same capability between test and moment bio is issued):

    291 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
    292 sector_t nr_sects, gfp_t gfp_mask)
    293 {
    294 if (bdev_write_same(bdev)) {
    295 unsigned char bdn[BDEVNAME_SIZE];
    296
    297 if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
    298 ZERO_PAGE(0)))
    299 return 0;
    300
    301 bdevname(bdev, bdn);
    302 pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
    303 }
    304
    305 return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
    306 }
    307 EXPORT_SYMBOL(blkdev_issue_zeroout);

Then it gets to LVM, and LVM forwards request to sda. When it fails, kernel clears bdev_write_same() on sda, and returns -121 (EREMOTEIO).

Now next request comes. Nobody cleared bdev_write_same() on dm-0, it got cleared only on sda, so request gets to LVM, which forwards it to sda. Where it hits a snag in blk-core.c:

   1824 if (bio->bi_rw & REQ_WRITE_SAME && !bdev_write_same(bio->bi_bdev)) {
   1825 err = -EOPNOTSUPP;
   1826 goto end_io;
   1827 }

bi_bdev here is sda, and I/O fails with EOPNOTSUPP, without WRITE_SAME ever being issued. And then it hits completion code that treats EOPNOTSUPP as success:

     18 static void bio_batch_end_io(struct bio *bio, int err)
     19 {
     20 struct bio_batch *bb = bio->bi_private;
     21
     22 if (err && (err != -EOPNOTSUPP))
     23 clear_bit(BIO_UPTODATE, &bb->flags);
     24 if (atomic_dec_and_test(&bb->done))
     25 complete(bb->wait);
     26 bio_put(bio);
     27 }

So everybody outside of blkdev_issue_write_same() thinks that I/O succeeded, while in reality kernel even did not issue request!

Fix should:

1. Use different error code if WRITE_SAME request is thrown away. Or remove special EOPNOTSUPP handling from end_io - I assume EOPNOTSUPP is supposed to ignore failures from discarded commands, but then nobody else should be using EOPNOTSUPP, and

2. WRITE_SAME failure should propagate from sda to dm-0.

--

Our understanding is that we should revert the fix in mptsas driver and try to do the right fix as described above. I am attaching the patch from Petr who did the investigation. CC'ing all involved people from VMware too. Could you please evaluate the patch and suggest on further steps?

Thanks!
Arvind