Comment 15 for bug 1570195

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Before going into discussions how it "should" be I added more debug code and gatherered some good case vs bad case data.

First of all it is "ok" to have no more buffers.
I had a prink in a codepath that only triggers when !more_used triggers.
And I've seen plentry for all kind of idx values.
On adding virtio traffic it triggers a few times as well.
Eventually that is what the loop is for, to wait until there is ia buffer that it can get.
So things aren't broken if this triggers ever - but of course it is if it never changes.

IIRC: last_used is != vring_used->idx just means nothing happened since our last interaction (to be confirmed).

Good case:
Some !more_used might occur, but not related and not infintely
[ 393.542550] __virtqueue_get_buf: No more buffers in vq ffff8801b74b3000 - vq->last_used_idx 303 == vq->vring.used->idx 303
[ 394.097117] __virtqueue_get_buf: No more buffers in vq ffff8801b74b3000 - vq->last_used_idx 304 == vq->vring.used->idx 304
[ 394.097413] __virtqueue_get_buf: No more buffers in vq ffff8801b74b4000 - vq->last_used_idx 125 == vq->vring.used->idx 125
[...]
[ 394.449672] __virtqueue_get_buf: Entry checks passed - vq ffff8800bbaef000 from _vq ffff8800bbaef000
[ 394.452734] __virtqueue_get_buf: Exit checks passed - ffff8801b74b5840 vq->data[i]
[ 394.455087] __virtqueue_get_buf: Returning ret ffff8801b74b5840
Done

Bad case (after DPDK ran):
Now both debug printk's trigger
I get a LOT of
[ 552.018862] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
Followed by a sequence like that in between
[ 554.157376] __virtqueue_get_buf: No more buffers in vq ffff8800bbaef000 - vq->last_used_idx 2 == vq->vring.used->idx 2
[ 554.158916] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
[ 554.160135] __virtqueue_get_buf: No more buffers in vq ffff8800bbaef000 - vq->last_used_idx 2 == vq->vring.used->idx 2
[ 554.161583] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
[ 554.162776] __virtqueue_get_buf: No more buffers in vq ffff8800bbaef000 - vq->last_used_idx 2 == vq->vring.used->idx 2
[ 554.164189] __virtqueue_is_broken: - vq ffff8800bbaef000 from _vq ffff8800bbaef000 -> broken 0
[...] (infinite loop)

Current assumption: DPDK disables something in the host part of the virtio device that makes the host no more response "correctly".
Via unbinding/binding the driver we can reinitialize that, but if not we will run into this hang.
Remember: we only initialize DPDK with testpmd, no load whatsoever is driven by it.

We likely need two fixes:
1. find what DPDK does "to" the device and avoid it
2. the kernel should give up after some number of retries or so and give up returning a fail (not good, but much better than hanging)