Accelerated networking (SR-IOV VF) broken in 18.10 daily

Bug #1794477 reported by Chris Valean
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
High
Joseph Salisbury
Cosmic
Invalid
Undecided
Unassigned
linux-azure (Ubuntu)
Fix Released
High
Marcelo Cerri
Cosmic
Fix Released
Undecided
Marcelo Cerri

Bug Description

While testing Ubuntu 18.10 daily from cloud-images repo, on Azure, we discovered that accelerated networking wasn’t working inside the VM.
No VF shows up inside the VM and lspci didn’t show any Mellanox drivers in use.
We tested the daily build on Hyper-V also, but there the Mellanox VF is functional, with the same mlx4 drivers.

To give more details about this:
• No mellanox logs are showing up in dmesg or syslog.
• Modinfo mlx4_core/mlx4_en finds the module, but lsmod doesn’t show it as loaded, although Accelerated Networking is enabled for the Azure VM, so this should happen transparently.
• Modprobe -r mlx4_core && modprobe mlx4_core is giving 0 exit code, but nothing really happens. And no Mellanox messages are logged in dmesg/syslog.
- There are no entries in the logs to show anything about the drivers or netvsc/pci-hyperv that might relate to this issue.

Kernel: 4.18.0-7-generic

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1794477

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: cosmic
Chris Valean (cvalean)
description: updated
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Marcelo Cerri (mhcerri)
tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Chris Valean (cvalean) wrote : Re: [Azure] Accelerated networking broken in 18.10 daily

This is now broken also on Hyper-V VMs, tested with 4.18.0-8-generic and daily 18.10 server

Chris Valean (cvalean)
summary: - [Azure] Accelerated networking broken in 18.10 daily
+ Accelerated networking (SR-IOV VF) broken in 18.10 daily
Revision history for this message
Chris Valean (cvalean) wrote :

Bug is now carried in linux-azure-edge kernels on the 4.18 tree, these should not be published to release further from proposed.

Revision history for this message
Joshua R. Poulson (jrp) wrote :

This is a blocked for 4.18 in 18.04. This bug should be extended to bionic "linux-azure"

no longer affects: linux-azure (Ubuntu Cosmic)
no longer affects: linux-azure (Ubuntu Bionic)
no longer affects: linux (Ubuntu Bionic)
Changed in linux-azure (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you have a way to test some kernels out? If so, we could perform a kernel bisect to narrow down the commit that introduced this.

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That is if this is a regression. Was this working on a prior kernel version?

Revision history for this message
Chris Valean (cvalean) wrote :

What I know is that this is working in azure kernels 4.15 series, but not in 4.18 since we got the dailies with it for Cosmic.

From the ubuntu kernel tree if these versions are separate branches then I don't know how a bisect would work...

Revision history for this message
Chris Valean (cvalean) wrote :

If you can point me to the ubuntu kernel sources where we can see how the 4.15 and 4.18 kernels are found then I can check if a bisect would be possible as such.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I can build you kernels for the bisect. You test the kernel and post back if it has the bug or not. Then I will build the next kernel based on your results. Repeat.

To start a bisect, we first need to identify the last good kernel version and the first bad one. Can you test the following kernels:

4.15.0-29: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/15137612
4.17.0-4: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/bootstrap/+build/15068818

Changed in linux (Ubuntu Cosmic):
assignee: Marcelo Cerri (mhcerri) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Marcelo Cerri (mhcerri) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, the Ubuntu kernel trees are located at:

4.15: git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/
4.18: git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/cosmic/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Regarding my comment #9, I read your comment #7 incorrectly. I literally read you don't know how a bisect would work. I understand clearly now.

We still should be able to bisect using the azure tree. I'll review the commits between 4.15 and 4.18 and see if anything sticks out.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Just to confirm, the 4.18 kernel you see this bug in is the standard Ubuntu kernel?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

For 4.18 based azure images, there are two repos:

lp:~canonical-kernel/ubuntu/+source/linux-azure/+git/cosmic

or the azure-edge-next branch in the bionic azure tree.

I'm going to review both to see if the commits going from 4.15 to 4.18 are linear. If they are, we may be able to bisect.

Changed in linux-azure (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Cosmic):
importance: High → Critical
Changed in linux (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Cosmic):
status: Confirmed → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Chris Valean (cvalean) wrote :

"Just to confirm, the 4.18 kernel you see this bug in is the standard Ubuntu kernel?"
> A: this first was seen in the daily cosmic 18.10 images (before RTM), and when azure-edge kernels were created we've seen this problem on those as well.
That being azure-edge for Bionic for example, not only Cosmic one.

Both kernels provided in comment #9 are having the issue - 4.15.0-29 and 4.17.0-4.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. That's interesting because both of those kernels are not the custom azure kernel and you say you don't see this bug on the 4.15 azure kernel. However, it is seen on the non-azure 4.15 kernel.

Can you see if this kernel exhibits the bug:
https://launchpad.net/ubuntu/+source/linux/4.15.0-20.21/+build/14791489

In parallel, I'm going to create an image on azure to see if I can reproduce the bug as well. Are there any particular options I should pass to azure when creating the VM or just the default?

Revision history for this message
David Coronel (davecore) wrote :
Download full text (3.9 KiB)

I can reproduce the issue in Azure with Ubuntu 18.04 and the kernel linux-azure-edge 4.18.0.1004.5 from bionic-proposed.

I use an instance of type "Standard F4s_v2 (4 vcpus, 8 GB memory)"

I launch the instance and get the 4.15.0-1030-azure kernel:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

$ uname -a
Linux davecore-an 4.15.0-1030-azure #31-Ubuntu SMP Tue Oct 30 18:35:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ dmesg | grep -i mella
[ 27.398120] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 27.425228] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 27.658916] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

$ dpkg -l | grep linux-azure
ii linux-azure 4.15.0.1030.30 amd64 Complete Linux kernel for Azure systems.
ii linux-azure-cloud-tools-4.15.0-1030 4.15.0-1030.31 amd64 Linux kernel version specific cloud tools for version 4.15.0-1030
ii linux-azure-headers-4.15.0-1030 4.15.0-1030.31 all Header files related to Linux kernel version 4.15.0
ii linux-azure-tools-4.15.0-1030 4.15.0-1030.31 amd64 Linux kernel version specific tools for version 4.15.0-1030

$ lsmod | grep -i mlx
mlx4_en 114688 0

$ lspci
0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
0001:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

I then enable bionic-proposed and install the linux-azure-edge kernel and reboot. I can't see the Mellanox device anymore:

$ dmesg | grep -i mella

$ uname -a
Linux davecore-an 4.18.0-1004-azure #4~18.04.1-Ubuntu SMP Thu Oct 25 14:25:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep linux-azure
ii linux-azure 4.15.0.1030.30 amd64 Complete Linux kernel for Azure systems.
ii linux-azure-cloud-tools-4.15.0-1030 4.15.0-1030.31 amd64 Linux kernel version specific cloud tools for version 4.15.0-1030
ii linux-azure-edge 4.18.0.1004.5 amd64 Complete Linux kernel for Azure systems.
ii linux-azure-edge-cloud-tools-4.18.0-1004 4.18.0-1004.4~18.04.1 amd64 Linux kernel version specific cloud tools for version 4.18.0-1004
ii linux-azure-edge-tools-4.18.0-1004 4.18.0-1004.4~18.04.1 amd64 Linux kernel version specific tools for version 4.18.0-1004
ii linux-azure-headers-4.15.0-1030 4.15.0-1030.31 all Header...

Read more...

Revision history for this message
David Coronel (davecore) wrote :

Hi Chris. I spoke to Joseph and I think we have everything we need to start the bisect. We'll get started and keep you posted.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi David,

Before starting a bisect, we can try to narrow down the exact versions first. To start, I built a 4.17 kernel based on the azure tree.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1794477

Can you test this kernel and see if it exhibits this bug?

Revision history for this message
David Coronel (davecore) wrote :
Download full text (4.5 KiB)

Hi Joseph,

Is there a special way I should install these? I tried:

sudo dpkg -i *

But I get:

Selecting previously unselected package linux-azure-cloud-tools-4.17.0-1001.
(Reading database ... 56547 files and directories currently installed.)
Preparing to unpack linux-azure-cloud-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-azure-cloud-tools-4.17.0-1001 (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-azure-tools-4.17.0-1001.
Preparing to unpack linux-azure-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-azure-tools-4.17.0-1001 (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-cloud-tools-4.17.0-1001-azure.
Preparing to unpack linux-cloud-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-cloud-tools-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-headers-4.17.0-1001-azure.
Preparing to unpack linux-headers-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-headers-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-image-unsigned-4.17.0-1001-azure.
Preparing to unpack linux-image-unsigned-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-image-unsigned-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-modules-4.17.0-1001-azure.
Preparing to unpack linux-modules-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-modules-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-modules-extra-4.17.0-1001-azure.
Preparing to unpack linux-modules-extra-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-modules-extra-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Selecting previously unselected package linux-tools-4.17.0-1001-azure.
Preparing to unpack linux-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb ...
Unpacking linux-tools-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
Setting up linux-azure-cloud-tools-4.17.0-1001 (4.17.0-1001.2~lp1794477) ...
dpkg: dependency problems prevent configuration of linux-azure-tools-4.17.0-1001:
 linux-azure-tools-4.17.0-1001 depends on libc6 (>= 2.28); however:
  Version of libc6:amd64 on system is 2.27-3ubuntu1.

dpkg: error processing package linux-azure-tools-4.17.0-1001 (--install):
 dependency problems - leaving unconfigured
Setting up linux-cloud-tools-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
dpkg: dependency problems prevent configuration of linux-headers-4.17.0-1001-azure:
 linux-headers-4.17.0-1001-azure depends on linux-azure-headers-4.17.0-1001; however:
  Package linux-azure-headers-4.17.0-1001 is not installed.

dpkg: error processing package linux-headers-4.17.0-1001-azure (--install):
 dependency problems - leaving unconfigured
Setting up linux-modules-4.17.0-1001-azure (4.17.0-1001.2~lp1794477) ...
dpkg: dependency problems prevent configuration of linux-modules-extra-4.17.0-1001-azure:
 linux-modules-extra-4.17.0-1001-azure depends on crda | wireless-crda; however:
  Package crda is not install...

Read more...

Revision history for this message
David Coronel (davecore) wrote :

These are the packages I see installed out of the box in the instance:

ubuntu@lp1794477:~/41701001$ dpkg -l | grep 4.15.0-1030.31 | sort

ii linux-azure-cloud-tools-4.15.0-1030 4.15.0-1030.31 amd64 Linux kernel version specific cloud tools for version 4.15.0-1030
ii linux-azure-headers-4.15.0-1030 4.15.0-1030.31 all Header files related to Linux kernel version 4.15.0
ii linux-azure-tools-4.15.0-1030 4.15.0-1030.31 amd64 Linux kernel version specific tools for version 4.15.0-1030
ii linux-cloud-tools-4.15.0-1030-azure 4.15.0-1030.31 amd64 Linux kernel version specific cloud tools for version 4.15.0-1030
ii linux-headers-4.15.0-1030-azure 4.15.0-1030.31 amd64 Linux kernel headers for version 4.15.0 on 64 bit x86 SMP
ii linux-image-4.15.0-1030-azure 4.15.0-1030.31 amd64 Signed kernel image azure
ii linux-modules-4.15.0-1030-azure 4.15.0-1030.31 amd64 Linux kernel extra modules for version 4.15.0 on 64 bit x86 SMP
ii linux-tools-4.15.0-1030-azure 4.15.0-1030.31 amd64 Linux kernel version specific tools for version 4.15.0-1030

And these are the files I downloaded from your link:

ubuntu@lp1794477:~/41701001$ ls -1 | sort

linux-azure-cloud-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb
linux-azure-tools-4.17.0-1001_4.17.0-1001.2~lp1794477_amd64.deb
linux-cloud-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-headers-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-image-unsigned-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-modules-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-modules-extra-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb
linux-tools-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

You at least need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages in that order. You should be able to do that with 'sudo dpkg -i package_name.deb'

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

For this particular bug, the tools packages are probably not needed.

Revision history for this message
David Coronel (davecore) wrote :

I can see the Mellanox devices with that 4.17.0-1001-azure kernel.

I installed the new kernel:

sudo dpkg -i linux-modules-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb linux-modules-extra-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb linux-image-unsigned-4.17.0-1001-azure_4.17.0-1001.2~lp1794477_amd64.deb

Fixed the missing depencies with:

sudo apt --fix-broken install

Updated grub just in case:

sudo update-grub

I rebooted and I see the devices:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.17.0-1001-azure #2~lp1794477 SMP Tue Nov 20 22:40:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 5.534382] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 5.564313] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 5.588721] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing, David. That means this regression was introduced in 4.18 and narrows down the number of commits we need to bisect through. Can you next test the first 4.18 based kernel in the azure tree:

http://kernel.ubuntu.com/~jsalisbury/lp1794477

Install the .debs in the same order as before.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

To try and speed up the kernel build time, I already built 4.18.0-1002 as well. There are now two 4.18 kernels available for testing:

4.18.0-1001: http://kernel.ubuntu.com/~jsalisbury/lp1794477/4.18.0-1001/
4.18.0-1002: http://kernel.ubuntu.com/~jsalisbury/lp1794477/4.18.0-1002/

Revision history for this message
David Coronel (davecore) wrote :

4.18.0-1001:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1001-azure #2~lp1794477 SMP Wed Nov 21 16:06:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 5.701988] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 5.730013] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 5.780748] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
David Coronel (davecore) wrote :

4.18.0-1002:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1002-azure #3~lp1794477 SMP Wed Nov 21 16:41:12 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 5.723350] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 5.751281] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 5.806824] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the 4.18.0-1003 kernel, which can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1794477/4.18.0-1003

Can you give this one a test?

Revision history for this message
David Coronel (davecore) wrote :

4.18.0-1003:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1003-azure #4~lp1794477 SMP Wed Nov 21 18:32:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 5.701478] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 5.732267] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 5.773481] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the 4.18.0-1004 kernel, which can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1794477/4.18.0-1004

Can you give this one a test?

If this kernel does not exhibit the bug, I'll start using the bionic azure-edge tree. These prior test kernels are with the azure cosmic tree, so the commit that causes this bug may not have been introduced there.

Revision history for this message
David Coronel (davecore) wrote :

4.18.0-1004:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1004-azure #5~lp1794477 SMP Wed Nov 21 19:19:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 6.009608] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 6.045063] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 6.085565] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the 4.18.0-1004 kernel, but this time using the bionic tree and azure-edge-next branch.
 This kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1794477/bionic-tree/4.18.0-1004

Revision history for this message
David Coronel (davecore) wrote :

bionic-tree/4.18.0-1004:

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1004-azure #4~18.04.1 SMP Wed Nov 21 20:07:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 6.337786] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 6.374260] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 6.393764] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built one more 1004 kernel from the bionic tree, but this time including all of the packages. This kernel can also be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1794477/bionic-tree/4.18.0-1004

Revision history for this message
David Coronel (davecore) wrote :

bionic-tree/4.18.0-1004 with all the packages:

$ sudo dpkg -i linux-modules-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-modules-extra-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-image-unsigned-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-azure-headers-4.18.0-1004_4.18.0-1004.4~18.04.1LP1794477_all.deb \
 linux-headers-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-azure-edge-tools-4.18.0-1004_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-tools-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-azure-edge-cloud-tools-4.18.0-1004_4.18.0-1004.4~18.04.1LP1794477_amd64.deb \
 linux-cloud-tools-4.18.0-1004-azure_4.18.0-1004.4~18.04.1LP1794477_amd64.deb

$ dpkg -l | grep 4.18.0
ii linux-azure-edge-cloud-tools-4.18.0-1004 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-azure-edge-tools-4.18.0-1004 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-azure-headers-4.18.0-1004 4.18.0-1004.4~18.04.1LP1794477 all
ii linux-cloud-tools-4.18.0-1004-azure 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-headers-4.18.0-1004-azure 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-image-unsigned-4.18.0-1004-azure 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-modules-4.18.0-1004-azure 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-modules-extra-4.18.0-1004-azure 4.18.0-1004.4~18.04.1LP1794477 amd64
ii linux-tools-4.18.0-1004-azure 4.18.0-1004.4~18.04.1LP1794477 amd64

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1004-azure #4~18.04.1LP1794477 SMP Wed Nov 21 20:57:54 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
[ 5.937218] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 5.971114] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 6.016903] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0

ubuntu@lp1794477:~$ lspci | grep -i mell
ae9f:00:02.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ubuntu@lp1794477:~$ lsmod | grep -i mlx
mlx4_en 114688 0

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

We already have identified the problem with bionic/azure-edge 4.18.0-1004. There was an issue on this version packaging that caused most of the modules to be included to the linux-modules-extra package instead of the linux-modules package.

The azure kernels do not install the linux-modules-extra package by default, and that's the reason the Mellanox driver wasn't found. As an workaround for 4.18.0-1004 the extra package can be installed with the command "apt-get install linux-modules-extra-azure-edge".

The version for the current cycle (4.18.0-1005) is already fixed and should be built in our PPA soon. We will post an update when this version is already available on the PPA.

The corresponding cosmic/azure was not affected.

Changed in linux-azure (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu):
importance: Critical → High
Changed in linux (Ubuntu Cosmic):
importance: Critical → High
Changed in linux-azure (Ubuntu):
importance: Critical → High
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Revision history for this message
David Coronel (davecore) wrote :

Hi Marcelo,

Maybe I'm doing something wrong but I still don't see the Mellanox devices with this 4.18.0-1005 kernel from the CKT PPA:

ubuntu@lp1794477:~$ sudo add-apt-repository ppa:canonical-kernel-team/ppa
ubuntu@lp1794477:~$ sudo apt install linux-azure-edge
ubuntu@lp1794477:~$ sudo update-grub
ubuntu@lp1794477:~$ sudo reboot

ubuntu@lp1794477:~$ uname -a
Linux lp1794477 4.18.0-1005-azure #5~18.04.1-Ubuntu SMP Thu Nov 22 00:01:08 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@lp1794477:~$ dmesg | grep -i mella
ubuntu@lp1794477:~$ lspci | grep -i mell
ubuntu@lp1794477:~$ lsmod | grep -i mlx

ubuntu@lp1794477:~$ apt-cache policy linux-image-4.18.0-1005-azure
linux-image-4.18.0-1005-azure:
  Installed: 4.18.0-1005.5~18.04.1
  Candidate: 4.18.0-1005.5~18.04.1
  Version table:
 *** 4.18.0-1005.5~18.04.1 500
        500 http://ppa.launchpad.net/canonical-kernel-team/ppa/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status

Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Confirmed
Changed in linux-azure (Ubuntu):
status: Fix Committed → Confirmed
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

There's a second issue that also affects cosmic/linux-azure:

In 4.18, the module pci-hyperv moved from "drivers/pci/host/pci-hyperv.ko" to "./drivers/pci/controller/pci-hyperv.c". That caused the module to be included to the linux-modules-extra instead of the linux-modules package.

I'm preparing a fix for it and adding some additional checks to prevent modules to move silently to the linux-modules-extra when its based location has changed.

Marcelo Cerri (mhcerri)
no longer affects: linux (Ubuntu Cosmic)
Changed in linux-azure (Ubuntu):
status: Confirmed → Fix Committed
Changed in linux-azure (Ubuntu Cosmic):
status: New → Fix Committed
assignee: nobody → Marcelo Cerri (mhcerri)
Revision history for this message
Chris Valean (cvalean) wrote :
Download full text (3.2 KiB)

This does look resolved on WS2016 and Azure.
However, SR-IOV is broken in another way now on WS2019.

Affected proposed kernels:
- cosmic proposed
- bionic proposed edge - 4.18 based.

Tested this with cosmic linux-azure 4.18.0.1006.7 from proposed, same vhd:
- SR-IOV with Mellanox CX3 works fine on WS2016, all testing has passed.
- SR-IOV with Mellanox CX3/CX4 is broken on WS2019.

These are the relevant log portions showing the issue when the kernel attempts to load the driver:

dmesg:
[ 21.059766] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 21.059775] mlx4_core: Initializing 9488:00:02.0
[ 21.191481] mlx4_core 9488:00:02.0: Detected virtual function - running in slave mode
[ 21.191508] mlx4_core 9488:00:02.0: Sending reset
[ 21.191602] mlx4_core 9488:00:02.0: Sending vhcr0
[ 21.193338] mlx4_core 9488:00:02.0: HCA minimum page size:512
[ 21.193804] mlx4_core 9488:00:02.0: Timestamping is not supported in slave mode
[ 93.148028] mlx4_core 9488:00:02.0: communication channel command 0x5 (op=0x31) timed out
[ 93.148031] mlx4_core 9488:00:02.0: device is going to be reset
[ 93.171917] mlx4_core 9488:00:02.0: VF is sending reset request to Firmware
[ 93.172584] mlx4_core 9488:00:02.0: VF Reset succeed
[ 93.172585] mlx4_core 9488:00:02.0: device was reset successfully
[ 93.195311] mlx4_core 9488:00:02.0: NOP command failed to generate MSI-X interrupt IRQ 24)
[ 93.195312] mlx4_core 9488:00:02.0: Trying again without MSI-X
[ 93.196258] mlx4_core 9488:00:02.0: Failed to close slave function
[ 93.196866] mlx4_core: probe of 9488:00:02.0 failed with error -5

----

syslog:

Dec 4 14:35:18 ubuntu kernel: [ 21.059766] mlx4_core: Mellanox ConnectX core driver v4.0-0
Dec 4 14:35:18 ubuntu kernel: [ 21.059775] mlx4_core: Initializing 9488:00:02.0
Dec 4 14:35:18 ubuntu kernel: [ 21.191481] mlx4_core 9488:00:02.0: Detected virtual function - running in slave mode
Dec 4 14:35:18 ubuntu kernel: [ 21.191508] mlx4_core 9488:00:02.0: Sending reset
Dec 4 14:35:18 ubuntu kernel: [ 21.191602] mlx4_core 9488:00:02.0: Sending vhcr0
Dec 4 14:35:18 ubuntu kernel: [ 21.193338] mlx4_core 9488:00:02.0: HCA minimum page size:512
Dec 4 14:35:18 ubuntu kernel: [ 21.193804] mlx4_core 9488:00:02.0: Timestamping is not supported in slave mode
Dec 4 14:35:18 ubuntu kernel: [ 93.148028] mlx4_core 9488:00:02.0: communication channel command 0x5 (op=0x31) timed out
Dec 4 14:35:18 ubuntu kernel: [ 93.148031] mlx4_core 9488:00:02.0: device is going to be reset
Dec 4 14:35:18 ubuntu kernel: [ 93.171917] mlx4_core 9488:00:02.0: VF is sending reset request to Firmware
Dec 4 14:35:18 ubuntu kernel: [ 93.172584] mlx4_core 9488:00:02.0: VF Reset succeed
Dec 4 14:35:18 ubuntu kernel: [ 93.172585] mlx4_core 9488:00:02.0: device was reset successfully
Dec 4 14:35:18 ubuntu kernel: [ 93.195311] mlx4_core 9488:00:02.0: NOP command failed to generate MSI-X interrupt IRQ 24)
Dec 4 14:35:18 ubuntu kernel: [ 93.195312] mlx4_core 9488:00:02.0: Trying again without MSI-X
Dec 4 14:35:18 ubuntu kernel: [ 93.196258] mlx4_core 9488:00:02.0: Failed to close slave function
Dec 4 14:35:18 ubuntu kernel: [ 93.196866] mlx4_co...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (43.5 KiB)

This bug was fixed in the package linux-azure - 4.18.0-1006.6

---------------
linux-azure (4.18.0-1006.6) cosmic; urgency=medium

  * linux-azure: 4.18.0-1006.6 -proposed tracker (LP: #1805244)

  * Accelerated networking (SR-IOV VF) broken in 18.10 daily (LP: #1794477)
    - [Packaging] Move pci-hyperv and autofs4 back to linux-modules

linux-azure (4.18.0-1005.5) cosmic; urgency=medium

  * linux-azure: 4.18.0-1005.5 -proposed tracker (LP: #1802752)

  * [Hyper-V] Fix IRQ spreading on NVMe devices with lower numbers of channels
    (LP: #1802358)
    - SAUCE: genirq/affinity: Spread IRQs to all available NUMA nodes
    - SAUCE: irq/matrix: Split out the CPU selection code into a helper
    - SAUCE: irq/matrix: Spread managed interrupts on allocation
    - SAUCE: genirq/matrix: Improve target CPU selection for managed interrupts.

  [ Ubuntu: 4.18.0-12.13 ]

  * linux: 4.18.0-12.13 -proposed tracker (LP: #1802743)
  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.
  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
  * CVE-2018-18955: nested user namespaces with more than fiv...

Changed in linux-azure (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Cosmic):
status: New → Invalid
Changed in linux (Ubuntu):
status: Fix Committed → Invalid
Changed in linux-azure (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Andy Whitcroft (apw) wrote :

This bug was erroneously marked for verification in bionic; verification is not required and verification-needed-bionic is being removed.

tags: added: kernel-fixup-verification-needed-bionic verification-done-bionic
removed: verification-needed-bionic
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.