MAAS deploys fail if host has NIC w/ random MAC

Bug #1936972 reported by dann frazier
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned
3.3
Won't Fix
Medium
Unassigned
3.4
Won't Fix
Medium
Unassigned
cloud-init
Expired
Undecided
Unassigned
curtin
New
Undecided
Unassigned

Bug Description

The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC. This NIC apparently provides no MAC address of it's own, so the driver generates a random MAC for it:

./drivers/net/usb/cdc_ether.c:

static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf)
{
        int status = usbnet_cdc_bind(dev, intf);

        if (!status && (dev->net->dev_addr[0] & 0x02))
                eth_hw_addr_random(dev->net);

        return status;
}

This causes a problem with MAAS because, during deployment, MAAS sees this as a normal NIC and records the MAC. The post-install reboot then fails:

[ 43.652573] cloud-init[3761]: init.apply_network_config(bring_up=not args.local)
[ 43.700516] cloud-init[3761]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in apply_network_config
[ 43.724496] cloud-init[3761]: self.distro.networking.wait_for_physdevs(netcfg)
[ 43.740509] cloud-init[3761]: File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in wait_for_physdevs
[ 43.764523] cloud-init[3761]: raise RuntimeError(msg)
[ 43.780511] cloud-init[3761]: RuntimeError: Not all expected physical devices present: {'fe:b8:63:69:9f:71'}

I'm not sure what the best answer for MAAS is here, but here's some thoughts:

1) Ignore all Redfish system interfaces. These are a connect between the host and the BMC, so they don't really have a use-case in the MAAS model AFAICT. These devices can be identified using the SMBIOS as described in the Redfish Host Interface Specification, section 8:
  https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf
Which can be read from within Linux using dmidecode.

2) Ignore (or specially handle) all NICs with randomly generated MAC addresses. While this is the only time I've seen the random MAC with production server hardware, it is something I've seen on e.g. ARM development boards. Problem is, I don't know how to detect a generated MAC. I'd hoped the permanent MAC (ethtool -P) MAC would be NULL, but it seems to also be set to the generated MAC :(

fyi, 2 workarounds for this that seem to work:
 1) Delete the NIC from the MAAS model in the MAAS UI after every commissioning.
 2) Use a tag's kernel_opts field to modprobe.blacklist the driver used for the Redfish NIC.

Related branches

Revision history for this message
Taihsiang Ho (tai271828) wrote :

It seems Ampere Mt. Jade Platform is also impacted by this issue.

Its dmidecode info (seems not provide useful information very much, unfortunately):

ubuntu@howzit:~$ sudo dmidecode -t bios -t 42; sudo dmidecode -t 42 -u
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
# SMBIOS implementations newer than version 3.2.0 are not
# fully supported by this version of dmidecode.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
        Vendor: Ampere(R)
        Version: 1.6.20210526 (SCP: 1.06.20210526)
        Release Date: 2021/05/26
        ROM Size: 7680 kB
        Characteristics:
                PCI is supported
                BIOS is upgradeable
                Boot from CD is supported
                Selectable boot is supported
                ACPI is supported
                UEFI is supported
        BIOS Revision: 5.15
        Firmware Revision: 1.6

Handle 0x0029, DMI type 13, 22 bytes
BIOS Language Information
        Language Description Format: Long
        Installable Languages: 1
                en|US|iso8859-1
        Currently Installed Language: en|US|iso8859-1

Handle 0x0055, DMI type 42, 17 bytes
Management Controller Host Interface
        Host Interface Type: OEM

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
# SMBIOS implementations newer than version 3.2.0 are not
# fully supported by this version of dmidecode.

Handle 0x0055, DMI type 42, 17 bytes
        Header and Data:
                2A 11 55 00 F0 04 FF 00 00 00 01 02 04 FF FF FF
                FF

Revision history for this message
James Falcon (falcojr) wrote :

From cloud-init's perspective, this is working as expected. If we're presented a network device through the metadata and then can't find it on the system, we intentionally complain about it.

We can of course add logic to ignore or make special cases for specific devices, but until we decide the best course of action, I'm going to mark this as incomplete for cloud-init.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

Generated MAC addresses should have the U/L bit set (2nd-least-significant bit of the first octet):

x2‑xx‑xx‑xx‑xx‑xx
x6‑xx‑xx‑xx‑xx‑xx
xA‑xx‑xx‑xx‑xx‑xx
xE‑xx‑xx‑xx‑xx‑xx

MAAS could filter these devices when not configured.

Related to: https://bugs.launchpad.net/maas/+bug/1931735

Revision history for this message
Bill Wear (billwear) wrote :

Marking as invalid because this sounds like a feature request for MAAS, not a bug, e.g, MAAS is working as designed. You can (a) file a feature request in the "Features" category, or (b) help me understand why this is should be considered an actual error in the MAAS code.

Changed in maas:
status: New → Invalid
Revision history for this message
dann frazier (dannf) wrote :

Thanks Bill. I don't happen to know the criteria MAAS uses to put things into bug vs. feature request buckets, so I don't feel like I have the rules for making the (b) argument either way. What we know is that MAAS (and/or its dependencies) assumes that all NICs have static MAC addresses. We're now seeing that assumption be proven incorrect, and it's causing MAAS to not support industry standard servers from large HW vendors that otherwise work fine with Ubuntu.

To be clear, I'm not asking that MAAS learn to model these devices. Maybe there's a reason it could/should, but that certainly seems like a new feature. Rather, I'm asking that MAAS not fail deployments due to their presence. Ignoring these easily-detected (Comment #3) devices during commissioning seems like a reasonable fix IMO.

Revision history for this message
dann frazier (dannf) wrote :

I attempted to fix this by ignoring NICs with the U/L bit set:
  https://code.launchpad.net/~dannf/maas/+git/maas/+merge/409001
As noted there, QEMU uses a prefix with this bit set, and that maybe true with other virtualization, so I don't think we can key on MAC. I think instead we may want to specifically try and detect Redfish NICs and ignore them.

Revision history for this message
dann frazier (dannf) wrote :

Assuming it is a reasonable solution to hide NICs that can be determined to be RedFish NICs, where is the appropriate place to do this? Should the machine-resources binary be expanded to provide the necessary information, e.g. via a new record?

"resources": {
  "management_controller": [
    {
      "type": "redfish",
      "interface_type": "network_host_interface",
      "device_type": "usb_network_interface",
      "usb_network_interface": {
        "vendor_id": "1234",
        "product_id": "5678",
        "serial_number_type": "string",
        "serial_number": "abc123",
      [...]

Presumably then, update_interfaces() could introspect the data, figure out which interface it maps to, and omit it from the model.

Revision history for this message
Björn Tillenius (bjornt) wrote :

I don't think this is a feature request. Ignoring the NIC in MAAS, might be reasonable. Although it's odd that the NIC doesn't have a MAC of its own. Is that a hardware feature, or is it the driver that doesn't surface the physical MAC?

Also, could you please provide the current output from the machine-resources resources binary for that machine?

Changed in maas:
status: Invalid → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

I don't know for sure if the absent MAC is by design - you're right, it could be missing support in the driver. I'll see if I can find someone who knows the firmware's intent.

machine-resources output attached.

Revision history for this message
Alberto Donato (ack) wrote :

I agree that in MAAS we should skip devices without a MAC.

Changed in maas:
status: Incomplete → Triaged
Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1936972] Re: MAAS deploys fail if host has NIC w/ random MAC

On Fri, Oct 22, 2021 at 3:35 AM Alberto Donato
<email address hidden> wrote:
>
> I agree that in MAAS we should skip devices without a MAC.

I'm not sure if there is a reliable way to detect a NIC without a MAC.
If the driver doesn't find a "burned in" MAC, it will generate a
random MAC for the device on every boot. The only thing we'll know for
sure about the random MAC is that it will have the "local assignment
bit" set. But we can't skip a device just because it has that bit set
in the MAC, because that bit is also set in MACs used by QEMU
instances, and likely other forms of virtualization. So my best
suggestion is to tackle the specific subset of devices with random
MACs that are causing real problems today - the RedFish NICs, which
can be easily identified from the SMBIOS table.

 -dann

Changed in maas:
importance: Undecided → High
assignee: nobody → Björn Tillenius (bjornt)
Changed in maas:
assignee: Björn Tillenius (bjornt) → nobody
milestone: none → 3.2.0
Alberto Donato (ack)
Changed in maas:
assignee: nobody → Alberto Donato (ack)
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Workarounds exist, no obvious way to deal with random MAC addresses in a general way.

Changed in maas:
assignee: Alberto Donato (ack) → nobody
importance: High → Medium
milestone: 3.2.0 → 3.3.0
Revision history for this message
Alberto Donato (ack) wrote :

For 3.3, we should investigate dropping the use of match/set-name for netplan config, since interface names are now stable. In that case, we wouldn't need to create a config stanza for interfaces that are unconfigured.

Revision history for this message
dann frazier (dannf) wrote :

Ignoring NICs that the firmware tells you are RedFish controllers (and therefore not wired up to any routable network) seems like a reasonable way to tackle the known/biggest source of this project.

Regarding stable interface names, I don't think we've reached that panacea quite yet. Predictable names can change based on OS series - and even in HWE kernels within a single series. See bugs 1940860 and 1945225 for examples.

Changed in maas:
milestone: 3.3.0 → 3.4.0
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Incomplete → Expired
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Revision history for this message
Rod Smith (rodsmith) wrote :

I've run into a similar problem: If a node was commissioned with a kernel that detects a particular network device, but the system is then deployed with a kernel that does not detect that device, the deployment fails with the same message as noted in the original bug report. I've encountered this most recently with a Redfish device on blubi (an Ericsson AB CRU 0201 server), but ISTR running into something similar with a "real" NIC in the past. I understand curtin complaining about this, but when the result is that the post-install reboot, and therefore the entire deployment, fails, it can be unclear to the user why the deployment failed -- MAAS logs show a successful installation followed by no output from the node. I had to use the system's remote KVM and watch the entire deployment to figure it out. I worked around it by deleting the Redfish device, which isn't important for us or MAAS; but if it were a real device, that solution would leave the device unconfigured, even when the node was deployed using a kernel that could use it. A more robust solution would enable the deployment to succeed, even if the device was inactive.

Revision history for this message
Zachary Mance (zmance) wrote :

Is there any plan on creating a work around or fix for this issue?

Changed in maas:
milestone: 3.4.x → 3.5.x
no longer affects: maas/3.3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.