[2.2 beta3] Node failed to be deployed, because of the following error: {"gateway_link_ipv4": ["Static IP Address instance with id 248066 does not exist."]}

Bug #1671891 reported by Larry Michel
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Mike Pontillo
2.1
Fix Released
High
Unassigned

Bug Description

After unlink_subnet and link_subnet through the API to set link to Auto Assign then doing the same to go back to unconfigured on eth1, the nodes can no longer be deployed because it can't find the eth0 interface.

If I try to deploy the nodes then it fails with this error: "Node failed to be deployed, because of the following error: {"gateway_link_ipv4": ["Static IP Address instance with id 248066 does not exist."]}"

This looks to related to bug 1671651 (those systems are not getting past allocated when deployed by Juju) and bug 1659607 (changes for fixing unlink/link subnet). For bug 1671651, hayward-08, hayward-20, hayward-40 and hayward-17 are hitting this after restarting maas-regiond to enable Debug flag, and the other 4 systems deploy OK.

I am attaching new logs after enabling debug. These would include my latest to deploy around 17:25 UTC.

ubuntu@maas2-integration-daily:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-=================================================
ii maas 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

I'll also attach node details for hayward-08 system shortly.

Related branches

Revision history for this message
Larry Michel (lmic) wrote :
Revision history for this message
Larry Michel (lmic) wrote :

hayward-08 machine read info.

summary: - [2.2 beta2] Node failed to be deployed, because of the following error:
+ [2.2 beta3] Node failed to be deployed, because of the following error:
{"gateway_link_ipv4": ["Static IP Address instance with id 248066 does
not exist."]}
tags: added: cdo-qa-blocker
Changed in maas:
milestone: none → 2.2.0
Revision history for this message
Mike Pontillo (mpontillo) wrote :

I've been looking at this issue for some time now, and have not been able to convince MAAS to leave in incorrect link in the gateway_link_* fields (which is apparently causing this issue). If a subnet link is completely deleted, the gateway_link_* fields on the machine should be automatically removed.

I see that there may be some related bugs here (for example, if you link a subnet, set it to be the gateway, and then change the link so that it no longer specifies a subnet - such as to be LINK_UP on that VLAN - then we might store a gateway that cannot be used). But I wasn't able to see the exact problem you saw.

One theory I have is that the database contains stale entries from an earlier version. Can you try the following workaround:

sudo maas-region dbshell
UPDATE maasserver_node
    SET gateway_link_ipv4_id=NULL
    WHERE gateway_link_ipv4_id NOT IN (
        SELECT id from maasserver_staticipaddress);

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Below is some more background on this one.

In `models/node.py` we have the following code:

    # Default IPv4 subnet link on an interface for this node. This is used to
    # define the default IPv4 route the node should use.
    gateway_link_ipv4 = ForeignKey(
        StaticIPAddress, default=None, blank=True, null=True,
        editable=False, related_name='+', on_delete=SET_NULL)

    # Default IPv6 subnet link on an interface for this node. This is used to
    # define the default IPv6 route the node should use.
    gateway_link_ipv6 = ForeignKey(
        StaticIPAddress, default=None, blank=True, null=True,
        editable=False, related_name='+', on_delete=SET_NULL)

The on_delete=SET_NULL means that there should never be a gateway_link_* field containing an IP address link if that IP address link has been removed.

This bug is incomplete until we can determine how the gateway_link_* fields have a value referencing a deleted StaticIPAddress.

The SQL in comment #3 will clean up the stale IP address link. If you can run the SQL and then determine the steps to reproduce the issue, please reopen this bug.

Changed in maas:
status: New → Incomplete
Revision history for this message
Mike Pontillo (mpontillo) wrote :

If you want to see which nodes this query would affect, run:

SELECT hostname, gateway_link_ipv4_id FROM maasserver_node
    WHERE gateway_link_ipv4_id NOT IN (
        SELECT id from maasserver_staticipaddress);

Revision history for this message
Larry Michel (lmic) wrote :

Data from triage:

ubuntu@maas2-integration-daily:~$ maas jason machine deploy 4y3hfg
{"gateway_link_ipv4": ["Static IP Address instance with id 250123 does not exist."]}
ubuntu@maas2-integration-daily:~$ maas jason interfaces read 4y3hfg | jq '.[] | {id:.id, name:.name, mac:.mac_address, vid:.vlan.vid, fabric:.vlan.fabric, links:.links[] |{id: .id, ip:.ip_address, mode:.mode, subnet:.subnet.cidr}}' --compact-output
{"id":490,"name":"rename8","mac":"00:22:99:e0:06:04","vid":0,"fabric":"fabric-0","links":{"id":249934,"ip":null,"mode":"link_up","subnet":null}}
{"id":488,"name":"rename6","mac":"00:22:99:e0:06:02","vid":0,"fabric":"fabric-0","links":{"id":249935,"ip":null,"mode":"link_up","subnet":null}}
{"id":486,"name":"rename4","mac":"00:22:99:e0:06:00","vid":0,"fabric":"fabric-0","links":{"id":249936,"ip":null,"mode":"link_up","subnet":null}}
{"id":489,"name":"rename7","mac":"00:22:99:e0:06:03","vid":0,"fabric":"fabric-0","links":{"id":249937,"ip":null,"mode":"link_up","subnet":null}}
{"id":487,"name":"rename5","mac":"00:22:99:e0:06:01","vid":0,"fabric":"fabric-0","links":{"id":249938,"ip":null,"mode":"link_up","subnet":null}}
{"id":491,"name":"rename9","mac":"00:22:99:e0:06:05","vid":0,"fabric":"fabric-0","links":{"id":249939,"ip":null,"mode":"link_up","subnet":null}}
{"id":493,"name":"eth0","mac":"00:22:99:e0:06:07","vid":0,"fabric":"fabric-0","links":{"id":250123,"ip":null,"mode":"auto","subnet":"10.244.192.0/18"}}
{"id":492,"name":"eth1","mac":"00:22:99:e0:06:06","vid":0,"fabric":"fabric-2","links":{"id":249940,"ip":null,"mode":"link_up","subnet":null}}

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Thanks for the triage, Larry.

This bug seems to occur only when an automatic IP address is set as the gateway link. The issue is that during deployment, MAAS allocates a new IP address and replaces the old one. This invalidates the gateway link, but the gateway link isn't properly cleared out by Django. (If it was, we never would have noticed this, AND the node wouldn't get the correct gateway!)

The simplest fix is to avoid replacing the address; instead delete the newly-allocated address and "steal" its IP address for the existing entry.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
status: Triaged → In Progress
assignee: nobody → Mike Pontillo (mpontillo)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.