MAAS 3.0 fails to initialize regiond when an IPoIB device is present

Bug #1939456 reported by Fifi
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Adam Collard

Bug Description

I decided to give MaaS 3.0 a spin the other day, followed the docs to "maas init region+rack", but while the script said all was ok, I couldn't open the web UI on port 5240. Which was no surprise, as there was no process listening to port 5240.

Searching the logs it appeared regiond had some trouble starting:

2021-08-07 12:14:01 maasserver.start_up: [error] Database error during start-up
Traceback (most recent call last):
  File "/snap/maas/15003/usr/lib/python3/dist-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type macaddr: "a0:00:03:00:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:4f:06:e2"
LINE 1: ...ss" IN ('44:a8:42:ba:a3:b4', '10:98:36:99:7d:9e', 'a0:00:03:...
                                                             ^

The (long) MAC address belongs to an infiniband interface created by IPoIB module. The error comes from Postgresql refusing to store an address not conforming to type 'macaddr', but shouldn't regiond have the capacity to handle this error gracefully and continue, something like ignoring the problematic interface altogether.

To get regiond started, I had to remove the IPoIB driver.

Re-enabling IPoIB later afterwards isn't an option, as even though regiond started ok, it still had trouble running 50-maas-01-commissioning script for much the same reason and things went south again.

A full trace of the regiond start up problem is attached.

I am new to MaaS but if guided I can provide any additional information required as well as test fixes.

Thanks,
-K.

Tags: trivial

Related branches

Revision history for this message
Fifi (koukou73gr) wrote :
Revision history for this message
Bill Wear (billwear) wrote :

can you get the output of sudo /snap/maas/current/usr/share/maas/machine-resources/amd64? it would help.

Changed in maas:
status: New → Incomplete
Revision history for this message
Fifi (koukou73gr) wrote (last edit ):

Sure, I am attaching output with and without IPoIB loaded, although I assume you are not interested in the latter.

Revision history for this message
Fifi (koukou73gr) wrote :
Revision history for this message
Fifi (koukou73gr) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :

Ok, thanks. So we can either look a the "protocol", or simply look at the length of the MAC. Ignoring it would be simple, but it would good to at least show that there's an infiband interface there.

tags: added: trivial
Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
milestone: none → next
Revision history for this message
Kilian Schnelle (kischnelle) wrote :

Hi,

ist there a solution to this problem?
having the same when trying to install 3.1 region+rack on a node with an infiniband interface.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

This issue is targeted after MAAS 3.2

no longer affects: maas/3.2
no longer affects: maas/3.3
Changed in maas:
milestone: next → 3.3.0
summary: - MaaS 3.0 fails to initialize regiond when an IPoIB device is present
+ MAAS 3.0 fails to initialize regiond when an IPoIB device is present
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We will attempt a fix in 3.2, based on unit-test validation as we don't have access to IB HW at the moment. If the issue reoccurs, we will reopen this ticket.

Changed in maas:
milestone: 3.3.0 → 3.2.0
Changed in maas:
assignee: nobody → Adam Collard (adam-collard)
Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Adam Collard (adam-collard) wrote :

The fix that landed here is to "simply" skip over IP over Infiniband like devices which have extra-long MACs, both at controller start up, and during refresh.

Changed in maas:
milestone: 3.2.0 → 3.2.0-beta5
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.