Comment 26 for bug 1874719

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I did some tests, to see what the cluster behavior is when changing node name and/or id. It all boils down to the fact that whatever is being changed, one has to be aware that the change is being done to a live real cluster, even though it's a simple one-node cluster. That's what you get right after the package is installed: a single node cluster:
- node name is either "node1" or `uname -n`, depending on the ubuntu release. In the case of focal, topic of this bug, it's "node1" currently
- node id is 1
- ring0_addr is localhost

The charm is doing 3 changes compared to the default config file:
- node name is back to being localhost
- node id is 1000 + application unit id (a juju thing: for example, postgresql/0 is unit 0 of application postgresql).
- ring0_addr gets the real IP of the application unit, instead of 127.0.0.1

When changing nodeid *AND* node name, this is essentially creating a new node in the cluster. The old "node1" name and ID will remain around, but offline because no live host is responding as that anymore.

If you change just one of the two (node name or id), then the cluster seems to be able to coalesce them together again, and you get a plain rename, I haven't tested this exhaustively, but it seems to be the case by inspecting the current cib.raw xml file in each node, and diffing to a previous one shows the rename.

Let's test a user story, showing how one could deploy 3 nodes manually from these focal packages.

After installing pacemaker corosync in all 3 nodes, let's call them f1, f2 and f3, we get:
- f1: node id = 1, node name = node1, cluster name = debian
- f2: node id = 1, node name = node1, cluster name = debian
- f3: node id = 1, node name = node1, cluster name = debian

All with the identical config. These are esssentially 3 isolated clusters called debian, with one node called node1 in each.

The following set of changes will work and not show a phantom "node1" node at the end:
- on f1, adjust corosync.conf with this node list:
nodelist {
  node {
    # name: node1
    nodeid: 1
    ring0_addr: f1 # (or f1's ip)
  }
  node {
    nodeid: 2
    ring0_addr: f2 # (or f2's ip)
  }
  node {
    nodeid: 3
    ring0_addr: f3 # (or f3's ip)
  }
}

Then scp this file to the other nodes, and restart corosync and pacemaker there.

We kept the nodeid on f1 as 1, just got rid of its name. That renames that node to `uname -n`, because the id was kept at 1.
The other nodes also got a new name, but their ids changed. And crucially, node id 1 still exists in the cluster (it's f1), so it all works out.

If you were to also change the node id range together with the name, like the charm does, then it's an entirely new node. and you will have to get rid of node1 with a crm or pcs command, or just " crm_node --remove node1".

All in all, it's best to either start with the correct configuration (which the charm does nowadays), or clear everything beforehand (with pcs cluster destroy, perhaps). "pcs cluster destroy" is quite comprehensive, it does:
- rm -f /etc/corosync/{corosync.conf,authkey} /etc/pacemaker/authkey /var/lib/pcsd/disaster-recovery
- removes many files from /var/lib/pacemaker (cib, pengine/pe*bz2, hostcache, cts, others)
- stops the services

One has to be very careful if changing node names and node ids in a live cluster, and a live cluster is what you get right after installing the packages.

I still haven't made up my mind about this focal SRU. I definitely prefer to have the node name default to the hostname (uname -n), but making such a change via an SRU is debatable. We might have to "bite the bullet" and live with this different behavior in focal only :/