OpenStack HA Cluster Charm

Bug #1874719
Comment #20

Comment 20 for bug 1874719

Revision history for this message

Robie Basak (racb) wrote on 2022-04-27:

#20

James said:

> Setting the pacemaker distro task back to new - it seems very odd that a system designed to manage a cluster of servers would install on every node with a non-unique node id, which is a change in behaviour from older versions of the same software.

In Jammy, I think this is still the case? corosync.conf still ships with a default nodeid of 1. It's just the name that's no longer supplied.

My understanding of the normal use of corosync in Ubuntu is that the entire file is generally always replaced after the package is installed. I believe the hacluster charm does this too.

So am I right in that the issue is that corosync started briefly before being configured by the charm, and is leaving state behind? In that case, I think the charm was possibly buggy in two ways:

1) It should use policy-rc.d to avoid corosync daemon startup before corosync.conf is written out, or maybe write it out in advance. Looks like Billy's commit fixed this in the charm already. FWIW, I find it surprising that charms don't generally always override with policy-rc.d and start services manually.

2) After rewriting corosync configuration, it should clear out corosync state files entirely before restarting the daemon. This is no longer necessary due to the other fix.

Both of these apply to anything configuring corosync on Ubuntu, not just the charm. So it's not clear to me that there's a bug in the corosync packaging in Ubuntu in Focal at all. We merely ship a default cluster of size 1 that isn't very useful and needs to be replaced correctly in order to be useful.

From an SRU perspective, I have further concerns for existing users.

1) It's a conffile change. Since corosync.conf is almost always modified by users, they're are going to be prompted on upgrade if interactive. This is a little alarming and not useful. Is there any actual case where existing users would realistically be using the default configuration file? Note also that since the issue is with state, changing the configuration file for existing users wouldn't avoid the issue for them anyway.

2) Changing the node name on an existing cluster seems dangerous to me.

For the SRU, what problem are we actually solving here then? The charm is fixed and no longer impacted. Are we trying to avoid having dirty state when users follow the broken installation flow of starting corosync with the default configuration and then changing it? In that case, it seems to me that the proposed fix only happens to work by chance in this case. The real fix is to make sure that the state is properly cleaned. I'm not sure how to do that in packaging except to try to guide the user into somehow not following the broken installation flow.

Therefore I'm soft-declining this SRU for now, but further discussion welcome if you disagree and I'll look again.