dpkg -l maas-region-api | grep ii
ii maas-region-api 2.5.1-7508-gb70537f3d-0ubuntu1~18.04.1 all Region controller API service for MAAS
(2.5.1 as I rebuilt the env from scratch and got the newly released version)
Also, I applied pretty aggressive system-wide TCP keepalive settings so that sockets MAAS creates to connect to postgres time out much faster (1 + 3 * 1 = 4 seconds) because it does not tune them.
When I tuned meta failure-timeout to 3 seconds for the DNS resource the failover happened much faster because pacemaker did a stop/start on that resource faster after start failure.
So, I believe, the patch suggested and tcp keepalive tuning combined give a much better result. E.g. I started seeing OperationalErrors from psycopg2 which means that the keepalive setting did work.
Considering that there there's a 4 second gap for TCP timeouts, 5-second monitor on res_maas_region_hostname, some time before the new PG VIP and master comes up and some time for MAAS to recover ~ 20 second failover time seems somewhat sane and can be tweaked further.
crm configure show | grep cluster-recheck-interval
cluster-recheck-interval=10s
The good side is that patching MAAS and tweaking keepalive settings worked.
I will upload more logs in the morning to illustrate how the patched version behaved.
An update after using this patch suggested by Alberto: http:// paste.ubuntu. com/p/fDbrrsBmd P/
This was applied onto:
dpkg -l maas-region-api | grep ii gb70537f3d- 0ubuntu1~ 18.04.1 all Region controller API service for MAAS
ii maas-region-api 2.5.1-7508-
(2.5.1 as I rebuilt the env from scratch and got the newly released version)
Also, I applied pretty aggressive system-wide TCP keepalive settings so that sockets MAAS creates to connect to postgres time out much faster (1 + 3 * 1 = 4 seconds) because it does not tune them.
# sysctl -a --pattern keepalive tcp_keepalive_ intvl = 1 tcp_keepalive_ probes = 3 tcp_keepalive_ time = 1
net.ipv4.
net.ipv4.
net.ipv4.
default kernel settings were not adequate (7200 seconds ~ 2 hours before keepalives are even sent) /paste. ubuntu. com/p/qy38pQHGR r/
https:/
# time before failover
root@maas-vhost1:~# date
Tue Feb 26 21:20:28 UTC 2019
https:/ /paste. ubuntu. com/p/mwCMwzY2s T/
I managed to get 500 for both start actions on 2 surviving nodes which lead to:
Failed Actions: region_ hostname_ start_0 on maas-vhost1 'unknown error' (1): call=86, status=complete, exitreason='', rc-change= 'Tue Feb 26 21:20:51 2019', queued=0ms, exec=688ms region_ hostname_ start_0 on maas-vhost3 'unknown error' (1): call=127, status=complete, exitreason='', rc-change= 'Tue Feb 26 21:20:50 2019', queued=0ms, exec=1030ms
* res_maas_
last-
* res_maas_
last-
This is expected and I can make a couple of retry attempts in the RA before failing.
Since I have not tuned failure-timeout the failure happened after a few minutes: /paste. ubuntu. com/p/kJRnYjjKQ g/
https:/
When I tuned meta failure-timeout to 3 seconds for the DNS resource the failover happened much faster because pacemaker did a stop/start on that resource faster after start failure.
So, I believe, the patch suggested and tcp keepalive tuning combined give a much better result. E.g. I started seeing OperationalErrors from psycopg2 which means that the keepalive setting did work.
The timeline with the last test I performed with failure-timeout=5: /paste. ubuntu. com/p/b6Nn3Kvgf D/
https:/
Considering that there there's a 4 second gap for TCP timeouts, 5-second monitor on res_maas_ region_ hostname, some time before the new PG VIP and master comes up and some time for MAAS to recover ~ 20 second failover time seems somewhat sane and can be tweaked further.
crm configure show | grep cluster- recheck- interval recheck- interval= 10s
cluster-
The good side is that patching MAAS and tweaking keepalive settings worked.
I will upload more logs in the morning to illustrate how the patched version behaved.