MAAS

Bug #1817484
Comment #19

Comment 19 for bug 1817484

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-02-27:

#19

An update after using this patch suggested by Alberto: http://paste.ubuntu.com/p/fDbrrsBmdP/

This was applied onto:

dpkg -l maas-region-api | grep ii
ii maas-region-api 2.5.1-7508-gb70537f3d-0ubuntu1~18.04.1 all Region controller API service for MAAS

(2.5.1 as I rebuilt the env from scratch and got the newly released version)

Also, I applied pretty aggressive system-wide TCP keepalive settings so that sockets MAAS creates to connect to postgres time out much faster (1 + 3 * 1 = 4 seconds) because it does not tune them.

# sysctl -a --pattern keepalive
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 1

default kernel settings were not adequate (7200 seconds ~ 2 hours before keepalives are even sent)
https://paste.ubuntu.com/p/qy38pQHGRr/

# time before failover
root@maas-vhost1:~# date
Tue Feb 26 21:20:28 UTC 2019

https://paste.ubuntu.com/p/mwCMwzY2sT/

I managed to get 500 for both start actions on 2 surviving nodes which lead to:

Failed Actions:
* res_maas_region_hostname_start_0 on maas-vhost1 'unknown error' (1): call=86, status=complete, exitreason='',
last-rc-change='Tue Feb 26 21:20:51 2019', queued=0ms, exec=688ms
* res_maas_region_hostname_start_0 on maas-vhost3 'unknown error' (1): call=127, status=complete, exitreason='',
last-rc-change='Tue Feb 26 21:20:50 2019', queued=0ms, exec=1030ms

This is expected and I can make a couple of retry attempts in the RA before failing.

Since I have not tuned failure-timeout the failure happened after a few minutes:
https://paste.ubuntu.com/p/kJRnYjjKQg/

When I tuned meta failure-timeout to 3 seconds for the DNS resource the failover happened much faster because pacemaker did a stop/start on that resource faster after start failure.

So, I believe, the patch suggested and tcp keepalive tuning combined give a much better result. E.g. I started seeing OperationalErrors from psycopg2 which means that the keepalive setting did work.

The timeline with the last test I performed with failure-timeout=5:
https://paste.ubuntu.com/p/b6Nn3KvgfD/

Considering that there there's a 4 second gap for TCP timeouts, 5-second monitor on res_maas_region_hostname, some time before the new PG VIP and master comes up and some time for MAAS to recover ~ 20 second failover time seems somewhat sane and can be tweaked further.

crm configure show | grep cluster-recheck-interval
cluster-recheck-interval=10s

The good side is that patching MAAS and tweaking keepalive settings worked.

I will upload more logs in the morning to illustrate how the patched version behaved.

An update after using this patch suggested by Alberto: http://paste.ubuntu.com/p/fDbrrsBmdP/

This was applied onto:

dpkg -l maas-region-api | grep ii
ii  maas-region-api 2.5.1-7508-gb70537f3d-0ubuntu1~18.04.1 all          Region controller API service for MAAS

(2.5.1 as I rebuilt the env from scratch and got the newly released version)

Also, I applied pretty aggressive system-wide TCP keepalive settings so that sockets MAAS creates to connect to postgres time out much faster (1 + 3 * 1 = 4 seconds) because it does not tune them.

# sysctl -a --pattern keepalive
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 1

default kernel settings were not adequate (7200 seconds ~ 2 hours before keepalives are even sent)
https://paste.ubuntu.com/p/qy38pQHGRr/

# time before failover
root@maas-vhost1:~# date
Tue Feb 26 21:20:28 UTC 2019

https://paste.ubuntu.com/p/mwCMwzY2sT/

I managed to get 500 for both start actions on 2 surviving nodes which lead to:

Failed Actions:
* res_maas_region_hostname_start_0 on maas-vhost1 'unknown error' (1): call=86, status=complete, exitreason='',
    last-rc-change='Tue Feb 26 21:20:51 2019', queued=0ms, exec=688ms
* res_maas_region_hostname_start_0 on maas-vhost3 'unknown error' (1): call=127, status=complete, exitreason='',
    last-rc-change='Tue Feb 26 21:20:50 2019', queued=0ms, exec=1030ms

This is expected and I can make a couple of retry attempts in the RA before failing.

Since I have not tuned failure-timeout the failure happened after a few minutes:
https://paste.ubuntu.com/p/kJRnYjjKQg/

When I tuned meta failure-timeout to 3 seconds for the DNS resource the failover happened much faster because pacemaker did a stop/start on that resource faster after start failure.

So, I believe, the patch suggested and tcp keepalive tuning combined give a much better result. E.g. I started seeing OperationalErrors from psycopg2 which means that the keepalive setting did work.

The timeline with the last test I performed with failure-timeout=5:
https://paste.ubuntu.com/p/b6Nn3KvgfD/

crm configure show | grep cluster-recheck-interval
	cluster-recheck-interval=10s

The good side is that patching MAAS and tweaking keepalive settings worked.

I will upload more logs in the morning to illustrate how the patched version behaved.