cert requests not handled when the original leader vault is not available

Bug #1836348 reported by Yoshi Kadokawa
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
vault-charm
Fix Released
High
Cory Johns

Bug Description

After leader unit of vault is unavailable or removed for whatever reason,
adding unit to kubernetes-master will be stuck in "Waiting for master components to start" status.

The steps to reproduce are as follows.

1. Deploy CDK with Vault HA
I used this bundle.
2. Remove or take the leader unit of vault down
$ juju run -a vault is-leader
- Stdout: |
    False
  UnitId: vault/0
- Stdout: |
    True
  UnitId: vault/1
- Stdout: |
    False
  UnitId: vault/2
$ juju remove-unit --force vault/1
3. Add unit for kubernetes-master
$ juju add-unit kubernetes-master

After a while, the added kubernetes-master unit will be stuck in "Waiting for master components to start" status.
Since the "tls_client.certs.saved" flag is not set and no certificates found under /root/cdk/,
I believe update_certs() function[0] somehow fails to retrieve the certificates from Vault when the original leader unit is not there.

$ juju run -a kubernetes-master -- "charms.reactive -p get_flags | grep tls_client.certs.saved"
- Stdout: |2
     'tls_client.certs.saved',
  UnitId: kubernetes-master/0
- ReturnCode: 1
  Stdout: ""
  UnitId: kubernetes-master/1
$ juju run -a kubernetes-master -- sudo ls -al /root/cdk
- Stdout: |
    total 52
    drwxrwx--- 4 root root 4096 Jul 12 09:22 .
    drwx------ 7 root root 4096 Jul 12 09:29 ..
    drwxr-xr-x 2 root root 4096 Jul 12 09:23 audit
    -rw-r--r-- 1 root root 61 Jul 12 09:05 basic_auth.csv
    -r--r----- 1 root root 1245 Jul 12 09:22 ca.crt
    -rw-r--r-- 1 root root 1406 Jul 12 09:22 client.crt
    -rw-r--r-- 1 root root 1678 Jul 12 09:22 client.key
    drwxr-xr-x 2 root root 4096 Jul 12 09:22 etcd
    -rw-r--r-- 1 root root 385 Jul 12 09:10 known_tokens.csv
    -rw------- 1 root root 2014 Jul 12 09:33 kubeproxyconfig
    -rw-r--r-- 1 root root 1670 Jul 12 09:22 server.crt
    -rw-r--r-- 1 root root 1674 Jul 12 09:22 server.key
    -rw------- 1 root root 1675 Jul 12 09:05 serviceaccount.key
  UnitId: kubernetes-master/0
- Stdout: |
    total 20
    drwxr-xr-x 2 root root 4096 Jul 12 09:40 .
    drwx------ 6 root root 4096 Jul 12 09:46 ..
    -rw-r--r-- 1 root root 60 Jul 12 10:11 basic_auth.csv
    -rw-r--r-- 1 root root 385 Jul 12 10:11 known_tokens.csv
    -rw-r--r-- 1 root root 1675 Jul 12 10:11 serviceaccount.key
  UnitId: kubernetes-master/1

[0] https://github.com/juju-solutions/layer-tls-client/blob/master/reactive/tls_client.py#L93

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Here is the bundle that I have used.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Subscribing this to field-critical, since this is the last item blocking a project completion.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

For the steps to unseal vault cluster, I have followed the exact same steps as described here.
https://ubuntu.com/kubernetes/docs/using-vault

But I could also confirm this issue when totally-unsecure-auto-unlock=true as well.

George Kraft (cynerva)
Changed in charm-kubernetes-master:
importance: Undecided → Critical
Cory Johns (johnsca)
Changed in vault-charm:
assignee: nobody → Cory Johns (johnsca)
status: New → In Progress
Revision history for this message
Cory Johns (johnsca) wrote :

This is actually being caused by a bug in the vault charm. When leadership changes, the flag indicating that the CA has been configured doesn't get updated on the new leader.

Until a fix is available in the vault charm, you can recover from this bad state by running:

juju run --unit vault/2 -- 'charms.reactive set_flag charm.vault.ca.ready ; hooks/update-status'

(assuming vault/2 is the new leader)

Revision history for this message
Cory Johns (johnsca) wrote :
no longer affects: charm-kubernetes-master
summary: - add-unit kubernetes-master will stuck in "Waiting for master components
- to start" when the original leader vault is not available
+ cert requests not handled when the original leader vault is not
+ available
Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Thank you for the workaround.
With the workaround by setting the flag charm.vault.ca.ready, I could confirm that this will mitigate the issue.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, fix has landed in charm-vault @ master, and I've cherry-picked that back to the stable/19.04 branch as a backport. Track status at:

https://review.opendev.org/#/q/topic:bug/1836348+(status:open+OR+status:merged)

Ryan Beisner (1chb1n)
Changed in vault-charm:
status: In Progress → Fix Committed
milestone: none → 19.07
importance: Undecided → High
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.