kubernetes provider: juju removed existing ingress after controller restart

Bug #1884674 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Yang Kelvin Liu

Bug Description

Hi,

Earlier today, on controller restart (3 nodes, jujuds were restarted as close to simultaneously as we could manage) we found that charm-created ingresses for an app was removed. The application uses the k8s-wordpress charm which manually creates k8s ingresses due to LP:1849725.

| https://git.launchpad.net/charm-k8s-wordpress/tree/src/charm.py#n202

From the controller logs:

machine-0: 21:18:00 DEBUG juju.kubernetes.provider deleting ingress resource for wordpress-k8s
machine-0: 21:18:00 DEBUG juju.kubernetes.provider created/updated ingress resources for "wordpress-k8s".
machine-2: 21:18:02 DEBUG juju.kubernetes.provider deleting ingress resource for wordpress-k8s
machine-2: 21:18:03 DEBUG juju.kubernetes.provider created/updated ingress resources for "wordpress-k8s".

The controller in question has two separate models with applications named wordpress-k8s. Only the ingress for one of them was lost. The k8s cluster is running v1.16.10. There is no persistent storage configured, so we added the cluster to Juju with --skip-storage.

Revision history for this message
Thomas Cuthbert (tcuthbert) wrote :

I believe it was because the ingress wasn't able to find the TLS secret:

2020/06/22 22:48:42 [alert] 31082#31082: *98047390 no ssl_certificate_by_lua* defined in server blog.launchpad.net while loading SSL certificate by lua, client: 91.189.91.49, server: 0.0.0.0:443
2020/06/22 22:48:42 [crit] 31082#31082: *98047390 SSL_do_handshake() failed (SSL: error:1417A179:SSL routines:tls_post_process_client_hello:cert cb error) while loading SSL certificate by lua, client: 91.189.91.49, server: 0.0.0.0:443

This certificate definitely exists and is correct in k8s.

ubuntu@juju-7cf688-prod-is-external-kubernetes-9:~$ kubectl get secrets blog-launchpad-net-tls -n prod-launchpad-blog-k8s
NAME TYPE DATA AGE
blog-launchpad-net-tls Opaque 2 11d

https://pastebin.canonical.com/p/fnzVN9Kkhg/

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

I was trying to deploy this wordpress charm following the README but I got some config missing errors.

Thank you pointed out the cause, Thomas.

What happens if kubectl applies the ingress yaml?

Revision history for this message
Tom Haddon (mthaddon) wrote :

Our current theory is that this was actually a kubernetes issues (possibly https://github.com/kubernetes/ingress-nginx/issues/5588) rather than a juju issue. However, we think there is still a bug here, in the sense that Juju should be enforcing the state of the ingress it expects to exist.

In this case the ingress had been deleted, and an update_status hook which runs every 5 minutes has a section where it configures the ingress (https://git.launchpad.net/charm-k8s-wordpress/tree/src/charm.py#n178). Juju could notice that things had changed in kubernetes and update things.

Let us know if it makes sense to file that as a separate bug, or reuse this one for that?

Revision history for this message
Tom Haddon (mthaddon) wrote :

So actually it looks like it was juju, we've confirmed this by looking at the audit logs in kubernetes: https://pastebin.canonical.com/p/fpjNCY8p8z/ (sorry, Canonical only).

Revision history for this message
Tom Haddon (mthaddon) wrote :

Just to reply to Kelvin's comment, as discussed on IRC, manually applying the ingress yaml restores service, as does updating juju config to retrigger a pod-spec-set. So there is a workaround to this issue to restore service in case of an outage, but the question remains as to why it's happening in the first place.

Revision history for this message
Thomas Miller (tlmiller) wrote :

Hey Tom,

I checked out the pastebin and I am not sure of the origin of the delete request. Those logs suggest that the delete request is being processed by our mutating web hook before being applied to the cluster. At the moment I can't see a reason why Juju would be deleting ingress resources. Are you able to confirm what service is using the admin account in the cluster? Juju should be using it's own credentials in the cluster and "should" have a different username.

Revision history for this message
Paul Collins (pjdc) wrote :

Parsing and filtering the k8s audit logging reveals this sequence of events for the ingress that survived, which seems to be the normal case:

2020-06-22T21:18:02.991378Z delete /apis/extensions/v1beta1/namespaces/stg-wordpress-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:03.313206Z create /apis/extensions/v1beta1/namespaces/stg-wordpress-k8s/ingresses 201 jujud/v0.0.0 (linux/amd64) kubernetes/$Format

However, for the ingress that did not survive, we see:
2020-06-22T21:18:00.362358Z create /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses 409 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:00.460626Z delete /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:00.630608Z get /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:00.668371Z update /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format

Given the timestamp and the 409 response status, it's clear that the create was received and processed by k8s before the delete. But if the ingress has been deleted, why does the next get return 200? And ditto update?

I've uploaded the raw audit records to https://pastebin.canonical.com/p/rx3TTCVrJr/ (sorry, Canonical-only).

Paul Collins (pjdc)
summary: - kubernetes provider: juju removed existing ingresses
+ kubernetes provider: juju removed existing ingress after controller
+ restart
Thomas Miller (tlmiller)
Changed in juju:
importance: Undecided → High
assignee: nobody → Thomas Miller (tlmiller)
milestone: none → 2.8.1
Paul Collins (pjdc)
description: updated
Revision history for this message
Paul Collins (pjdc) wrote :

> Are you able to confirm what service is using the admin account in the cluster? Juju should be using it's own credentials in the cluster and "should" have a different username.

For various reasons we're still using a single account for all access to each of our k8s clusters.

Paul Collins (pjdc)
description: updated
description: updated
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Hi Paul,

When you run juju add-k8s, Juju create a new Service Account with proper RBAC setup and that will be SA used by Juju later.
The SA name has a prefix like `juju-credentia-`.

Revision history for this message
Paul Collins (pjdc) wrote :

> When you run juju add-k8s, Juju create a new Service Account with proper RBAC setup and that will be SA used by Juju later. The SA name has a prefix like `juju-credentia-`.

We don't have a user backend set up, though, so to my knowledge everything will end up mapping to the same account. This is what we expect, and why we have a bunch of separate k8s clusters.

Paul Collins (pjdc)
description: updated
description: updated
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

So the root cause is the charm uses application name for the ingress name which is the ingress name used for the one created by juju expose.

Currently, ingress with juju expose and definition in Podspec could be a conflict if they are all using application name as the resource name. This is a known issue tracked here this is a known issue https://bugs.launchpad.net/juju/+bug/1854123

We can think "applicationName" is a reserved ingress resource name for Juju to use only.
Ingresses defined in the PodSpec should never be named using the application name.

Changed in juju:
assignee: Thomas Miller (tlmiller) → Yang Kelvin Liu (kelvin.liu)
status: New → Triaged
status: Triaged → In Progress
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Adding validation to raise an error if any ingress resource name uses the application name would be a fix for this bug for now.

We will need to work a real fix in https://bugs.launchpad.net/juju/+bug/1854123

A workaround fix would be to rename the ingress name then upgrade-charm.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

https://github.com/juju/juju/pull/11748 added a validation to prevent any new deployment using application for any ingresses.

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.