Canonical Juju

kubernetes provider: juju removed existing ingress after controller restart

Bug #1884674 reported by Haw Loeung on 2020-06-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Yang Kelvin Liu	Canonical Juju 2.8.1

Bug Description

Hi,

Earlier today, on controller restart (3 nodes, jujuds were restarted as close to simultaneously as we could manage) we found that charm-created ingresses for an app was removed. The application uses the k8s-wordpress charm which manually creates k8s ingresses due to LP:1849725.

| https://git.launchpad.net/charm-k8s-wordpress/tree/src/charm.py#n202

From the controller logs:

machine-0: 21:18:00 DEBUG juju.kubernetes.provider deleting ingress resource for wordpress-k8s
machine-0: 21:18:00 DEBUG juju.kubernetes.provider created/updated ingress resources for "wordpress-k8s".
machine-2: 21:18:02 DEBUG juju.kubernetes.provider deleting ingress resource for wordpress-k8s
machine-2: 21:18:03 DEBUG juju.kubernetes.provider created/updated ingress resources for "wordpress-k8s".

The controller in question has two separate models with applications named wordpress-k8s. Only the ingress for one of them was lost. The k8s cluster is running v1.16.10. There is no persistent storage configured, so we added the cluster to Juju with --skip-storage.

See original description

Revision history for this message

Thomas Cuthbert (tcuthbert) wrote on 2020-06-23:

I believe it was because the ingress wasn't able to find the TLS secret:

2020/06/22 22:48:42 [alert] 31082#31082: *98047390 no ssl_certificate_by_lua* defined in server blog.launchpad.net while loading SSL certificate by lua, client: 91.189.91.49, server: 0.0.0.0:443
2020/06/22 22:48:42 [crit] 31082#31082: *98047390 SSL_do_handshake() failed (SSL: error:1417A179:SSL routines:tls_post_process_client_hello:cert cb error) while loading SSL certificate by lua, client: 91.189.91.49, server: 0.0.0.0:443

This certificate definitely exists and is correct in k8s.

ubuntu@juju-7cf688-prod-is-external-kubernetes-9:~$ kubectl get secrets blog-launchpad-net-tls -n prod-launchpad-blog-k8s
NAME TYPE DATA AGE
blog-launchpad-net-tls Opaque 2 11d

https://pastebin.canonical.com/p/fnzVN9Kkhg/

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-06-23:

I was trying to deploy this wordpress charm following the README but I got some config missing errors.

Thank you pointed out the cause, Thomas.

What happens if kubectl applies the ingress yaml?

Revision history for this message

Tom Haddon (mthaddon) wrote on 2020-06-23:

Our current theory is that this was actually a kubernetes issues (possibly https://github.com/kubernetes/ingress-nginx/issues/5588) rather than a juju issue. However, we think there is still a bug here, in the sense that Juju should be enforcing the state of the ingress it expects to exist.

In this case the ingress had been deleted, and an update_status hook which runs every 5 minutes has a section where it configures the ingress (https://git.launchpad.net/charm-k8s-wordpress/tree/src/charm.py#n178). Juju could notice that things had changed in kubernetes and update things.

Let us know if it makes sense to file that as a separate bug, or reuse this one for that?

Revision history for this message

Tom Haddon (mthaddon) wrote on 2020-06-23:

So actually it looks like it was juju, we've confirmed this by looking at the audit logs in kubernetes: https://pastebin.canonical.com/p/fpjNCY8p8z/ (sorry, Canonical only).

Revision history for this message

Tom Haddon (mthaddon) wrote on 2020-06-23:

Just to reply to Kelvin's comment, as discussed on IRC, manually applying the ingress yaml restores service, as does updating juju config to retrigger a pod-spec-set. So there is a workaround to this issue to restore service in case of an outage, but the question remains as to why it's happening in the first place.

Revision history for this message

Thomas Miller (tlmiller) wrote on 2020-06-23:

Hey Tom,

I checked out the pastebin and I am not sure of the origin of the delete request. Those logs suggest that the delete request is being processed by our mutating web hook before being applied to the cluster. At the moment I can't see a reason why Juju would be deleting ingress resources. Are you able to confirm what service is using the admin account in the cluster? Juju should be using it's own credentials in the cluster and "should" have a different username.

Revision history for this message

Paul Collins (pjdc) wrote on 2020-06-23:

Parsing and filtering the k8s audit logging reveals this sequence of events for the ingress that survived, which seems to be the normal case:

2020-06-22T21:18:02.991378Z delete /apis/extensions/v1beta1/namespaces/stg-wordpress-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:03.313206Z create /apis/extensions/v1beta1/namespaces/stg-wordpress-k8s/ingresses 201 jujud/v0.0.0 (linux/amd64) kubernetes/$Format

However, for the ingress that did not survive, we see:
2020-06-22T21:18:00.362358Z create /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses 409 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:00.460626Z delete /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:00.630608Z get /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format
2020-06-22T21:18:00.668371Z update /apis/extensions/v1beta1/namespaces/prod-launchpad-blog-k8s/ingresses/wordpress-k8s 200 jujud/v0.0.0 (linux/amd64) kubernetes/$Format

Given the timestamp and the 409 response status, it's clear that the create was received and processed by k8s before the delete. But if the ingress has been deleted, why does the next get return 200? And ditto update?

I've uploaded the raw audit records to https://pastebin.canonical.com/p/rx3TTCVrJr/ (sorry, Canonical-only).

Paul Collins (pjdc) on 2020-06-23

summary:

- kubernetes provider: juju removed existing ingresses
+ kubernetes provider: juju removed existing ingress after controller
+ restart

Thomas Miller (tlmiller) on 2020-06-23

Changed in juju:
importance:	Undecided → High
assignee:	nobody → Thomas Miller (tlmiller)
milestone:	none → 2.8.1

Paul Collins (pjdc) on 2020-06-23

description:

updated

Revision history for this message

Paul Collins (pjdc) wrote on 2020-06-23:

> Are you able to confirm what service is using the admin account in the cluster? Juju should be using it's own credentials in the cluster and "should" have a different username.

For various reasons we're still using a single account for all access to each of our k8s clusters.

Paul Collins (pjdc) on 2020-06-24

description:	updated
description:	updated

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-06-24:

Hi Paul,

When you run juju add-k8s, Juju create a new Service Account with proper RBAC setup and that will be SA used by Juju later.
The SA name has a prefix like `juju-credentia-`.

Revision history for this message

Paul Collins (pjdc) wrote on 2020-06-24:

#10

> When you run juju add-k8s, Juju create a new Service Account with proper RBAC setup and that will be SA used by Juju later. The SA name has a prefix like `juju-credentia-`.

We don't have a user backend set up, though, so to my knowledge everything will end up mapping to the same account. This is what we expect, and why we have a bunch of separate k8s clusters.

Paul Collins (pjdc) on 2020-06-24

description:	updated
description:	updated

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-06-24:

#11

So the root cause is the charm uses application name for the ingress name which is the ingress name used for the one created by juju expose.

Currently, ingress with juju expose and definition in Podspec could be a conflict if they are all using application name as the resource name. This is a known issue tracked here this is a known issue https://bugs.launchpad.net/juju/+bug/1854123

We can think "applicationName" is a reserved ingress resource name for Juju to use only.
Ingresses defined in the PodSpec should never be named using the application name.

Yang Kelvin Liu (kelvin.liu) on 2020-06-24

Changed in juju:
assignee:	Thomas Miller (tlmiller) → Yang Kelvin Liu (kelvin.liu)
status:	New → Triaged
status:	Triaged → In Progress

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-06-24:

#12

Adding validation to raise an error if any ingress resource name uses the application name would be a fix for this bug for now.

We will need to work a real fix in https://bugs.launchpad.net/juju/+bug/1854123

A workaround fix would be to rename the ingress name then upgrade-charm.

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-06-24:

#13

https://github.com/juju/juju/pull/11748 added a validation to prevent any new deployment using application for any ingresses.

Ian Booth (wallyworld) on 2020-06-29

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2020-07-13

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-kubernetes-ingress-nginx #5588
[open kind/bug] Edit

Bug watches keep track of this bug in other bug trackers.