Canonical Juju

stuck k8s workload unit following upgrade-charm with new image

Bug #1878329 reported by Paul Collins on 2020-05-13

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Yang Kelvin Liu	Canonical Juju 2.8-rc2

Bug Description

Following a "juju upgrade-charm --path /local/mycharm --resource mycharm_image=image_url mycharm" to test a newer image for a charm I'm working on, I've ended up with a stuck unit that is blocking subsequent charm upgrades, although all the changes on the k8s side have been executed correctly.

[agnew(~)] snap info juju | grep ^installed:
installed: 2.8-rc1 (11992) 134MB classic
[agnew(~)] juju version
2.8-rc1-focal-amd64
[agnew(~)] juju debug-log --no-tail
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "migration-inactive-flag" manifold worker started at 2020-05-13 03:18:16.620862773 +0000 UTC
application-mattermost: 15:18:16 INFO juju.worker.migrationminion migration phase is now: NONE
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "charm-dir" manifold worker started at 2020-05-13 03:18:16.631889653 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.logger initial log config: "<root>=DEBUG"
application-mattermost: 15:18:16 INFO juju.worker.logger logger worker started
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "logging-config-updater" manifold worker started at 2020-05-13 03:18:16.632448582 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "api-address-updater" manifold worker started at 2020-05-13 03:18:16.633285603 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "proxy-config-updater" manifold worker started at 2020-05-13 03:18:16.633441178 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.logger reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING"
application-mattermost: 15:32:59 ERROR juju.worker.uniter resolver loop error: executing operation "remote init": caas-unit-init for unit "mattermost/0" failed: ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
: command terminated with exit code 1
[agnew(~)] juju status --format yaml
model:
  name: mattermost
  type: caas
  controller: beta
  cloud: k8s
  region: localhost
  version: 2.8-rc1
  model-status:
    current: available
    since: 13 May 2020 15:17:30+12:00
  sla: unsupported
machines: {}
applications:
  mattermost:
    charm: local:kubernetes/mattermost-3
    series: kubernetes
    os: kubernetes
    charm-origin: local
    charm-name: mattermost
    charm-rev: 3
    charm-version: 89783cd-dirty
    scale: 1
    provider-id: 4fb6cece-d45d-43a0-907e-f7a08d71bdae
    address: 10.152.183.25
    exposed: false
    application-status:
      current: active
      since: 13 May 2020 15:46:04+12:00
    units:
      mattermost/0:
        workload-status:
          current: terminated
          message: unit stopped by the cloud
          since: 13 May 2020 15:33:04+12:00
        juju-status:
          current: idle
          since: 13 May 2020 15:33:04+12:00
        upgrading-from: local:kubernetes/mattermost-1
        open-ports:
        - 8000/TCP
        address: 10.1.1.24
        provider-id: fef2495d-2654-421e-8d6d-9909d7b62feb
      mattermost/1:
        workload-status:
          current: waiting
          message: Not a leader
          since: 13 May 2020 15:33:02+12:00
        juju-status:
          current: idle
          since: 13 May 2020 15:38:51+12:00
        leader: true
        upgrading-from: local:kubernetes/mattermost-1
        open-ports:
        - 8000/TCP
        address: 10.1.1.25
        provider-id: 4fc05039-ee2c-4a05-9ad4-546fb0425c97
storage: {}
controller:
  timestamp: 15:48:51+12:00
[agnew(~)] kubectl get all -n mattermost
NAME READY STATUS RESTARTS AGE
pod/mattermost-6bfd44b4f6-xnpml 1/1 Running 4 15m
pod/mattermost-operator-0 1/1 Running 0 30m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mattermost ClusterIP 10.152.183.25 <none> 8000/TCP 30m
service/mattermost-operator ClusterIP 10.152.183.252 <none> 30666/TCP 30m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mattermost 1/1 1 1 30m

NAME DESIRED CURRENT READY AGE
replicaset.apps/mattermost-6bfd44b4f6 1 1 1 15m

NAME READY AGE
statefulset.apps/mattermost-operator 1/1 30m
[agnew(~)] _

Tags:

Paul Collins (pjdc) on 2020-05-13

summary:

- stuck k8s workload unit following upgrade-charm with new imagea
+ stuck k8s workload unit following upgrade-charm with new image

Ian Booth (wallyworld) on 2020-05-13

Changed in juju:
milestone:	none → 2.8-rc2
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-05-14:

The root cause here is the issue identified in bug 1877935
Juju needs to take account the pod's phase - whether it is still running init containers - before deciding how to interact with it.

tags:

added: k8s

Ian Booth (wallyworld) on 2020-05-15

Changed in juju:
assignee:	nobody → Yang Kelvin Liu (kelvin.liu)
status:	Triaged → In Progress

Ian Booth (wallyworld) on 2020-05-18

Changed in juju:
status:	In Progress → Fix Committed

Harry Pidcock (hpidcock) on 2020-05-19

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Paul Collins (pjdc) wrote on 2020-05-20:

Download full text (6.7 KiB)

This seems to still happen with 2.8-rc2:

[agnew(charm-mattermost-k8s)] juju deploy ./ --resource mattermost_image='localhost:32000/mattermost:v4' mattermost
Deploying charm "local:kubernetes/mattermost-0".
[agnew(charm-mattermost-k8s)] juju status
Model Controller Cloud/Region Version SLA Timestamp
betabeta-mattermost betabeta k8s/localhost 2.8-rc2 unsupported 13:23:05+12:00

App Version Status Scale Charm Store Rev OS Address Notes
mattermost waiting 0/1 mattermost local 0 kubernetes agent initializing

Unit Workload Agent Address Ports Message
mattermost/0 waiting allocating agent initializing

[agnew(charm-mattermost-k8s)] juju status
Model Controller Cloud/Region Version SLA Timestamp
betabeta-mattermost betabeta k8s/localhost 2.8-rc2 unsupported 13:24:37+12:00

App Version Status Scale Charm Store Rev OS Address Notes
mattermost active 1 mattermost local 0 kubernetes 10.152.183.176

Unit Workload Agent Address Ports Message
mattermost/0* active idle 10.1.1.86 8000/TCP

[agnew(charm-mattermost-k8s)] kubectl get all -n betabeta-mattermost
NAME READY STATUS RESTARTS AGE
pod/mattermost-99bb9c774-7g62w 1/1 Running 0 88s
pod/mattermost-operator-0 1/1 Running 0 96s
pod/modeloperator-687fdfbcf8-jqwv5 1/1 Running 0 111s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mattermost ClusterIP 10.152.183.176 <none> 8000/TCP 88s
service/mattermost-operator ClusterIP 10.152.183.225 <none> 30666/TCP 96s
service/modeloperator ClusterIP 10.152.183.118 <none> 17071/TCP 112s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mattermost 1/1 1 1 88s
deployment.apps/modeloperator 1/1 1 1 112s

NAME DESIRED CURRENT READY AGE
replicaset.apps/mattermost-99bb9c774 1 1 1 88s
replicaset.apps/modeloperator-687fdfbcf8 1 1 1 112s

NAME READY AGE
statefulset.apps/mattermost-operator 1/1 96s
[agnew(charm-mattermost-k8s)] juju upgrade-charm --path ./ --resource mattermost_image=localhost:32000/mattermost:v5 mattermost
Added charm "local:kubernetes/mattermost-1" to the model.
[agnew(charm-mattermost-k8s)] juju status
Model Controller Cloud/Region Version SLA Timestamp
betabeta-mattermost betabeta k8s/localhost 2.8-rc2 unsupported 13:25:46+12:00

App Version Status Scale Charm Store Rev OS Address Notes
mattermost active 1 mattermost local 1 kubernetes 10.152.183.176

Unit Workload Agent Address Ports Message
mattermost/0 terminated idle 10.1.1.86 8000/TCP unit stopped by th...

This seems to still happen with 2.8-rc2:

[agnew(charm-mattermost-k8s)] juju deploy ./ --resource mattermost_image='localhost:32000/mattermost:v4' mattermost
Deploying charm "local:kubernetes/mattermost-0".
[agnew(charm-mattermost-k8s)] juju status
Model                Controller  Cloud/Region   Version  SLA          Timestamp
betabeta-mattermost  betabeta    k8s/localhost  2.8-rc2  unsupported  13:23:05+12:00

App         Version  Status   Scale  Charm       Store  Rev  OS          Address  Notes
mattermost           waiting    0/1  mattermost  local    0  kubernetes           agent initializing

Unit          Workload  Agent       Address  Ports  Message
mattermost/0  waiting   allocating                  agent initializing

[agnew(charm-mattermost-k8s)] juju status
Model                Controller  Cloud/Region   Version  SLA          Timestamp
betabeta-mattermost  betabeta    k8s/localhost  2.8-rc2  unsupported  13:24:37+12:00

App         Version  Status  Scale  Charm       Store  Rev  OS          Address         Notes
mattermost           active      1  mattermost  local    0  kubernetes  10.152.183.176

Unit           Workload  Agent  Address    Ports     Message
mattermost/0*  active    idle   10.1.1.86  8000/TCP

[agnew(charm-mattermost-k8s)] kubectl get all -n betabeta-mattermost
NAME                                 READY   STATUS    RESTARTS   AGE
pod/mattermost-99bb9c774-7g62w       1/1     Running   0          88s
pod/mattermost-operator-0            1/1     Running   0          96s
pod/modeloperator-687fdfbcf8-jqwv5   1/1     Running   0          111s

NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/mattermost            ClusterIP   10.152.183.176   <none>        8000/TCP    88s
service/mattermost-operator   ClusterIP   10.152.183.225   <none>        30666/TCP   96s
service/modeloperator         ClusterIP   10.152.183.118   <none>        17071/TCP   112s

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mattermost      1/1     1            1           88s
deployment.apps/modeloperator   1/1     1            1           112s

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/mattermost-99bb9c774       1         1         1       88s
replicaset.apps/modeloperator-687fdfbcf8   1         1         1       112s

NAME                                   READY   AGE
statefulset.apps/mattermost-operator   1/1     96s
[agnew(charm-mattermost-k8s)] juju upgrade-charm --path ./ --resource mattermost_image=localhost:32000/mattermost:v5 mattermost
Added charm "local:kubernetes/mattermost-1" to the model.
[agnew(charm-mattermost-k8s)] juju status
Model                Controller  Cloud/Region   Version  SLA          Timestamp
betabeta-mattermost  betabeta    k8s/localhost  2.8-rc2  unsupported  13:25:46+12:00

App         Version  Status  Scale  Charm       Store  Rev  OS          Address         Notes
mattermost           active      1  mattermost  local    1  kubernetes  10.152.183.176

Unit           Workload    Agent  Address    Ports     Message
mattermost/0   terminated  idle   10.1.1.86  8000/TCP  unit stopped by the cloud
mattermost/1*  waiting     idle   10.1.1.87  8000/TCP  Not a leader

[agnew(charm-mattermost-k8s)] kubectl get all -n betabeta-mattermost
NAME                                 READY   STATUS    RESTARTS   AGE
pod/mattermost-6845684dbc-smzh9      1/1     Running   0          47s
pod/mattermost-operator-0            1/1     Running   0          2m43s
pod/modeloperator-687fdfbcf8-jqwv5   1/1     Running   0          2m58s

NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/mattermost            ClusterIP   10.152.183.176   <none>        8000/TCP    2m35s
service/mattermost-operator   ClusterIP   10.152.183.225   <none>        30666/TCP   2m43s
service/modeloperator         ClusterIP   10.152.183.118   <none>        17071/TCP   2m59s

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mattermost      1/1     1            1           2m35s
deployment.apps/modeloperator   1/1     1            1           2m59s

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/mattermost-6845684dbc      1         1         1       47s
replicaset.apps/modeloperator-687fdfbcf8   1         1         1       2m59s

NAME                                   READY   AGE
statefulset.apps/mattermost-operator   1/1     2m43s
[agnew(charm-mattermost-k8s)] juju debug-log --no-tail
application-mattermost: 13:23:09 DEBUG juju.worker.dependency "api-address-updater" manifold worker started at 2020-05-20 01:23:09.596034843 +0000 UTC
application-mattermost: 13:23:09 DEBUG juju.worker.dependency "proxy-config-updater" manifold worker started at 2020-05-20 01:23:09.596062449 +0000 UTC
application-mattermost: 13:23:09 DEBUG juju.worker.caasupgrader current agent binary version: 2.8-rc2
application-mattermost: 13:23:09 INFO juju.worker.caasupgrader unblocking abort check
application-mattermost: 13:23:09 DEBUG juju.worker.logger initial log config: "<root>=DEBUG"
application-mattermost: 13:23:09 DEBUG juju.worker.dependency "logging-config-updater" manifold worker started at 2020-05-20 01:23:09.596923031 +0000 UTC
application-mattermost: 13:23:09 INFO juju.worker.logger logger worker started
application-mattermost: 13:23:09 DEBUG juju.worker.logger reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING"
application-mattermost: 13:25:04 ERROR juju.worker.uniter resolver loop error: executing operation "remote init": caas-unit-init for unit "mattermost/0" with command: "/var/lib/juju/tools/jujud caas-unit-init --unit unit-mattermost-0 --charm-dir /tmp/unit-mattermost-0404582015/charm --upgrade" failed: ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
: command terminated with exit code 1
application-mattermost: 13:25:04 ERROR juju.worker.caasoperator exited "mattermost/0": executing operation "remote init": caas-unit-init for unit "mattermost/0" with command: "/var/lib/juju/tools/jujud caas-unit-init --unit unit-mattermost-0 --charm-dir /tmp/unit-mattermost-0404582015/charm --upgrade" failed: ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
: command terminated with exit code 1
[agnew(charm-mattermost-k8s)] _

Changed in juju:
status:	Fix Released → New

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-05-20:

There's a different root cause issue here - a new bug 1879598 has been opened to track it.

This fix was for the case where there was a race execing into the workload pod as its init containers were running.

The new issue appears related to pod termination when a rolling update is done by the k8s deployment controller.

It's been hard to reproduce - any chance of getting access to the mattermost charm so we can try that to see if we can get things to misbehave?

Changed in juju:
status:	New → Fix Released

Revision history for this message

Paul Collins (pjdc) wrote on 2020-05-20:

I'll go update the other bug, but I'm not seeing the difference between the reproductions I supplied here. What's different?

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-05-21:

The difference is the underlying juju issue that needs to be fixed.

The symptoms appear similar in this case but because the underlying root cause is different there may also be other dissimilar manifestations of the issue. So best to raise a new bug to track the work to implement the changes to a different piece of code.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.