destroy-unit --force

Bug #1089289 reported by William Reade
80
This bug affects 16 people
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
High
William Reade

Bug Description

At some stage, we need to implement this flag, which forcibly sets the Unit to Dead; this will be used to work around the possibility of non-responsive (or just plain broken) unit agents (which can otherwise block the destruction of machines, services, and relations).

This is potentially tricky because it may necessitate the cleanup of large numbers of relations, subordinates, the subordinates' relations, and potentially some services.

Related branches

William Reade (fwereade)
description: updated
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

This has bitten me a few times already, units getting wedged (more than one underlying cause) and there's no recourse except destroying the environment.

Revision history for this message
William Reade (fwereade) wrote :

With reference to lp:1173224, I'm inclined to redefine the desired action as "the unit agent should run all hooks appropriate to Dying as usual, but ignore all errors". Tolerable?

Changed in juju-core:
status: New → Confirmed
Revision history for this message
David Britton (dpb) wrote :

It seems this state gets entered quite often if there is ever an error in your deployment. I can repeat it with:

juju deploy service
# make sure ^ has some kind of deployment error
juju destroy-service service # unit will not go away
juju resolved service
juju destroy-service service #unit is gone, but service is in 'dying' state
# at this point you are stuck.

Revision history for this message
William Reade (fwereade) wrote :

David, I haven't been able to reproduce that situation... I'd expect that a charm in which every hook failed would need to have 4 hooks resolved before it was finally removed (install, config-changed, start, stop); although, in the course of investigating this, I did find that the unit agent can sometimes resolve more than one error in response to a single request.

Next time you encounter it, would you ping me in #juju-dev so I can try to investigate a bit more?

Changed in juju-core:
status: Confirmed → Triaged
importance: Undecided → High
Revision history for this message
William Reade (fwereade) wrote :

See also lp:1190715

Nick Veitch (evilnick)
tags: added: doc
William Reade (fwereade)
summary: - remove-unit --force
+ destroy-unit --force
Curtis Hovey (sinzui)
tags: added: destroy-unit
Revision history for this message
Jeff Lane  (bladernr) wrote :

I can reliably create this in EC2 with the quantum-gateway charm:

ubuntu@ip-10-0-0-14:~$ juju status quantum-gateway
environment: amazon
machines:
  "14":
    agent-state: started
    agent-version: 1.16.0
    dns-name: ec2-54-205-199-95.compute-1.amazonaws.com
    instance-id: i-cfa5c3a8
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
services:
  quantum-gateway:
    charm: cs:precise/quantum-gateway-7
    exposed: false
    relations:
      amqp:
      - rabbitmq-server
      cluster:
      - quantum-gateway
      quantum-network-service:
      - nova-cloud-controller
      shared-db:
      - mysql
    units:
      quantum-gateway/0:
        agent-state: down
        agent-state-info: (installed)
        agent-version: 1.16.0
        life: dying
        machine: "14"

so I did this:

juju deploy --config $YAML quantum-gateway
juju add-relation quantum-gateway mysql
juju add-relation quantum-gateway nova-cloud-controller
juju add-relation quantum-gateway rabbitmq-server

my yaml file has this for quantum-gateway:
quantum-gateway:
  openstack-origin: cloud:precise-grizzly/updates
  ext-port: 'eth0'

and with this, every time I try to deploy quantum-gateway, the instance spins up then the EC2 dashboard shows it failing the second check (the 2nd check is connectivity after booting).

I am unable to contact this node at all, via juju ssh, direct ssh, or any other means, so I'm thinking something in the charm may be re-writing the network config, but that's really just a guess as I can't access the node to check the logs to see what happens.

Curtis Hovey (sinzui)
tags: added: docs
removed: doc
Curtis Hovey (sinzui)
tags: added: cts-cloud-review
William Reade (fwereade)
Changed in juju-core:
assignee: nobody → William Reade (fwereade)
milestone: none → 2.0
status: Triaged → In Progress
Revision history for this message
William Reade (fwereade) wrote :

Sorry, the progress reported on this was for lp:1089291 -- I had a miswiring in my brain.

Changed in juju-core:
status: In Progress → Triaged
Revision history for this message
William Reade (fwereade) wrote :

A unit that has not yet started running can already be removed with a plain `destroy-unit`, but once it's started a forcible removal becomes unsafe -- any processes started by the charm will continue to run, will interact unpredictably with new units on the same machine, and may even become a security risk. `destroy-machine --force` allows a whole machine (and all its units) to be decommissioned at once, and is the only safe way to accomplish forcible removal of running units; hence, WONTFIX.

Changed in juju-core:
status: Triaged → Won't Fix
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0 → none
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

This was marked as a pre-requisite for service-destroy --force. The machine is not reusable for multiple for deploys (its marked dirty). This over abundance of caution when cleaning up is causing usability issues. Either juju is a tool for deployments at scale (ie manage thousand of units) or its not... having the admin manually have to clean up after juju after already expressing intent. Alternatively we should document killing the machine on force suffices and default to that for service-destroy --force

Revision history for this message
HeinMueck (cperz) wrote :

It would be funny, woud it not make you weep.

- Juju can run one charm at a time. So you either have one machine or one unit per charm.
- You go with the flow and install some openstack components in units
- one breaks, you want to reinstall?

Well, no problem, just kill your fine infrastructure at once and rebuild it from scratch - your customers will love you for that.

Sorry guys, no kididng. How would you try to convince me it would be any good idea using maas and juju for setting up an infrastructure?

Wontfix = dontuse

Amazing.

Revision history for this message
William Reade (fwereade) wrote :

I don't quite understand the missing use case here.

If you want to deploy one unit per machine, you can, and then force-destroying the machine is equivalent to force-destroying the unit.

If you want to deploy multiple units per machine, you can, in two ways:

If you want to hulk-smash them together in the same OS, you risk unexpected interactions; and because force-destroying a unit on such a system would leave the machine in a dangerously unknowable state, we just forbid that and require that you clear down the whole machine.

BUT, if you want your units to be on the same hardware, but nicely isolated from one another -- which you probably want anyway -- you can deploy units into containers on the top-level machines; and then if a unit misbehaves you can force-destroy its container (which is just another machine to juju) and leave the parent machine -- and its other containers -- untouched.

What's the scenario that causes you to have to rebuild anything from scratch?

Revision history for this message
HeinMueck (cperz) wrote :

My understanding was that this behavior is intended, that you go for destroying the machine with all units.

My usecase is an OpenStack Deployment (small environment, not so many machines, but also not much traffic), where all the "minor" services go into one machine, but separated units. Compute gets its own machine.

Now, I wanted to extend the original installation and this required a configuration change to one of the compontents. As this setting only applies at deploy time, I had to redeploy this component. Did not work.

Your intended workflow I understand as destroy the machine and all its units - thus "the cloud" - and redeploy.

It left me frustrated. And especially the case of managing a cloud installation on bare metal with MaaS and juju seems questionable reading your statement. For most customer workloads are "cheap", I understand the vote for destroy and rebuild in this domain.

Revision history for this message
Jason Meinzer (meinzerj) wrote :

I'm running into similar problems while testing juju and learning about OpenStack. I've had to completely destroy my MAAS, Juju, etc environments more times than I can count. While this is fine during testing, I've started to wonder if it's even possible to use Juju in production. Rebootstrapping everything when you run into inevitable problems isn't a solution.

Revision history for this message
Fabrice Matrat (fabricematrat) wrote :

I have manage to deploy a subordinate unit but destroyed it and the service just after.
The charm wasn't even installed so the subordinate unit wants to install before destroying itself which won't ever happen.
I can't force destroy the subordinate unit so the only thing left might be to delete the machine with all the unit already well in place.
This is the kind of situation where I really would love a force destroy of a unit.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.