upgrade-juju from 1.18 to 1.20.14 leaves some agents down

Bug #1473450 reported by Mario Splivalo
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Unassigned

Bug Description

I'm testing an upgrade path from 1.18 to 1.22 (or 1.24, which ever seems to work). As I fail to upgrade directly, I am using 1.20 as intermittent step.

Here is how I was doing it:

ubuntu@mariosplivalo-bastion:~$ sudo apt-get install juju-core=1.18.1-0ubuntu1
ubuntu@mariosplivalo-bastion:~$ juju deploy ubuntu -n5
ubuntu@mariosplivalo-bastion:~$ juju upgrade-juju

(The full paste of executed commands is here: http://pastebin.ubuntu.com/11855933/)

But, this leave some of my agents in 'down' state (see attached juju.status.txt).

I'm also attaching machine.log from the failed unit (machine-1.log).

If I restart the unit, jujud for machine won't start. Here is what happens if I try to start it manually:

root@juju-mariosplivalo-machine-1:~# /var/lib/juju/tools/machine-1/jujud machine --data-dir '/var/lib/juju' --machine-id 1 --debug
2015-07-10 13:34:41 INFO juju.cmd supercommand.go:37 running jujud [1.20.14-trusty-i386 gc]
2015-07-10 13:34:41 INFO juju.cmd.jujud machine.go:158 machine agent machine-1 start (1.20.14-trusty-i386 [gc])
2015-07-10 13:34:41 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2015-07-10 13:34:41 INFO juju.cmd.jujud machine.go:169 no upgrade steps required or upgrade steps for 1.20.14 have already been run.
2015-07-10 13:34:41 INFO juju.worker runner.go:260 start "api"
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x1 pc=0x80bedda]

goroutine 7 [running]:
runtime.panic(0x87cc1e0, 0x94cc4c8)
 /usr/lib/go/src/pkg/runtime/panic.c:266 +0xac
github.com/juju/juju/agent.(*configInternal).APIInfo(0x19877620, 0x198bb298)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/agent/agent.go:611 +0xaa
main.openAPIState(0xb750a7f8, 0x19877620, 0xb750a988, 0x198a23c0, 0x0, ...)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/agent.go:212 +0x7f
main.(*MachineAgent).APIWorker(0x198a23c0, 0x8, 0xb7436f5c, 0x1, 0x1)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/machine.go:247 +0xdc
main.*MachineAgent.APIWorker·fm(0x0, 0x0, 0x0, 0x0)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/machine.go:179 +0x4a
github.com/juju/juju/worker.(*runner).runWorker(0x198aa930, 0x0, 0x0, 0x8878020, 0x3, ...)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/worker/runner.go:261 +0x2ab
created by github.com/juju/juju/worker.(*runner).run
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/worker/runner.go:177 +0x2d7

goroutine 1 [select]:
github.com/juju/juju/worker.(*runner).StartWorker(0x198aa930, 0x88a7f18, 0xb, 0x8a1e1ac, 0x0, ...)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/worker/runner.go:100 +0xee
main.(*MachineAgent).Run(0x198a23c0, 0x198aa3c0, 0x0, 0x0)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/machine.go:183 +0x5dc
github.com/juju/cmd.(*SuperCommand).Run(0x198a4540, 0x198aa3c0, 0x198aa3c0, 0x0)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/cmd/supercommand.go:321 +0x299
github.com/juju/cmd.Main(0xb750a490, 0x198a4540, 0x198aa3c0, 0x1980a008, 0x6, ...)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/cmd/cmd.go:247 +0x1bf
main.jujuDMain(0x1980a000, 0x7, 0x7, 0x198aa3c0, 0x1981c501, ...)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/main.go:107 +0x1f9
main.Main(0x1980a000, 0x7, 0x7)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/main.go:122 +0x186
main.main()
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/cmd/jujud/main.go:139 +0x3e

goroutine 3 [syscall]:
os/signal.loop()
 /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x21
created by os/signal.init·1
 /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x34

goroutine 6 [runnable]:
github.com/juju/juju/worker.(*runner).run(0x198aa930, 0x806cc50, 0x806cc60)
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/worker/runner.go:160 +0x811
github.com/juju/juju/worker.func·004()
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/worker/runner.go:84 +0x4e
created by github.com/juju/juju/worker.NewRunner
 /build/buildd/juju-core-1.20.14/src/github.com/juju/juju/worker/runner.go:85 +0x12e
root@juju-mariosplivalo-machine-1:~#

Checking the agent.conf file, inside /var/lib/juju/agents/machine-1 I see that there is no mention of apiaddress nor apipassword. On the working unit those are present.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Revision history for this message
Curtis Hovey (sinzui) wrote :

This could be a duplicate of bug 1444912

tags: added: upgrade-juju
Changed in juju-core:
status: New → Triaged
milestone: none → 1.25.0
importance: Undecided → High
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

I am observing the same behavior when upgrading from 1.18 to 1.22.6. but in this case only 'apiaddress' was missing in agent.conf.
Out of 6 deployed units only one failed.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

When I added apiaddress: to agent.conf then I was able to start agent without issues. I'm assuming this worked because I had apipassword: present inside agent.conf

However, some units dont's have it, and I'm not sure where/how to get passwrod from.

Revision history for this message
Ian Booth (wallyworld) wrote :

I tried reproducing this using the latest 1.18 release that was published (1.18.4).

I bootstrapped an environment and deployed 5 ubuntu charms.
At that point, I checked the agent conf files and as expected they all contained apiaddress and password entries.
I then juju-upgrade(d) and everything went to 1.20.14 no problems.

I don't know why the apiaddress and password entries were missing from the env deployed when this bug was filed. These entries should have been in the agent conf files from the start. Was the env in question perhaps itself upgraded from 1.16 or earlier?

summary: - upgrade-juju from 1.18 to 1.20.4 leaves some agents down
+ upgrade-juju from 1.18 to 1.20.14 leaves some agents down
Revision history for this message
Ian Booth (wallyworld) wrote :

I ran the experiment again and after upgrading to 1.20.14 I could not add-unit. I'll comment in bug 1473517 as that bug is tracking the status of the environment after upgrade. For now, I'll mark this bug as incomplete unless we can reproduce the problem again.

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

I have tested this again, several times. I deployed 6-unit ubuntu charm, twice. Once it worked ok (no issues), once I had the same issue.
I tried repeating with 20 units (as I need to upgrade customer who's running over 100 units), and it failed each time.
I'm adding the log files from the environments as attachment here. There is terminal.log, which is copy-paste output from commands run in terminal, and there is all-machines.log from the bootstrap node.
I have all the other log files archived and can provide them if needed.

There is, however, a workaround for this - first one has to save all the agent.conf files from each unit, and then add apiaddress/apipassword where needed.

Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Changed in juju-core:
status: Incomplete → New
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote :

I looked at the terminal log and see the grep for apiaddresses like this:

sudo grep api /var/lib/juju/agents/machine-${m}/agent.conf; done

resulting in:

apiaddresses:
apipassword: fljMcZUmiNWmLpeBg0S34N9+

But this does not show api addresses are missing. The yaml looks like this:

stateaddresses:
- localhost:37017
statepassword: zZZ0dN2yOBSU0/Ns+GachPzo
apiaddresses:
- ip-10-180-222-9.ec2.internal:17070
apipassword: zZZ0dN2yOBSU0/Ns+GachPzo
oldpassword: GsrxfqfkbYS84WN2BvGa0m8P

The apiaddresses is a list, so grepping will just show the list header line, not the contents.

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Hi, Ian.

If you check the run-2-terminal log, you can see that I run the 'grep' agains agent.conf files on all units twice - once before I run 'upgrade-juju' and once after the upgrade completed.

Before the upgrade was run, the grep on all the units return something like this:

Warning: Permanently added '10.5.2.230' (ECDSA) to the list of known hosts.
sudo: unable to resolve host juju-mariosplivalo-machine-5
apiaddresses:
apipassword: DaoJjc0YsshLEsyzkVM9Z+1F
Connection to 10.5.2.230 closed.

This shows that both apiaddresses and apipassword were present in the agent.conf file.

When you check the output of the grep after the upgrade, for some units it looks like this:
Warning: Permanently added '10.5.2.230' (ECDSA) to the list of known hosts.
sudo: unable to resolve host juju-mariosplivalo-machine-5
apipassword: DaoJjc0YsshLEsyzkVM9Z+1F
Connection to 10.5.2.230 closed.

So, after the upgrade agent.conf file on machine-5 is missing apiaddresses.
This happened on each of the units, except on machine-1.

I still have my environment deployed, are there any other log files that might be of interest? Would it help if I grant you access to the env so you can investigate ?

Revision history for this message
Ian Booth (wallyworld) wrote :

Mario, sorry, I misread what you were saying.

This could be another manifestation of the bug where jujud agents need to be restarted to get them to know about the apiaddresses. As per https://bugs.launchpad.net/juju-core/+bug/1473517/comments/3

Unfortunately, we are not going to do another 1.20 release. I think the only viable option is to automate the workaround you have used ie record the apiaddresses (which should also be in the jenv file for easy access) and then insert those if missing after the upgrade into the agent conf files.

Felipe Reyes (freyes)
tags: added: sts
Curtis Hovey (sinzui)
Changed in juju-core:
status: Incomplete → Fix Released
milestone: 1.25.0 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.