Tickets take 4.5 hours to successfully complete

Bug #1290716 reported by Evan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu CI Engine
In Progress
Medium
Unassigned

Bug Description

tl;dr we need to chip away at all the places where we're moving slowly because the instances are too small (make them bigger), are polling too infrequently (going from 1 min down to 15 seconds, etc), are blocked on builds (get PPA scores increased), etc.

--

I fed a ticket all the way through CI Core last night. However, this took quite a while to complete:

Created: Mon, 10 Mar 2014 22:38:44 +0000
Updated: Tue, 11 Mar 2014 03:10:18 +0000

About 40 minutes of that was waiting on the PPA. That we'll just have to bump build scores for (and longer term get PPAs deployed as microservices):

22:41:48 INFO PROGRESS_TRIGGER: WAITING: {u'source_packages': [u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2_source.changes', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2.debian.tar.gz', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0.orig.tar.gz', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2.dsc'], u'progress_trigger': u'lander_master-1-bsbuilder', u'series': u'saucy', u'ppa': u'ppa:ev/ci-pool-002', u'ticket_id': u'2', u'archive': u'ppa:ci-engineering-airline/ci-archive'}

*snip*

23:03:42 INFO PROGRESS_TRIGGER: STATUS: {u'artifacts': [], u'status': {u'building': u'set([python-scipy - 0.12.0-2ubuntu2~ev2])', u'failed': u'set([])', u'pending': u'set([])'}, u'source_packages': [u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2_source.changes', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2.debian.tar.gz', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0.orig.tar.gz', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2.dsc'], u'ppa': u'ppa:ev/ci-pool-002'}

*snip*

23:23:55 INFO PROGRESS_TRIGGER: COMPLETED: {u'status': {u'building': u'set([])', u'failed': u'set([])', u'pending': u'set([])'}, u'source_packages': [u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2_source.changes', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2.debian.tar.gz', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0.orig.tar.gz', u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/python-scipy_0.12.0-2ubuntu2~ev2.dsc'], u'artifacts': [], u'ppa': u'ppa:ev/ci-pool-002', u'exit': True, u'result': u'PASSED'}

The real problem seems to be in the test runner.

Part of this may just be that the scipy tests take ages to run on an m1.small instance. If that's the case, we should give the default instance for the test runner multiple cores.

http://paste.ubuntu.com/7072218/

It looks like it took 94.5846 minutes to complete the tests:

sudo cat /var/log/upstart/tr_rabbit_worker.log | grep "tests in " | sed 's/.* //' | sed 's/s//' | awk '{ sum += $1 } END { print sum / 60.0 }'

From ticket-2/lander_test_runner.log:
23:33:54 INFO root: Waiting for messages. ^C to exit.
23:33:54 INFO PROGRESS_TRIGGER: WAITING: {u'progress_trigger': u'lander_master-1-testrunner', u'image_id': u'5fb841da-a8ac-11e3-a788-fa163e39aa80.img', u'ticket_id': u'2', u'package_list': [u'python-scipy']}
23:33:54 INFO PROGRESS_TRIGGER: STATUS: {u'progress_trigger': u'lander_master-1-testrunner', u'image_id': u'5fb841da-a8ac-11e3-a788-fa163e39aa80.img', u'ticket_id': u'2', u'package_list': [u'python-scipy']}
23:36:37 INFO PROGRESS_TRIGGER: STATUS: {u'message': u'testing python-scipy'}
03:10:07 INFO PROGRESS_TRIGGER: COMPLETED: {u'artifacts': [{u'type': u'RESULTS', u'name': u'test_runner.python-scipy-subunit-stream', u'reference': u'https://region-a.geo-1.objects.hpcloudsvc.com/v1/11597203075020/ticket-2/test_runner.python-scipy-subunit-stream'}], u'exit': True, u'result': u'PASSED'}
03:10:07 DEBUG amqplib: Closed channel #1

Tags: airline

Related branches

Evan (ev)
description: updated
Evan (ev)
description: updated
description: updated
Evan (ev)
summary: - Tickets take a very long time to successfully complete
+ Tickets take 4.5 hours to successfully complete
Revision history for this message
Evan (ev) wrote :

Marking this as medium for now as vila says his tickets take 20 minutes to complete. I'll bump this up to high/critical if we cannot get confirmation from a few other people.

Changed in ubuntu-ci-services-itself:
importance: Undecided → Medium
Evan (ev)
Changed in ubuntu-ci-services-itself:
assignee: nobody → Parameswaran Sivatharman (psivaa)
Revision history for this message
Vincent Ladeuil (vila) wrote :

https://launchpad.net/ubuntu/saucy/+source/python-scipy/+builds

Between 41 mins and 1h50 to build depending on arch.

> It looks like it took 94.5846 minutes to complete the tests:

http://d-jenkins.ubuntu-ci:8080/job/saucy-adt-python-scipy/buildTimeTrend

http://d-jenkins.ubuntu-ci:8080/job/saucy-adt-python-scipy/52/testReport

/http://d-jenkins.ubuntu-ci:8080/job/saucy-adt-python-scipy/52/ARCH=amd64,label=adt/

22 minutes.

The hardwares are different and I suspect the processor itself is more relevant than the number of cores (the test runner testbed runs one dep-8 test after the other in both cases).

Let's resume the comparison once we have more logs out of the testbed.

Note that having timestamps in http://d-jenkins.ubuntu-ci:8080/job/saucy-adt-python-scipy/52/ARCH=amd64,label=adt/artifact/results/log will go a long way to make comparisons easier.

Evan (ev)
description: updated
Revision history for this message
Para Siva (psivaa) wrote :

One of my tickets for pkg 'adduser' took 1 hour and 8 mins.
                                  bsb- 15 mins,
                                  imagebuilder-47 mins
                                  testrunner-5 mins
The imagebuild-worker which has 1G RAM had taken 21 mins for apt-get clean in the chroot env. http://pastebin.ubuntu.com/7073343/ .

Another ticket of the same package finished in 33 mins with imagebuild-worker taking just 10 mins.

Increasing the RAM size could possibly improve this timing? Still investigating with some more runs altering configs.

Revision history for this message
Andy Doan (doanac) wrote :

We have a RAM issue for the bsbuilder also. Should we just default all our services to use more memory?

Revision history for this message
Evan (ev) wrote :

Not every service needs it. Apache will run comfortably on a small instance, for example. Our deployment is already pretty big and resource intensive with the number of instances we have, so we should be careful about making it larger without having clear reason to. Equally, if a service needs more memory, we should find out why - if there's a memleak, using a m1.medium will only save us for so long.

So yeah, pick a larger instance size where you need one.

Revision history for this message
Paul Larson (pwlars) wrote :

The kernel tests I've been going through take hours, but almost all of that is caught up in the ppa build. I seem to get a builder assigned pretty quickly so I don't think bumping the score would help much in this case. We do have a ram issue on the bsb-worker, but only in some cases (kernel so far). I've bumped it to 4G locally and I'll include that in my patch later today which will hopefully fix some other things too, as soon as I can make it all the way through the build/image and at least see it trying to run the test. I suspect for my case, we'll want to bump up to more cpu/mem on the test runner also though. NB: this would not be the tr-worker, but rather, the instance it starts to actually run the tests on the image we create.

Revision history for this message
Vincent Ladeuil (vila) wrote :

NB: this would not be the tr-worker, but rather, the instance it starts to actually run the tests on the image we create.

/me nods

Until bug #1288617 is fixed, bug #1287955 provided a way to address that by changing the instance flavor in ./juju-deployer/configs/unit_config.yaml.tmpl

Revision history for this message
Para Siva (psivaa) wrote :

Still working on this, but from what I see, bsb, imagebuilder and test runner are the time consuming stages and we've already increased the bsb memory to 4G.

Along with increasing the memory/ cores/ scores we also need to see if we should impose timeouts for each stage. For now at least, it is impossible to see all possible reasons for a stage to wait/hang forever. As of rev365 I don't see any timeouts at least on ppa-assigner stage.

Increasing polling time from 60s-> 15s won't save us a lot in my view especially when we have stages taking time in minutes.

I am running consecutive tickets for the same package to compare the timing variations with the default configs of the deployment, not sure how many tickets we need to run to really spot a memory leak, but will keep running and later with some increased config values.

Revision history for this message
Evan (ev) wrote :

> Increasing polling time from 60s-> 15s won't save us a lot in my view
> especially when we have stages taking time in minutes.

If in the optimal case a ticket is taking us a few minutes, shaving 45 seconds off here and there would make a noticeable difference.

Changed in ubuntu-ci-services-itself:
status: New → Confirmed
Revision history for this message
Para Siva (psivaa) wrote :

Ev,

Do you have a package that runs through the system in a few minutes, and how many minutes is few minutes? :)

The least I saw is 18 minutes in my runs.

https://docs.google.com/a/canonical.com/spreadsheet/ccc?key=0AvOb-Wd__AD8dEhkeGRWUzBnSmZ5SWhOT0RlQ0tyWmc&usp=drive_web#gid=0

Contains the list of timings in about 15 successful runs and here is a summary:
+ 's
1. testrunner takes almost half the time when I used standard.medium instead of standard.small (6-9 mins -> 3 mins)
2. Imagebuilder completes from 8-9 mins -> 3 mins when 4G RAM is used

-'s
1. bsb does not conclusively gain anything by increasing either RAM nor cpu cores
2. ppa-assigner stays waiting for ever because launchad taking too long to clean the 'dirty' ppa's. I've had to force make the ppa's available to continue on this.

I am going to do an MP for imagebuilder and test runner as a first step.

Revision history for this message
Para Siva (psivaa) wrote :

Attached a branch with improved capabilities to imaegbuilder and testbed instances. Looking into polling time improvements now

Revision history for this message
Andy Doan (doanac) wrote :

psivaa has merged some code that should help this out quite a bit. Lets keep an eye on ticket times over the next few days before we say this is "fixed"

Changed in ubuntu-ci-services-itself:
status: Confirmed → In Progress
Revision history for this message
Para Siva (psivaa) wrote :

The ticket for python-scipy took 2hr 2min for me to complete with failed tests: http://paste.ubuntu.com/7086926/

Testrunner took 1 hr 4 min on a "standard.2xlarge" (30GB RAM | 8 VCPU | 30.0GB Disk) testbed instance and logging into the testbed during the test showed the tests themselves consuming 99+ % CPU.

On alderamin as Vincent points out this takes less time. This is definitely package specific and making a single selection for all the packages is overkill.

Leaving the testbed specification to be configurable by the user, probably via the cli appears to be the best solution for me, may be post phase-0.

Revision history for this message
Andy Doan (doanac) wrote :

Another data point I discovered. The image-builder is super slow for me tonight:

23:59:30 INFO PROGRESS_TRIGGER: STATUS: {u'message': u'Adding requested packages to the image...'}
00:30:51 INFO PROGRESS_TRIGGER: STATUS: {u'message': u'updating chroot to be more compressible...'}

So it took 30 minutes to run "apt-get install phablet-tools". I logged in while it was running and "apt-get install" seemed to be waiting on 2 instances of /usr/lib/apt/methods/http

This seems to not have anything todo with capacity but with network i/o or something. What's really odd, is I ran that command by hand and it worked in <60seconds. Not sure what this means.

Revision history for this message
Para Siva (psivaa) wrote :

@Andy: Only difference that I can think of why it took significantly different times is that they could have probably used different archive locations: http://az3.clouds.archive.ubuntu.com vs http://archive.ubuntu.com/ubuntu/ and the latter might have been too slow at the time. The apt-get update within the chroot is hitting the latter but on the instance it's hitting the az3.cloud one.

Also there is a zerofile creation/removal step that's used to speed up the compression in the image-builder and this used to make the process slow but after increasing the RAM I have not seen it. From the logs it's not clear whether this was the case.

Just to double check to see if there is anything in phablet-tools I ran a ticket to see if it has anything specific in it for the delay. That only took 3 mins for the apt-get install with rev388 of the trunk.
17:59:23 INFO PROGRESS_TRIGGER: STATUS: {u'message': u'Adding requested packages to the image...'}
18:02:06 INFO PROGRESS_TRIGGER: STATUS: {u'message': u'Shrinking the converted image...'}

Will run it a couple of more times to see if it happens by chance.

Revision history for this message
Andy Doan (doanac) wrote : Re: [Bug 1290716] Re: Tickets take 4.5 hours to successfully complete

On 03/14/2014 02:22 PM, Parameswaran Sivatharman wrote:
> @Andy: Only difference that I can think of why it took significantly
> different times is that they could have probably used different archive
> locations: http://az3.clouds.archive.ubuntu.com vs
> http://archive.ubuntu.com/ubuntu/ and the latter might have been too
> slow at the time. The apt-get update within the chroot is hitting the
> latter but on the instance it's hitting the az3.cloud one.

Its not happening anymore, so I think it was just temporary LP insanity

> Also there is a zerofile creation/removal step that's used to speed up
> the compression in the image-builder and this used to make the process
> slow but after increasing the RAM I have not seen it. From the logs
> it's not clear whether this was the case.

This was purely in apt-get-install. The compression step ran pretty fast.

Revision history for this message
Andy Doan (doanac) wrote :

i'm seeing package installation going really slow again in the image builder. I've hacked mine up where i can easily follow the download speed while trying to grab the stuff required to install my kernel. I'm going at about 50kB/s and its telling me it will take 40 minutes to download.

One thing we need to be cognizant of is that hpcloud seems to be much slower on grabbing stuff from our archives. Kind of makes sense I suppose.

Revision history for this message
Para Siva (psivaa) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15/03/14 04:56, Andy Doan wrote:
> I'm going at about 50kB/s and its telling me it will take 40
> minutes to download.
I was able to reproduce this and it took 24 mins for me to install
phablet-tools. Had a chat with the hp support and according to them
there was apparently nothing wrong in the imagebuilder instance and
that "It could be the nature of the way the load balancer handles
duplicate requests" in archive.u.c.

>
> One thing we need to be cognizant of is that hpcloud seems to be
> much slower on grabbing stuff from our archives. Kind of makes
> sense I suppose.
The issue does appear to be on archve.u.c side. Download speed on the
image builder instance from archive.ubuntu.com is 55.9 KB/s as opposed
to 11.9 MB/s to download an iso from cdimage.u.c. At the same point in
time this almost matches to the timings on my local machine. Hence I
don't think there is any issue in the instance itself or hp side. It
could be archive.u.c is heavily loaded during this time.

I'll poke at the IS about this.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTJNqDAAoJEA4SEg8jPuwAL+4H/0Cxx55eoP058xqUBXFf4rzy
d2ZlPdBzA7M5hRtSedx5fY4EsPogXw3CQWR8dEUK8DrDRE1TDbit3aKcS7fSWH35
XTo1xFex3SvoJUl2M+sickwuoVs2i8m1D8ODPGUYrGXHGi1LTDcYBmOJKPS0qiUR
Xh1goCn15hr4hRcUMJ6/MKy4tOizcPu8/IDz0kgVnSz+efNkJZwYQF7N0beikajR
9Xdb2NQuVEU1p7LQp4RsCuwikM5bStCMhP2UW55N5DlHhanynqhGuxmOaR8fvEGc
MB8NIskP849TD/OIgo/z7yAwdqNbxWU+Gq0cGV199/cDRIMooem9aWoZDLeotEA=
=Qtds
-----END PGP SIGNATURE-----

Revision history for this message
Para Siva (psivaa) wrote :

RT 68480 is reported on IS to look into this.

Revision history for this message
Para Siva (psivaa) wrote :

https://app.asana.com/0/9548224959691/11003358517617 is created to update the images to use the suitable archive mirrors

Evan (ev)
Changed in ubuntu-ci-services-itself:
milestone: phase-0 → phase-0.5
Ursula Junque (ursinha)
Changed in uci-engine:
assignee: nobody → Parameswaran Sivatharman (psivaa)
importance: Undecided → Medium
milestone: none → phase-0.5
status: New → In Progress
Evan (ev)
no longer affects: ubuntu-ci-services-itself
Para Siva (psivaa)
Changed in uci-engine:
assignee: Parameswaran Sivatharman (psivaa) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.