Merge lp:~pfalcon/linaro-aws-tools/elaborate-slave-check into lp:linaro-aws-tools

Proposed by Paul Sokolovsky
Status: Rejected
Rejected by: Paul Sokolovsky
Proposed branch: lp:~pfalcon/linaro-aws-tools/elaborate-slave-check
Merge into: lp:linaro-aws-tools
Diff against target: 64 lines (+31/-3)
1 file modified
monitor-ec2-build-slaves (+31/-3)
To merge this branch: bzr merge lp:~pfalcon/linaro-aws-tools/elaborate-slave-check
Reviewer Review Type Date Requested Status
Paul Sokolovsky Approve
James Tunnicliffe (community) Needs Information
Review via email: mp+104894@code.launchpad.net

Description of the change

Further elaboration of monitor-ec2-build-slaves which has been spamming me for a while: now show which actual build are running on long-running instances. That's because long-running instance is not yet a problem - we have instances running 8+hrs serving number of builds. The problem is actually long-running builds, not instances. So, this change just shows that info, next step is to act on it.

To post a comment you must log in.
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ah, and Jenkins API doesn't provide any useful info about a slave in this case, so I do HTML scraping.

Revision history for this message
Loïc Minier (lool) wrote :

On Mon, May 07, 2012, Paul Sokolovsky wrote:
> Ah, and Jenkins API doesn't provide any useful info about a slave in this case, so I do HTML scraping.

 Ah I was wondering the same, but what about result / timestamp / id?

 In progress build:
{
[...]
    "duration": 0,
[...]
    "id": "2012-05-07_12-34-17",
[...]
    "result": null,
    "timestamp": 1336394057152,
[...]

 Successful build:
{
    "duration": 31411,
    "id": "2012-05-07_11-34-10",
    "result": "SUCCESS",
    "timestamp": 1336390450834,

 Failed build:
{
    "duration": 576533,
    "id": "2012-05-07_11-30-18",
    "result": "FAILURE",
    "timestamp": 1336390218117,

 Timestamp is always the start timestamp in ms since the epoch, result
 is unset when in progress and indicates SUCCESS or FAILURE otherwise,
 duration is number of seconds when job is done (failed or successful).

 id is based on the start date, but best not to use that if you can

 (date -d @1336390450 -> Mon, 07 May 2012 11:34:10)

 JSON is so much machine-friendlier than the tooltip :-)

--
Loïc Minier

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Assume I am waiting for a response about Loic's comment...

review: Needs Information
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

> Ah I was wondering the same, but what about result / timestamp / id?

Jenkins has some rudimentary API to query slave, e.g. https://android-build.linaro.org/jenkins/computer/i-0df2286b/api/xml , https://android-build.linaro.org/jenkins/computer/i-0df2286b/loadStatistics/api/ (specific ids will expire fo course).

But info returned by those API calls doesn't include *builds running on a slave*, or even "build slave name/type/description" or "build slave labels" (the latter is good enough to see what kind of instance is the slave). And those 2 types of data is what we need. So, I had to reduce to scrape HTML "slave info" page to get list of current jobs and labels. But once I did that, it turns out that page also includes enough info about each build (running duration). So, I could scrape the page for list of build, and then do N calls (each of them taking time and requiring error handling) to normal build API to get build info, or keep scraping for build info which is already on ours hands. I chose the latter.

It's true that tooltip format may change, but that's just as true for job list for example. Scraping code didn't turn out to be too big or complex (I luv BeautifulSoup), so imho it's ok for v1 ;-).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

I'm sorry, I wanted to push further updates to the script, and just did "bzr push", which went to main repo... Why doesn't bzr remember previously specified push location?

Revision history for this message
Loïc Minier (lool) wrote :

On Wed, May 09, 2012, Paul Sokolovsky wrote:
> I'm sorry, I wanted to push further updates to the script, and just
> did "bzr push", which went to main repo... Why doesn't bzr remember
> previously specified push location?

 It does automatically remember the push location the first time you
 push; if you want to replace it, pass --remember to bzr push.

--
Loïc Minier

Revision history for this message
Loïc Minier (lool) wrote :

On Wed, May 09, 2012, Paul Sokolovsky wrote:
> Jenkins has some rudimentary API to query slave, e.g.
> https://android-build.linaro.org/jenkins/computer/i-0df2286b/api/xml ,
> https://android-build.linaro.org/jenkins/computer/i-0df2286b/loadStatistics/api/
> (specific ids will expire fo course).
>
> But info returned by those API calls doesn't include *builds running
> on a slave*, or even "build slave name/type/description" or "build
> slave labels" (the latter is good enough to see what kind of instance
> is the slave).
[...]
> So, I could scrape
> the page for list of build, and then do N calls (each of them taking
> time and requiring error handling) to normal build API to get build
> info, or keep scraping for build info which is already on ours hands.

 I'm not sure you've seen that in the Jenkins API documentation, but
 passing a "depth" allows getting as much data as you want from the
 Jenkins state object.

 e.g. compare:
 https://android-build.linaro.org/jenkins/computer/api/json
 with:
 https://android-build.linaro.org/jenkins/computer/api/json?depth=2

 (NB: you need at least an executor present to see a difference!)

 I love BeautifulSoup too, but it does seem to me that a single API call
 would return the needed info, so I'd rather we use a semi-stable API.

--
Loïc Minier

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

> https://android-build.linaro.org/jenkins/computer/api/json?depth=2

That's really cool, indeed, that appear to have most of the info needed, possibly even idle time detection can be done with it.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Anyway, will submit a new merge request, closing this.

Revision history for this message
Paul Sokolovsky (pfalcon) :
review: Approve
Revision history for this message
Loïc Minier (lool) wrote :

Isn't this superseded instead?

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Well, this was merged, that's why I set it. But actually, I just want to get rid of it in the list of active reviews...

Unmerged revisions

83. By Paul Sokolovsky

Set lower build thershold for testing time.

82. By Paul Sokolovsky

Get rid of test value for HOUR.

81. By Paul Sokolovsky

Add logic to actually analyze *build* times to decide if instance is runaway.

So, instance running for long is generally ok, if its current build(s) run
for sensible time. Still, report instance running very long (>10hrs), these
amost certainly hang soon if not yet per our experience. Problematic case
are idle instances - from once side, instance idling for short are ok - they're
waiting for new builds or heading to expiration (in 30 mins). But long-idling
instances are bad. As we don't have idling time on our hands, accept possibility
of false positive reports (with 20min monitoring period, the problem is getting
3+ consecutive reports of idle instance).

80. By Paul Sokolovsky

Add comment about HTML scraping

79. By Paul Sokolovsky

Handle idle slaves when querying running builds properly.

78. By Paul Sokolovsky

monitor-ec2-build-slaves: Query and show whar actual builds are running.

77. By Paul Sokolovsky

Typo fix in comment.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'monitor-ec2-build-slaves'
2--- monitor-ec2-build-slaves 2012-04-06 14:47:27 +0000
3+++ monitor-ec2-build-slaves 2012-05-07 15:26:19 +0000
4@@ -1,17 +1,20 @@
5 #!/usr/bin/env python2.7
6 #
7-# Note: this requiers recent boto (say, 2.0)
8+# Note: this requires recent boto (say, 2.0)
9 # Older Ubuntu version don't have recent enough boto, so we instead use
10 # non-system python2.7 and install needed packages using easy_install:
11 #
12-# easy_install-2.7 boto lxml pycrypto
13+# easy_install-2.7 boto lxml pycrypto BeautifulSoup
14 #
15
16 import datetime
17 import logging
18+import re
19+import urllib2
20
21 from boto import connect_ec2, connect_s3, ec2, utils
22 from lxml.etree import fromstring
23+from BeautifulSoup import BeautifulSoup
24
25
26 ACTIVE_REGION = "us-east-1"
27@@ -42,6 +45,29 @@
28 secret_key = get_cleartext(nodes[0].text)
29 return access_id, secret_key
30
31+
32+def print_slave_info(instance_id, host):
33+ f = urllib2.urlopen("https://%s/jenkins/computer/%s/" % (host, instance_id))
34+ soup = BeautifulSoup(f)
35+ n = soup.find(text=re.compile("Labels:")).parent
36+ print "Labels:", ", ".join([t.contents[0] for t in n.findAll("a")])
37+
38+ n = soup.find(text=re.compile("Building"))
39+ if not n:
40+ print "Idle"
41+ else:
42+ print "Running builds:"
43+ n = n.parent
44+ for t in n.findAll("table", href=re.compile(r"/\d+/")):
45+ m = re.match(r"/jenkins/job/(.+?)/(\d+?)/console", t["href"])
46+ job_name = m.group(1)
47+ build_no = m.group(2)
48+ print "%s #%s" % (job_name, build_no)
49+ m = re.match(r"Started (.+?) ago", t["tooltip"])
50+ print "Running for:", m.group(1)
51+ print "https://%s/jenkins/job/%s/%s/" % (host, job_name, build_no)
52+
53+
54 if __name__ == "__main__":
55 #logging.basicConfig(level=logging.DEBUG)
56 key, secret = get_credentials()
57@@ -58,4 +84,6 @@
58 run_time = now - utils.parse_ts(i.launch_time)
59 # print i.id, i.key_name, i.state, i.launch_time, run_time
60 if run_time > datetime.timedelta(minutes=RUN_TIME_WARNING):
61- print "Instance %s (%s) is running for too long (%s)!" % (i.id, owner, run_time)
62+ print "Build slave %s (%s) is running for too long (%s)!" % (i.id, owner, run_time)
63+ print_slave_info(i.id, owner)
64+ print

Subscribers

People subscribed via source and target branches