Merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission into lp:linaro-ci

Proposed by Deepti B. Kalakeri
Status: Merged
Approved by: James Tunnicliffe
Approved revision: 61
Merged at revision: 61
Proposed branch: lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission
Merge into: lp:linaro-ci
Diff against target: 325 lines (+172/-59)
3 files modified
download_content_yes_to_lic.py (+166/-49)
download_file (+3/-9)
find_latest.py (+3/-1)
To merge this branch: bzr merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission
Reviewer Review Type Date Requested Status
James Tunnicliffe (community) Approve
Review via email: mp+104737@code.launchpad.net

Description of the change

Fixing the CI lava submission failing for origen and snowball jobs.
Also, aligning with the latest changes of license acceptance and downloading the restricted file.

To post a comment you must log in.
Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Looks fine. Not critical but I may change this section a little:

=== modified file 'download_file'
--- download_file 2012-01-09 07:18:32 +0000
+++ download_file 2012-05-04 13:33:19 +0000

@@ -18,17 +18,11 @@
 args = parser.parse_args()

 fetcher = LicenseProtectedFileFetcher()
-content = fetcher.get(args.url[0])

 # Get file name from URL
 file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
-
-# If file name can not be found (for example, we have got a directory
-# index), provide a default.
 if not file_name:
- file_name = "unnamed.out"
+ file_name = "downloaded"

-out = open(file_name, 'w')
-out.write(content)
-out.close()
+fetcher.get(args.url[0], file_name)
 fetcher.close()

Since the script is called download_file and it would appear that you aren't using the default file name at all (it used to be unnamed.out, now is downloaded, didn't see any more references to that change), I would probably do this:

if not file_name:
    print >> sys.stderr, "Could not derive file name from URL - aborting"
    exit(1)

You will need to import sys.

It looks like using that script to download something that isn't an explicitly named in the URL file is an error, so you might as well catch it early.

Other than that, looks good.

Just FYI (because my comments weren't updated everywhere), LicenseProtectedFileFetcher.get() will only return the first 1MB of a file. It always stores the full file to disk if file_name is set. I can't see you using it to download large files and expecting them to be returned in this way, so it looks safe, but thought it was worth drawing your attention to the change just in case. I would like to update it to only keep files in RAM if file_name isn't set and throw an exception if the file is too large. Of course, we may not have this script for much longer, so there is little point in me changing it until we know its future. Just thought I would keep you in the loop :-)

review: Approve
Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :

On Fri, May 4, 2012 at 7:36 PM, James Tunnicliffe
<email address hidden>wrote:

> Review: Approve
>
> Looks fine. Not critical but I may change this section a little:
>
> === modified file 'download_file'
> --- download_file 2012-01-09 07:18:32 +0000
> +++ download_file 2012-05-04 13:33:19 +0000
>
> @@ -18,17 +18,11 @@
> args = parser.parse_args()
>
> fetcher = LicenseProtectedFileFetcher()
> -content = fetcher.get(args.url[0])
>
> # Get file name from URL
> file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
> -
> -# If file name can not be found (for example, we have got a directory
> -# index), provide a default.
> if not file_name:
> - file_name = "unnamed.out"
> + file_name = "downloaded"
>
> -out = open(file_name, 'w')
> -out.write(content)
> -out.close()
> +fetcher.get(args.url[0], file_name)
> fetcher.close()
>
> Since the script is called download_file and it would appear that you
> aren't using the default file name at all (it used to be unnamed.out, now
> is downloaded, didn't see any more references to that change), I would
> probably do this:
>
> if not file_name:
> print >> sys.stderr, "Could not derive file name from URL - aborting"
> exit(1)
>
> The file_name could be used later in the build scripts if not in the
download_file.
So I would retain it. I will make the changes for using this in another MP
and send for review.

> You will need to import sys.
>
> It looks like using that script to download something that isn't an
> explicitly named in the URL file is an error, so you might as well catch it
> early.
>
> I did not get this completely. Can you please elaborate ?

> Other than that, looks good.
>
> Just FYI (because my comments weren't updated everywhere),
> LicenseProtectedFileFetcher.get() will only return the first 1MB of a file.
> It always stores the full file to disk if file_name is set. I can't see you
> using it to download large files and expecting them to be returned in this
> way, so it looks safe, but thought it was worth drawing your attention to
> the change just in case. I would like to update it to only keep files in
> RAM if file_name isn't set and throw an exception if the file is too large.
> Of course, we may not have this script for much longer, so there is little
> point in me changing it until we know its future. Just thought I would keep
> you in the loop :-)
> --
>
> https://code.launchpad.net/~deeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737<https://code.launchpad.net/%7Edeeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737>
> You are the owner of lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission.
>

--
Thanks and Regards,
Deepti
Infrastructure Team Member, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

>> It looks like using that script to download something that isn't an
>> explicitly named in the URL file is an error, so you might as well catch it
>> early.
>>
> I did not get this completely. Can you please elaborate ?

Sorry, didn't manage to use English at that point! You seem to be
using the script to download a file that comes from a URL like
http://server/dir/file. In this case we get the file name from the URL
and save to a file with that name. The potential problem is the
download script will download what a server returns when there is a
trailing slash on the end of a URL, such as a directory listing from
http://server/dir/. In this case it won't know what to name the file,
so it uses a default file name. You don't seem to be using this
default file name, so you could use your knowledge of how you intend
to use the script to modify it to print a helpful error message and
quit in this case (I am assuming that it would be easier to debug this
than downloading to the default file name and failing later).

Hope that is clear!

--
James Tunnicliffe

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'download_content_yes_to_lic.py'
2--- download_content_yes_to_lic.py 2012-01-09 07:18:32 +0000
3+++ download_content_yes_to_lic.py 2012-05-04 13:33:19 +0000
4@@ -1,11 +1,14 @@
5+# Changes required to address EULA for the origen hwpacks
6+
7 #!/usr/bin/env python
8
9-# Changes required to address EULA for the origen hwpacks
10-
11+import argparse
12 import os
13 import pycurl
14 import re
15 import urlparse
16+import html2text
17+from BeautifulSoup import BeautifulSoup
18
19 class LicenseProtectedFileFetcher:
20 """Fetch a file from the web that may be protected by a license redirect
21@@ -27,25 +30,104 @@
22 downloads.
23
24 """
25- def __init__(self):
26+ def __init__(self, cookie_file="cookies.txt"):
27 """Set up cURL"""
28 self.curl = pycurl.Curl()
29- self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
30 self.curl.setopt(pycurl.WRITEFUNCTION, self._write_body)
31 self.curl.setopt(pycurl.HEADERFUNCTION, self._write_header)
32- self.curl.setopt(pycurl.COOKIEFILE, "cookies.txt")
33- self.curl.setopt(pycurl.COOKIEJAR, "cookies.txt")
34+ self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
35+ self.curl.setopt(pycurl.COOKIEFILE, cookie_file)
36+ self.curl.setopt(pycurl.COOKIEJAR, cookie_file)
37+ self.file_out = None
38
39 def _get(self, url):
40 """Clear out header and body storage, fetch URL, filling them in."""
41- self.curl.setopt(pycurl.URL, url)
42-
43- self.body = ""
44- self.header = ""
45-
46- self.curl.perform()
47-
48- def get(self, url):
49+ url = url.encode("ascii")
50+ self.curl.setopt(pycurl.URL, url)
51+
52+ self.body = ""
53+ self.header = ""
54+
55+ if self.file_name:
56+ self.file_out = open(self.file_name, 'w')
57+ else:
58+ self.file_out = None
59+
60+ self.curl.perform()
61+ self._parse_headers(url)
62+
63+ if self.file_out:
64+ self.file_out.close()
65+
66+ def _parse_headers(self, url):
67+ header = {}
68+ for line in self.header.splitlines():
69+ # Header lines typically are of the form thing: value...
70+ test_line = re.search("^(.*?)\s*:\s*(.*)$", line)
71+
72+ if test_line:
73+ header[test_line.group(1)] = test_line.group(2)
74+
75+ # The location attribute is sometimes relative, but we would
76+ # like to have it as always absolute...
77+
78+ if 'Location' in header.keys():
79+ parsed_location = urlparse.urlparse(header["Location"])
80+
81+ # If not an absolute location...
82+ if not parsed_location.netloc:
83+ parsed_source_url = urlparse.urlparse(url)
84+ new_location = ["", "", "", "", ""]
85+
86+ new_location[0] = parsed_source_url.scheme
87+ new_location[1] = parsed_source_url.netloc
88+ new_location[2] = header["Location"]
89+
90+ # Update location with absolute URL
91+ header["Location"] = urlparse.urlunsplit(new_location)
92+
93+ self.header_text = self.header
94+ self.header = header
95+
96+ def get_headers(self, url):
97+ url = url.encode("ascii")
98+ self.curl.setopt(pycurl.URL, url)
99+
100+ self.body = ""
101+ self.header = ""
102+
103+ # Setting NOBODY causes CURL to just fetch the header.
104+ self.curl.setopt(pycurl.NOBODY, True)
105+ self.curl.perform()
106+ self.curl.setopt(pycurl.NOBODY, False)
107+
108+ self._parse_headers(url)
109+
110+ return self.header
111+
112+ def get_or_return_license(self, url, file_name=None):
113+ """Get file at the requested URL or, if behind a license, return that.
114+
115+ If the URL provided does not redirect us to a license, then return the
116+ body of that file. If we are redirected to a license click through
117+ then return (the license as plain text, url to accept the license).
118+
119+ If the user of this function accepts the license, then they should
120+ call get_protected_file."""
121+
122+ self.file_name = file_name
123+
124+ # Get the license details. If this returns None, the file isn't license
125+ # protected and we can just return the file we started to get in the
126+ # function (self.body).
127+ license_details = self._get_license(url)
128+
129+ if license_details:
130+ return license_details
131+
132+ return self.body
133+
134+ def get(self, url, file_name=None):
135 """Fetch the requested URL, accepting licenses, returns file body
136
137 Fetches the file at url. If a redirect is encountered, it is
138@@ -53,13 +135,34 @@
139 then download the original file.
140
141 """
142- self._get(url)
143-
144- location = self._get_location()
145- if location:
146- # Off to the races - we have been redirected.
147- # Expect to find a link to self.location with -accepted inserted
148- # before the .html, i.e. ste.html -> ste-accepted.html
149+
150+ self.file_name = file_name
151+ license_details = self._get_license(url)
152+
153+ if license_details:
154+ # Found a license. Accept the license without looking at it and
155+ # start fetching the file we originally wanted.
156+ accept_url = license_details[1]
157+ self.get_protected_file(accept_url, url)
158+
159+ else:
160+ # If we got here, there wasn't a license protecting the file
161+ # so we just fetch it.
162+ self._get(url)
163+
164+ return self.body
165+
166+ def _get_license(self, url):
167+ """Return (license, accept URL) if found, else return None"""
168+
169+ self.get_headers(url)
170+
171+ if "Location" in self.header and self.header["Location"] != url:
172+ # We have been redirected to a new location - the license file
173+ location = self.header["Location"]
174+
175+ # Fetch the license HTML
176+ self._get(location)
177
178 # Get the file from the URL (full path)
179 file = urlparse.urlparse(location).path
180@@ -68,50 +171,64 @@
181 file = os.path.split(file)[-1]
182
183 # Look for a link with accepted.html in the page name. Follow it.
184- new_file = None
185 for line in self.body.splitlines():
186 link_search = re.search("""href=.*?["'](.*?-accepted.html)""",
187 line)
188 if link_search:
189 # Have found license accept URL!
190 new_file = link_search.group(1)
191-
192- if new_file:
193- # Accept the license...
194- accept_url = re.sub(file, new_file, location)
195- self._get(accept_url)
196-
197- # The above get *should* take us to the file requested via
198- # a redirect. If we manually need to follow that redirect,
199- # do that now.
200-
201- if self._get_location():
202- # If we haven't been redirected to our original file,
203- # we should be able to just download it now.
204- self._get(url)
205-
206- return self.body
207-
208- def _search_header(self, field):
209- """Search header for the supplied field, return field / None"""
210- for line in self.header.splitlines():
211- search = re.search(field + ":\s+(.*?)$", line)
212- if search:
213- return search.group(1)
214+ accept_url = re.sub(file, new_file, location)
215+
216+ # Parse the HTML using BeautifulSoup
217+ soup = BeautifulSoup(self.body)
218+
219+ # The license is in a div with the ID license-text, so we
220+ # use this to pull just the license out of the HTML.
221+ html_license = u""
222+ for chunk in soup.findAll(id="license-text"):
223+ # Output of chunk.prettify is UTF8, but comes back
224+ # as a str, so convert it here.
225+ html_license += chunk.prettify().decode("utf-8")
226+
227+ text_license = html2text.html2text(html_license)
228+
229+ return text_license, accept_url
230+
231 return None
232
233- def _get_location(self):
234- """Return content of Location field in header / None"""
235- return self._search_header("Location")
236+ def get_protected_file(self, accept_url, url):
237+ """Gets the file redirected to by the accept_url"""
238+
239+ self._get(accept_url) # Accept the license
240+
241+ if not("Location" in self.header and self.header["Location"] == url):
242+ # If we got here, we don't have the file yet (weren't redirected
243+ # to it). Fetch our target file. This should work now that we have
244+ # the right cookie.
245+ self._get(url) # Download the target file
246+
247+ return self.body
248
249 def _write_body(self, buf):
250 """Used by curl as a sink for body content"""
251- self.body += buf
252+
253+ # If we have a target file to write to, write to it
254+ if self.file_out:
255+ self.file_out.write(buf)
256+
257+ # Only buffer first 1MB of body. This should be plenty for anything
258+ # we wish to parse internally.
259+ if len(self.body) < 1024*1024*1024:
260+ self.body += buf
261
262 def _write_header(self, buf):
263 """Used by curl as a sink for header content"""
264 self.header += buf
265
266+ def register_progress_callback(self, callback):
267+ self.curl.setopt(pycurl.NOPROGRESS, 0)
268+ self.curl.setopt(pycurl.PROGRESSFUNCTION, callback)
269+
270 def close(self):
271 """Wrapper to close curl - this will allow curl to write out cookies"""
272 self.curl.close()
273
274=== modified file 'download_file'
275--- download_file 2012-01-09 07:18:32 +0000
276+++ download_file 2012-05-04 13:33:19 +0000
277@@ -8,7 +8,7 @@
278 import urlparse
279 import os
280
281-#Download file specified on command line
282+"""Download file specified on command line"""
283 parser = argparse.ArgumentParser(description="Download a file, accepting "
284 "any licenses required to do so.")
285
286@@ -18,17 +18,11 @@
287 args = parser.parse_args()
288
289 fetcher = LicenseProtectedFileFetcher()
290-content = fetcher.get(args.url[0])
291
292 # Get file name from URL
293 file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
294-
295-# If file name can not be found (for example, we have got a directory
296-# index), provide a default.
297 if not file_name:
298- file_name = "unnamed.out"
299+ file_name = "downloaded"
300
301-out = open(file_name, 'w')
302-out.write(content)
303-out.close()
304+fetcher.get(args.url[0], file_name)
305 fetcher.close()
306
307=== modified file 'find_latest.py'
308--- find_latest.py 2012-04-30 07:40:00 +0000
309+++ find_latest.py 2012-05-04 13:33:19 +0000
310@@ -127,7 +127,8 @@
311 :param url: The base url to search
312 :param extra: The extra path needed to complete the url
313 """
314- builddates = geturl(url)
315+ fetcher = LicenseProtectedFileFetcher()
316+ builddates = fetcher.get(url)
317 dates = find_ci_builds(builddates)
318 dates = sorted(dates, key=lambda x: x[1])
319
320@@ -139,4 +140,5 @@
321 raise StopIteration()
322 except StopIteration:
323 pass
324+ fetcher.close()
325 return filename

Subscribers

People subscribed via source and target branches