Merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission into lp:linaro-ci
- fix_bug_994573_lava_submission
- Merge into lci-build-tools
Status: | Merged |
---|---|
Approved by: | James Tunnicliffe |
Approved revision: | 61 |
Merged at revision: | 61 |
Proposed branch: | lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission |
Merge into: | lp:linaro-ci |
Diff against target: |
325 lines (+172/-59) 3 files modified
download_content_yes_to_lic.py (+166/-49) download_file (+3/-9) find_latest.py (+3/-1) |
To merge this branch: | bzr merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
James Tunnicliffe (community) | Approve | ||
Review via email: mp+104737@code.launchpad.net |
Commit message
Description of the change
Fixing the CI lava submission failing for origen and snowball jobs.
Also, aligning with the latest changes of license acceptance and downloading the restricted file.
Deepti B. Kalakeri (deeptik) wrote : | # |
On Fri, May 4, 2012 at 7:36 PM, James Tunnicliffe
<email address hidden>wrote:
> Review: Approve
>
> Looks fine. Not critical but I may change this section a little:
>
> === modified file 'download_file'
> --- download_file 2012-01-09 07:18:32 +0000
> +++ download_file 2012-05-04 13:33:19 +0000
>
> @@ -18,17 +18,11 @@
> args = parser.parse_args()
>
> fetcher = LicenseProtecte
> -content = fetcher.
>
> # Get file name from URL
> file_name = os.path.
> -
> -# If file name can not be found (for example, we have got a directory
> -# index), provide a default.
> if not file_name:
> - file_name = "unnamed.out"
> + file_name = "downloaded"
>
> -out = open(file_name, 'w')
> -out.write(content)
> -out.close()
> +fetcher.
> fetcher.close()
>
> Since the script is called download_file and it would appear that you
> aren't using the default file name at all (it used to be unnamed.out, now
> is downloaded, didn't see any more references to that change), I would
> probably do this:
>
> if not file_name:
> print >> sys.stderr, "Could not derive file name from URL - aborting"
> exit(1)
>
> The file_name could be used later in the build scripts if not in the
download_file.
So I would retain it. I will make the changes for using this in another MP
and send for review.
> You will need to import sys.
>
> It looks like using that script to download something that isn't an
> explicitly named in the URL file is an error, so you might as well catch it
> early.
>
> I did not get this completely. Can you please elaborate ?
> Other than that, looks good.
>
> Just FYI (because my comments weren't updated everywhere),
> LicenseProtecte
> It always stores the full file to disk if file_name is set. I can't see you
> using it to download large files and expecting them to be returned in this
> way, so it looks safe, but thought it was worth drawing your attention to
> the change just in case. I would like to update it to only keep files in
> RAM if file_name isn't set and throw an exception if the file is too large.
> Of course, we may not have this script for much longer, so there is little
> point in me changing it until we know its future. Just thought I would keep
> you in the loop :-)
> --
>
> https:/
> You are the owner of lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission.
>
--
Thanks and Regards,
Deepti
Infrastructure Team Member, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://
http://
James Tunnicliffe (dooferlad) wrote : | # |
>> It looks like using that script to download something that isn't an
>> explicitly named in the URL file is an error, so you might as well catch it
>> early.
>>
> I did not get this completely. Can you please elaborate ?
Sorry, didn't manage to use English at that point! You seem to be
using the script to download a file that comes from a URL like
http://
and save to a file with that name. The potential problem is the
download script will download what a server returns when there is a
trailing slash on the end of a URL, such as a directory listing from
http://
so it uses a default file name. You don't seem to be using this
default file name, so you could use your knowledge of how you intend
to use the script to modify it to print a helpful error message and
quit in this case (I am assuming that it would be easier to debug this
than downloading to the default file name and failing later).
Hope that is clear!
--
James Tunnicliffe
Preview Diff
1 | === modified file 'download_content_yes_to_lic.py' |
2 | --- download_content_yes_to_lic.py 2012-01-09 07:18:32 +0000 |
3 | +++ download_content_yes_to_lic.py 2012-05-04 13:33:19 +0000 |
4 | @@ -1,11 +1,14 @@ |
5 | +# Changes required to address EULA for the origen hwpacks |
6 | + |
7 | #!/usr/bin/env python |
8 | |
9 | -# Changes required to address EULA for the origen hwpacks |
10 | - |
11 | +import argparse |
12 | import os |
13 | import pycurl |
14 | import re |
15 | import urlparse |
16 | +import html2text |
17 | +from BeautifulSoup import BeautifulSoup |
18 | |
19 | class LicenseProtectedFileFetcher: |
20 | """Fetch a file from the web that may be protected by a license redirect |
21 | @@ -27,25 +30,104 @@ |
22 | downloads. |
23 | |
24 | """ |
25 | - def __init__(self): |
26 | + def __init__(self, cookie_file="cookies.txt"): |
27 | """Set up cURL""" |
28 | self.curl = pycurl.Curl() |
29 | - self.curl.setopt(pycurl.FOLLOWLOCATION, 1) |
30 | self.curl.setopt(pycurl.WRITEFUNCTION, self._write_body) |
31 | self.curl.setopt(pycurl.HEADERFUNCTION, self._write_header) |
32 | - self.curl.setopt(pycurl.COOKIEFILE, "cookies.txt") |
33 | - self.curl.setopt(pycurl.COOKIEJAR, "cookies.txt") |
34 | + self.curl.setopt(pycurl.FOLLOWLOCATION, 1) |
35 | + self.curl.setopt(pycurl.COOKIEFILE, cookie_file) |
36 | + self.curl.setopt(pycurl.COOKIEJAR, cookie_file) |
37 | + self.file_out = None |
38 | |
39 | def _get(self, url): |
40 | """Clear out header and body storage, fetch URL, filling them in.""" |
41 | - self.curl.setopt(pycurl.URL, url) |
42 | - |
43 | - self.body = "" |
44 | - self.header = "" |
45 | - |
46 | - self.curl.perform() |
47 | - |
48 | - def get(self, url): |
49 | + url = url.encode("ascii") |
50 | + self.curl.setopt(pycurl.URL, url) |
51 | + |
52 | + self.body = "" |
53 | + self.header = "" |
54 | + |
55 | + if self.file_name: |
56 | + self.file_out = open(self.file_name, 'w') |
57 | + else: |
58 | + self.file_out = None |
59 | + |
60 | + self.curl.perform() |
61 | + self._parse_headers(url) |
62 | + |
63 | + if self.file_out: |
64 | + self.file_out.close() |
65 | + |
66 | + def _parse_headers(self, url): |
67 | + header = {} |
68 | + for line in self.header.splitlines(): |
69 | + # Header lines typically are of the form thing: value... |
70 | + test_line = re.search("^(.*?)\s*:\s*(.*)$", line) |
71 | + |
72 | + if test_line: |
73 | + header[test_line.group(1)] = test_line.group(2) |
74 | + |
75 | + # The location attribute is sometimes relative, but we would |
76 | + # like to have it as always absolute... |
77 | + |
78 | + if 'Location' in header.keys(): |
79 | + parsed_location = urlparse.urlparse(header["Location"]) |
80 | + |
81 | + # If not an absolute location... |
82 | + if not parsed_location.netloc: |
83 | + parsed_source_url = urlparse.urlparse(url) |
84 | + new_location = ["", "", "", "", ""] |
85 | + |
86 | + new_location[0] = parsed_source_url.scheme |
87 | + new_location[1] = parsed_source_url.netloc |
88 | + new_location[2] = header["Location"] |
89 | + |
90 | + # Update location with absolute URL |
91 | + header["Location"] = urlparse.urlunsplit(new_location) |
92 | + |
93 | + self.header_text = self.header |
94 | + self.header = header |
95 | + |
96 | + def get_headers(self, url): |
97 | + url = url.encode("ascii") |
98 | + self.curl.setopt(pycurl.URL, url) |
99 | + |
100 | + self.body = "" |
101 | + self.header = "" |
102 | + |
103 | + # Setting NOBODY causes CURL to just fetch the header. |
104 | + self.curl.setopt(pycurl.NOBODY, True) |
105 | + self.curl.perform() |
106 | + self.curl.setopt(pycurl.NOBODY, False) |
107 | + |
108 | + self._parse_headers(url) |
109 | + |
110 | + return self.header |
111 | + |
112 | + def get_or_return_license(self, url, file_name=None): |
113 | + """Get file at the requested URL or, if behind a license, return that. |
114 | + |
115 | + If the URL provided does not redirect us to a license, then return the |
116 | + body of that file. If we are redirected to a license click through |
117 | + then return (the license as plain text, url to accept the license). |
118 | + |
119 | + If the user of this function accepts the license, then they should |
120 | + call get_protected_file.""" |
121 | + |
122 | + self.file_name = file_name |
123 | + |
124 | + # Get the license details. If this returns None, the file isn't license |
125 | + # protected and we can just return the file we started to get in the |
126 | + # function (self.body). |
127 | + license_details = self._get_license(url) |
128 | + |
129 | + if license_details: |
130 | + return license_details |
131 | + |
132 | + return self.body |
133 | + |
134 | + def get(self, url, file_name=None): |
135 | """Fetch the requested URL, accepting licenses, returns file body |
136 | |
137 | Fetches the file at url. If a redirect is encountered, it is |
138 | @@ -53,13 +135,34 @@ |
139 | then download the original file. |
140 | |
141 | """ |
142 | - self._get(url) |
143 | - |
144 | - location = self._get_location() |
145 | - if location: |
146 | - # Off to the races - we have been redirected. |
147 | - # Expect to find a link to self.location with -accepted inserted |
148 | - # before the .html, i.e. ste.html -> ste-accepted.html |
149 | + |
150 | + self.file_name = file_name |
151 | + license_details = self._get_license(url) |
152 | + |
153 | + if license_details: |
154 | + # Found a license. Accept the license without looking at it and |
155 | + # start fetching the file we originally wanted. |
156 | + accept_url = license_details[1] |
157 | + self.get_protected_file(accept_url, url) |
158 | + |
159 | + else: |
160 | + # If we got here, there wasn't a license protecting the file |
161 | + # so we just fetch it. |
162 | + self._get(url) |
163 | + |
164 | + return self.body |
165 | + |
166 | + def _get_license(self, url): |
167 | + """Return (license, accept URL) if found, else return None""" |
168 | + |
169 | + self.get_headers(url) |
170 | + |
171 | + if "Location" in self.header and self.header["Location"] != url: |
172 | + # We have been redirected to a new location - the license file |
173 | + location = self.header["Location"] |
174 | + |
175 | + # Fetch the license HTML |
176 | + self._get(location) |
177 | |
178 | # Get the file from the URL (full path) |
179 | file = urlparse.urlparse(location).path |
180 | @@ -68,50 +171,64 @@ |
181 | file = os.path.split(file)[-1] |
182 | |
183 | # Look for a link with accepted.html in the page name. Follow it. |
184 | - new_file = None |
185 | for line in self.body.splitlines(): |
186 | link_search = re.search("""href=.*?["'](.*?-accepted.html)""", |
187 | line) |
188 | if link_search: |
189 | # Have found license accept URL! |
190 | new_file = link_search.group(1) |
191 | - |
192 | - if new_file: |
193 | - # Accept the license... |
194 | - accept_url = re.sub(file, new_file, location) |
195 | - self._get(accept_url) |
196 | - |
197 | - # The above get *should* take us to the file requested via |
198 | - # a redirect. If we manually need to follow that redirect, |
199 | - # do that now. |
200 | - |
201 | - if self._get_location(): |
202 | - # If we haven't been redirected to our original file, |
203 | - # we should be able to just download it now. |
204 | - self._get(url) |
205 | - |
206 | - return self.body |
207 | - |
208 | - def _search_header(self, field): |
209 | - """Search header for the supplied field, return field / None""" |
210 | - for line in self.header.splitlines(): |
211 | - search = re.search(field + ":\s+(.*?)$", line) |
212 | - if search: |
213 | - return search.group(1) |
214 | + accept_url = re.sub(file, new_file, location) |
215 | + |
216 | + # Parse the HTML using BeautifulSoup |
217 | + soup = BeautifulSoup(self.body) |
218 | + |
219 | + # The license is in a div with the ID license-text, so we |
220 | + # use this to pull just the license out of the HTML. |
221 | + html_license = u"" |
222 | + for chunk in soup.findAll(id="license-text"): |
223 | + # Output of chunk.prettify is UTF8, but comes back |
224 | + # as a str, so convert it here. |
225 | + html_license += chunk.prettify().decode("utf-8") |
226 | + |
227 | + text_license = html2text.html2text(html_license) |
228 | + |
229 | + return text_license, accept_url |
230 | + |
231 | return None |
232 | |
233 | - def _get_location(self): |
234 | - """Return content of Location field in header / None""" |
235 | - return self._search_header("Location") |
236 | + def get_protected_file(self, accept_url, url): |
237 | + """Gets the file redirected to by the accept_url""" |
238 | + |
239 | + self._get(accept_url) # Accept the license |
240 | + |
241 | + if not("Location" in self.header and self.header["Location"] == url): |
242 | + # If we got here, we don't have the file yet (weren't redirected |
243 | + # to it). Fetch our target file. This should work now that we have |
244 | + # the right cookie. |
245 | + self._get(url) # Download the target file |
246 | + |
247 | + return self.body |
248 | |
249 | def _write_body(self, buf): |
250 | """Used by curl as a sink for body content""" |
251 | - self.body += buf |
252 | + |
253 | + # If we have a target file to write to, write to it |
254 | + if self.file_out: |
255 | + self.file_out.write(buf) |
256 | + |
257 | + # Only buffer first 1MB of body. This should be plenty for anything |
258 | + # we wish to parse internally. |
259 | + if len(self.body) < 1024*1024*1024: |
260 | + self.body += buf |
261 | |
262 | def _write_header(self, buf): |
263 | """Used by curl as a sink for header content""" |
264 | self.header += buf |
265 | |
266 | + def register_progress_callback(self, callback): |
267 | + self.curl.setopt(pycurl.NOPROGRESS, 0) |
268 | + self.curl.setopt(pycurl.PROGRESSFUNCTION, callback) |
269 | + |
270 | def close(self): |
271 | """Wrapper to close curl - this will allow curl to write out cookies""" |
272 | self.curl.close() |
273 | |
274 | === modified file 'download_file' |
275 | --- download_file 2012-01-09 07:18:32 +0000 |
276 | +++ download_file 2012-05-04 13:33:19 +0000 |
277 | @@ -8,7 +8,7 @@ |
278 | import urlparse |
279 | import os |
280 | |
281 | -#Download file specified on command line |
282 | +"""Download file specified on command line""" |
283 | parser = argparse.ArgumentParser(description="Download a file, accepting " |
284 | "any licenses required to do so.") |
285 | |
286 | @@ -18,17 +18,11 @@ |
287 | args = parser.parse_args() |
288 | |
289 | fetcher = LicenseProtectedFileFetcher() |
290 | -content = fetcher.get(args.url[0]) |
291 | |
292 | # Get file name from URL |
293 | file_name = os.path.basename(urlparse.urlparse(args.url[0]).path) |
294 | - |
295 | -# If file name can not be found (for example, we have got a directory |
296 | -# index), provide a default. |
297 | if not file_name: |
298 | - file_name = "unnamed.out" |
299 | + file_name = "downloaded" |
300 | |
301 | -out = open(file_name, 'w') |
302 | -out.write(content) |
303 | -out.close() |
304 | +fetcher.get(args.url[0], file_name) |
305 | fetcher.close() |
306 | |
307 | === modified file 'find_latest.py' |
308 | --- find_latest.py 2012-04-30 07:40:00 +0000 |
309 | +++ find_latest.py 2012-05-04 13:33:19 +0000 |
310 | @@ -127,7 +127,8 @@ |
311 | :param url: The base url to search |
312 | :param extra: The extra path needed to complete the url |
313 | """ |
314 | - builddates = geturl(url) |
315 | + fetcher = LicenseProtectedFileFetcher() |
316 | + builddates = fetcher.get(url) |
317 | dates = find_ci_builds(builddates) |
318 | dates = sorted(dates, key=lambda x: x[1]) |
319 | |
320 | @@ -139,4 +140,5 @@ |
321 | raise StopIteration() |
322 | except StopIteration: |
323 | pass |
324 | + fetcher.close() |
325 | return filename |
Looks fine. Not critical but I may change this section a little:
=== modified file 'download_file'
--- download_file 2012-01-09 07:18:32 +0000
+++ download_file 2012-05-04 13:33:19 +0000
@@ -18,17 +18,11 @@
args = parser.parse_args()
fetcher = LicenseProtecte dFileFetcher( ) get(args. url[0])
-content = fetcher.
# Get file name from URL basename( urlparse. urlparse( args.url[ 0]).path)
file_name = os.path.
-
-# If file name can not be found (for example, we have got a directory
-# index), provide a default.
if not file_name:
- file_name = "unnamed.out"
+ file_name = "downloaded"
-out = open(file_name, 'w') get(args. url[0], file_name)
-out.write(content)
-out.close()
+fetcher.
fetcher.close()
Since the script is called download_file and it would appear that you aren't using the default file name at all (it used to be unnamed.out, now is downloaded, didn't see any more references to that change), I would probably do this:
if not file_name:
print >> sys.stderr, "Could not derive file name from URL - aborting"
exit(1)
You will need to import sys.
It looks like using that script to download something that isn't an explicitly named in the URL file is an error, so you might as well catch it early.
Other than that, looks good.
Just FYI (because my comments weren't updated everywhere), LicenseProtecte dFileFetcher. get() will only return the first 1MB of a file. It always stores the full file to disk if file_name is set. I can't see you using it to download large files and expecting them to be returned in this way, so it looks safe, but thought it was worth drawing your attention to the change just in case. I would like to update it to only keep files in RAM if file_name isn't set and throw an exception if the file is too large. Of course, we may not have this script for much longer, so there is little point in me changing it until we know its future. Just thought I would keep you in the loop :-)