Merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission into lp:linaro-ci

Proposed by Deepti B. Kalakeri
Status: Merged
Approved by: James Tunnicliffe
Approved revision: 61
Merged at revision: 61
Proposed branch: lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission
Merge into: lp:linaro-ci
Diff against target: 325 lines (+172/-59)
3 files modified
download_content_yes_to_lic.py (+166/-49)
download_file (+3/-9)
find_latest.py (+3/-1)
To merge this branch: bzr merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission
Reviewer Review Type Date Requested Status
James Tunnicliffe (community) Approve
Review via email: mp+104737@code.launchpad.net

Description of the change

Fixing the CI lava submission failing for origen and snowball jobs.
Also, aligning with the latest changes of license acceptance and downloading the restricted file.

To post a comment you must log in.
Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Looks fine. Not critical but I may change this section a little:

=== modified file 'download_file'
--- download_file 2012-01-09 07:18:32 +0000
+++ download_file 2012-05-04 13:33:19 +0000

@@ -18,17 +18,11 @@
 args = parser.parse_args()

 fetcher = LicenseProtectedFileFetcher()
-content = fetcher.get(args.url[0])

 # Get file name from URL
 file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
-
-# If file name can not be found (for example, we have got a directory
-# index), provide a default.
 if not file_name:
- file_name = "unnamed.out"
+ file_name = "downloaded"

-out = open(file_name, 'w')
-out.write(content)
-out.close()
+fetcher.get(args.url[0], file_name)
 fetcher.close()

Since the script is called download_file and it would appear that you aren't using the default file name at all (it used to be unnamed.out, now is downloaded, didn't see any more references to that change), I would probably do this:

if not file_name:
    print >> sys.stderr, "Could not derive file name from URL - aborting"
    exit(1)

You will need to import sys.

It looks like using that script to download something that isn't an explicitly named in the URL file is an error, so you might as well catch it early.

Other than that, looks good.

Just FYI (because my comments weren't updated everywhere), LicenseProtectedFileFetcher.get() will only return the first 1MB of a file. It always stores the full file to disk if file_name is set. I can't see you using it to download large files and expecting them to be returned in this way, so it looks safe, but thought it was worth drawing your attention to the change just in case. I would like to update it to only keep files in RAM if file_name isn't set and throw an exception if the file is too large. Of course, we may not have this script for much longer, so there is little point in me changing it until we know its future. Just thought I would keep you in the loop :-)

review: Approve
Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :

On Fri, May 4, 2012 at 7:36 PM, James Tunnicliffe
<email address hidden>wrote:

> Review: Approve
>
> Looks fine. Not critical but I may change this section a little:
>
> === modified file 'download_file'
> --- download_file 2012-01-09 07:18:32 +0000
> +++ download_file 2012-05-04 13:33:19 +0000
>
> @@ -18,17 +18,11 @@
> args = parser.parse_args()
>
> fetcher = LicenseProtectedFileFetcher()
> -content = fetcher.get(args.url[0])
>
> # Get file name from URL
> file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
> -
> -# If file name can not be found (for example, we have got a directory
> -# index), provide a default.
> if not file_name:
> - file_name = "unnamed.out"
> + file_name = "downloaded"
>
> -out = open(file_name, 'w')
> -out.write(content)
> -out.close()
> +fetcher.get(args.url[0], file_name)
> fetcher.close()
>
> Since the script is called download_file and it would appear that you
> aren't using the default file name at all (it used to be unnamed.out, now
> is downloaded, didn't see any more references to that change), I would
> probably do this:
>
> if not file_name:
> print >> sys.stderr, "Could not derive file name from URL - aborting"
> exit(1)
>
> The file_name could be used later in the build scripts if not in the
download_file.
So I would retain it. I will make the changes for using this in another MP
and send for review.

> You will need to import sys.
>
> It looks like using that script to download something that isn't an
> explicitly named in the URL file is an error, so you might as well catch it
> early.
>
> I did not get this completely. Can you please elaborate ?

> Other than that, looks good.
>
> Just FYI (because my comments weren't updated everywhere),
> LicenseProtectedFileFetcher.get() will only return the first 1MB of a file.
> It always stores the full file to disk if file_name is set. I can't see you
> using it to download large files and expecting them to be returned in this
> way, so it looks safe, but thought it was worth drawing your attention to
> the change just in case. I would like to update it to only keep files in
> RAM if file_name isn't set and throw an exception if the file is too large.
> Of course, we may not have this script for much longer, so there is little
> point in me changing it until we know its future. Just thought I would keep
> you in the loop :-)
> --
>
> https://code.launchpad.net/~deeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737<https://code.launchpad.net/%7Edeeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737>
> You are the owner of lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission.
>

--
Thanks and Regards,
Deepti
Infrastructure Team Member, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

>> It looks like using that script to download something that isn't an
>> explicitly named in the URL file is an error, so you might as well catch it
>> early.
>>
> I did not get this completely. Can you please elaborate ?

Sorry, didn't manage to use English at that point! You seem to be
using the script to download a file that comes from a URL like
http://server/dir/file. In this case we get the file name from the URL
and save to a file with that name. The potential problem is the
download script will download what a server returns when there is a
trailing slash on the end of a URL, such as a directory listing from
http://server/dir/. In this case it won't know what to name the file,
so it uses a default file name. You don't seem to be using this
default file name, so you could use your knowledge of how you intend
to use the script to modify it to print a helpful error message and
quit in this case (I am assuming that it would be easier to debug this
than downloading to the default file name and failing later).

Hope that is clear!

--
James Tunnicliffe

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'download_content_yes_to_lic.py'
--- download_content_yes_to_lic.py 2012-01-09 07:18:32 +0000
+++ download_content_yes_to_lic.py 2012-05-04 13:33:19 +0000
@@ -1,11 +1,14 @@
1# Changes required to address EULA for the origen hwpacks
2
1#!/usr/bin/env python3#!/usr/bin/env python
24
3# Changes required to address EULA for the origen hwpacks5import argparse
4
5import os6import os
6import pycurl7import pycurl
7import re8import re
8import urlparse9import urlparse
10import html2text
11from BeautifulSoup import BeautifulSoup
912
10class LicenseProtectedFileFetcher:13class LicenseProtectedFileFetcher:
11 """Fetch a file from the web that may be protected by a license redirect14 """Fetch a file from the web that may be protected by a license redirect
@@ -27,25 +30,104 @@
27 downloads.30 downloads.
2831
29 """32 """
30 def __init__(self):33 def __init__(self, cookie_file="cookies.txt"):
31 """Set up cURL"""34 """Set up cURL"""
32 self.curl = pycurl.Curl()35 self.curl = pycurl.Curl()
33 self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
34 self.curl.setopt(pycurl.WRITEFUNCTION, self._write_body)36 self.curl.setopt(pycurl.WRITEFUNCTION, self._write_body)
35 self.curl.setopt(pycurl.HEADERFUNCTION, self._write_header)37 self.curl.setopt(pycurl.HEADERFUNCTION, self._write_header)
36 self.curl.setopt(pycurl.COOKIEFILE, "cookies.txt")38 self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
37 self.curl.setopt(pycurl.COOKIEJAR, "cookies.txt")39 self.curl.setopt(pycurl.COOKIEFILE, cookie_file)
40 self.curl.setopt(pycurl.COOKIEJAR, cookie_file)
41 self.file_out = None
3842
39 def _get(self, url):43 def _get(self, url):
40 """Clear out header and body storage, fetch URL, filling them in."""44 """Clear out header and body storage, fetch URL, filling them in."""
41 self.curl.setopt(pycurl.URL, url)45 url = url.encode("ascii")
4246 self.curl.setopt(pycurl.URL, url)
43 self.body = ""47
44 self.header = ""48 self.body = ""
4549 self.header = ""
46 self.curl.perform()50
4751 if self.file_name:
48 def get(self, url):52 self.file_out = open(self.file_name, 'w')
53 else:
54 self.file_out = None
55
56 self.curl.perform()
57 self._parse_headers(url)
58
59 if self.file_out:
60 self.file_out.close()
61
62 def _parse_headers(self, url):
63 header = {}
64 for line in self.header.splitlines():
65 # Header lines typically are of the form thing: value...
66 test_line = re.search("^(.*?)\s*:\s*(.*)$", line)
67
68 if test_line:
69 header[test_line.group(1)] = test_line.group(2)
70
71 # The location attribute is sometimes relative, but we would
72 # like to have it as always absolute...
73
74 if 'Location' in header.keys():
75 parsed_location = urlparse.urlparse(header["Location"])
76
77 # If not an absolute location...
78 if not parsed_location.netloc:
79 parsed_source_url = urlparse.urlparse(url)
80 new_location = ["", "", "", "", ""]
81
82 new_location[0] = parsed_source_url.scheme
83 new_location[1] = parsed_source_url.netloc
84 new_location[2] = header["Location"]
85
86 # Update location with absolute URL
87 header["Location"] = urlparse.urlunsplit(new_location)
88
89 self.header_text = self.header
90 self.header = header
91
92 def get_headers(self, url):
93 url = url.encode("ascii")
94 self.curl.setopt(pycurl.URL, url)
95
96 self.body = ""
97 self.header = ""
98
99 # Setting NOBODY causes CURL to just fetch the header.
100 self.curl.setopt(pycurl.NOBODY, True)
101 self.curl.perform()
102 self.curl.setopt(pycurl.NOBODY, False)
103
104 self._parse_headers(url)
105
106 return self.header
107
108 def get_or_return_license(self, url, file_name=None):
109 """Get file at the requested URL or, if behind a license, return that.
110
111 If the URL provided does not redirect us to a license, then return the
112 body of that file. If we are redirected to a license click through
113 then return (the license as plain text, url to accept the license).
114
115 If the user of this function accepts the license, then they should
116 call get_protected_file."""
117
118 self.file_name = file_name
119
120 # Get the license details. If this returns None, the file isn't license
121 # protected and we can just return the file we started to get in the
122 # function (self.body).
123 license_details = self._get_license(url)
124
125 if license_details:
126 return license_details
127
128 return self.body
129
130 def get(self, url, file_name=None):
49 """Fetch the requested URL, accepting licenses, returns file body131 """Fetch the requested URL, accepting licenses, returns file body
50132
51 Fetches the file at url. If a redirect is encountered, it is133 Fetches the file at url. If a redirect is encountered, it is
@@ -53,13 +135,34 @@
53 then download the original file.135 then download the original file.
54136
55 """137 """
56 self._get(url)138
57139 self.file_name = file_name
58 location = self._get_location()140 license_details = self._get_license(url)
59 if location:141
60 # Off to the races - we have been redirected.142 if license_details:
61 # Expect to find a link to self.location with -accepted inserted143 # Found a license. Accept the license without looking at it and
62 # before the .html, i.e. ste.html -> ste-accepted.html144 # start fetching the file we originally wanted.
145 accept_url = license_details[1]
146 self.get_protected_file(accept_url, url)
147
148 else:
149 # If we got here, there wasn't a license protecting the file
150 # so we just fetch it.
151 self._get(url)
152
153 return self.body
154
155 def _get_license(self, url):
156 """Return (license, accept URL) if found, else return None"""
157
158 self.get_headers(url)
159
160 if "Location" in self.header and self.header["Location"] != url:
161 # We have been redirected to a new location - the license file
162 location = self.header["Location"]
163
164 # Fetch the license HTML
165 self._get(location)
63166
64 # Get the file from the URL (full path)167 # Get the file from the URL (full path)
65 file = urlparse.urlparse(location).path168 file = urlparse.urlparse(location).path
@@ -68,50 +171,64 @@
68 file = os.path.split(file)[-1]171 file = os.path.split(file)[-1]
69172
70 # Look for a link with accepted.html in the page name. Follow it.173 # Look for a link with accepted.html in the page name. Follow it.
71 new_file = None
72 for line in self.body.splitlines():174 for line in self.body.splitlines():
73 link_search = re.search("""href=.*?["'](.*?-accepted.html)""",175 link_search = re.search("""href=.*?["'](.*?-accepted.html)""",
74 line)176 line)
75 if link_search:177 if link_search:
76 # Have found license accept URL!178 # Have found license accept URL!
77 new_file = link_search.group(1)179 new_file = link_search.group(1)
78180 accept_url = re.sub(file, new_file, location)
79 if new_file:181
80 # Accept the license...182 # Parse the HTML using BeautifulSoup
81 accept_url = re.sub(file, new_file, location)183 soup = BeautifulSoup(self.body)
82 self._get(accept_url)184
83185 # The license is in a div with the ID license-text, so we
84 # The above get *should* take us to the file requested via186 # use this to pull just the license out of the HTML.
85 # a redirect. If we manually need to follow that redirect,187 html_license = u""
86 # do that now.188 for chunk in soup.findAll(id="license-text"):
87189 # Output of chunk.prettify is UTF8, but comes back
88 if self._get_location():190 # as a str, so convert it here.
89 # If we haven't been redirected to our original file,191 html_license += chunk.prettify().decode("utf-8")
90 # we should be able to just download it now.192
91 self._get(url)193 text_license = html2text.html2text(html_license)
92194
93 return self.body195 return text_license, accept_url
94196
95 def _search_header(self, field):
96 """Search header for the supplied field, return field / None"""
97 for line in self.header.splitlines():
98 search = re.search(field + ":\s+(.*?)$", line)
99 if search:
100 return search.group(1)
101 return None197 return None
102198
103 def _get_location(self):199 def get_protected_file(self, accept_url, url):
104 """Return content of Location field in header / None"""200 """Gets the file redirected to by the accept_url"""
105 return self._search_header("Location")201
202 self._get(accept_url) # Accept the license
203
204 if not("Location" in self.header and self.header["Location"] == url):
205 # If we got here, we don't have the file yet (weren't redirected
206 # to it). Fetch our target file. This should work now that we have
207 # the right cookie.
208 self._get(url) # Download the target file
209
210 return self.body
106211
107 def _write_body(self, buf):212 def _write_body(self, buf):
108 """Used by curl as a sink for body content"""213 """Used by curl as a sink for body content"""
109 self.body += buf214
215 # If we have a target file to write to, write to it
216 if self.file_out:
217 self.file_out.write(buf)
218
219 # Only buffer first 1MB of body. This should be plenty for anything
220 # we wish to parse internally.
221 if len(self.body) < 1024*1024*1024:
222 self.body += buf
110223
111 def _write_header(self, buf):224 def _write_header(self, buf):
112 """Used by curl as a sink for header content"""225 """Used by curl as a sink for header content"""
113 self.header += buf226 self.header += buf
114227
228 def register_progress_callback(self, callback):
229 self.curl.setopt(pycurl.NOPROGRESS, 0)
230 self.curl.setopt(pycurl.PROGRESSFUNCTION, callback)
231
115 def close(self):232 def close(self):
116 """Wrapper to close curl - this will allow curl to write out cookies"""233 """Wrapper to close curl - this will allow curl to write out cookies"""
117 self.curl.close()234 self.curl.close()
118235
=== modified file 'download_file'
--- download_file 2012-01-09 07:18:32 +0000
+++ download_file 2012-05-04 13:33:19 +0000
@@ -8,7 +8,7 @@
8import urlparse8import urlparse
9import os9import os
1010
11#Download file specified on command line11"""Download file specified on command line"""
12parser = argparse.ArgumentParser(description="Download a file, accepting "12parser = argparse.ArgumentParser(description="Download a file, accepting "
13 "any licenses required to do so.")13 "any licenses required to do so.")
1414
@@ -18,17 +18,11 @@
18args = parser.parse_args()18args = parser.parse_args()
1919
20fetcher = LicenseProtectedFileFetcher()20fetcher = LicenseProtectedFileFetcher()
21content = fetcher.get(args.url[0])
2221
23# Get file name from URL22# Get file name from URL
24file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)23file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
25
26# If file name can not be found (for example, we have got a directory
27# index), provide a default.
28if not file_name:24if not file_name:
29 file_name = "unnamed.out"25 file_name = "downloaded"
3026
31out = open(file_name, 'w')27fetcher.get(args.url[0], file_name)
32out.write(content)
33out.close()
34fetcher.close()28fetcher.close()
3529
=== modified file 'find_latest.py'
--- find_latest.py 2012-04-30 07:40:00 +0000
+++ find_latest.py 2012-05-04 13:33:19 +0000
@@ -127,7 +127,8 @@
127 :param url: The base url to search127 :param url: The base url to search
128 :param extra: The extra path needed to complete the url128 :param extra: The extra path needed to complete the url
129 """129 """
130 builddates = geturl(url)130 fetcher = LicenseProtectedFileFetcher()
131 builddates = fetcher.get(url)
131 dates = find_ci_builds(builddates)132 dates = find_ci_builds(builddates)
132 dates = sorted(dates, key=lambda x: x[1])133 dates = sorted(dates, key=lambda x: x[1])
133134
@@ -139,4 +140,5 @@
139 raise StopIteration()140 raise StopIteration()
140 except StopIteration:141 except StopIteration:
141 pass142 pass
143 fetcher.close()
142 return filename144 return filename

Subscribers

People subscribed via source and target branches