Merge into lci-build-tools : fix_bug_994573_lava

Status:	Merged
Approved by:	James Tunnicliffe on 2012-05-04
Approved revision:	61
Merged at revision:	61
Proposed branch:	lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission
Merge into:	lp:linaro-ci
Diff against target:	325 lines (+172/-59) 3 files modified download_content_yes_to_lic.py (+166/-49) download_file (+3/-9) find_latest.py (+3/-1)
To merge this branch:	bzr merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
James Tunnicliffe (community)		2012-05-04	Approve on 2012-05-04
Review via email: mp+104737@code.launchpad.net

Description of the change

Fixing the CI lava submission failing for origen and snowball jobs.
Also, aligning with the latest changes of license acceptance and downloading the restricted file.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2012-05-04:

#

Looks fine. Not critical but I may change this section a little:

=== modified file 'download_file'
--- download_file 2012-01-09 07:18:32 +0000
+++ download_file 2012-05-04 13:33:19 +0000

@@ -18,17 +18,11 @@
args = parser.parse_args()

fetcher = LicenseProtectedFileFetcher()
-content = fetcher.get(args.url[0])

# Get file name from URL
file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
-
-# If file name can not be found (for example, we have got a directory
-# index), provide a default.
if not file_name:
- file_name = "unnamed.out"
+ file_name = "downloaded"

-out = open(file_name, 'w')
-out.write(content)
-out.close()
+fetcher.get(args.url[0], file_name)
fetcher.close()

Since the script is called download_file and it would appear that you aren't using the default file name at all (it used to be unnamed.out, now is downloaded, didn't see any more references to that change), I would probably do this:

if not file_name:
print >> sys.stderr, "Could not derive file name from URL - aborting"
exit(1)

You will need to import sys.

It looks like using that script to download something that isn't an explicitly named in the URL file is an error, so you might as well catch it early.

Other than that, looks good.

Just FYI (because my comments weren't updated everywhere), LicenseProtectedFileFetcher.get() will only return the first 1MB of a file. It always stores the full file to disk if file_name is set. I can't see you using it to download large files and expecting them to be returned in this way, so it looks safe, but thought it was worth drawing your attention to the change just in case. I would like to update it to only keep files in RAM if file_name isn't set and throw an exception if the file is too large. Of course, we may not have this script for much longer, so there is little point in me changing it until we know its future. Just thought I would keep you in the loop :-)

review: Approve

Revision history for this message

Deepti B. Kalakeri (deeptik) wrote on 2012-05-04:

#

On Fri, May 4, 2012 at 7:36 PM, James Tunnicliffe
<email address hidden>wrote:

> Review: Approve
>
> Looks fine. Not critical but I may change this section a little:
>
> === modified file 'download_file'
> --- download_file 2012-01-09 07:18:32 +0000
> +++ download_file 2012-05-04 13:33:19 +0000
>
> @@ -18,17 +18,11 @@
> args = parser.parse_args()
>
> fetcher = LicenseProtectedFileFetcher()
> -content = fetcher.get(args.url[0])
>
> # Get file name from URL
> file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
> -
> -# If file name can not be found (for example, we have got a directory
> -# index), provide a default.
> if not file_name:
> - file_name = "unnamed.out"
> + file_name = "downloaded"
>
> -out = open(file_name, 'w')
> -out.write(content)
> -out.close()
> +fetcher.get(args.url[0], file_name)
> fetcher.close()
>
> Since the script is called download_file and it would appear that you
> aren't using the default file name at all (it used to be unnamed.out, now
> is downloaded, didn't see any more references to that change), I would
> probably do this:
>
> if not file_name:
> print >> sys.stderr, "Could not derive file name from URL - aborting"
> exit(1)
>
> The file_name could be used later in the build scripts if not in the
download_file.
So I would retain it. I will make the changes for using this in another MP
and send for review.

> You will need to import sys.
>
> It looks like using that script to download something that isn't an
> explicitly named in the URL file is an error, so you might as well catch it
> early.
>
> I did not get this completely. Can you please elaborate ?

> Other than that, looks good.
>
> Just FYI (because my comments weren't updated everywhere),
> LicenseProtectedFileFetcher.get() will only return the first 1MB of a file.
> It always stores the full file to disk if file_name is set. I can't see you
> using it to download large files and expecting them to be returned in this
> way, so it looks safe, but thought it was worth drawing your attention to
> the change just in case. I would like to update it to only keep files in
> RAM if file_name isn't set and throw an exception if the file is too large.
> Of course, we may not have this script for much longer, so there is little
> point in me changing it until we know its future. Just thought I would keep
> you in the loop :-)
> --
>
> https://code.launchpad.net/~deeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737<https://code.launchpad.net/%7Edeeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737>
> You are the owner of lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission.
>

--
Thanks and Regards,
Deepti
Infrastructure Team Member, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

On Fri, May 4, 2012 at 7:36 PM, James Tunnicliffe
<launchpad@nanosheep.org>wrote:

> Review: Approve
>
> Looks fine. Not critical but I may change this section a little:
>
> === modified file 'download_file'
> --- download_file       2012-01-09 07:18:32 +0000
> +++ download_file       2012-05-04 13:33:19 +0000
>
> @@ -18,17 +18,11 @@
>  args = parser.parse_args()
>
>  fetcher = LicenseProtectedFileFetcher()
> -content = fetcher.get(args.url[0])
>
>  # Get file name from URL
>  file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
> -
> -# If file name can not be found (for example, we have got a directory
> -# index), provide a default.
>  if not file_name:
> -    file_name = "unnamed.out"
> +    file_name = "downloaded"
>
> -out = open(file_name, 'w')
> -out.write(content)
> -out.close()
> +fetcher.get(args.url[0], file_name)
>  fetcher.close()
>
> Since the script is called download_file and it would appear that you
> aren't using the default file name at all (it used to be unnamed.out, now
> is downloaded, didn't see any more references to that change), I would
> probably do this:
>
> if not file_name:
>    print >> sys.stderr, "Could not derive file name from URL - aborting"
>    exit(1)
>
> The file_name could be used later in the build scripts if not in the
download_file.
So I would retain it. I will make the changes for using this in another MP
and send for review.

> You will need to import sys.
>
> It looks like using that script to download something that isn't an
> explicitly named in the URL file is an error, so you might as well catch it
> early.
>
> I did not get this completely. Can you please elaborate ?

> Other than that, looks good.
>
> Just FYI (because my comments weren't updated everywhere),
> LicenseProtectedFileFetcher.get() will only return the first 1MB of a file.
> It always stores the full file to disk if file_name is set. I can't see you
> using it to download large files and expecting them to be returned in this
> way, so it looks safe, but thought it was worth drawing your attention to
> the change just in case. I would like to update it to only keep files in
> RAM if file_name isn't set and throw an exception if the file is too large.
> Of course, we may not have this script for much longer, so there is little
> point in me changing it until we know its future. Just thought I would keep
> you in the loop :-)
> --
>
> https://code.launchpad.net/~deeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737<https://code.launchpad.net/%7Edeeptik/linaro-ci/fix_bug_994573_lava_submission/+merge/104737>
> You are the owner of lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission.
>

-- 
Thanks and Regards,
Deepti
Infrastructure Team Member, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2012-05-08:

#

>> It looks like using that script to download something that isn't an
>> explicitly named in the URL file is an error, so you might as well catch it
>> early.
>>
> I did not get this completely. Can you please elaborate ?

Sorry, didn't manage to use English at that point! You seem to be
using the script to download a file that comes from a URL like
http://server/dir/file. In this case we get the file name from the URL
and save to a file with that name. The potential problem is the
download script will download what a server returns when there is a
trailing slash on the end of a URL, such as a directory listing from
http://server/dir/. In this case it won't know what to name the file,
so it uses a default file name. You don't seem to be using this
default file name, so you could use your knowledge of how you intend
to use the script to modify it to print a helpful error message and
quit in this case (I am assuming that it would be easier to debug this
than downloading to the default file name and failing later).

Hope that is clear!

--
James Tunnicliffe

Linaro CI

Merge lp:~deeptik/linaro-ci/fix_bug_994573_lava_submission into lp:linaro-ci

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'download_content_yes_to_lic.py'
 --- download_content_yes_to_lic.py	2012-01-09 07:18:32 +0000
 +++ download_content_yes_to_lic.py	2012-05-04 13:33:19 +0000
@@ -1,11 +1,14 @@
++# Changes required to address EULA for the origen hwpacks
++
  #!/usr/bin/env python
--# Changes required to address EULA for the origen hwpacks
--
++import argparse
  import os
  import pycurl
  import re
  import urlparse
++import html2text
++from BeautifulSoup import BeautifulSoup
  class LicenseProtectedFileFetcher:
      """Fetch a file from the web that may be protected by a license redirect
@@ -27,25 +30,104 @@
      downloads.
      """
--    def __init__(self):
++    def __init__(self, cookie_file="cookies.txt"):
          """Set up cURL"""
          self.curl = pycurl.Curl()
--        self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
          self.curl.setopt(pycurl.WRITEFUNCTION, self._write_body)
          self.curl.setopt(pycurl.HEADERFUNCTION, self._write_header)
--        self.curl.setopt(pycurl.COOKIEFILE, "cookies.txt")
--        self.curl.setopt(pycurl.COOKIEJAR, "cookies.txt")
++        self.curl.setopt(pycurl.FOLLOWLOCATION, 1)
++        self.curl.setopt(pycurl.COOKIEFILE, cookie_file)
++        self.curl.setopt(pycurl.COOKIEJAR, cookie_file)
++        self.file_out = None
      def _get(self, url):
          """Clear out header and body storage, fetch URL, filling them in."""
--        self.curl.setopt(pycurl.URL, url)
--
--        self.body = ""
--        self.header = ""
--
--        self.curl.perform()
--
--    def get(self, url):
++        url = url.encode("ascii")
++        self.curl.setopt(pycurl.URL, url)
++
++        self.body = ""
++        self.header = ""
++
++        if self.file_name:
++            self.file_out = open(self.file_name, 'w')
++        else:
++            self.file_out = None
++
++        self.curl.perform()
++        self._parse_headers(url)
++
++        if self.file_out:
++            self.file_out.close()
++
++    def _parse_headers(self, url):
++        header = {}
++        for line in self.header.splitlines():
++            # Header lines typically are of the form thing: value...
++            test_line = re.search("^(.*?)\s*:\s*(.*)$", line)
++
++            if test_line:
++                header[test_line.group(1)] = test_line.group(2)
++
++        # The location attribute is sometimes relative, but we would
++        # like to have it as always absolute...
++
++        if 'Location' in header.keys():
++            parsed_location = urlparse.urlparse(header["Location"])
++
++            # If not an absolute location...
++            if not parsed_location.netloc:
++                parsed_source_url = urlparse.urlparse(url)
++                new_location = ["", "", "", "", ""]
++
++                new_location[0] = parsed_source_url.scheme
++                new_location[1] = parsed_source_url.netloc
++                new_location[2] = header["Location"]
++
++                # Update location with absolute URL
++                header["Location"] = urlparse.urlunsplit(new_location)
++
++        self.header_text = self.header
++        self.header = header
++
++    def get_headers(self, url):
++        url = url.encode("ascii")
++        self.curl.setopt(pycurl.URL, url)
++
++        self.body = ""
++        self.header = ""
++
++        # Setting NOBODY causes CURL to just fetch the header.
++        self.curl.setopt(pycurl.NOBODY, True)
++        self.curl.perform()
++        self.curl.setopt(pycurl.NOBODY, False)
++
++        self._parse_headers(url)
++
++        return self.header
++
++    def get_or_return_license(self, url, file_name=None):
++        """Get file at the requested URL or, if behind a license, return that.
++
++        If the URL provided does not redirect us to a license, then return the
++        body of that file. If we are redirected to a license click through
++        then return (the license as plain text, url to accept the license).
++
++        If the user of this function accepts the license, then they should
++        call get_protected_file."""
++
++        self.file_name = file_name
++
++        # Get the license details. If this returns None, the file isn't license
++        # protected and we can just return the file we started to get in the
++        # function (self.body).
++        license_details = self._get_license(url)
++
++        if license_details:
++            return license_details
++
++        return self.body
++
++    def get(self, url, file_name=None):
          """Fetch the requested URL, accepting licenses, returns file body
          Fetches the file at url. If a redirect is encountered, it is
@@ -53,13 +135,34 @@
          then download the original file.
          """
--        self._get(url)
--
--        location = self._get_location()
--        if location:
--            # Off to the races - we have been redirected.
--            # Expect to find a link to self.location with -accepted inserted
--            # before the .html, i.e. ste.html -> ste-accepted.html
++
++        self.file_name = file_name
++        license_details = self._get_license(url)
++
++        if license_details:
++            # Found a license. Accept the license without looking at it and
++            # start fetching the file we originally wanted.
++            accept_url = license_details[1]
++            self.get_protected_file(accept_url, url)
++
++        else:
++            # If we got here, there wasn't a license protecting the file
++            # so we just fetch it.
++            self._get(url)
++
++        return self.body
++
++    def _get_license(self, url):
++        """Return (license, accept URL) if found, else return None"""
++
++        self.get_headers(url)
++
++        if "Location" in self.header and self.header["Location"] != url:
++            # We have been redirected to a new location - the license file
++            location = self.header["Location"]
++
++            # Fetch the license HTML
++            self._get(location)
              # Get the file from the URL (full path)
              file = urlparse.urlparse(location).path
@@ -68,50 +171,64 @@
              file = os.path.split(file)[-1]
              # Look for a link with accepted.html in the page name. Follow it.
--            new_file = None
              for line in self.body.splitlines():
                  link_search = re.search("""href=.*?["'](.*?-accepted.html)""",
                                          line)
                  if link_search:
                      # Have found license accept URL!
                      new_file = link_search.group(1)
--
--            if new_file:
--                # Accept the license...
--                accept_url = re.sub(file, new_file, location)
--                self._get(accept_url)
--
--                # The above get *should* take us to the file requested via
--                # a redirect. If we manually need to follow that redirect,
--                # do that now.
--
--                if self._get_location():
--                    # If we haven't been redirected to our original file,
--                    # we should be able to just download it now.
--                    self._get(url)
--
--        return self.body
--
--    def _search_header(self, field):
--        """Search header for the supplied field, return field / None"""
--        for line in self.header.splitlines():
--            search = re.search(field + ":\s+(.*?)$", line)
--            if search:
--                return search.group(1)
++                    accept_url = re.sub(file, new_file, location)
++
++                    # Parse the HTML using BeautifulSoup
++                    soup = BeautifulSoup(self.body)
++
++                    # The license is in a div with the ID license-text, so we
++                    # use this to pull just the license out of the HTML.
++                    html_license = u""
++                    for chunk in soup.findAll(id="license-text"):
++                        # Output of chunk.prettify is UTF8, but comes back
++                        # as a str, so convert it here.
++                        html_license += chunk.prettify().decode("utf-8")
++
++                    text_license = html2text.html2text(html_license)
++
++                    return text_license, accept_url
++
          return None
--    def _get_location(self):
--        """Return content of Location field in header / None"""
--        return self._search_header("Location")
++    def get_protected_file(self, accept_url, url):
++        """Gets the file redirected to by the accept_url"""
++
++        self._get(accept_url)  # Accept the license
++
++        if not("Location" in self.header and self.header["Location"] == url):
++            # If we got here, we don't have the file yet (weren't redirected
++            # to it). Fetch our target file. This should work now that we have
++            # the right cookie.
++            self._get(url)  # Download the target file
++
++        return self.body
      def _write_body(self, buf):
          """Used by curl as a sink for body content"""
--        self.body += buf
++
++        # If we have a target file to write to, write to it
++        if self.file_out:
++            self.file_out.write(buf)
++
++        # Only buffer first 1MB of body. This should be plenty for anything
++        # we wish to parse internally.
++        if len(self.body) < 1024*1024*1024:
++            self.body += buf
      def _write_header(self, buf):
          """Used by curl as a sink for header content"""
          self.header += buf
++    def register_progress_callback(self, callback):
++        self.curl.setopt(pycurl.NOPROGRESS, 0)
++        self.curl.setopt(pycurl.PROGRESSFUNCTION, callback)
++
      def close(self):
          """Wrapper to close curl - this will allow curl to write out cookies"""
          self.curl.close()
 === modified file 'download_file'
 --- download_file	2012-01-09 07:18:32 +0000
 +++ download_file	2012-05-04 13:33:19 +0000
@@ -8,7 +8,7 @@
  import urlparse
  import os
--#Download file specified on command line
++"""Download file specified on command line"""
  parser = argparse.ArgumentParser(description="Download a file, accepting "
                                  "any licenses required to do so.")
@@ -18,17 +18,11 @@
  args = parser.parse_args()
  fetcher = LicenseProtectedFileFetcher()
--content = fetcher.get(args.url[0])
  # Get file name from URL
  file_name = os.path.basename(urlparse.urlparse(args.url[0]).path)
--
--# If file name can not be found (for example, we have got a directory
--# index), provide a default.
  if not file_name:
--    file_name = "unnamed.out"
++    file_name = "downloaded"
--out = open(file_name, 'w')
--out.write(content)
--out.close()
++fetcher.get(args.url[0], file_name)
  fetcher.close()
 === modified file 'find_latest.py'
 --- find_latest.py	2012-04-30 07:40:00 +0000
 +++ find_latest.py	2012-05-04 13:33:19 +0000
@@ -127,7 +127,8 @@
      :param url: The base url to search
      :param extra: The extra path needed to complete the url
      """
--    builddates = geturl(url)
++    fetcher = LicenseProtectedFileFetcher()
++    builddates = fetcher.get(url)
      dates = find_ci_builds(builddates)
      dates = sorted(dates, key=lambda x: x[1])
@@ -139,4 +140,5 @@
                  raise StopIteration()
      except StopIteration:
          pass
++    fetcher.close()
      return filename