Apport

Merge lp:~louis/apport/apport-unpack-extract into lp:~apport-hackers/apport/trunk

apport-unpack-extract
Merge into trunk

Proposed by Louis Bouchard on 2015-01-26

Status:	Merged
Merged at revision:	2896
Proposed branch:	lp:~louis/apport/apport-unpack-extract
Merge into:	lp:~apport-hackers/apport/trunk
Diff against target:	188 lines (+132/-3) 3 files modified bin/apport-unpack (+11/-2) problem_report.py (+66/-0) test/test_problem_report.py (+55/-1)
To merge this branch:	bzr merge lp:~louis/apport/apport-unpack-extract
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Martin Pitt (community)		2015-01-26	Approve on 2015-02-05
Review via email: mp+247591@code.launchpad.net

Description of the change

The modifications to problem_report.py are mandatory in order for the changes in apport-unpack to work.

Changes to apport-unpack are only meaningful for Ubuntu Precise, so they could be brought in as a debian patch, hence dropped from that MP, while keeping the enhancement to problem_report.

Explanation:
============
Kernel VmCore can be very big ( multi Gigabytes) files. ProblemReport.load() does a systematic load of those in RAM which is very expensive in resource. This adds a ProblemReport.extract() method that will write directly to a file the binary element without loading it into memory first. ProblemExtact() method will write the element given as an argument to disk in the directory passed as an argument or /tmp by default.

apport-unpack is modified to extract the content of the report in two steps :
1) Load the report with binary=False and write all the non-binary element to their respective file, keeping a list of the binary element identified.
2) Use the extract() method on each element identified as binary to write directly to disk

Here is a comparison of the time taken before and after the modification :

root@crashdump:~# time apport-unpack linux-image-3.2.0-38-generic.0.crash tmp

Test on Precise:
Current version :
real 4m54.216s
user 1m16.777s
sys 0m47.919s

With extract method :
============
real 0m12.574s
user 0m5.560s
sys 0m7.004s

Revision history for this message

Martin Pitt (pitti) wrote on 2015-01-27:

Download full text (7.7 KiB)

Hey Louis,

thanks for working on this! The general approach looks fine to me, I
just have some small things to fix.

Louis Bouchard [2015-01-26 14:42 -0000]:
> Changes to apport-unpack are only meaningful for Ubuntu Precise, so
> they could be brought in as a debian patch, hence dropped from that
> MP, while keeping the enhancement to problem_report.

I think it's fine to keep that in trunk. We might have other large
attachments which are better handled that way.

> === modified file 'bin/apport-unpack'
> --- bin/apport-unpack 2012-05-04 06:54:56 +0000
> +++ bin/apport-unpack 2015-01-26 14:41:51 +0000
> @@ -49,18 +49,28 @@
> except OSError as e:
> fatal(str(e))
>
> +binaries = ()

Please make that a list or a set; using a tuple in this way is
confusing. It would also be nice if you could name this "bin_keys"?

> for k in pr:
> + if pr[k] == u'':

No u'' please, just ''; u'' isn't compatible with earlier Python 3.x.

> + binaries += (k,)

Following the above, bin_keys.append(k) (list) or .add(k) (set).

> + with open(report, 'rb') as f:
> + for element in binaries:

"key" instead of element, please? Let's consistently call the report
contents "key" and "value".

> + def extract(self, file, item=None, directory='/tmp'):
> + '''Extract only one element from the problem_report
> + directly without loading the report beforehand
> + This is required for Kernel Crash Dumps that can be
> + very big and saturate the RAM
> + '''

Short descriptions need to be one line. Please rename "item" to "key",
and drop the None default for it, as None will turn the whole thing
into a no-op. Also, don't specify a default for directory; such
defaults are for CLI programs, and libraries should *never* default to
/tmp unless they very carefully avoid overwriting existing files in a
race free manner (i. e. O_CREAT|O_EXCL).

Finally, to further point out that this is for a single key instead of
the whole report, could this be named extract_key()?

> + self._assert_bin_mode(file)
> + self.data.clear()
> + key = None
> + value = None
> + b64_block = False
> + bd = None
> + # Make sure the report is at the beginning
> + file.seek(0)

I'd like to avoid this. First, it's unexpected (and undocumented), you
might deliberately call this on an open file at an offset, and second
it won't work for non-seekable fds such as reading from stdin.

Hey Louis,

thanks for working on this! The general approach looks fine to me, I
just have some small things to fix.

I think it's fine to keep that in trunk. We might have other large
attachments which are better handled that way.

> === modified file 'bin/apport-unpack'
> --- bin/apport-unpack	2012-05-04 06:54:56 +0000
> +++ bin/apport-unpack	2015-01-26 14:41:51 +0000
> @@ -49,18 +49,28 @@
>  except OSError as e:
>      fatal(str(e))
>  
> +binaries = ()

Please make that a list or a set; using a tuple in this way is
confusing. It would also be nice if you could name this "bin_keys"?

>  for k in pr:
> +    if pr[k] == u'':

No u'' please, just ''; u'' isn't compatible with earlier Python 3.x.

> +        binaries += (k,)

Following the above, bin_keys.append(k) (list) or .add(k) (set).

> +    with open(report, 'rb') as f:
> +        for element in binaries:

"key" instead of element, please? Let's consistently call the report
contents "key" and "value".

> +    def extract(self, file, item=None, directory='/tmp'):
> +        '''Extract only one element from the problem_report
> +        directly without loading the report beforehand
> +        This is required for Kernel Crash Dumps that can be
> +        very big and saturate the RAM
> +        '''

Finally, to further point out that this is for a single key instead of
the whole report, could this be named extract_key()?

> +        self._assert_bin_mode(file)
> +        self.data.clear()
> +        key = None
> +        value = None
> +        b64_block = False
> +        bd = None
> +        # Make sure the report is at the beginning
> +        file.seek(0)

> +        for line in file:
> +            # continuation line
> +            if line.startswith(b' '):
> +                if not b64_block:
> +                    continue
> +                assert (key is not None and value is not None)
> +                if b64_block:
> +                    l = base64.b64decode(line)
> +                    if bd:
> +                        out.write(bd.decompress(l))
> +                    else:
> +                        # lazy initialization of bd
> +                        # skip gzip header, if present
> +                        if l.startswith(b'\037\213\010'):
> +                            bd = zlib.decompressobj(-zlib.MAX_WBITS)
> +                            out.write(bd.decompress(self._strip_gzip_header(l)))
> +                        else:
> +                            # legacy zlib-only format used default block
> +                            # size
> +                            bd = zlib.decompressobj()
> +                            out.write(bd.decompress(l))
> +                else:
> +                    if len(value) > 0:
> +                        out.write('{}'.format(b'\n'))
> +                    if line.endswith(b'\n'):
> +                        out.write('{}'.format(line[1:-1]))
> +                    else:
> +                        out.write('{}'.format(line[1:]))
> +            else:
> +                if b64_block:
> +                    if bd:
> +                        value += bd.flush()
> +                    b64_block = False
> +                    bd = None
> +                if key:
> +                    assert value is not None
> +                    self.data[key] = self._try_unicode(value)
> +                (key, value) = line.split(b':', 1)
> +                if not _python2:
> +                    key = key.decode('ASCII')
> +                if key != item:
> +                    continue
> +                value = value.strip()
> +                if value == b'base64':
> +                    value = b''
> +                    b64_block = True
> +                    try:
> +                        out=open(os.path.join(directory, item), 'wb')
> +                    except IOError as e:
> +                        fatal(str(e))

This logic can become a simpler as you know precisely what to scan for
at first: a line equal to key + b':\n'. If it's not found, this should
throw a KeyError, otherwise you can start uncompressing/writing from
there and quit once you arrive at the next key without having to read
the remainder of the file.

> +
> +        if key is not None:
> +            self.data[key] = self._try_unicode(value)
> +
> +        self.old_keys = set(self.data.keys())

These two shouldn't happen as you don't modify the report.

> --- test/test_problem_report.py	2012-10-11 05:56:35 +0000
> +++ test/test_problem_report.py	2015-01-26 14:41:51 +0000
> @@ -258,6 +266,52 @@
>          pr.load(BytesIO(b'ProblemType: Crash'))
>          self.assertEqual(list(pr.keys()), ['ProblemType'])
>  
> +
> +    def test_extract(self):
> +        '''extract() with various binary elements.'''
> +
> +        # create a test report with binary elements
> +        large_val = b'A' * 5000000
> +
> +        pr = problem_report.ProblemReport()
> +        pr['Txt'] = 'some text'
> +        pr['MoreTxt'] = 'some more text'
> +        pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
> +        pr['Uncompressed'] = bin_data
> +        pr['Bin'] = problem_report.CompressedValue()
> +        pr['Bin'].set_value(bin_data)
> +        pr['Large'] = problem_report.CompressedValue(large_val)
> +        pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')

This is really binary, so I would just drop this bit.

> +        report = BytesIO()
> +        pr.write(report)
> +        report.seek(0)
> +
> +        #Extracts nothing if non binary
> +        pr.extract(report, 'Txt', self.workdir )
> +        self.assertEqual(os.path.exists(os.path.join(self.workdir, 'Txt')), False)

This should raise a ValueError instead. Silently not doing anything on
wrong input (i. e. an ASCII key) is very un-Pythonic.

Also, don't use assertEqual(..., True|False) -- use assertTrue() and
assertFalse(). Verifying that it does not write a file is still a good
thing, of course (after the assertRaises).

> +        #Check inexistant element
> +        pr.extract(report, 'Bar', self.workdir )
> +        self.assertEqual(os.path.exists(os.path.join(self.workdir, 'Bar')), False)

This should raise a KeyError.

> +        #Check valid elements
> +        pr.extract(report, 'Foo', self.workdir )
> +        element = open(os.path.join(self.workdir, 'Foo'))
> +        self.assertEqual(element.read(), b'FooFoo!' )

Please don't leave fd leaks. Use "with open(...) as f:". Same for the
tests below.

Finally, there are a lot of PEP-8 errors in this. Can you please
install "pep8" and fix the issues in

pep8 -r --ignore=E401,E501,W291,W293 test/test_problem_report.py
  pep8 -r --ignore=E401,E501,E124 problem_report.py

That's what test/run does, BTW. You can just start that, and as soon
as it starts running the module tests you are good. It will fail very
quickly on pep8/pyflakes errors.

Thanks!

Martin
-- 
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

lp:~louis/apport/apport-unpack-extract updated on 2015-01-29

2898. By Louis Bouchard on 2015-01-27

PEP8 cleanup

2899. By Louis Bouchard on 2015-01-28

Fix pylint complains

Predefine out
Replace fatal() by returning an exception to caller

2900. By Louis Bouchard on 2015-01-28

Replace tuple binaries by list called bin_keys

2901. By Louis Bouchard on 2015-01-28

Rename method to extract_key and fix description

2902. By Louis Bouchard on 2015-01-28

Change extract_key parameters

Rename directory to dir and remove default
Rename item to bin_key

2903. By Louis Bouchard on 2015-01-28

Remove py3 incompatibility

2904. By Louis Bouchard on 2015-01-28

Remove seek(0) in extract_key method

Adapt test and fix apport-unpack accordingly

2905. By Louis Bouchard on 2015-01-28

Remove unneeded modifications to report

2906. By Louis Bouchard on 2015-01-28

Replace some assertEquals by assertFalse

2907. By Louis Bouchard on 2015-01-28

Lump similar tests in one loop

2908. By Louis Bouchard on 2015-01-29

Rework the extract_key logic to simplify

Previous logic was a straight copy from the load method.
Revisit to remove unneeded complexity

2909. By Louis Bouchard on 2015-01-29

Add assertRaises tests for exception handling in extract_key

Revision history for this message

Louis Bouchard (louis) wrote on 2015-01-29:

Hello Martin,

All the changes that you requested have been implemented except for one that I don't understand :

> +
> + def test_extract(self):
> + '''extract() with various binary elements.'''
> +
> + # create a test report with binary elements
> + large_val = b'A' * 5000000
> +
> + pr = problem_report.ProblemReport()
> + pr['Txt'] = 'some text'
> + pr['MoreTxt'] = 'some more text'
> + pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
> + pr['Uncompressed'] = bin_data
> + pr['Bin'] = problem_report.CompressedValue()
> + pr['Bin'].set_value(bin_data)
> + pr['Large'] = problem_report.CompressedValue(large_val)
> + pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')

> This is really binary, so I would just drop this bit.

I'm not sure of what should be dropped.

The other slight change is the request to replace item by key : the key variable was already in use in the method, so I replaced item by bin_key to discriminate.

I also improved the test by avoiding repetition and testing the exceptions.

Let me know if there is anything else to be done.

Kind regards,

...Louis

Revision history for this message

Martin Pitt (pitti) wrote on 2015-01-29:

Thanks! This looks mostly good now, I'll merge this.

This bit in apport-unpack still has quite some optimization potential:

    for key in bin_keys:
        with open(report, 'rb') as f:
            pr.extract_key(f, key, dir)

This reads the report file n times (for #number of binary keys). It would be much faster to extract all keys sequentially in one go. This would require either changing the API of load() to generate a list of binary keys which is in the order as they appear in the file, or changing the API of extract_key() to accept a collection of keys and extract them all. The latter is obviously better in terms of API stability.

review: Approve

lp:~louis/apport/apport-unpack-extract updated on 2015-02-02

2910. By Louis Bouchard on 2015-01-30

Remove unneeded initialization

2911. By Louis Bouchard on 2015-02-02

Implement multiple key extract : rename to extract_keys

Modify the extract_keys() method to receive keys to extract
in as a list of elements. It will sequentially extract the listed
binary elements present in the list passed as argument.

It will throw a ValueError if some of the keys are not in binary
and will issue a KeyError if some keys are not in the report.

Adapt apport-unpack accordingly and add unit tests

2912. By Louis Bouchard on 2015-02-02

Replace format() by %s % syntax

Revision history for this message

Louis Bouchard (louis) wrote on 2015-02-02:

Hello,

Following your advice, I have modified the extract_key() method into extract_keys() which can receive a list of binary keys to extract directly in one pass.

Error handling has been changed accordingly along with improvement to the unit tests.

apport-unpack now uses this new feature which avoid reopening the report multiple times.

Revision history for this message

Martin Pitt (pitti) wrote on 2015-02-05:

Many thanks! I merged this with a few cleanups and a NEWS entry.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Brian Murray

Bruno Maximilian Voss

Louis Bouchard

Martin Pitt

Ritesh Raj Sarraf

 === modified file 'bin/apport-unpack'
 --- bin/apport-unpack	2012-05-04 06:54:56 +0000
 +++ bin/apport-unpack	2015-02-02 14:38:34 +0000
@@ -49,18 +49,27 @@
  except OSError as e:
      fatal(str(e))
++bin_keys = []
  pr = problem_report.ProblemReport()
  if report == '-':
--    pr.load(sys.stdin)
++    pr.load(sys.stdin, binary=False)
  else:
      try:
          with open(report, 'rb') as f:
--            pr.load(f)
++            pr.load(f, binary=False)
      except IOError as e:
          fatal(str(e))
  for k in pr:
++    if pr[k] == '':
++        bin_keys.append(k)
++        continue
      with open(os.path.join(dir, k), 'wb') as f:
          if type(pr[k]) == t_str:
              f.write(pr[k].encode('UTF-8'))
          else:
              f.write(pr[k])
++try:
++    with open(report, 'rb') as f:
++        pr.extract_keys(f, bin_keys, dir)
++except IOError as e:
++    fatal(str(e))
 === modified file 'problem_report.py'
 --- problem_report.py	2013-09-19 14:26:33 +0000
 +++ problem_report.py	2015-02-02 14:38:34 +0000
@@ -188,6 +188,72 @@
          self.old_keys = set(self.data.keys())
++    def extract_keys(self, file, bin_keys, dir):
++        '''Extract only one binary element from the problem_report
++
++        Binary elements can be very big. This method extracts
++        directly to a file without loading the report beforehand
++        This is required for Kernel Crash Dumps that can be
++        very big and saturate the RAM
++        '''
++        self._assert_bin_mode(file)
++        if not isinstance(bin_keys, list):
++            bin_keys = [bin_keys]
++        key = None
++        value = None
++        has_key = {key: False for key in bin_keys}
++        b64_block = {}
++        bd = None
++        out = None
++        for line in file:
++            # Identify the bin_keys we're looking for
++            while not line.startswith(b' '):
++                (key, value) = line.split(b':', 1)
++                if not _python2:
++                    key = key.decode('ASCII')
++                if key not in bin_keys:
++                    break
++                b64_block[key] = False
++                has_key[key] = True
++                value = value.strip()
++                if value == b'base64':
++                    value = b''
++                    b64_block[key] = True
++                    try:
++                        bd = None
++                        with open(os.path.join(dir, key), 'wb') as out:
++                            for line in file:
++                                # continuation line
++                                if line.startswith(b' '):
++                                    assert (key is not None and value is not None)
++                                    if b64_block[key]:
++                                        l = base64.b64decode(line)
++                                        if bd:
++                                            out.write(bd.decompress(l))
++                                        else:
++                                            # lazy initialization of bd
++                                            # skip gzip header, if present
++                                            if l.startswith(b'\037\213\010'):
++                                                bd = zlib.decompressobj(-zlib.MAX_WBITS)
++                                                out.write(bd.decompress(self._strip_gzip_header(l)))
++                                            else:
++                                                # legacy zlib-only format used default block
++                                                # size
++                                                bd = zlib.decompressobj()
++                                                out.write(bd.decompress(l))
++                                else:
++                                    break
++                    except IOError:
++                        raise IOError('unable to open %s' % (os.path.join(dir, key)))
++                else:
++                    break
++        if False in has_key.values():
++            raise KeyError('Cannot find %s in report' %
++                           [item for item, element in has_key.items() if element is False])
++        if False in b64_block.values():
++            raise ValueError('%s has no binary content' %
++                             [item for item, element in b64_block.items() if element is False])
++
      def has_removed_fields(self):
          '''Check if the report has any keys which were not loaded.
 === modified file 'test/test_problem_report.py'
 --- test/test_problem_report.py	2012-10-11 05:56:35 +0000
 +++ test/test_problem_report.py	2015-02-02 14:38:34 +0000
@@ -1,5 +1,5 @@
  # vim: set encoding=UTF-8 fileencoding=UTF-8 :
--import unittest, tempfile, os, email, gzip, time, sys
++import unittest, tempfile, os, shutil, email, gzip, time, sys
  from io import BytesIO
  import problem_report
@@ -12,6 +12,14 @@
  class T(unittest.TestCase):
++    @classmethod
++    def setUp(self):
++        self.workdir = tempfile.mkdtemp()
++
++    @classmethod
++    def tearDown(self):
++        shutil.rmtree(self.workdir)
++
      def test_basic_operations(self):
          '''basic creation and operation.'''
@@ -258,6 +266,52 @@
          pr.load(BytesIO(b'ProblemType: Crash'))
          self.assertEqual(list(pr.keys()), ['ProblemType'])
++    def test_extract_keys(self):
++        '''extract_keys() with various binary elements.'''
++
++        # create a test report with binary elements
++        large_val = b'A' * 5000000
++
++        pr = problem_report.ProblemReport()
++        pr['Txt'] = 'some text'
++        pr['MoreTxt'] = 'some more text'
++        pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
++        pr['Uncompressed'] = bin_data
++        pr['Bin'] = problem_report.CompressedValue()
++        pr['Bin'].set_value(bin_data)
++        pr['Large'] = problem_report.CompressedValue(large_val)
++        pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')
++
++        report = BytesIO()
++        pr.write(report)
++        report.seek(0)
++
++        self.assertRaises(IOError, pr.extract_keys, report, 'Bin', '{}/foo'.format(self.workdir))
++        # Test exception handling : Non-binary and inexistant key
++        tests = {ValueError: 'Txt', ValueError: ['Foo', 'Txt'], KeyError: 'Bar', KeyError: ['Foo', 'Bar']}
++        for test in tests.keys():
++            report.seek(0)
++            self.assertRaises(test, pr.extract_keys, report, tests[test], self.workdir)
++        # Check valid single elements
++        tests = {'Foo': b'FooFoo!', 'Uncompressed': bin_data, 'Bin': bin_data, 'Large': large_val,
++                 'Multiline': b'\1\1\1\n\2\2\n\3\3\3'}
++        for test in tests.keys():
++            report.seek(0)
++            pr.extract_keys(report, test, self.workdir)
++            with open(os.path.join(self.workdir, test), 'rb') as element:
++                self.assertEqual(element.read(), tests[test])
++                element.close()
++            # remove file for next pass
++            os.remove(os.path.join(self.workdir, test))
++        # Check element list
++        report.seek(0)
++        key_list = ['Foo', 'Uncompressed']
++        tests = {'Foo': b'FooFoo!', 'Uncompressed': bin_data}
++        pr.extract_keys(report, key_list, self.workdir)
++        for key in key_list:
++            with open(os.path.join(self.workdir, key), 'rb') as element:
++                self.assertEqual(element.read(), tests[key])
++
      def test_write_file(self):
          '''writing a report with binary file data.'''

Apport

Merge lp:~louis/apport/apport-unpack-extract into lp:~apport-hackers/apport/trunk

Commit message

Description of the change

Preview Diff

Subscribers