Merge lp:~louis/apport/apport-unpack-extract into lp:~apport-hackers/apport/trunk

Proposed by Louis Bouchard
Status: Merged
Merged at revision: 2896
Proposed branch: lp:~louis/apport/apport-unpack-extract
Merge into: lp:~apport-hackers/apport/trunk
Diff against target: 188 lines (+132/-3)
3 files modified
bin/apport-unpack (+11/-2)
problem_report.py (+66/-0)
test/test_problem_report.py (+55/-1)
To merge this branch: bzr merge lp:~louis/apport/apport-unpack-extract
Reviewer Review Type Date Requested Status
Martin Pitt (community) Approve
Review via email: mp+247591@code.launchpad.net

Description of the change

The modifications to problem_report.py are mandatory in order for the changes in apport-unpack to work.

Changes to apport-unpack are only meaningful for Ubuntu Precise, so they could be brought in as a debian patch, hence dropped from that MP, while keeping the enhancement to problem_report.

Explanation:
============
Kernel VmCore can be very big ( multi Gigabytes) files. ProblemReport.load() does a systematic load of those in RAM which is very expensive in resource. This adds a ProblemReport.extract() method that will write directly to a file the binary element without loading it into memory first. ProblemExtact() method will write the element given as an argument to disk in the directory passed as an argument or /tmp by default.

apport-unpack is modified to extract the content of the report in two steps :
1) Load the report with binary=False and write all the non-binary element to their respective file, keeping a list of the binary element identified.
2) Use the extract() method on each element identified as binary to write directly to disk

Here is a comparison of the time taken before and after the modification :

root@crashdump:~# time apport-unpack linux-image-3.2.0-38-generic.0.crash tmp

Test on Precise:
Current version :
real 4m54.216s
user 1m16.777s
sys 0m47.919s

With extract method :
============
real 0m12.574s
user 0m5.560s
sys 0m7.004s

To post a comment you must log in.
Revision history for this message
Martin Pitt (pitti) wrote :
Download full text (7.7 KiB)

Hey Louis,

thanks for working on this! The general approach looks fine to me, I
just have some small things to fix.

Louis Bouchard [2015-01-26 14:42 -0000]:
> Changes to apport-unpack are only meaningful for Ubuntu Precise, so
> they could be brought in as a debian patch, hence dropped from that
> MP, while keeping the enhancement to problem_report.

I think it's fine to keep that in trunk. We might have other large
attachments which are better handled that way.

> === modified file 'bin/apport-unpack'
> --- bin/apport-unpack 2012-05-04 06:54:56 +0000
> +++ bin/apport-unpack 2015-01-26 14:41:51 +0000
> @@ -49,18 +49,28 @@
> except OSError as e:
> fatal(str(e))
>
> +binaries = ()

Please make that a list or a set; using a tuple in this way is
confusing. It would also be nice if you could name this "bin_keys"?

> for k in pr:
> + if pr[k] == u'':

No u'' please, just ''; u'' isn't compatible with earlier Python 3.x.

> + binaries += (k,)

Following the above, bin_keys.append(k) (list) or .add(k) (set).

> + with open(report, 'rb') as f:
> + for element in binaries:

"key" instead of element, please? Let's consistently call the report
contents "key" and "value".

> + def extract(self, file, item=None, directory='/tmp'):
> + '''Extract only one element from the problem_report
> + directly without loading the report beforehand
> + This is required for Kernel Crash Dumps that can be
> + very big and saturate the RAM
> + '''

Short descriptions need to be one line. Please rename "item" to "key",
and drop the None default for it, as None will turn the whole thing
into a no-op. Also, don't specify a default for directory; such
defaults are for CLI programs, and libraries should *never* default to
/tmp unless they very carefully avoid overwriting existing files in a
race free manner (i. e. O_CREAT|O_EXCL).

Finally, to further point out that this is for a single key instead of
the whole report, could this be named extract_key()?

> + self._assert_bin_mode(file)
> + self.data.clear()
> + key = None
> + value = None
> + b64_block = False
> + bd = None
> + # Make sure the report is at the beginning
> + file.seek(0)

I'd like to avoid this. First, it's unexpected (and undocumented), you
might deliberately call this on an open file at an offset, and second
it won't work for non-seekable fds such as reading from stdin.

> + for line in file:
> + # continuation line
> + if line.startswith(b' '):
> + if not b64_block:
> + continue
> + assert (key is not None and value is not None)
> + if b64_block:
> + l = base64.b64decode(line)
> + if bd:
> + out.write(bd.decompress(l))
> + else:
> + # lazy initialization of bd
> + # skip gzip header, if present
> + if l.startswith(b'\037\213\010'):
> + bd = zlib.decompressobj(-zlib.MAX_WBITS)
> + out.write(bd.de...

Read more...

2898. By Louis Bouchard

PEP8 cleanup

2899. By Louis Bouchard

Fix pylint complains

Predefine out
Replace fatal() by returning an exception to caller

2900. By Louis Bouchard

Replace tuple binaries by list called bin_keys

2901. By Louis Bouchard

Rename method to extract_key and fix description

2902. By Louis Bouchard

Change extract_key parameters

 Rename directory to dir and remove default
 Rename item to bin_key

2903. By Louis Bouchard

Remove py3 incompatibility

2904. By Louis Bouchard

Remove seek(0) in extract_key method

Adapt test and fix apport-unpack accordingly

2905. By Louis Bouchard

Remove unneeded modifications to report

2906. By Louis Bouchard

Replace some assertEquals by assertFalse

2907. By Louis Bouchard

Lump similar tests in one loop

2908. By Louis Bouchard

Rework the extract_key logic to simplify

Previous logic was a straight copy from the load method.
Revisit to remove unneeded complexity

2909. By Louis Bouchard

Add assertRaises tests for exception handling in extract_key

Revision history for this message
Louis Bouchard (louis) wrote :

Hello Martin,

All the changes that you requested have been implemented except for one that I don't understand :

> +
> + def test_extract(self):
> + '''extract() with various binary elements.'''
> +
> + # create a test report with binary elements
> + large_val = b'A' * 5000000
> +
> + pr = problem_report.ProblemReport()
> + pr['Txt'] = 'some text'
> + pr['MoreTxt'] = 'some more text'
> + pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
> + pr['Uncompressed'] = bin_data
> + pr['Bin'] = problem_report.CompressedValue()
> + pr['Bin'].set_value(bin_data)
> + pr['Large'] = problem_report.CompressedValue(large_val)
> + pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')

> This is really binary, so I would just drop this bit.

I'm not sure of what should be dropped.

The other slight change is the request to replace item by key : the key variable was already in use in the method, so I replaced item by bin_key to discriminate.

I also improved the test by avoiding repetition and testing the exceptions.

Let me know if there is anything else to be done.

Kind regards,

...Louis

Revision history for this message
Martin Pitt (pitti) wrote :

Thanks! This looks mostly good now, I'll merge this.

This bit in apport-unpack still has quite some optimization potential:

    for key in bin_keys:
        with open(report, 'rb') as f:
            pr.extract_key(f, key, dir)

This reads the report file n times (for #number of binary keys). It would be much faster to extract all keys sequentially in one go. This would require either changing the API of load() to generate a list of binary keys which is in the order as they appear in the file, or changing the API of extract_key() to accept a collection of keys and extract them all. The latter is obviously better in terms of API stability.

review: Approve
2910. By Louis Bouchard

Remove unneeded initialization

2911. By Louis Bouchard

Implement multiple key extract : rename to extract_keys

Modify the extract_keys() method to receive keys to extract
in as a list of elements. It will sequentially extract the listed
binary elements present in the list passed as argument.

It will throw a ValueError if some of the keys are not in binary
and will issue a KeyError if some keys are not in the report.

Adapt apport-unpack accordingly and add unit tests

2912. By Louis Bouchard

Replace format() by %s % syntax

Revision history for this message
Louis Bouchard (louis) wrote :

Hello,

Following your advice, I have modified the extract_key() method into extract_keys() which can receive a list of binary keys to extract directly in one pass.

Error handling has been changed accordingly along with improvement to the unit tests.

apport-unpack now uses this new feature which avoid reopening the report multiple times.

Revision history for this message
Martin Pitt (pitti) wrote :

Many thanks! I merged this with a few cleanups and a NEWS entry.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bin/apport-unpack'
2--- bin/apport-unpack 2012-05-04 06:54:56 +0000
3+++ bin/apport-unpack 2015-02-02 14:38:34 +0000
4@@ -49,18 +49,27 @@
5 except OSError as e:
6 fatal(str(e))
7
8+bin_keys = []
9 pr = problem_report.ProblemReport()
10 if report == '-':
11- pr.load(sys.stdin)
12+ pr.load(sys.stdin, binary=False)
13 else:
14 try:
15 with open(report, 'rb') as f:
16- pr.load(f)
17+ pr.load(f, binary=False)
18 except IOError as e:
19 fatal(str(e))
20 for k in pr:
21+ if pr[k] == '':
22+ bin_keys.append(k)
23+ continue
24 with open(os.path.join(dir, k), 'wb') as f:
25 if type(pr[k]) == t_str:
26 f.write(pr[k].encode('UTF-8'))
27 else:
28 f.write(pr[k])
29+try:
30+ with open(report, 'rb') as f:
31+ pr.extract_keys(f, bin_keys, dir)
32+except IOError as e:
33+ fatal(str(e))
34
35=== modified file 'problem_report.py'
36--- problem_report.py 2013-09-19 14:26:33 +0000
37+++ problem_report.py 2015-02-02 14:38:34 +0000
38@@ -188,6 +188,72 @@
39
40 self.old_keys = set(self.data.keys())
41
42+ def extract_keys(self, file, bin_keys, dir):
43+ '''Extract only one binary element from the problem_report
44+
45+ Binary elements can be very big. This method extracts
46+ directly to a file without loading the report beforehand
47+ This is required for Kernel Crash Dumps that can be
48+ very big and saturate the RAM
49+ '''
50+ self._assert_bin_mode(file)
51+ if not isinstance(bin_keys, list):
52+ bin_keys = [bin_keys]
53+ key = None
54+ value = None
55+ has_key = {key: False for key in bin_keys}
56+ b64_block = {}
57+ bd = None
58+ out = None
59+ for line in file:
60+ # Identify the bin_keys we're looking for
61+ while not line.startswith(b' '):
62+ (key, value) = line.split(b':', 1)
63+ if not _python2:
64+ key = key.decode('ASCII')
65+ if key not in bin_keys:
66+ break
67+ b64_block[key] = False
68+ has_key[key] = True
69+ value = value.strip()
70+ if value == b'base64':
71+ value = b''
72+ b64_block[key] = True
73+ try:
74+ bd = None
75+ with open(os.path.join(dir, key), 'wb') as out:
76+ for line in file:
77+ # continuation line
78+ if line.startswith(b' '):
79+ assert (key is not None and value is not None)
80+ if b64_block[key]:
81+ l = base64.b64decode(line)
82+ if bd:
83+ out.write(bd.decompress(l))
84+ else:
85+ # lazy initialization of bd
86+ # skip gzip header, if present
87+ if l.startswith(b'\037\213\010'):
88+ bd = zlib.decompressobj(-zlib.MAX_WBITS)
89+ out.write(bd.decompress(self._strip_gzip_header(l)))
90+ else:
91+ # legacy zlib-only format used default block
92+ # size
93+ bd = zlib.decompressobj()
94+ out.write(bd.decompress(l))
95+ else:
96+ break
97+ except IOError:
98+ raise IOError('unable to open %s' % (os.path.join(dir, key)))
99+ else:
100+ break
101+ if False in has_key.values():
102+ raise KeyError('Cannot find %s in report' %
103+ [item for item, element in has_key.items() if element is False])
104+ if False in b64_block.values():
105+ raise ValueError('%s has no binary content' %
106+ [item for item, element in b64_block.items() if element is False])
107+
108 def has_removed_fields(self):
109 '''Check if the report has any keys which were not loaded.
110
111
112=== modified file 'test/test_problem_report.py'
113--- test/test_problem_report.py 2012-10-11 05:56:35 +0000
114+++ test/test_problem_report.py 2015-02-02 14:38:34 +0000
115@@ -1,5 +1,5 @@
116 # vim: set encoding=UTF-8 fileencoding=UTF-8 :
117-import unittest, tempfile, os, email, gzip, time, sys
118+import unittest, tempfile, os, shutil, email, gzip, time, sys
119
120 from io import BytesIO
121 import problem_report
122@@ -12,6 +12,14 @@
123
124
125 class T(unittest.TestCase):
126+ @classmethod
127+ def setUp(self):
128+ self.workdir = tempfile.mkdtemp()
129+
130+ @classmethod
131+ def tearDown(self):
132+ shutil.rmtree(self.workdir)
133+
134 def test_basic_operations(self):
135 '''basic creation and operation.'''
136
137@@ -258,6 +266,52 @@
138 pr.load(BytesIO(b'ProblemType: Crash'))
139 self.assertEqual(list(pr.keys()), ['ProblemType'])
140
141+ def test_extract_keys(self):
142+ '''extract_keys() with various binary elements.'''
143+
144+ # create a test report with binary elements
145+ large_val = b'A' * 5000000
146+
147+ pr = problem_report.ProblemReport()
148+ pr['Txt'] = 'some text'
149+ pr['MoreTxt'] = 'some more text'
150+ pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
151+ pr['Uncompressed'] = bin_data
152+ pr['Bin'] = problem_report.CompressedValue()
153+ pr['Bin'].set_value(bin_data)
154+ pr['Large'] = problem_report.CompressedValue(large_val)
155+ pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')
156+
157+ report = BytesIO()
158+ pr.write(report)
159+ report.seek(0)
160+
161+ self.assertRaises(IOError, pr.extract_keys, report, 'Bin', '{}/foo'.format(self.workdir))
162+ # Test exception handling : Non-binary and inexistant key
163+ tests = {ValueError: 'Txt', ValueError: ['Foo', 'Txt'], KeyError: 'Bar', KeyError: ['Foo', 'Bar']}
164+ for test in tests.keys():
165+ report.seek(0)
166+ self.assertRaises(test, pr.extract_keys, report, tests[test], self.workdir)
167+ # Check valid single elements
168+ tests = {'Foo': b'FooFoo!', 'Uncompressed': bin_data, 'Bin': bin_data, 'Large': large_val,
169+ 'Multiline': b'\1\1\1\n\2\2\n\3\3\3'}
170+ for test in tests.keys():
171+ report.seek(0)
172+ pr.extract_keys(report, test, self.workdir)
173+ with open(os.path.join(self.workdir, test), 'rb') as element:
174+ self.assertEqual(element.read(), tests[test])
175+ element.close()
176+ # remove file for next pass
177+ os.remove(os.path.join(self.workdir, test))
178+ # Check element list
179+ report.seek(0)
180+ key_list = ['Foo', 'Uncompressed']
181+ tests = {'Foo': b'FooFoo!', 'Uncompressed': bin_data}
182+ pr.extract_keys(report, key_list, self.workdir)
183+ for key in key_list:
184+ with open(os.path.join(self.workdir, key), 'rb') as element:
185+ self.assertEqual(element.read(), tests[key])
186+
187 def test_write_file(self):
188 '''writing a report with binary file data.'''
189

Subscribers

People subscribed via source and target branches