Merge lp:~louis/apport/apport-unpack-extract into lp:~apport-hackers/apport/trunk

Proposed by Louis Bouchard
Status: Merged
Merged at revision: 2896
Proposed branch: lp:~louis/apport/apport-unpack-extract
Merge into: lp:~apport-hackers/apport/trunk
Diff against target: 188 lines (+132/-3)
3 files modified
bin/apport-unpack (+11/-2)
problem_report.py (+66/-0)
test/test_problem_report.py (+55/-1)
To merge this branch: bzr merge lp:~louis/apport/apport-unpack-extract
Reviewer Review Type Date Requested Status
Martin Pitt (community) Approve
Review via email: mp+247591@code.launchpad.net

Description of the change

The modifications to problem_report.py are mandatory in order for the changes in apport-unpack to work.

Changes to apport-unpack are only meaningful for Ubuntu Precise, so they could be brought in as a debian patch, hence dropped from that MP, while keeping the enhancement to problem_report.

Explanation:
============
Kernel VmCore can be very big ( multi Gigabytes) files. ProblemReport.load() does a systematic load of those in RAM which is very expensive in resource. This adds a ProblemReport.extract() method that will write directly to a file the binary element without loading it into memory first. ProblemExtact() method will write the element given as an argument to disk in the directory passed as an argument or /tmp by default.

apport-unpack is modified to extract the content of the report in two steps :
1) Load the report with binary=False and write all the non-binary element to their respective file, keeping a list of the binary element identified.
2) Use the extract() method on each element identified as binary to write directly to disk

Here is a comparison of the time taken before and after the modification :

root@crashdump:~# time apport-unpack linux-image-3.2.0-38-generic.0.crash tmp

Test on Precise:
Current version :
real 4m54.216s
user 1m16.777s
sys 0m47.919s

With extract method :
============
real 0m12.574s
user 0m5.560s
sys 0m7.004s

To post a comment you must log in.
Revision history for this message
Martin Pitt (pitti) wrote :
Download full text (7.7 KiB)

Hey Louis,

thanks for working on this! The general approach looks fine to me, I
just have some small things to fix.

Louis Bouchard [2015-01-26 14:42 -0000]:
> Changes to apport-unpack are only meaningful for Ubuntu Precise, so
> they could be brought in as a debian patch, hence dropped from that
> MP, while keeping the enhancement to problem_report.

I think it's fine to keep that in trunk. We might have other large
attachments which are better handled that way.

> === modified file 'bin/apport-unpack'
> --- bin/apport-unpack 2012-05-04 06:54:56 +0000
> +++ bin/apport-unpack 2015-01-26 14:41:51 +0000
> @@ -49,18 +49,28 @@
> except OSError as e:
> fatal(str(e))
>
> +binaries = ()

Please make that a list or a set; using a tuple in this way is
confusing. It would also be nice if you could name this "bin_keys"?

> for k in pr:
> + if pr[k] == u'':

No u'' please, just ''; u'' isn't compatible with earlier Python 3.x.

> + binaries += (k,)

Following the above, bin_keys.append(k) (list) or .add(k) (set).

> + with open(report, 'rb') as f:
> + for element in binaries:

"key" instead of element, please? Let's consistently call the report
contents "key" and "value".

> + def extract(self, file, item=None, directory='/tmp'):
> + '''Extract only one element from the problem_report
> + directly without loading the report beforehand
> + This is required for Kernel Crash Dumps that can be
> + very big and saturate the RAM
> + '''

Short descriptions need to be one line. Please rename "item" to "key",
and drop the None default for it, as None will turn the whole thing
into a no-op. Also, don't specify a default for directory; such
defaults are for CLI programs, and libraries should *never* default to
/tmp unless they very carefully avoid overwriting existing files in a
race free manner (i. e. O_CREAT|O_EXCL).

Finally, to further point out that this is for a single key instead of
the whole report, could this be named extract_key()?

> + self._assert_bin_mode(file)
> + self.data.clear()
> + key = None
> + value = None
> + b64_block = False
> + bd = None
> + # Make sure the report is at the beginning
> + file.seek(0)

I'd like to avoid this. First, it's unexpected (and undocumented), you
might deliberately call this on an open file at an offset, and second
it won't work for non-seekable fds such as reading from stdin.

> + for line in file:
> + # continuation line
> + if line.startswith(b' '):
> + if not b64_block:
> + continue
> + assert (key is not None and value is not None)
> + if b64_block:
> + l = base64.b64decode(line)
> + if bd:
> + out.write(bd.decompress(l))
> + else:
> + # lazy initialization of bd
> + # skip gzip header, if present
> + if l.startswith(b'\037\213\010'):
> + bd = zlib.decompressobj(-zlib.MAX_WBITS)
> + out.write(bd.de...

Read more...

2898. By Louis Bouchard

PEP8 cleanup

2899. By Louis Bouchard

Fix pylint complains

Predefine out
Replace fatal() by returning an exception to caller

2900. By Louis Bouchard

Replace tuple binaries by list called bin_keys

2901. By Louis Bouchard

Rename method to extract_key and fix description

2902. By Louis Bouchard

Change extract_key parameters

 Rename directory to dir and remove default
 Rename item to bin_key

2903. By Louis Bouchard

Remove py3 incompatibility

2904. By Louis Bouchard

Remove seek(0) in extract_key method

Adapt test and fix apport-unpack accordingly

2905. By Louis Bouchard

Remove unneeded modifications to report

2906. By Louis Bouchard

Replace some assertEquals by assertFalse

2907. By Louis Bouchard

Lump similar tests in one loop

2908. By Louis Bouchard

Rework the extract_key logic to simplify

Previous logic was a straight copy from the load method.
Revisit to remove unneeded complexity

2909. By Louis Bouchard

Add assertRaises tests for exception handling in extract_key

Revision history for this message
Louis Bouchard (louis) wrote :

Hello Martin,

All the changes that you requested have been implemented except for one that I don't understand :

> +
> + def test_extract(self):
> + '''extract() with various binary elements.'''
> +
> + # create a test report with binary elements
> + large_val = b'A' * 5000000
> +
> + pr = problem_report.ProblemReport()
> + pr['Txt'] = 'some text'
> + pr['MoreTxt'] = 'some more text'
> + pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
> + pr['Uncompressed'] = bin_data
> + pr['Bin'] = problem_report.CompressedValue()
> + pr['Bin'].set_value(bin_data)
> + pr['Large'] = problem_report.CompressedValue(large_val)
> + pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')

> This is really binary, so I would just drop this bit.

I'm not sure of what should be dropped.

The other slight change is the request to replace item by key : the key variable was already in use in the method, so I replaced item by bin_key to discriminate.

I also improved the test by avoiding repetition and testing the exceptions.

Let me know if there is anything else to be done.

Kind regards,

...Louis

Revision history for this message
Martin Pitt (pitti) wrote :

Thanks! This looks mostly good now, I'll merge this.

This bit in apport-unpack still has quite some optimization potential:

    for key in bin_keys:
        with open(report, 'rb') as f:
            pr.extract_key(f, key, dir)

This reads the report file n times (for #number of binary keys). It would be much faster to extract all keys sequentially in one go. This would require either changing the API of load() to generate a list of binary keys which is in the order as they appear in the file, or changing the API of extract_key() to accept a collection of keys and extract them all. The latter is obviously better in terms of API stability.

review: Approve
2910. By Louis Bouchard

Remove unneeded initialization

2911. By Louis Bouchard

Implement multiple key extract : rename to extract_keys

Modify the extract_keys() method to receive keys to extract
in as a list of elements. It will sequentially extract the listed
binary elements present in the list passed as argument.

It will throw a ValueError if some of the keys are not in binary
and will issue a KeyError if some keys are not in the report.

Adapt apport-unpack accordingly and add unit tests

2912. By Louis Bouchard

Replace format() by %s % syntax

Revision history for this message
Louis Bouchard (louis) wrote :

Hello,

Following your advice, I have modified the extract_key() method into extract_keys() which can receive a list of binary keys to extract directly in one pass.

Error handling has been changed accordingly along with improvement to the unit tests.

apport-unpack now uses this new feature which avoid reopening the report multiple times.

Revision history for this message
Martin Pitt (pitti) wrote :

Many thanks! I merged this with a few cleanups and a NEWS entry.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bin/apport-unpack'
--- bin/apport-unpack 2012-05-04 06:54:56 +0000
+++ bin/apport-unpack 2015-02-02 14:38:34 +0000
@@ -49,18 +49,27 @@
49except OSError as e:49except OSError as e:
50 fatal(str(e))50 fatal(str(e))
5151
52bin_keys = []
52pr = problem_report.ProblemReport()53pr = problem_report.ProblemReport()
53if report == '-':54if report == '-':
54 pr.load(sys.stdin)55 pr.load(sys.stdin, binary=False)
55else:56else:
56 try:57 try:
57 with open(report, 'rb') as f:58 with open(report, 'rb') as f:
58 pr.load(f)59 pr.load(f, binary=False)
59 except IOError as e:60 except IOError as e:
60 fatal(str(e))61 fatal(str(e))
61for k in pr:62for k in pr:
63 if pr[k] == '':
64 bin_keys.append(k)
65 continue
62 with open(os.path.join(dir, k), 'wb') as f:66 with open(os.path.join(dir, k), 'wb') as f:
63 if type(pr[k]) == t_str:67 if type(pr[k]) == t_str:
64 f.write(pr[k].encode('UTF-8'))68 f.write(pr[k].encode('UTF-8'))
65 else:69 else:
66 f.write(pr[k])70 f.write(pr[k])
71try:
72 with open(report, 'rb') as f:
73 pr.extract_keys(f, bin_keys, dir)
74except IOError as e:
75 fatal(str(e))
6776
=== modified file 'problem_report.py'
--- problem_report.py 2013-09-19 14:26:33 +0000
+++ problem_report.py 2015-02-02 14:38:34 +0000
@@ -188,6 +188,72 @@
188188
189 self.old_keys = set(self.data.keys())189 self.old_keys = set(self.data.keys())
190190
191 def extract_keys(self, file, bin_keys, dir):
192 '''Extract only one binary element from the problem_report
193
194 Binary elements can be very big. This method extracts
195 directly to a file without loading the report beforehand
196 This is required for Kernel Crash Dumps that can be
197 very big and saturate the RAM
198 '''
199 self._assert_bin_mode(file)
200 if not isinstance(bin_keys, list):
201 bin_keys = [bin_keys]
202 key = None
203 value = None
204 has_key = {key: False for key in bin_keys}
205 b64_block = {}
206 bd = None
207 out = None
208 for line in file:
209 # Identify the bin_keys we're looking for
210 while not line.startswith(b' '):
211 (key, value) = line.split(b':', 1)
212 if not _python2:
213 key = key.decode('ASCII')
214 if key not in bin_keys:
215 break
216 b64_block[key] = False
217 has_key[key] = True
218 value = value.strip()
219 if value == b'base64':
220 value = b''
221 b64_block[key] = True
222 try:
223 bd = None
224 with open(os.path.join(dir, key), 'wb') as out:
225 for line in file:
226 # continuation line
227 if line.startswith(b' '):
228 assert (key is not None and value is not None)
229 if b64_block[key]:
230 l = base64.b64decode(line)
231 if bd:
232 out.write(bd.decompress(l))
233 else:
234 # lazy initialization of bd
235 # skip gzip header, if present
236 if l.startswith(b'\037\213\010'):
237 bd = zlib.decompressobj(-zlib.MAX_WBITS)
238 out.write(bd.decompress(self._strip_gzip_header(l)))
239 else:
240 # legacy zlib-only format used default block
241 # size
242 bd = zlib.decompressobj()
243 out.write(bd.decompress(l))
244 else:
245 break
246 except IOError:
247 raise IOError('unable to open %s' % (os.path.join(dir, key)))
248 else:
249 break
250 if False in has_key.values():
251 raise KeyError('Cannot find %s in report' %
252 [item for item, element in has_key.items() if element is False])
253 if False in b64_block.values():
254 raise ValueError('%s has no binary content' %
255 [item for item, element in b64_block.items() if element is False])
256
191 def has_removed_fields(self):257 def has_removed_fields(self):
192 '''Check if the report has any keys which were not loaded.258 '''Check if the report has any keys which were not loaded.
193259
194260
=== modified file 'test/test_problem_report.py'
--- test/test_problem_report.py 2012-10-11 05:56:35 +0000
+++ test/test_problem_report.py 2015-02-02 14:38:34 +0000
@@ -1,5 +1,5 @@
1# vim: set encoding=UTF-8 fileencoding=UTF-8 :1# vim: set encoding=UTF-8 fileencoding=UTF-8 :
2import unittest, tempfile, os, email, gzip, time, sys2import unittest, tempfile, os, shutil, email, gzip, time, sys
33
4from io import BytesIO4from io import BytesIO
5import problem_report5import problem_report
@@ -12,6 +12,14 @@
1212
1313
14class T(unittest.TestCase):14class T(unittest.TestCase):
15 @classmethod
16 def setUp(self):
17 self.workdir = tempfile.mkdtemp()
18
19 @classmethod
20 def tearDown(self):
21 shutil.rmtree(self.workdir)
22
15 def test_basic_operations(self):23 def test_basic_operations(self):
16 '''basic creation and operation.'''24 '''basic creation and operation.'''
1725
@@ -258,6 +266,52 @@
258 pr.load(BytesIO(b'ProblemType: Crash'))266 pr.load(BytesIO(b'ProblemType: Crash'))
259 self.assertEqual(list(pr.keys()), ['ProblemType'])267 self.assertEqual(list(pr.keys()), ['ProblemType'])
260268
269 def test_extract_keys(self):
270 '''extract_keys() with various binary elements.'''
271
272 # create a test report with binary elements
273 large_val = b'A' * 5000000
274
275 pr = problem_report.ProblemReport()
276 pr['Txt'] = 'some text'
277 pr['MoreTxt'] = 'some more text'
278 pr['Foo'] = problem_report.CompressedValue(b'FooFoo!')
279 pr['Uncompressed'] = bin_data
280 pr['Bin'] = problem_report.CompressedValue()
281 pr['Bin'].set_value(bin_data)
282 pr['Large'] = problem_report.CompressedValue(large_val)
283 pr['Multiline'] = problem_report.CompressedValue(b'\1\1\1\n\2\2\n\3\3\3')
284
285 report = BytesIO()
286 pr.write(report)
287 report.seek(0)
288
289 self.assertRaises(IOError, pr.extract_keys, report, 'Bin', '{}/foo'.format(self.workdir))
290 # Test exception handling : Non-binary and inexistant key
291 tests = {ValueError: 'Txt', ValueError: ['Foo', 'Txt'], KeyError: 'Bar', KeyError: ['Foo', 'Bar']}
292 for test in tests.keys():
293 report.seek(0)
294 self.assertRaises(test, pr.extract_keys, report, tests[test], self.workdir)
295 # Check valid single elements
296 tests = {'Foo': b'FooFoo!', 'Uncompressed': bin_data, 'Bin': bin_data, 'Large': large_val,
297 'Multiline': b'\1\1\1\n\2\2\n\3\3\3'}
298 for test in tests.keys():
299 report.seek(0)
300 pr.extract_keys(report, test, self.workdir)
301 with open(os.path.join(self.workdir, test), 'rb') as element:
302 self.assertEqual(element.read(), tests[test])
303 element.close()
304 # remove file for next pass
305 os.remove(os.path.join(self.workdir, test))
306 # Check element list
307 report.seek(0)
308 key_list = ['Foo', 'Uncompressed']
309 tests = {'Foo': b'FooFoo!', 'Uncompressed': bin_data}
310 pr.extract_keys(report, key_list, self.workdir)
311 for key in key_list:
312 with open(os.path.join(self.workdir, key), 'rb') as element:
313 self.assertEqual(element.read(), tests[key])
314
261 def test_write_file(self):315 def test_write_file(self):
262 '''writing a report with binary file data.'''316 '''writing a report with binary file data.'''
263317

Subscribers

People subscribed via source and target branches