Merge lp:~jameinel/bzr/2.0-490228-gc-segfault into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: Vincent Ladeuil
Approved revision: not available
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/2.0-490228-gc-segfault
Merge into: lp:bzr
Diff against target: 73 lines (+20/-10)
2 files modified
NEWS (+5/-0)
bzrlib/diff-delta.c (+15/-10)
To merge this branch: bzr merge lp:~jameinel/bzr/2.0-490228-gc-segfault
Reviewer Review Type Date Requested Status
Vincent Ladeuil Approve
Review via email: mp+16139@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

Bug fix for the groupcompress code. The actual diff here also cleans up the specific code to be a bit more readable. But the minimal change is just:
=== modified file 'bzrlib/diff-delta.c'
--- bzrlib/diff-delta.c 2009-08-03 16:54:36 +0000
+++ bzrlib/diff-delta.c 2009-12-14 15:52:24 +0000
@@ -804,8 +804,8 @@
             old_entry--;
         }
         old_entry++;
- if (old_entry->ptr != NULL
- || old_entry >= old_index->hash[hash_offset + 1]) {
+ if (old_entry >= old_index->hash[hash_offset + 1]
+ || old_entry->ptr != NULL) {
             /* There is no room for this entry, we have to resize */
             // char buff[128];
             // get_text(buff, entry->ptr);

I don't have a good way to test this. We adapted this code from a 3rd-party source and getting at the specific structures is pretty difficult. And replicating it with real data would be very difficult. You would need:

1) Custom text data that gives a RABIN_HASH that RABIN_HASH & (hash_map_mask) falls into the last bucket.
2) Enough of them to fill the final hash bucket.
3) A memory allocator that doesn't provide any extra bytes after the end of the entry table, such that reading "entry_table[len(entry_table)]->ptr" gives a segfault. (len() versus len()-1)

I would like to revisit the groupcompress data structures, but I don't think that is a profitable use of my time right now.

Revision history for this message
Vincent Ladeuil (vila) wrote :

I would be curious to know how long you and other people involved have spent on this bug
and compare that to the time necessary to get a test infrastructure good enough to
write a test reproducing this bug...

I'm more and more convinced that each time someone says: "I don't have time to write a test"
he is just lying to himself :-(

Given that you will likely be the one writing this test in the end anyway, I'm certainly
not throwing stones as I realize that we got this code without tests to start with.
So, let's land this patch :-D

Do you believe that this bug is present upstream or that it was introduced during
the adaptation to pyrex ?

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Review: Approve
> I would be curious to know how long you and other people involved have spent on this bug
> and compare that to the time necessary to get a test infrastructure good enough to
> write a test reproducing this bug...
>

Once I had a traceback, and did a line-by-line analysis of what was
going on, maybe 20 min?

Rewriting the code to be testable is at least a days work for me,
depending on how much I want to cheat.

The main problems are:

1) Data injection. I'd have to figure out how to generate a structure
that is ready to expose the bug. Going further, how do you do it without
generating a brittle test?

2) Measuring the error. The specific error is that we are *reading* from
an invalid address. This hasn't been failing in *lots* of circumstances,
because reading "N+1" is often not a segfault. For example, I think most
allocators reserve a whole page for the process. So allocating a 20 byte
string will have the OS give you an 8kB page to work with, and malloc
will then use the rest of the 8172 bytes for other malloc requests.

However, accessing the 21st byte will not be a segfault. You have to
access the 8193rd byte to get the OS to say "hey, you don't have access
to that page, go home."

If it was writing to an invalid address, then we could check the memory
array and say "yep, X+1 doesn't have any data in it anymore, great."

3) Alternatively, we refactor the whole thing so that memory access goes
through a callback function. And then inject our own callback function
to test that we aren't accessing an bad section of memory. However, that
means that *every* access in this inner loop needs to go through a
virtual function abstraction, which would probably hurt its performance
significantly. And the gc code is definitely one of the performance
sensitive areas of our codebase...

> I'm more and more convinced that each time someone says: "I don't have time to write a test"
> he is just lying to himself :-(
>
> Given that you will likely be the one writing this test in the end anyway, I'm certainly
> not throwing stones as I realize that we got this code without tests to start with.
> So, let's land this patch :-D

If you have any hints as to a good way to test this, I'm all ears.

>
> Do you believe that this bug is present upstream or that it was introduced during
> the adaptation to pyrex ?

It is a bug from our adaptation of the code. They don't ever insert new
records into the hash map. It has to do with how groupcompress blocks work.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksmcc0ACgkQJdeBCYSNAAOeMACfSECIKn0xNU+n4Jd+A3MTRAss
iVkAoMYH6BrfrRMmSfTUogEF28XkGgWa
=6hJc
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :
Download full text (5.3 KiB)

Let me get this straight:

I approved your patch.

I think you did very well on that bug.

I'm reacting to the "no test sorry" part and trying to debug what
I consider a weakness in the way we practice TDD *in general*.

I don't *know* the answer, but I can recognize a concern here.

I think this patch is the right trade-off, but it is a trade-off
or you wouldn't have said "I don't have a good way to test this".

>>>>> "jam" == John A Meinel <email address hidden> writes:

    jam> Vincent Ladeuil wrote:
    >> Review: Approve

    >> I would be curious to know how long you and other people
    >> involved have spent on this bug and compare that to the
    >> time necessary to get a test infrastructure good enough to
    >> write a test reproducing this bug...
    >>

    jam> Once I had a traceback, and did a line-by-line analysis
    jam> of what was going on, maybe 20 min?

Yeah, *once* you had the traceback.

The time I was referring to was the one *including* the time you
spent to get there *AND* the time spent by *others* to get there.

    jam> Rewriting the code to be testable is at least a days
    jam> work for me, depending on how much I want to cheat.

Exactly, but I'm perfectly fine with cheating when testing !

The goal is to reduce the time spent to *reproduce* the bug. I
may sound like a test freak but what I really care about is to
minimize the time spent to fix bugs and it's crystal clear to me
that the most important part of that time is spent in
*reproducing* any given bug.

    jam> The main problems are:

    jam> 1) Data injection. I'd have to figure out how to
    jam> generate a structure that is ready to expose the
    jam> bug. Going further, how do you do it without generating
    jam> a brittle test?

Sometimes a brittle test is better than no test. I agree that
this is very hard to decide and that's why I try to focus on
building robust low-level code by doing TDD to the book: never
write code without a failing test. (I don't mean that *you* don't
by the way :)

And I insist on the "try", the pain threshold often makes me
resign when working on code that doesn't have a good enough
associated test infrastructure.

    jam> 2) Measuring the error.

<snip/>

    jam> However, accessing the 21st byte will not be a
    jam> segfault. You have to access the 8193rd byte to get the
    jam> OS to say "hey, you don't have access to that page, go
    jam> home."

Pain, pain :(

<snip/>

    jam> 3) Alternatively, we refactor the whole thing so that
    jam> memory access goes through a callback function.

Or use some independent object that could be tested independently
maybe ?

    jam> And the gc code is definitely one of the performance
    jam> sensitive areas of our codebase...

Make it work, make it right, make it fast.

Untested code is broken code.

I think we all agree on these principles *even* when we don't
find a good way to respect them.

I'm fine with violating them when circumstances or judgment tell
us to do so. What I'd like is that we document these violations
more carefully so that we at least know where we did.

<snip/>

    jam> If you have any hints as to a good way to test this, I'm all ears.

...

Read more...

Revision history for this message
Matthew Fuller (fullermd) wrote :

> However, accessing the 21st byte will not be a segfault. You have to
> access the 8193rd byte to get the OS to say "hey, you don't have
> access to that page, go home."

Or not. Maybe there are 4k pages, or 4M pages. Or maybe the next
page is already mapped into your process from an earlier or later
allocation anyway, and you'd have to step past that. It'll vary by
OS, OS version, architecture, run of program, phase of moon...

It's hard to define a test of undefined behavior.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-12-14 09:31:20 +0000
3+++ NEWS 2009-12-14 16:00:51 +0000
4@@ -172,6 +172,11 @@
5 * Content filters are now applied correctly after pull, merge and switch.
6 (Ian Clatworthy, #385879)
7
8+* Fix a potential segfault in the groupcompress hash map handling code.
9+ When inserting new entries, if the final hash bucket was empty, we could
10+ end up trying to access if ``(last_entry+1)->ptr == NULL``.
11+ (John Arbash Meinel, #490228)
12+
13 * Improve "Binary files differ" hunk handling. (Aaron Bentley, #436325)
14
15 Improvements
16
17=== modified file 'bzrlib/diff-delta.c'
18--- bzrlib/diff-delta.c 2009-08-03 16:54:36 +0000
19+++ bzrlib/diff-delta.c 2009-12-14 16:00:51 +0000
20@@ -688,7 +688,7 @@
21 const unsigned char *data, *buffer, *top;
22 unsigned char cmd;
23 struct delta_index *new_index;
24- struct index_entry *entry, *entries, *old_entry;
25+ struct index_entry *entry, *entries;
26
27 if (!src->buf || !src->size)
28 return NULL;
29@@ -789,6 +789,7 @@
30 entry = entries;
31 num_inserted = 0;
32 for (; num_entries > 0; --num_entries, ++entry) {
33+ struct index_entry *next_bucket_entry, *cur_entry, *bucket_first_entry;
34 hash_offset = (entry->val & old_index->hash_mask);
35 /* The basic structure is a hash => packed_entries that fit in that
36 * hash bucket. Things are structured such that the hash-pointers are
37@@ -797,15 +798,19 @@
38 * forward. If there are no NULL targets, then we know because
39 * entry->ptr will not be NULL.
40 */
41- old_entry = old_index->hash[hash_offset + 1];
42- old_entry--;
43- while (old_entry->ptr == NULL
44- && old_entry >= old_index->hash[hash_offset]) {
45- old_entry--;
46+ // The start of the next bucket, this may point past the end of the
47+ // entry table if hash_offset is the last bucket.
48+ next_bucket_entry = old_index->hash[hash_offset + 1];
49+ // First entry in this bucket
50+ bucket_first_entry = old_index->hash[hash_offset];
51+ cur_entry = next_bucket_entry - 1;
52+ while (cur_entry->ptr == NULL && cur_entry >= bucket_first_entry) {
53+ cur_entry--;
54 }
55- old_entry++;
56- if (old_entry->ptr != NULL
57- || old_entry >= old_index->hash[hash_offset + 1]) {
58+ // cur_entry now either points at the first NULL, or it points to
59+ // next_bucket_entry if there were no blank spots.
60+ cur_entry++;
61+ if (cur_entry >= next_bucket_entry || cur_entry->ptr != NULL) {
62 /* There is no room for this entry, we have to resize */
63 // char buff[128];
64 // get_text(buff, entry->ptr);
65@@ -822,7 +827,7 @@
66 break;
67 }
68 num_inserted++;
69- *old_entry = *entry;
70+ *cur_entry = *entry;
71 /* For entries which we *do* manage to insert into old_index, we don't
72 * want them double copied into the final output.
73 */