Merge into bzr.dev : 2.0-490228-gc-segfault : Code : Bazaar

Status:

Merged

Approved by:

Vincent Ladeuil on 2009-12-14

Approved revision:

not available

Merged at revision:

not available

Proposed branch:

lp:~jameinel/bzr/2.0-490228-gc-segfault

Merge into:

lp:bzr

Diff against target:

73 lines (+20/-10)

2 files modified

NEWS (+5/-0)
bzrlib/diff-delta.c (+15/-10)

To merge this branch:

bzr merge lp:~jameinel/bzr/2.0-490228-gc-segfault

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Vincent Ladeuil		2009-12-14	Approve on 2009-12-14
Review via email: mp+16139@code.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-14:

#

Bug fix for the groupcompress code. The actual diff here also cleans up the specific code to be a bit more readable. But the minimal change is just:
=== modified file 'bzrlib/diff-delta.c'
--- bzrlib/diff-delta.c 2009-08-03 16:54:36 +0000
+++ bzrlib/diff-delta.c 2009-12-14 15:52:24 +0000
@@ -804,8 +804,8 @@
             old_entry--;
         }
         old_entry++;
- if (old_entry->ptr != NULL
- || old_entry >= old_index->hash[hash_offset + 1]) {
+ if (old_entry >= old_index->hash[hash_offset + 1]
+ || old_entry->ptr != NULL) {
             /* There is no room for this entry, we have to resize */
             // char buff[128];
             // get_text(buff, entry->ptr);

I don't have a good way to test this. We adapted this code from a 3rd-party source and getting at the specific structures is pretty difficult. And replicating it with real data would be very difficult. You would need:

1) Custom text data that gives a RABIN_HASH that RABIN_HASH & (hash_map_mask) falls into the last bucket.
2) Enough of them to fill the final hash bucket.
3) A memory allocator that doesn't provide any extra bytes after the end of the entry table, such that reading "entry_table[len(entry_table)]->ptr" gives a segfault. (len() versus len()-1)

I would like to revisit the groupcompress data structures, but I don't think that is a profitable use of my time right now.

Revision history for this message

Vincent Ladeuil (vila) wrote on 2009-12-14:

#

I would be curious to know how long you and other people involved have spent on this bug
and compare that to the time necessary to get a test infrastructure good enough to
write a test reproducing this bug...

I'm more and more convinced that each time someone says: "I don't have time to write a test"
he is just lying to himself :-(

Given that you will likely be the one writing this test in the end anyway, I'm certainly
not throwing stones as I realize that we got this code without tests to start with.
So, let's land this patch :-D

Do you believe that this bug is present upstream or that it was introduced during
the adaptation to pyrex ?

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2009-12-14:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Review: Approve
> I would be curious to know how long you and other people involved have spent on this bug
> and compare that to the time necessary to get a test infrastructure good enough to
> write a test reproducing this bug...
>

Once I had a traceback, and did a line-by-line analysis of what was
going on, maybe 20 min?

Rewriting the code to be testable is at least a days work for me,
depending on how much I want to cheat.

The main problems are:

1) Data injection. I'd have to figure out how to generate a structure
that is ready to expose the bug. Going further, how do you do it without
generating a brittle test?

2) Measuring the error. The specific error is that we are *reading* from
an invalid address. This hasn't been failing in *lots* of circumstances,
because reading "N+1" is often not a segfault. For example, I think most
allocators reserve a whole page for the process. So allocating a 20 byte
string will have the OS give you an 8kB page to work with, and malloc
will then use the rest of the 8172 bytes for other malloc requests.

However, accessing the 21st byte will not be a segfault. You have to
access the 8193rd byte to get the OS to say "hey, you don't have access
to that page, go home."

If it was writing to an invalid address, then we could check the memory
array and say "yep, X+1 doesn't have any data in it anymore, great."

3) Alternatively, we refactor the whole thing so that memory access goes
through a callback function. And then inject our own callback function
to test that we aren't accessing an bad section of memory. However, that
means that *every* access in this inner loop needs to go through a
virtual function abstraction, which would probably hurt its performance
significantly. And the gc code is definitely one of the performance
sensitive areas of our codebase...

> I'm more and more convinced that each time someone says: "I don't have time to write a test"
> he is just lying to himself :-(
>
> Given that you will likely be the one writing this test in the end anyway, I'm certainly
> not throwing stones as I realize that we got this code without tests to start with.
> So, let's land this patch :-D

If you have any hints as to a good way to test this, I'm all ears.

>
> Do you believe that this bug is present upstream or that it was introduced during
> the adaptation to pyrex ?

It is a bug from our adaptation of the code. They don't ever insert new
records into the hash map. It has to do with how groupcompress blocks work.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksmcc0ACgkQJdeBCYSNAAOeMACfSECIKn0xNU+n4Jd+A3MTRAss
iVkAoMYH6BrfrRMmSfTUogEF28XkGgWa
=6hJc
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Review: Approve
> I would be curious to know how long you and other people involved have spent on this bug 
> and compare that to the time necessary to get a test infrastructure good enough to
> write a test reproducing this bug...
>

Once I had a traceback, and did a line-by-line analysis of what was
going on, maybe 20 min?

Rewriting the code to be testable is at least a days work for me,
depending on how much I want to cheat.

The main problems are:

1) Data injection. I'd have to figure out how to generate a structure
that is ready to expose the bug. Going further, how do you do it without
generating a brittle test?

2) Measuring the error. The specific error is that we are *reading* from
an invalid address. This hasn't been failing in *lots* of circumstances,
because reading "N+1" is often not a segfault. For example, I think most
allocators reserve a whole page for the process. So allocating a 20 byte
string will have the OS give you an 8kB page to work with, and malloc
will then use the rest of the 8172 bytes for other malloc requests.

However, accessing the 21st byte will not be a segfault. You have to
access the 8193rd byte to get the OS to say "hey, you don't have access
to that page, go home."

If it was writing to an invalid address, then we could check the memory
array and say "yep, X+1 doesn't have any data in it anymore, great."

3) Alternatively, we refactor the whole thing so that memory access goes
through a callback function. And then inject our own callback function
to test that we aren't accessing an bad section of memory. However, that
means that *every* access in this inner loop needs to go through a
virtual function abstraction, which would probably hurt its performance
significantly. And the gc code is definitely one of the performance
sensitive areas of our codebase...

> I'm more and more convinced that each time someone says: "I don't have time to write a test"
> he is just lying to himself :-(
> 
> Given that you will likely be the one writing this test in the end anyway, I'm certainly 
> not throwing stones as I realize that we got this code without tests to start with. 
> So, let's land this patch :-D

If you have any hints as to a good way to test this, I'm all ears.

> 
> Do you believe that this bug is present upstream or that it was introduced during 
> the adaptation to pyrex ?

It is a bug from our adaptation of the code. They don't ever insert new
records into the hash map. It has to do with how groupcompress blocks work.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksmcc0ACgkQJdeBCYSNAAOeMACfSECIKn0xNU+n4Jd+A3MTRAss
iVkAoMYH6BrfrRMmSfTUogEF28XkGgWa
=6hJc
-----END PGP SIGNATURE-----

Revision history for this message

Vincent Ladeuil (vila) wrote on 2009-12-15:

#

Download full text (5.3 KiB)

Let me get this straight:

I approved your patch.

I think you did very well on that bug.

I'm reacting to the "no test sorry" part and trying to debug what
I consider a weakness in the way we practice TDD *in general*.

I don't *know* the answer, but I can recognize a concern here.

I think this patch is the right trade-off, but it is a trade-off
or you wouldn't have said "I don't have a good way to test this".

>>>>> "jam" == John A Meinel <email address hidden> writes:

jam> Vincent Ladeuil wrote:
>> Review: Approve

    >> I would be curious to know how long you and other people
    >> involved have spent on this bug and compare that to the
    >> time necessary to get a test infrastructure good enough to
    >> write a test reproducing this bug...
    >>

jam> Once I had a traceback, and did a line-by-line analysis
jam> of what was going on, maybe 20 min?

Yeah, *once* you had the traceback.

The time I was referring to was the one *including* the time you
spent to get there *AND* the time spent by *others* to get there.

jam> Rewriting the code to be testable is at least a days
jam> work for me, depending on how much I want to cheat.

Exactly, but I'm perfectly fine with cheating when testing !

The goal is to reduce the time spent to *reproduce* the bug. I
may sound like a test freak but what I really care about is to
minimize the time spent to fix bugs and it's crystal clear to me
that the most important part of that time is spent in
*reproducing* any given bug.

jam> The main problems are:

    jam> 1) Data injection. I'd have to figure out how to
    jam> generate a structure that is ready to expose the
    jam> bug. Going further, how do you do it without generating
    jam> a brittle test?

Sometimes a brittle test is better than no test. I agree that
this is very hard to decide and that's why I try to focus on
building robust low-level code by doing TDD to the book: never
write code without a failing test. (I don't mean that *you* don't
by the way :)

And I insist on the "try", the pain threshold often makes me
resign when working on code that doesn't have a good enough
associated test infrastructure.

jam> 2) Measuring the error.

<snip/>

    jam> However, accessing the 21st byte will not be a
    jam> segfault. You have to access the 8193rd byte to get the
    jam> OS to say "hey, you don't have access to that page, go
    jam> home."

Pain, pain :(

<snip/>

jam> 3) Alternatively, we refactor the whole thing so that
jam> memory access goes through a callback function.

Or use some independent object that could be tested independently
maybe ?

jam> And the gc code is definitely one of the performance
jam> sensitive areas of our codebase...

Make it work, make it right, make it fast.

Untested code is broken code.

I think we all agree on these principles *even* when we don't
find a good way to respect them.

I'm fine with violating them when circumstances or judgment tell
us to do so. What I'd like is that we document these violations
more carefully so that we at least know where we did.

<snip/>

jam> If you have any hints as to a good way to test this, I'm all ears.

...

Let me get this straight:

I approved your patch.

I think you did very well on that bug.

I'm reacting to the "no test sorry" part and trying to debug what
I consider a weakness in the way we practice TDD *in general*.

I don't *know* the answer, but I can recognize a concern here.

I think this patch is the right trade-off, but it is a trade-off
or you wouldn't have said "I don't have a good way to test this".

>>>>> "jam" == John A Meinel <john@arbash-meinel.com> writes:

jam> Vincent Ladeuil wrote:
    >> Review: Approve

>> I would be curious to know how long you and other people
    >> involved have spent on this bug and compare that to the
    >> time necessary to get a test infrastructure good enough to
    >> write a test reproducing this bug...
    >>

jam> Once I had a traceback, and did a line-by-line analysis
    jam> of what was going on, maybe 20 min?

Yeah, *once* you had the traceback.

The time I was referring to was the one *including* the time you
spent to get there *AND* the time spent by *others* to get there.

jam> Rewriting the code to be testable is at least a days
    jam> work for me, depending on how much I want to cheat.

Exactly, but I'm perfectly fine with cheating when testing !

The goal is to reduce the time spent to *reproduce* the bug. I
may sound like a test freak but what I really care about is to
minimize the time spent to fix bugs and it's crystal clear to me
that the most important part of that time is spent in
*reproducing* any given bug.

jam> The main problems are:

jam> 1) Data injection. I'd have to figure out how to
    jam> generate a structure that is ready to expose the
    jam> bug. Going further, how do you do it without generating
    jam> a brittle test?

Sometimes a brittle test is better than no test. I agree that
this is very hard to decide and that's why I try to focus on
building robust low-level code by doing TDD to the book: never
write code without a failing test. (I don't mean that *you* don't
by the way :)

And I insist on the "try", the pain threshold often makes me
resign when working on code that doesn't have a good enough
associated test infrastructure.

jam> 2) Measuring the error.

<snip/>

jam> However, accessing the 21st byte will not be a
    jam> segfault. You have to access the 8193rd byte to get the
    jam> OS to say "hey, you don't have access to that page, go
    jam> home."

Pain, pain :(

<snip/>

jam> 3) Alternatively, we refactor the whole thing so that
    jam> memory access goes through a callback function.

Or use some independent object that could be tested independently
maybe ?

jam> And the gc code is definitely one of the performance
    jam> sensitive areas of our codebase...

Make it work, make it right, make it fast.

Untested code is broken code.

I think we all agree on these principles *even* when we don't
find a good way to respect them.

I'm fine with violating them when circumstances or judgment tell
us to do so. What I'd like is that we document these violations
more carefully so that we at least know where we did.

<snip/>

jam> If you have any hints as to a good way to test this, I'm all ears.

Without investing as much time (at least) as you did ? Certainly
not, you're the expert here and I respect that.

Again, I'm reacting to the "no test because..." as I'm sure
you're not happy with that and I want to support you during this
painful time ;-)

So, I don't have a specific answer.

The generic answer is to start small with the case at hand faking
anything that comes in the way, concentrating on the layering to
end with a simplified object where one simple test can reproduce
the bug.

If that means replacing the compression part with some trivial
implementation that don't compress but with a compatible API,
good. If that means replacing it with a trivial implementation
that fake the compression by providing pre-calculated streams,
good too. Anything, really.

At the minimum, put some comments telling: 'FIXME: How can I test
x or y ?'.

I don't care that much about perfect tests, I care more about
*identifying* where we don't test enough so that we can better
track where we don't get the benefits of TDD because we didn't
invest enough...

May be we should just file a bug saying, damn, the gc code is not
well tested see bug #490228.

May be we lack some rules to write more testable pyrex code, may
be we need to investigate using C function pointers that we can
override for tests (indirections like that are cheap in C, far
cheaper than in python).

Here I go, asking questions and not giving answers... Well at
least that may better explain my point :)

And in the end, I think my point is: sometimes we fail to write
good enough tests and we shouldn't put that under the carpet and
be done with it. If we lack good tests we will get bugs and we
will spend time reproducing them before being able to fix them,
so let's be honest and document that instead of enforcing the
meme that it's ok to not write tests.

>> 
    >> Do you believe that this bug is present upstream or that
    >> it was introduced during the adaptation to pyrex ?

jam> It is a bug from our adaptation of the code. They don't
    jam> ever insert new records into the hash map. It has to do
    jam> with how groupcompress blocks work.

That's good news, kind of ;-)

Vincent

Revision history for this message

Matthew Fuller (fullermd) wrote on 2009-12-15:

#

> However, accessing the 21st byte will not be a segfault. You have to
> access the 8193rd byte to get the OS to say "hey, you don't have
> access to that page, go home."

Or not. Maybe there are 4k pages, or 4M pages. Or maybe the next
page is already mapped into your process from an earlier or later
allocation anyway, and you'd have to step past that. It'll vary by
OS, OS version, architecture, run of program, phase of moon...

It's hard to define a test of undefined behavior.

 === modified file 'NEWS'
 --- NEWS	2009-12-14 09:31:20 +0000
 +++ NEWS	2009-12-14 16:00:51 +0000
@@ -172,6 +172,11 @@
  * Content filters are now applied correctly after pull, merge and switch.
    (Ian Clatworthy, #385879)
++* Fix a potential segfault in the groupcompress hash map handling code.
++  When inserting new entries, if the final hash bucket was empty, we could
++  end up trying to access if ``(last_entry+1)->ptr == NULL``.
++  (John Arbash Meinel, #490228)
++
  * Improve "Binary files differ" hunk handling.  (Aaron Bentley, #436325)
  Improvements
 === modified file 'bzrlib/diff-delta.c'
 --- bzrlib/diff-delta.c	2009-08-03 16:54:36 +0000
 +++ bzrlib/diff-delta.c	2009-12-14 16:00:51 +0000
@@ -688,7 +688,7 @@
      const unsigned char *data, *buffer, *top;
      unsigned char cmd;
      struct delta_index *new_index;
--    struct index_entry *entry, *entries, *old_entry;
++    struct index_entry *entry, *entries;
      if (!src->buf || !src->size)
          return NULL;
@@ -789,6 +789,7 @@
      entry = entries;
      num_inserted = 0;
      for (; num_entries > 0; --num_entries, ++entry) {
++        struct index_entry *next_bucket_entry, *cur_entry, *bucket_first_entry;
          hash_offset = (entry->val & old_index->hash_mask);
          /* The basic structure is a hash => packed_entries that fit in that
           * hash bucket. Things are structured such that the hash-pointers are
@@ -797,15 +798,19 @@
           * forward. If there are no NULL targets, then we know because
           * entry->ptr will not be NULL.
           */
--        old_entry = old_index->hash[hash_offset + 1];
--        old_entry--;
--        while (old_entry->ptr == NULL
--               && old_entry >= old_index->hash[hash_offset]) {
--            old_entry--;
++        // The start of the next bucket, this may point past the end of the
++        // entry table if hash_offset is the last bucket.
++        next_bucket_entry = old_index->hash[hash_offset + 1];
++        // First entry in this bucket
++        bucket_first_entry = old_index->hash[hash_offset];
++        cur_entry = next_bucket_entry - 1;
++        while (cur_entry->ptr == NULL && cur_entry >= bucket_first_entry) {
++            cur_entry--;
+         }
--        old_entry++;
--        if (old_entry->ptr != NULL
--            || old_entry >= old_index->hash[hash_offset + 1]) {
++        // cur_entry now either points at the first NULL, or it points to
++        // next_bucket_entry if there were no blank spots.
++        cur_entry++;
++        if (cur_entry >= next_bucket_entry || cur_entry->ptr != NULL) {
              /* There is no room for this entry, we have to resize */
              // char buff[128];
              // get_text(buff, entry->ptr);
@@ -822,7 +827,7 @@
              break;
+         }
          num_inserted++;
--        *old_entry = *entry;
++        *cur_entry = *entry;
          /* For entries which we *do* manage to insert into old_index, we don't
           * want them double copied into the final output.
           */

Bazaar

Merge lp:~jameinel/bzr/2.0-490228-gc-segfault into lp:bzr

Commit message

Description of the change

Preview Diff

Subscribers