Merge lp:~jameinel/bzr/2.1-simple-set into lp:bzr

Proposed by John A Meinel
Status: Merged
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/2.1-simple-set
Merge into: lp:bzr
Diff against target: 1146 lines
8 files modified
.bzrignore (+3/-0)
NEWS (+7/-0)
bzrlib/_simple_set_pyx.pxd (+91/-0)
bzrlib/_simple_set_pyx.pyx (+600/-0)
bzrlib/python-compat.h (+5/-0)
bzrlib/tests/__init__.py (+1/-0)
bzrlib/tests/test__simple_set.py (+371/-0)
setup.py (+1/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.1-simple-set
Reviewer Review Type Date Requested Status
Andrew Bennetts Approve
Review via email: mp+13039@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (3.6 KiB)

This is the first in my series of patches to lower memory overhead with StaticTuple objects.

I picked this first because it wasn't strictly dependent on the other code, but I use it in the implementation of StaticTuple's interning.

Anyway, this introduces a "SimpleSet" class. For now, I have a pyrex version but no pure-python version, as it is only currently used in StaticTuple. I may go to the point of implementing a python version, though it is fairly close in api to a set or a dict. The big thing is that we don't really have values like a dict, but a set() doesn't let you get access to the object which is stored in the internal table.

So the 2 primary wins for this class is:

1) Don't cache the hash of every object. For the types of object we are putting in here, the hash should be relatively cheap to compute. Even further, though, the times when you need the hash are:
  a) When inserting a new object, you need to know its hash, but you haven't cached it yet anyway.
  b) When resolving a collision, you compare the hash to the cached value as a 'cheap' comparison.
However, the number of collisions is based on the quality of your hash, your collision avoidance algorithm, and the size of your table. We can't really change the hash function. We could use 'quadratic hash', but what sets use seems pretty good anyway (it mixes in more of the upper bits, so you probably get divergence faster, but you also lose locality...)
As for the size of the table. It takes the same number of bytes to cache a 'long hash', as it takes to hold a 'PyObject *'. Which means that in the same memory, you could cache hashes *or* double your addressable space and halve the number of collisions.

2) Allow lookups, so we don't need to use a Dict, which has yet another pointer per address. (So for the same number of entries, this will generally be 1/3rd the size of the equivalent dict, and 1/2 the size of the equivalent set.)

3) Have a single function for the equivalent function 'key = dict.setdefault(key, key)'. At the C api, you could use dict->lookup which is a rather private function, or you would generally use PyDict_GetItem() followed by PyDict_SetItem().
With SimpleSet_Add(), it returns the object stored there, so you have a single lookup. (Only really important in the case of collisions, where you may have several steps that need to be repeated.)

Because of the memory savings, I may look into using this elsewhere in our code base. If I do, then I will certainly implement a Python version (probably just subclassing a dict and exposing an 'def add(self key): return self.setdefault(key, key)' function. (And whatever else I specifically need.

As for the specific memory savings here, dicts and sets are resized to average 50% full (resize at 67%, etc.), so is SimpleSet [theory says hash tables fall apart at about 80% full]. The idea is to start interning the StaticTuple objects I'm creating. (For every key, you average at least 2 references, one for the key, and one for all of its children.)
When loading all of launchpad, that translates into about 500k strings, and 1.1M interned StaticTuples (there are actually more, because of how the btree code uses tuples-of...

Read more...

Revision history for this message
Andrew Bennetts (spiv) wrote :
Download full text (10.0 KiB)

I'm very excited by this patch series!

This one is:

 review approve

Although there are some comments you should look at, especially regarding
licensing...

John A Meinel wrote:
[...]
> === added file 'bzrlib/_simple_set_pyx.pxd'
[...]
> +"""Interface definition of a class like PySet but without caching the hash.
> +
> +This is generally useful when you want to 'intern' objects, etc. Note that this
> +differs from Set in that we:
> + 1) Don't have all of the .intersection, .difference, etc functions
> + 2) Do return the object from the set via queries
> + eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
> +"""

I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
It's a slightly unusual mix of dict-like and set-like APIs. I think it's fine
after some thinking about it, I'm just noting that it seems a bit funny at
first. Although maybe .remove would be more set-like than .__delitem__?

> +cdef api object SimpleSet_Add(object self, object key)
> +cdef api SimpleSet SimpleSet_New()
> +cdef api object SimpleSet_Add(object self, object key)

That appears to be a duplicate declaration of SimpleSet_Add.

> +cdef api int SimpleSet_Contains(object self, object key) except -1
> +cdef api int SimpleSet_Discard(object self, object key) except -1
> +cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL
> +cdef api Py_ssize_t SimpleSet_Size(object self) except -1
> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key)
>
> === added file 'bzrlib/_simple_set_pyx.pyx'
[...]
> +cdef object _dummy_obj
> +cdef PyObject *_dummy
> +_dummy_obj = object()
> +_dummy = <PyObject *>_dummy_obj

It's not very clear what _dummy is used for. I guess what's missing is some
text giving an overview of the data structure, although I suppose that's what
you mean to convey by the docstrings saying it is similar to the builtin
dict/set types.

Anyway, it appears _dummy is used to avoid resizing/compacting the
table for every single discard. [Part of why it would be nice to have some text
describing the data structure is so that there's some clear terminology to
use... the code pretty clearly talks about “tables” and “slots”, which seem
fairly clear, but then what's an “entry”?]

> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> + cdef long other_hash
> + cdef PyObject *res
> +
> + if this == other:
> + return 1
> + other_hash = Py_TYPE(other).tp_hash(other)

What happens if 'other' is not hashable?

> + if other_hash != this_hash:
> + return 0
> + res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)

Similarly, what if one the richcompare calls raise an exception?

> +cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
[...]
> + cdef int _insert_clean(self, PyObject *key) except -1:
> + """Insert a key into self.table.
> +
> + This is only meant to be used during times like '_resize',
> + as it makes a lot of assuptions about keys not already being present,
> + and there being no dummy entries.
> + """
> + cdef size_t i, perturb, mask
> + cdef lon...

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (15.8 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
...

>> + 1) Don't have all of the .intersection, .difference, etc functions
>> + 2) Do return the object from the set via queries
>> + eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
>> +"""
>
> I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
> It's a slightly unusual mix of dict-like and set-like APIs. I think it's fine
> after some thinking about it, I'm just noting that it seems a bit funny at
> first. Although maybe .remove would be more set-like than .__delitem__?

So at the moment, the only apis I need for interning are:
  add() and discard()

I could certainly remove the __getitem__ and __delitem__ if you are more
comfortable with it.

I want to have an api that fits what we need/want to use. As such, it
will probably be driven by actual use cases. And so far, there is only 1
of those...

I considered changing the semantics, such that "__getitem__" == "add()".
The main reason for that is because it avoids the getattr() overhead, as
__getitem__ is a slot in the type struct (tp_item) while .add() requires
an attribute lookup.

Which basically means that in an interning loop you would have:

for bar in group:
  bar = dedup[bar]

rather than

  bar = dedup.add(bar)

or the old form

  bar = dict.setdefault(bar, bar)

I would guess that doing

add = dedup.add
for bar in group:
  bar = add(bar)

Is going to perform the same as the dedup[bar] form. But both should do
better than setdefault. (At a minimum, setdefault takes 2 parameters,
and thus has to create a tuple, etc.)

I don't know if the interpreter has further internal benefits to using
the __getitem__ form, since it is a known api that has to conform in
certain ways.

The big concern is that "dedup[bar]" can mutate 'dedup' and that is
potentially unexpected.

>
>> +cdef api object SimpleSet_Add(object self, object key)
>> +cdef api SimpleSet SimpleSet_New()
>> +cdef api object SimpleSet_Add(object self, object key)
>
> That appears to be a duplicate declaration of SimpleSet_Add.

Thanks.

...

> [...]
>> +cdef object _dummy_obj
>> +cdef PyObject *_dummy
>> +_dummy_obj = object()
>> +_dummy = <PyObject *>_dummy_obj
>
> It's not very clear what _dummy is used for. I guess what's missing is some
> text giving an overview of the data structure, although I suppose that's what
> you mean to convey by the docstrings saying it is similar to the builtin
> dict/set types.
>
> Anyway, it appears _dummy is used to avoid resizing/compacting the
> table for every single discard. [Part of why it would be nice to have some text
> describing the data structure is so that there's some clear terminology to
> use... the code pretty clearly talks about “tables” and “slots”, which seem
> fairly clear, but then what's an “entry”?]

How's this:

 # Data structure definition:
 # This is a basic hash table using open addressing.
 # http://en.wikipedia.org/wiki/Open_addressing
 # Basically that means we keep an array of pointers to Python objects
 # (called a table). Each location in the array is called a 'slot'.
 #
 # An empty slot holds a NULL pointer, a ...

Revision history for this message
Andrew Bennetts (spiv) wrote :
Download full text (12.2 KiB)

John A Meinel wrote:
[...]
> I would guess that doing
>
> add = dedup.add
> for bar in group:
> bar = add(bar)
>
> Is going to perform the same as the dedup[bar] form. But both should do
> better than setdefault. (At a minimum, setdefault takes 2 parameters,
> and thus has to create a tuple, etc.)
>
> I don't know if the interpreter has further internal benefits to using
> the __getitem__ form, since it is a known api that has to conform in
> certain ways.

Right, I think you'd still see a bit of benefit in the __getitem__ form because
Python knows that that operator only takes one arg and so avoids packing and
unpacking an args tuple for it. (Like METH_O vs. METH_VARARGS.) ISTR some part
of the zope.interface C extension intentionally abuses an obscure no-arg
operator (maybe pos?) as that's the fastest way to invoke C code from Python.

> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
> potentially unexpected.

Agreed. For readability my preference is .add, but obviously sufficiently large
performance benefits may override that. I don't *think* the boost from using
__getitem__ instead would be that much greater for this use, but I haven't
measured so I may be very wrong :)

[...]
> How's this:
>
> # Data structure definition:

Great!

> >> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> >> + cdef long other_hash
> >> + cdef PyObject *res
> >> +
> >> + if this == other:
> >> + return 1
> >> + other_hash = Py_TYPE(other).tp_hash(other)
> >
> > What happens if 'other' is not hashable?
>
> Other will always be something that is already held in the internal
> structure, and thus has been hashed. 'this' has also already been hashed
> as part of inserting, and thus we've also checked that it can be hashed.
>
> I can change this to PyObject_Hash() if you feel strongly. I was
> avoiding a function call overhead, though it probably doesn't have a
> huge impact on performance.

Well, badly-behaved objects might successfully give a hash value the first time
and then get an error later. I do strongly lean towards paranoia in C code, so
I think PyObject_Hash is probably safest. You ought to be able to construct a
test that defines such a badly-behaved object to see just how bad the fallout
is. If segfaults are possible then definitely close that hole!

Also, PyObject_Hash is where Python implements the logic that an object with no
tp_hash in its type will use hash(id(obj)), so long as it doesn't have
tp_compare or tp_richcompare. I think this is pretty common, e.g. the module
type has no tp_hash. And given that you use the 'hash' global elsewhere in this
module there's definitely some chance for confusion here.

But for the case you're optimising for PyObject_Hash will just call the object's
tp_hash immediately and return the result, so it should be a minimal penalty,
just a C function call overhead.

The thing I was really looking for though was checking the result value of
tp_hash/PyObject_Hash — if its -1 then there's an error to be dealt with.

> >> + if other_hash != this_hash:
> >> + return 0
> >> + res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
>...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (9.4 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> I would guess that doing
>>
>> add = dedup.add
>> for bar in group:
>> bar = add(bar)
>>
>> Is going to perform the same as the dedup[bar] form. But both should do
>> better than setdefault. (At a minimum, setdefault takes 2 parameters,
>> and thus has to create a tuple, etc.)
>>
>> I don't know if the interpreter has further internal benefits to using
>> the __getitem__ form, since it is a known api that has to conform in
>> certain ways.
>
> Right, I think you'd still see a bit of benefit in the __getitem__ form because
> Python knows that that operator only takes one arg and so avoids packing and
> unpacking an args tuple for it. (Like METH_O vs. METH_VARARGS.) ISTR some part
> of the zope.interface C extension intentionally abuses an obscure no-arg
> operator (maybe pos?) as that's the fastest way to invoke C code from Python.
>
>> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
>> potentially unexpected.
>
> Agreed. For readability my preference is .add, but obviously sufficiently large
> performance benefits may override that. I don't *think* the boost from using
> __getitem__ instead would be that much greater for this use, but I haven't
> measured so I may be very wrong :)
>
> [...]
>> How's this:
>>
>> # Data structure definition:
>
> Great!
>
>>>> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
>>>> + cdef long other_hash
>>>> + cdef PyObject *res
>>>> +
>>>> + if this == other:
>>>> + return 1
>>>> + other_hash = Py_TYPE(other).tp_hash(other)
>>> What happens if 'other' is not hashable?
>> Other will always be something that is already held in the internal
>> structure, and thus has been hashed. 'this' has also already been hashed
>> as part of inserting, and thus we've also checked that it can be hashed.
>>
>> I can change this to PyObject_Hash() if you feel strongly. I was
>> avoiding a function call overhead, though it probably doesn't have a
>> huge impact on performance.
>
> Well, badly-behaved objects might successfully give a hash value the first time
> and then get an error later. I do strongly lean towards paranoia in C code, so
> I think PyObject_Hash is probably safest. You ought to be able to construct a
> test that defines such a badly-behaved object to see just how bad the fallout
> is. If segfaults are possible then definitely close that hole!

No segfaults, but the potential to leave an error in the pipe. Which
causes random failures IIRC. (I forget the function where '-1' is maybe
an error, so you have to check PyErr_Occurred, etc.)

>
> Also, PyObject_Hash is where Python implements the logic that an object with no
> tp_hash in its type will use hash(id(obj)), so long as it doesn't have
> tp_compare or tp_richcompare. I think this is pretty common, e.g. the module
> type has no tp_hash. And given that you use the 'hash' global elsewhere in this
> module there's definitely some chance for confusion here.

So in my "_insert" I now have the assertion:
        if (Py_TYPE(py_key).tp_richcompare == NULL
            or Py_TYPE(py_key).tp_...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> Again, PyObject_IsTrue tries == Py_True as the very first thing, so the cost
> should be small. If you're worried a macro along the lines of:
>
> #define PY_IS_TRUE(o) ((o == Py_True) || PyObject_IsTrue(o))
>
> might make us both happy.
>

So it turns out that this fix actually causes crazy corruption. Specifically

assert PyObject_IsTrue(Py_NotImplemented)

passes.

In other words "Py_NotImplemented" evaluates to True.
You can also see that with:

>>> bool(NotImplemented)
True

So the code was doing:

if res == NULL:
  return -1
if PyObject_IsTrue(res):
  ...
if res == Py_NotImplemented:
  # reverse the comparison.

Anyway, I've tracked this down to some crazy interning issues I've been
seeing. (You have to get a hash collision *and* have objects that return
NotImplemented.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrV72AACgkQJdeBCYSNAAPiOgCdGOnxjBvPH08fKYxa0vjXHSGu
yfsAn0Mn8b62R8wn8jHsttXoBZEILhja
=3Wq/
-----END PGP SIGNATURE-----

Revision history for this message
Andrew Bennetts (spiv) wrote :

John A Meinel wrote:
[...]
> So it turns out that this fix actually causes crazy corruption. Specifically
>
> assert PyObject_IsTrue(Py_NotImplemented)
>
> passes.

Ah, yes, it would.

[...]
> So the code was doing:
>
> if res == NULL:
> return -1
> if PyObject_IsTrue(res):
> ...
> if res == Py_NotImplemented:
> # reverse the comparison.
>

So I guess you need to change this to:

if res == NULL:
    return -1
elif res == Py_NotImplemented:
    # reverse the comparison
    ...
elif PyObject_IsTrue(res):
    ...

And similarly for handling the res of the reversed comparison.

-Andrew.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file '.bzrignore'
--- .bzrignore 2009-09-09 11:43:10 +0000
+++ .bzrignore 2009-10-12 16:51:14 +0000
@@ -58,6 +58,9 @@
58bzrlib/_known_graph_pyx.c58bzrlib/_known_graph_pyx.c
59bzrlib/_readdir_pyx.c59bzrlib/_readdir_pyx.c
60bzrlib/_rio_pyx.c60bzrlib/_rio_pyx.c
61bzrlib/_simple_set_pyx.c
62bzrlib/_simple_set_pyx.h
63bzrlib/_simple_set_pyx_api.h
61bzrlib/_walkdirs_win32.c64bzrlib/_walkdirs_win32.c
62# built extension modules65# built extension modules
63bzrlib/_*.dll66bzrlib/_*.dll
6467
=== modified file 'NEWS'
--- NEWS 2009-10-08 23:44:40 +0000
+++ NEWS 2009-10-12 16:51:14 +0000
@@ -206,6 +206,13 @@
206 repository or branch object is unlocked then relocked the same way.206 repository or branch object is unlocked then relocked the same way.
207 (Andrew Bennetts)207 (Andrew Bennetts)
208 208
209* Added ``bzrlib._simple_set_pyx``. This is a hybrid between a Set and a
210 Dict (it only holds keys, but you can lookup the object located at a
211 given key). It has significantly reduced memory consumption versus the
212 builtin objects (1/2 the size of Set, 1/3rd the size of Dict). This will
213 be used as the interning structure for StaticTuple objects, as part of
214 an ongoing push to reduce peak memory consumption. (John Arbash Meinel)
215
209* ``BTreeLeafParser.extract_key`` has been tweaked slightly to reduce216* ``BTreeLeafParser.extract_key`` has been tweaked slightly to reduce
210 mallocs while parsing the index (approx 3=>1 mallocs per key read).217 mallocs while parsing the index (approx 3=>1 mallocs per key read).
211 This results in a 10% speedup while reading an index.218 This results in a 10% speedup while reading an index.
212219
=== added file 'bzrlib/_simple_set_pyx.pxd'
--- bzrlib/_simple_set_pyx.pxd 1970-01-01 00:00:00 +0000
+++ bzrlib/_simple_set_pyx.pxd 2009-10-12 16:51:14 +0000
@@ -0,0 +1,91 @@
1# Copyright (C) 2009 Canonical Ltd
2#
3# This program is free software; you can redistribute it and/or modify
4# it under the terms of the GNU General Public License as published by
5# the Free Software Foundation; either version 2 of the License, or
6# (at your option) any later version.
7#
8# This program is distributed in the hope that it will be useful,
9# but WITHOUT ANY WARRANTY; without even the implied warranty of
10# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11# GNU General Public License for more details.
12#
13# You should have received a copy of the GNU General Public License
14# along with this program; if not, write to the Free Software
15# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
16
17"""Interface definition of a class like PySet but without caching the hash.
18
19This is generally useful when you want to 'intern' objects, etc. Note that this
20differs from Set in that we:
21 1) Don't have all of the .intersection, .difference, etc functions
22 2) Do return the object from the set via queries
23 eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
24"""
25
26cdef extern from "Python.h":
27 ctypedef struct PyObject:
28 pass
29
30
31cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
32 """A class similar to PySet, but with simpler implementation.
33
34 The main advantage is that this class uses only 2N memory to store N
35 objects rather than 4N memory. The main trade-off is that we do not cache
36 the hash value of saved objects. As such, it is assumed that computing the
37 hash will be cheap (such as strings or tuples of strings, etc.)
38
39 This also differs in that you can get back the objects that are stored
40 (like a dict), but we also don't implement the complete list of 'set'
41 operations (difference, intersection, etc).
42 """
43 # Data structure definition:
44 # This is a basic hash table using open addressing.
45 # http://en.wikipedia.org/wiki/Open_addressing
46 # Basically that means we keep an array of pointers to Python objects
47 # (called a table). Each location in the array is called a 'slot'.
48 #
49 # An empty slot holds a NULL pointer, a slot where there was an item
50 # which was then deleted will hold a pointer to _dummy, and a filled slot
51 # points at the actual object which fills that slot.
52 #
53 # The table is always a power of two, and the default location where an
54 # object is inserted is at hash(object) & (table_size - 1)
55 #
56 # If there is a collision, then we search for another location. The
57 # specific algorithm is in _lookup. We search until we:
58 # find the object
59 # find an equivalent object (by tp_richcompare(obj1, obj2, Py_EQ))
60 # find a NULL slot
61 #
62 # When an object is deleted, we set its slot to _dummy. this way we don't
63 # have to track whether there was a collision, and find the corresponding
64 # keys. (The collision resolution algorithm makes that nearly impossible
65 # anyway, because it depends on the upper bits of the hash.)
66 # The main effect of this, is that if we find _dummy, then we can insert
67 # an object there, but we have to keep searching until we find NULL to
68 # know that the object is not present elsewhere.
69
70 cdef Py_ssize_t _used # active
71 cdef Py_ssize_t _fill # active + dummy
72 cdef Py_ssize_t _mask # Table contains (mask+1) slots, a power of 2
73 cdef PyObject **_table # Pyrex/Cython doesn't support arrays to 'object'
74 # so we manage it manually
75
76 cdef PyObject *_get(self, object key) except? NULL
77 cdef object _add(self, key)
78 cdef int _discard(self, key) except -1
79 cdef int _insert_clean(self, PyObject *key) except -1
80 cdef Py_ssize_t _resize(self, Py_ssize_t min_unused) except -1
81
82
83# TODO: might want to export the C api here, though it is all available from
84# the class object...
85cdef api SimpleSet SimpleSet_New()
86cdef api object SimpleSet_Add(object self, object key)
87cdef api int SimpleSet_Contains(object self, object key) except -1
88cdef api int SimpleSet_Discard(object self, object key) except -1
89cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL
90cdef api Py_ssize_t SimpleSet_Size(object self) except -1
91cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key)
092
=== added file 'bzrlib/_simple_set_pyx.pyx'
--- bzrlib/_simple_set_pyx.pyx 1970-01-01 00:00:00 +0000
+++ bzrlib/_simple_set_pyx.pyx 2009-10-12 16:51:14 +0000
@@ -0,0 +1,600 @@
1# Copyright (C) 2009 Canonical Ltd
2#
3# This program is free software; you can redistribute it and/or modify
4# it under the terms of the GNU General Public License as published by
5# the Free Software Foundation; either version 2 of the License, or
6# (at your option) any later version.
7#
8# This program is distributed in the hope that it will be useful,
9# but WITHOUT ANY WARRANTY; without even the implied warranty of
10# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11# GNU General Public License for more details.
12#
13# You should have received a copy of the GNU General Public License
14# along with this program; if not, write to the Free Software
15# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
16
17"""Definition of a class that is similar to Set with some small changes."""
18
19cdef extern from "python-compat.h":
20 pass
21
22cdef extern from "Python.h":
23 ctypedef unsigned long size_t
24 ctypedef long (*hashfunc)(PyObject*)
25 ctypedef PyObject *(*richcmpfunc)(PyObject *, PyObject *, int)
26 ctypedef int (*visitproc)(PyObject *, void *)
27 ctypedef int (*traverseproc)(PyObject *, visitproc, void *)
28 int Py_EQ
29 PyObject *Py_True
30 PyObject *Py_NotImplemented
31 void Py_INCREF(PyObject *)
32 void Py_DECREF(PyObject *)
33 ctypedef struct PyTypeObject:
34 hashfunc tp_hash
35 richcmpfunc tp_richcompare
36 traverseproc tp_traverse
37
38 PyTypeObject *Py_TYPE(PyObject *)
39 int PyObject_IsTrue(PyObject *)
40
41 void *PyMem_Malloc(size_t nbytes)
42 void PyMem_Free(void *)
43 void memset(void *, int, size_t)
44
45
46# Dummy is an object used to mark nodes that have been deleted. Since
47# collisions require us to move a node to an alternative location, if we just
48# set an entry to NULL on delete, we won't find any relocated nodes.
49# We have to use _dummy_obj because we need to keep a refcount to it, but we
50# also use _dummy as a pointer, because it avoids having to put <PyObject*> all
51# over the code base.
52cdef object _dummy_obj
53cdef PyObject *_dummy
54_dummy_obj = object()
55_dummy = <PyObject *>_dummy_obj
56
57
58cdef int _is_equal(PyObject *this, long this_hash, PyObject *other) except -1:
59 cdef long other_hash
60 cdef PyObject *res
61
62 if this == other:
63 return 1
64 other_hash = Py_TYPE(other).tp_hash(other)
65 if other_hash == -1:
66 # Even though other successfully hashed in the past, it seems to have
67 # changed its mind, and failed this time, so propogate the failure.
68 return -1
69 if other_hash != this_hash:
70 return 0
71
72 # This implements a subset of the PyObject_RichCompareBool functionality.
73 # Namely it:
74 # 1) Doesn't try to do anything with old-style classes
75 # 2) Assumes that both objects have a tp_richcompare implementation, and
76 # that if that is not enough to compare equal, then they are not
77 # equal. (It doesn't try to cast them both to some intermediate form
78 # that would compare equal.)
79 res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
80 if res == NULL: # Exception
81 return -1
82 if PyObject_IsTrue(res):
83 Py_DECREF(res)
84 return 1
85 if res == Py_NotImplemented:
86 Py_DECREF(res)
87 res = Py_TYPE(other).tp_richcompare(other, this, Py_EQ)
88 if res == NULL:
89 return -1
90 if PyObject_IsTrue(res):
91 Py_DECREF(res)
92 return 1
93 Py_DECREF(res)
94 return 0
95
96
97cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
98 """This class can be used to track canonical forms for objects.
99
100 It is similar in function to the interned dictionary that is used by
101 strings. However:
102
103 1) It assumes that hash(obj) is cheap, so does not need to inline a copy
104 of it
105 2) It only stores one reference to the object, rather than 2 (key vs
106 key:value)
107
108 As such, it uses 1/3rd the amount of memory to store a pointer to the
109 interned object.
110 """
111 # Attributes are defined in the .pxd file
112 DEF DEFAULT_SIZE=1024
113
114 def __init__(self):
115 cdef Py_ssize_t size, n_bytes
116
117 size = DEFAULT_SIZE
118 self._mask = size - 1
119 self._used = 0
120 self._fill = 0
121 n_bytes = sizeof(PyObject*) * size;
122 self._table = <PyObject **>PyMem_Malloc(n_bytes)
123 if self._table == NULL:
124 raise MemoryError()
125 memset(self._table, 0, n_bytes)
126
127 def __dealloc__(self):
128 if self._table != NULL:
129 PyMem_Free(self._table)
130 self._table = NULL
131
132 property used:
133 def __get__(self):
134 return self._used
135
136 property fill:
137 def __get__(self):
138 return self._fill
139
140 property mask:
141 def __get__(self):
142 return self._mask
143
144 def _memory_size(self):
145 """Return the number of bytes of memory consumed by this class."""
146 return sizeof(self) + (sizeof(PyObject*)*(self._mask + 1))
147
148 def __len__(self):
149 return self._used
150
151 def _test_lookup(self, key):
152 cdef PyObject **slot
153
154 slot = _lookup(self, key)
155 if slot[0] == NULL:
156 res = '<null>'
157 elif slot[0] == _dummy:
158 res = '<dummy>'
159 else:
160 res = <object>slot[0]
161 return <int>(slot - self._table), res
162
163 def __contains__(self, key):
164 """Is key present in this SimpleSet."""
165 cdef PyObject **slot
166
167 slot = _lookup(self, key)
168 if slot[0] == NULL or slot[0] == _dummy:
169 return False
170 return True
171
172 cdef PyObject *_get(self, object key) except? NULL:
173 """Return the object (or nothing) define at the given location."""
174 cdef PyObject **slot
175
176 slot = _lookup(self, key)
177 if slot[0] == NULL or slot[0] == _dummy:
178 return NULL
179 return slot[0]
180
181 def __getitem__(self, key):
182 """Return a stored item that is equivalent to key."""
183 cdef PyObject *py_val
184
185 py_val = self._get(key)
186 if py_val == NULL:
187 raise KeyError("Key %s is not present" % key)
188 val = <object>(py_val)
189 return val
190
191 cdef int _insert_clean(self, PyObject *key) except -1:
192 """Insert a key into self.table.
193
194 This is only meant to be used during times like '_resize',
195 as it makes a lot of assuptions about keys not already being present,
196 and there being no dummy entries.
197 """
198 cdef size_t i, n_lookup
199 cdef long the_hash
200 cdef PyObject **table, **slot
201 cdef Py_ssize_t mask
202
203 mask = self._mask
204 table = self._table
205
206 the_hash = Py_TYPE(key).tp_hash(key)
207 if the_hash == -1:
208 return -1
209 i = the_hash
210 for n_lookup from 0 <= n_lookup <= <size_t>mask: # Don't loop forever
211 slot = &table[i & mask]
212 if slot[0] == NULL:
213 slot[0] = key
214 self._fill = self._fill + 1
215 self._used = self._used + 1
216 return 1
217 i = i + 1 + n_lookup
218 raise RuntimeError('ran out of slots.')
219
220 def _py_resize(self, min_used):
221 """Do not use this directly, it is only exposed for testing."""
222 return self._resize(min_used)
223
224 cdef Py_ssize_t _resize(self, Py_ssize_t min_used) except -1:
225 """Resize the internal table.
226
227 The final table will be big enough to hold at least min_used entries.
228 We will copy the data from the existing table over, leaving out dummy
229 entries.
230
231 :return: The new size of the internal table
232 """
233 cdef Py_ssize_t new_size, n_bytes, remaining
234 cdef PyObject **new_table, **old_table, **slot
235
236 new_size = DEFAULT_SIZE
237 while new_size <= min_used and new_size > 0:
238 new_size = new_size << 1
239 # We rolled over our signed size field
240 if new_size <= 0:
241 raise MemoryError()
242 # Even if min_used == self._mask + 1, and we aren't changing the actual
243 # size, we will still run the algorithm so that dummy entries are
244 # removed
245 # TODO: Test this
246 # if new_size < self._used:
247 # raise RuntimeError('cannot shrink SimpleSet to something'
248 # ' smaller than the number of used slots.')
249 n_bytes = sizeof(PyObject*) * new_size;
250 new_table = <PyObject **>PyMem_Malloc(n_bytes)
251 if new_table == NULL:
252 raise MemoryError()
253
254 old_table = self._table
255 self._table = new_table
256 memset(self._table, 0, n_bytes)
257 self._mask = new_size - 1
258 self._used = 0
259 remaining = self._fill
260 self._fill = 0
261
262 # Moving everything to the other table is refcount neutral, so we don't
263 # worry about it.
264 slot = old_table
265 while remaining > 0:
266 if slot[0] == NULL: # unused slot
267 pass
268 elif slot[0] == _dummy: # dummy slot
269 remaining = remaining - 1
270 else: # active slot
271 remaining = remaining - 1
272 self._insert_clean(slot[0])
273 slot = slot + 1
274 PyMem_Free(old_table)
275 return new_size
276
277 def add(self, key):
278 """Similar to set.add(), start tracking this key.
279
280 There is one small difference, which is that we return the object that
281 is stored at the given location. (which is closer to the
282 dict.setdefault() functionality.)
283 """
284 return self._add(key)
285
286 cdef object _add(self, key):
287 cdef PyObject **slot, *py_key
288 cdef int added
289
290 py_key = <PyObject *>key
291 if (Py_TYPE(py_key).tp_richcompare == NULL
292 or Py_TYPE(py_key).tp_hash == NULL):
293 raise TypeError('Types added to SimpleSet must implement'
294 ' both tp_richcompare and tp_hash')
295 added = 0
296 # We need at least one empty slot
297 assert self._used < self._mask
298 slot = _lookup(self, key)
299 if (slot[0] == NULL):
300 Py_INCREF(py_key)
301 self._fill = self._fill + 1
302 self._used = self._used + 1
303 slot[0] = py_key
304 added = 1
305 elif (slot[0] == _dummy):
306 Py_INCREF(py_key)
307 self._used = self._used + 1
308 slot[0] = py_key
309 added = 1
310 # No else: clause. If _lookup returns a pointer to
311 # a live object, then we already have a value at this location.
312 retval = <object>(slot[0])
313 # PySet and PyDict use a 2-3rds full algorithm, we'll follow suit
314 if added and (self._fill * 3) >= ((self._mask + 1) * 2):
315 # However, we always work for a load factor of 2:1
316 self._resize(self._used * 2)
317 # Even if we resized and ended up moving retval into a different slot,
318 # it is still the value that is held at the slot equivalent to 'key',
319 # so we can still return it
320 return retval
321
322 def discard(self, key):
323 """Remove key from the set, whether it exists or not.
324
325 :return: False if the item did not exist, True if it did
326 """
327 if self._discard(key):
328 return True
329 return False
330
331 cdef int _discard(self, key) except -1:
332 cdef PyObject **slot, *py_key
333
334 slot = _lookup(self, key)
335 if slot[0] == NULL or slot[0] == _dummy:
336 return 0
337 self._used = self._used - 1
338 Py_DECREF(slot[0])
339 slot[0] = _dummy
340 # PySet uses the heuristic: If more than 1/5 are dummies, then resize
341 # them away
342 # if ((so->_fill - so->_used) * 5 < so->mask)
343 # However, we are planning on using this as an interning structure, in
344 # which we will be putting a lot of objects. And we expect that large
345 # groups of them are going to have the same lifetime.
346 # Dummy entries hurt a little bit because they cause the lookup to keep
347 # searching, but resizing is also rather expensive
348 # For now, we'll just use their algorithm, but we may want to revisit
349 # it
350 if ((self._fill - self._used) * 5 > self._mask):
351 self._resize(self._used * 2)
352 return 1
353
354 def __iter__(self):
355 return _SimpleSet_iterator(self)
356
357
358cdef class _SimpleSet_iterator:
359 """Iterator over the SimpleSet structure."""
360
361 cdef Py_ssize_t pos
362 cdef SimpleSet set
363 cdef Py_ssize_t _used # track if things have been mutated while iterating
364 cdef Py_ssize_t len # number of entries left
365
366 def __init__(self, obj):
367 self.set = obj
368 self.pos = 0
369 self._used = self.set._used
370 self.len = self.set._used
371
372 def __iter__(self):
373 return self
374
375 def __next__(self):
376 cdef Py_ssize_t mask, i
377 cdef PyObject *key
378
379 if self.set is None:
380 raise StopIteration
381 if self.set._used != self._used:
382 # Force this exception to continue to be raised
383 self._used = -1
384 raise RuntimeError("Set size changed during iteration")
385 if not SimpleSet_Next(self.set, &self.pos, &key):
386 self.set = None
387 raise StopIteration
388 # we found something
389 the_key = <object>key # INCREF
390 self.len = self.len - 1
391 return the_key
392
393 def __length_hint__(self):
394 if self.set is not None and self._used == self.set._used:
395 return self.len
396 return 0
397
398
399
400cdef api SimpleSet SimpleSet_New():
401 """Create a new SimpleSet object."""
402 return SimpleSet()
403
404
405cdef SimpleSet _check_self(object self):
406 """Check that the parameter is not None.
407
408 Pyrex/Cython will do type checking, but only to ensure that an object is
409 either the right type or None. You can say "object foo not None" for pure
410 python functions, but not for C functions.
411 So this is just a helper for all the apis that need to do the check.
412 """
413 cdef SimpleSet true_self
414 if self is None:
415 raise TypeError('self must not be None')
416 true_self = self
417 return true_self
418
419
420cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
421 """Find the slot where 'key' would fit.
422
423 This is the same as a dicts 'lookup' function.
424
425 :param key: An object we are looking up
426 :param hash: The hash for key
427 :return: The location in self.table where key should be put.
428 location == NULL is an exception, but (*location) == NULL just
429 indicates the slot is empty and can be used.
430 """
431 # This uses Quadratic Probing:
432 # http://en.wikipedia.org/wiki/Quadratic_probing
433 # with c1 = c2 = 1/2
434 # This leads to probe locations at:
435 # h0 = hash(k1)
436 # h1 = h0 + 1
437 # h2 = h0 + 3 = h1 + 1 + 1
438 # h3 = h0 + 6 = h2 + 1 + 2
439 # h4 = h0 + 10 = h2 + 1 + 3
440 # Note that all of these are '& mask', but that is computed *after* the
441 # offset.
442 # This differs from the algorithm used by Set and Dict. Which, effectively,
443 # use double-hashing, and a step size that starts large, but dwindles to
444 # stepping one-by-one.
445 # This gives more 'locality' in that if you have a collision at offset X,
446 # the first fallback is X+1, which is fast to check. However, that means
447 # that an object w/ hash X+1 will also check there, and then X+2 next.
448 # However, for objects with differing hashes, their chains are different.
449 # The former checks X, X+1, X+3, ... the latter checks X+1, X+2, X+4, ...
450 # So different hashes diverge quickly.
451 # A bigger problem is that we *only* ever use the lowest bits of the hash
452 # So all integers (x + SIZE*N) will resolve into the same bucket, and all
453 # use the same collision resolution. We may want to try to find a way to
454 # incorporate the upper bits of the hash with quadratic probing. (For
455 # example, X, X+1, X+3+some_upper_bits, X+6+more_upper_bits, etc.)
456 cdef size_t i, n_lookup
457 cdef Py_ssize_t mask
458 cdef long key_hash
459 cdef PyObject **table, **slot, *cur, **free_slot, *py_key
460
461 # hash is a signed long(), we are using an offset at unsigned size_t
462 key_hash = hash(key)
463 i = <size_t>key_hash
464 mask = self._mask
465 table = self._table
466 free_slot = NULL
467 py_key = <PyObject *>key
468 for n_lookup from 0 <= n_lookup <= <size_t>mask: # Don't loop forever
469 slot = &table[i & mask]
470 cur = slot[0]
471 if cur == NULL:
472 # Found a blank spot
473 if free_slot != NULL:
474 # Did we find an earlier _dummy entry?
475 return free_slot
476 else:
477 return slot
478 if cur == py_key:
479 # Found an exact pointer to the key
480 return slot
481 if cur == _dummy:
482 if free_slot == NULL:
483 free_slot = slot
484 elif _is_equal(py_key, key_hash, cur):
485 # Both py_key and cur belong in this slot, return it
486 return slot
487 i = i + 1 + n_lookup
488 raise AssertionError('should never get here')
489
490
491cdef api PyObject **_SimpleSet_Lookup(object self, object key) except NULL:
492 """Find the slot where 'key' would fit.
493
494 This is the same as a dicts 'lookup' function. This is a private
495 api because mutating what you get without maintaing the other invariants
496 is a 'bad thing'.
497
498 :param key: An object we are looking up
499 :param hash: The hash for key
500 :return: The location in self._table where key should be put
501 should never be NULL, but may reference a NULL (PyObject*)
502 """
503 return _lookup(_check_self(self), key)
504
505
506cdef api object SimpleSet_Add(object self, object key):
507 """Add a key to the SimpleSet (set).
508
509 :param self: The SimpleSet to add the key to.
510 :param key: The key to be added. If the key is already present,
511 self will not be modified
512 :return: The current key stored at the location defined by 'key'.
513 This may be the same object, or it may be an equivalent object.
514 (consider dict.setdefault(key, key))
515 """
516 return _check_self(self)._add(key)
517
518
519cdef api int SimpleSet_Contains(object self, object key) except -1:
520 """Is key present in self?"""
521 return (key in _check_self(self))
522
523
524cdef api int SimpleSet_Discard(object self, object key) except -1:
525 """Remove the object referenced at location 'key'.
526
527 :param self: The SimpleSet being modified
528 :param key: The key we are checking on
529 :return: 1 if there was an object present, 0 if there was not, and -1 on
530 error.
531 """
532 return _check_self(self)._discard(key)
533
534
535cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL:
536 """Get a pointer to the object present at location 'key'.
537
538 This returns an object which is equal to key which was previously added to
539 self. This returns a borrowed reference, as it may also return NULL if no
540 value is present at that location.
541
542 :param key: The value we are looking for
543 :return: The object present at that location
544 """
545 return _check_self(self)._get(key)
546
547
548cdef api Py_ssize_t SimpleSet_Size(object self) except -1:
549 """Get the number of active entries in 'self'"""
550 return _check_self(self)._used
551
552
553cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key):
554 """Walk over items in a SimpleSet.
555
556 :param pos: should be initialized to 0 by the caller, and will be updated
557 by this function
558 :param key: Will return a borrowed reference to key
559 :return: 0 if nothing left, 1 if we are returning a new value
560 """
561 cdef Py_ssize_t i, mask
562 cdef SimpleSet true_self
563 cdef PyObject **table
564 true_self = _check_self(self)
565 i = pos[0]
566 if (i < 0):
567 return 0
568 mask = true_self._mask
569 table= true_self._table
570 while (i <= mask and (table[i] == NULL or table[i] == _dummy)):
571 i = i + 1
572 pos[0] = i + 1
573 if (i > mask):
574 return 0 # All done
575 if (key != NULL):
576 key[0] = table[i]
577 return 1
578
579
580cdef int SimpleSet_traverse(SimpleSet self, visitproc visit, void *arg):
581 """This is an implementation of 'tp_traverse' that hits the whole table.
582
583 Cython/Pyrex don't seem to let you define a tp_traverse, and they only
584 define one for you if you have an 'object' attribute. Since they don't
585 support C arrays of objects, we access the PyObject * directly.
586 """
587 cdef Py_ssize_t pos
588 cdef PyObject *next_key
589 cdef int ret
590
591 pos = 0
592 while SimpleSet_Next(self, &pos, &next_key):
593 ret = visit(next_key, arg)
594 if ret:
595 return ret
596 return 0
597
598# It is a little bit ugly to do this, but it works, and means that Meliae can
599# dump the total memory consumed by all child objects.
600(<PyTypeObject *>SimpleSet).tp_traverse = <traverseproc>SimpleSet_traverse
0601
=== modified file 'bzrlib/python-compat.h'
--- bzrlib/python-compat.h 2009-06-10 03:56:49 +0000
+++ bzrlib/python-compat.h 2009-10-12 16:51:14 +0000
@@ -73,4 +73,9 @@
73#define snprintf _snprintf73#define snprintf _snprintf
74#endif74#endif
7575
76/* Introduced in Python 2.6 */
77#ifndef Py_TYPE
78# define Py_TYPE(o) ((o)->ob_type)
79#endif
80
76#endif /* _BZR_PYTHON_COMPAT_H */81#endif /* _BZR_PYTHON_COMPAT_H */
7782
=== modified file 'bzrlib/tests/__init__.py'
--- bzrlib/tests/__init__.py 2009-10-08 01:50:30 +0000
+++ bzrlib/tests/__init__.py 2009-10-12 16:51:14 +0000
@@ -3688,6 +3688,7 @@
3688 'bzrlib.tests.test__groupcompress',3688 'bzrlib.tests.test__groupcompress',
3689 'bzrlib.tests.test__known_graph',3689 'bzrlib.tests.test__known_graph',
3690 'bzrlib.tests.test__rio',3690 'bzrlib.tests.test__rio',
3691 'bzrlib.tests.test__simple_set',
3691 'bzrlib.tests.test__walkdirs_win32',3692 'bzrlib.tests.test__walkdirs_win32',
3692 'bzrlib.tests.test_ancestry',3693 'bzrlib.tests.test_ancestry',
3693 'bzrlib.tests.test_annotate',3694 'bzrlib.tests.test_annotate',
36943695
=== added file 'bzrlib/tests/test__simple_set.py'
--- bzrlib/tests/test__simple_set.py 1970-01-01 00:00:00 +0000
+++ bzrlib/tests/test__simple_set.py 2009-10-12 16:51:14 +0000
@@ -0,0 +1,371 @@
1# Copyright (C) 2009 Canonical Ltd
2#
3# This program is free software; you can redistribute it and/or modify
4# it under the terms of the GNU General Public License as published by
5# the Free Software Foundation; either version 2 of the License, or
6# (at your option) any later version.
7#
8# This program is distributed in the hope that it will be useful,
9# but WITHOUT ANY WARRANTY; without even the implied warranty of
10# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11# GNU General Public License for more details.
12#
13# You should have received a copy of the GNU General Public License
14# along with this program; if not, write to the Free Software
15# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
16
17"""Tests for the StaticTupleInterned type."""
18
19import sys
20
21from bzrlib import (
22 errors,
23 osutils,
24 tests,
25 )
26
27try:
28 from bzrlib import _simple_set_pyx
29except ImportError:
30 _simple_set_pyx = None
31
32
33class _Hashable(object):
34 """A simple object which has a fixed hash value.
35
36 We could have used an 'int', but it turns out that Int objects don't
37 implement tp_richcompare...
38 """
39
40 def __init__(self, the_hash):
41 self.hash = the_hash
42
43 def __hash__(self):
44 return self.hash
45
46 def __eq__(self, other):
47 if not isinstance(other, _Hashable):
48 return NotImplemented
49 return other.hash == self.hash
50
51
52class _BadSecondHash(_Hashable):
53
54 def __init__(self, the_hash):
55 _Hashable.__init__(self, the_hash)
56 self._first = True
57
58 def __hash__(self):
59 if self._first:
60 self._first = False
61 return self.hash
62 else:
63 raise ValueError('I can only be hashed once.')
64
65
66class _BadCompare(_Hashable):
67
68 def __eq__(self, other):
69 raise RuntimeError('I refuse to play nice')
70
71
72# Even though this is an extension, we don't permute the tests for a python
73# version. As the plain python version is just a dict or set
74
75class _CompiledSimpleSet(tests.Feature):
76
77 def _probe(self):
78 if _simple_set_pyx is None:
79 return False
80 return True
81
82 def feature_name(self):
83 return 'bzrlib._simple_set_pyx'
84
85CompiledSimpleSet = _CompiledSimpleSet()
86
87
88class TestSimpleSet(tests.TestCase):
89
90 _test_needs_features = [CompiledSimpleSet]
91 module = _simple_set_pyx
92
93 def assertIn(self, obj, container):
94 self.assertTrue(obj in container,
95 '%s not found in %s' % (obj, container))
96
97 def assertNotIn(self, obj, container):
98 self.assertTrue(obj not in container,
99 'We found %s in %s' % (obj, container))
100
101 def assertFillState(self, used, fill, mask, obj):
102 self.assertEqual((used, fill, mask), (obj.used, obj.fill, obj.mask))
103
104 def assertLookup(self, offset, value, obj, key):
105 self.assertEqual((offset, value), obj._test_lookup(key))
106
107 def assertRefcount(self, count, obj):
108 """Assert that the refcount for obj is what we expect.
109
110 Note that this automatically adjusts for the fact that calling
111 assertRefcount actually creates a new pointer, as does calling
112 sys.getrefcount. So pass the expected value *before* the call.
113 """
114 # I'm not sure why the offset is 3, but I've check that in the caller,
115 # an offset of 1 works, which is expected. Not sure why assertRefcount
116 # is incrementing/decrementing 2 times
117 self.assertEqual(count, sys.getrefcount(obj)-3)
118
119 def test_initial(self):
120 obj = self.module.SimpleSet()
121 self.assertEqual(0, len(obj))
122 st = ('foo', 'bar')
123 self.assertFillState(0, 0, 0x3ff, obj)
124
125 def test__lookup(self):
126 # These are carefully chosen integers to force hash collisions in the
127 # algorithm, based on the initial set size of 1024
128 obj = self.module.SimpleSet()
129 self.assertLookup(643, '<null>', obj, _Hashable(643))
130 self.assertLookup(643, '<null>', obj, _Hashable(643 + 1024))
131 self.assertLookup(643, '<null>', obj, _Hashable(643 + 50*1024))
132
133 def test__lookup_collision(self):
134 obj = self.module.SimpleSet()
135 k1 = _Hashable(643)
136 k2 = _Hashable(643 + 1024)
137 self.assertLookup(643, '<null>', obj, k1)
138 self.assertLookup(643, '<null>', obj, k2)
139 obj.add(k1)
140 self.assertLookup(643, k1, obj, k1)
141 self.assertLookup(644, '<null>', obj, k2)
142
143 def test__lookup_after_resize(self):
144 obj = self.module.SimpleSet()
145 k1 = _Hashable(643)
146 k2 = _Hashable(643 + 1024)
147 obj.add(k1)
148 obj.add(k2)
149 self.assertLookup(643, k1, obj, k1)
150 self.assertLookup(644, k2, obj, k2)
151 obj._py_resize(2047) # resized to 2048
152 self.assertEqual(2048, obj.mask + 1)
153 self.assertLookup(643, k1, obj, k1)
154 self.assertLookup(643+1024, k2, obj, k2)
155 obj._py_resize(1023) # resized back to 1024
156 self.assertEqual(1024, obj.mask + 1)
157 self.assertLookup(643, k1, obj, k1)
158 self.assertLookup(644, k2, obj, k2)
159
160 def test_get_set_del_with_collisions(self):
161 obj = self.module.SimpleSet()
162
163 h1 = 643
164 h2 = 643 + 1024
165 h3 = 643 + 1024*50
166 h4 = 643 + 1024*25
167 h5 = 644
168 h6 = 644 + 1024
169
170 k1 = _Hashable(h1)
171 k2 = _Hashable(h2)
172 k3 = _Hashable(h3)
173 k4 = _Hashable(h4)
174 k5 = _Hashable(h5)
175 k6 = _Hashable(h6)
176 self.assertLookup(643, '<null>', obj, k1)
177 self.assertLookup(643, '<null>', obj, k2)
178 self.assertLookup(643, '<null>', obj, k3)
179 self.assertLookup(643, '<null>', obj, k4)
180 self.assertLookup(644, '<null>', obj, k5)
181 self.assertLookup(644, '<null>', obj, k6)
182 obj.add(k1)
183 self.assertIn(k1, obj)
184 self.assertNotIn(k2, obj)
185 self.assertNotIn(k3, obj)
186 self.assertNotIn(k4, obj)
187 self.assertLookup(643, k1, obj, k1)
188 self.assertLookup(644, '<null>', obj, k2)
189 self.assertLookup(644, '<null>', obj, k3)
190 self.assertLookup(644, '<null>', obj, k4)
191 self.assertLookup(644, '<null>', obj, k5)
192 self.assertLookup(644, '<null>', obj, k6)
193 self.assertIs(k1, obj[k1])
194 self.assertIs(k2, obj.add(k2))
195 self.assertIs(k2, obj[k2])
196 self.assertLookup(643, k1, obj, k1)
197 self.assertLookup(644, k2, obj, k2)
198 self.assertLookup(646, '<null>', obj, k3)
199 self.assertLookup(646, '<null>', obj, k4)
200 self.assertLookup(645, '<null>', obj, k5)
201 self.assertLookup(645, '<null>', obj, k6)
202 self.assertLookup(643, k1, obj, _Hashable(h1))
203 self.assertLookup(644, k2, obj, _Hashable(h2))
204 self.assertLookup(646, '<null>', obj, _Hashable(h3))
205 self.assertLookup(646, '<null>', obj, _Hashable(h4))
206 self.assertLookup(645, '<null>', obj, _Hashable(h5))
207 self.assertLookup(645, '<null>', obj, _Hashable(h6))
208 obj.add(k3)
209 self.assertIs(k3, obj[k3])
210 self.assertIn(k1, obj)
211 self.assertIn(k2, obj)
212 self.assertIn(k3, obj)
213 self.assertNotIn(k4, obj)
214
215 obj.discard(k1)
216 self.assertLookup(643, '<dummy>', obj, k1)
217 self.assertLookup(644, k2, obj, k2)
218 self.assertLookup(646, k3, obj, k3)
219 self.assertLookup(643, '<dummy>', obj, k4)
220 self.assertNotIn(k1, obj)
221 self.assertIn(k2, obj)
222 self.assertIn(k3, obj)
223 self.assertNotIn(k4, obj)
224
225 def test_add(self):
226 obj = self.module.SimpleSet()
227 self.assertFillState(0, 0, 0x3ff, obj)
228 # We use this clumsy notation, because otherwise the refcounts are off.
229 # I'm guessing the python compiler sees it is a static tuple, and adds
230 # it to the function variables, or somesuch
231 k1 = tuple(['foo'])
232 self.assertRefcount(1, k1)
233 self.assertIs(k1, obj.add(k1))
234 self.assertFillState(1, 1, 0x3ff, obj)
235 self.assertRefcount(2, k1)
236 ktest = obj[k1]
237 self.assertRefcount(3, k1)
238 self.assertIs(k1, ktest)
239 del ktest
240 self.assertRefcount(2, k1)
241 k2 = tuple(['foo'])
242 self.assertRefcount(1, k2)
243 self.assertIsNot(k1, k2)
244 # doesn't add anything, so the counters shouldn't be adjusted
245 self.assertIs(k1, obj.add(k2))
246 self.assertFillState(1, 1, 0x3ff, obj)
247 self.assertRefcount(2, k1) # not changed
248 self.assertRefcount(1, k2) # not incremented
249 self.assertIs(k1, obj[k1])
250 self.assertIs(k1, obj[k2])
251 self.assertRefcount(2, k1)
252 self.assertRefcount(1, k2)
253 # Deleting an entry should remove the fill, but not the used
254 obj.discard(k1)
255 self.assertFillState(0, 1, 0x3ff, obj)
256 self.assertRefcount(1, k1)
257 k3 = tuple(['bar'])
258 self.assertRefcount(1, k3)
259 self.assertIs(k3, obj.add(k3))
260 self.assertFillState(1, 2, 0x3ff, obj)
261 self.assertRefcount(2, k3)
262 self.assertIs(k2, obj.add(k2))
263 self.assertFillState(2, 2, 0x3ff, obj)
264 self.assertRefcount(1, k1)
265 self.assertRefcount(2, k2)
266 self.assertRefcount(2, k3)
267
268 def test_discard(self):
269 obj = self.module.SimpleSet()
270 k1 = tuple(['foo'])
271 k2 = tuple(['foo'])
272 k3 = tuple(['bar'])
273 self.assertRefcount(1, k1)
274 self.assertRefcount(1, k2)
275 self.assertRefcount(1, k3)
276 obj.add(k1)
277 self.assertRefcount(2, k1)
278 self.assertEqual(0, obj.discard(k3))
279 self.assertRefcount(1, k3)
280 obj.add(k3)
281 self.assertRefcount(2, k3)
282 self.assertEqual(1, obj.discard(k3))
283 self.assertRefcount(1, k3)
284
285 def test__resize(self):
286 obj = self.module.SimpleSet()
287 k1 = ('foo',)
288 k2 = ('bar',)
289 k3 = ('baz',)
290 obj.add(k1)
291 obj.add(k2)
292 obj.add(k3)
293 obj.discard(k2)
294 self.assertFillState(2, 3, 0x3ff, obj)
295 self.assertEqual(1024, obj._py_resize(500))
296 # Doesn't change the size, but does change the content
297 self.assertFillState(2, 2, 0x3ff, obj)
298 obj.add(k2)
299 obj.discard(k3)
300 self.assertFillState(2, 3, 0x3ff, obj)
301 self.assertEqual(4096, obj._py_resize(4095))
302 self.assertFillState(2, 2, 0xfff, obj)
303 self.assertIn(k1, obj)
304 self.assertIn(k2, obj)
305 self.assertNotIn(k3, obj)
306 obj.add(k2)
307 self.assertIn(k2, obj)
308 obj.discard(k2)
309 self.assertEqual((591, '<dummy>'), obj._test_lookup(k2))
310 self.assertFillState(1, 2, 0xfff, obj)
311 self.assertEqual(2048, obj._py_resize(1024))
312 self.assertFillState(1, 1, 0x7ff, obj)
313 self.assertEqual((591, '<null>'), obj._test_lookup(k2))
314
315 def test_second_hash_failure(self):
316 obj = self.module.SimpleSet()
317 k1 = _BadSecondHash(200)
318 k2 = _Hashable(200)
319 # Should only call hash() one time
320 obj.add(k1)
321 self.assertFalse(k1._first)
322 self.assertRaises(ValueError, obj.add, k2)
323
324 def test_richcompare_failure(self):
325 obj = self.module.SimpleSet()
326 k1 = _Hashable(200)
327 k2 = _BadCompare(200)
328 obj.add(k1)
329 # Tries to compare with k1, fails
330 self.assertRaises(RuntimeError, obj.add, k2)
331
332 def test_add_and_remove_lots_of_items(self):
333 obj = self.module.SimpleSet()
334 chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890'
335 for i in chars:
336 for j in chars:
337 k = (i, j)
338 obj.add(k)
339 num = len(chars)*len(chars)
340 self.assertFillState(num, num, 0x1fff, obj)
341 # Now delete all of the entries and it should shrink again
342 for i in chars:
343 for j in chars:
344 k = (i, j)
345 obj.discard(k)
346 # It should be back to 1024 wide mask, though there may still be some
347 # dummy values in there
348 self.assertFillState(0, obj.fill, 0x3ff, obj)
349 # but there should be fewer than 1/5th dummy entries
350 self.assertTrue(obj.fill < 1024 / 5)
351
352 def test__iter__(self):
353 obj = self.module.SimpleSet()
354 k1 = ('1',)
355 k2 = ('1', '2')
356 k3 = ('3', '4')
357 obj.add(k1)
358 obj.add(k2)
359 obj.add(k3)
360 all = set()
361 for key in obj:
362 all.add(key)
363 self.assertEqual(sorted([k1, k2, k3]), sorted(all))
364 iterator = iter(obj)
365 iterator.next()
366 obj.add(('foo',))
367 # Set changed size
368 self.assertRaises(RuntimeError, iterator.next)
369 # And even removing an item still causes it to fail
370 obj.discard(k2)
371 self.assertRaises(RuntimeError, iterator.next)
0372
=== modified file 'setup.py'
--- setup.py 2009-10-01 03:46:41 +0000
+++ setup.py 2009-10-12 16:51:14 +0000
@@ -300,6 +300,7 @@
300add_pyrex_extension('bzrlib._chk_map_pyx', libraries=[z_lib])300add_pyrex_extension('bzrlib._chk_map_pyx', libraries=[z_lib])
301ext_modules.append(Extension('bzrlib._patiencediff_c',301ext_modules.append(Extension('bzrlib._patiencediff_c',
302 ['bzrlib/_patiencediff_c.c']))302 ['bzrlib/_patiencediff_c.c']))
303add_pyrex_extension('bzrlib._simple_set_pyx')
303304
304305
305if unavailable_files:306if unavailable_files: