Bazaar

Code review comment for lp:~jameinel/bzr/2.1-simple-set

2.1-simple-set
Merge into bzr.dev

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-08:

This is the first in my series of patches to lower memory overhead with StaticTuple objects.

I picked this first because it wasn't strictly dependent on the other code, but I use it in the implementation of StaticTuple's interning.

Anyway, this introduces a "SimpleSet" class. For now, I have a pyrex version but no pure-python version, as it is only currently used in StaticTuple. I may go to the point of implementing a python version, though it is fairly close in api to a set or a dict. The big thing is that we don't really have values like a dict, but a set() doesn't let you get access to the object which is stored in the internal table.

So the 2 primary wins for this class is:

1) Don't cache the hash of every object. For the types of object we are putting in here, the hash should be relatively cheap to compute. Even further, though, the times when you need the hash are:
a) When inserting a new object, you need to know its hash, but you haven't cached it yet anyway.
b) When resolving a collision, you compare the hash to the cached value as a 'cheap' comparison.
However, the number of collisions is based on the quality of your hash, your collision avoidance algorithm, and the size of your table. We can't really change the hash function. We could use 'quadratic hash', but what sets use seems pretty good anyway (it mixes in more of the upper bits, so you probably get divergence faster, but you also lose locality...)
As for the size of the table. It takes the same number of bytes to cache a 'long hash', as it takes to hold a 'PyObject *'. Which means that in the same memory, you could cache hashes *or* double your addressable space and halve the number of collisions.

2) Allow lookups, so we don't need to use a Dict, which has yet another pointer per address. (So for the same number of entries, this will generally be 1/3rd the size of the equivalent dict, and 1/2 the size of the equivalent set.)

3) Have a single function for the equivalent function 'key = dict.setdefault(key, key)'. At the C api, you could use dict->lookup which is a rather private function, or you would generally use PyDict_GetItem() followed by PyDict_SetItem().
With SimpleSet_Add(), it returns the object stored there, so you have a single lookup. (Only really important in the case of collisions, where you may have several steps that need to be repeated.)

Because of the memory savings, I may look into using this elsewhere in our code base. If I do, then I will certainly implement a Python version (probably just subclassing a dict and exposing an 'def add(self key): return self.setdefault(key, key)' function. (And whatever else I specifically need.

As for the specific memory savings here, dicts and sets are resized to average 50% full (resize at 67%, etc.), so is SimpleSet [theory says hash tables fall apart at about 80% full]. The idea is to start interning the StaticTuple objects I'm creating. (For every key, you average at least 2 references, one for the key, and one for all of its children.)
When loading all of launchpad, that translates into about 500k strings, and 1.1M interned StaticTuples (there are actually more, because of how the btree code uses tuples-of-tuples, but not all of them are worth interning.)

Anyway, that means that 1.1M tuples requires an 8MB SimpleSet, but would require a 24MB dict. (So my in my memory savings of going from 325MB => 240MB [85MB], 16/85=19% of that is by using a SimpleSet here.)

Especially given that the size of all StaticTuples plus there referenced strings is only 44.8MB. (This is also why intern(string) is not particularly memory efficient, as it costs 24bytes per item in the dict, versus 24+N bytes per string.)

This is the first in my series of patches to lower memory overhead with StaticTuple objects.

I picked this first because it wasn't strictly dependent on the other code, but I use it in the implementation of StaticTuple's interning.

So the 2 primary wins for this class is:

1) Don't cache the hash of every object. For the types of object we are putting in here, the hash should be relatively cheap to compute. Even further, though, the times when you need the hash are:
  a) When inserting a new object, you need to know its hash, but you haven't cached it yet anyway.
  b) When resolving a collision, you compare the hash to the cached value as a 'cheap' comparison.
However, the number of collisions is based on the quality of your hash, your collision avoidance algorithm, and the size of your table. We can't really change the hash function. We could use 'quadratic hash', but what sets use seems pretty good anyway (it mixes in more of the upper bits, so you probably get divergence faster, but you also lose locality...)
As for the size of the table. It takes the same number of bytes to cache a 'long hash', as it takes to hold a 'PyObject *'. Which means that in the same memory, you could cache hashes *or* double your addressable space and halve the number of collisions.

« Back to merge proposal