txStatsD

Merge lp:~lucio.torre/txstatsd/add-redis-to-distinct-plugin into lp:txstatsd

add-redis-to-distinct-plugin
Merge into trunk

Proposed by Lucio Torre on 2011-12-15

Status:	Rejected
Rejected by:	Lucio Torre on 2011-12-15
Proposed branch:	lp:~lucio.torre/txstatsd/add-redis-to-distinct-plugin
Merge into:	lp:txstatsd
Diff against target:	977 lines (+903/-0) (has conflicts) 13 files modified Makefile (+26/-0) README (+10/-0) bin/redis.conf (+312/-0) bin/schema.sql (+3/-0) bin/start-database.sh (+63/-0) bin/start-redis.sh (+3/-0) bin/stop-database.sh (+22/-0) bin/stop-redis.sh (+1/-0) distinctdb/distinctmetric.py (+160/-0) distinctdb/tests/test_distinct.py (+200/-0) distinctdb/version.py (+1/-0) setup.py (+64/-0) twisted/plugins/distinctdbplugin.py (+38/-0) Conflict adding file README. Moved existing file to README.moved. Conflict adding file setup.py. Moved existing file to setup.py.moved. Conflict adding file twisted. Moved existing file to twisted.moved.
To merge this branch:	bzr merge lp:~lucio.torre/txstatsd/add-redis-to-distinct-plugin
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
txStatsD Developers		2011-12-15	Pending
Review via email: mp+85948@code.launchpad.net

Description of the change

add redis support for realtime stats

lp:~lucio.torre/txstatsd/add-redis-to-distinct-plugin updated on 2011-12-15

4. By Lucio Torre on 2011-12-15: added load test + cleanup

Unmerged revisions

7. By Lucio Torre on 2012-02-07: made compatible with old versions, even more
6. By Lucio Torre on 2011-12-20: made config compatible with lucid
5. By Lucio Torre on 2011-12-16: path to startstopdaemon
4. By Lucio Torre on 2011-12-15: added load test + cleanup
3. By Lucio Torre on 2011-12-15: redis support for realtime stats
2. By Lucio Torre on 2011-12-13: bzr merge lp:~lucio.torre/txstatsd/add-distinct-db-plugin
1. By Lucio Torre on 2011-12-06: empty plugin skeleton

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Lucio Torre

txStatsD Developers

 === added file 'Makefile'
 --- Makefile	1970-01-01 00:00:00 +0000
 +++ Makefile	2011-12-15 20:24:01 +0000
@@ -0,0 +1,26 @@
++
++trial:
++	trial distinctdb/
++
++start-database:
++	./bin/start-database.sh
++	psql -h `pwd`/tmp/db1 -d distinct < bin/schema.sql
++
++stop-database:
++	./bin/stop-database.sh
++
++start-redis:
++	./bin/start-redis.sh
++
++stop-redis:
++	./bin/stop-redis.sh
++
++start: start-database start-redis
++
++stop: stop-redis stop-database
++
++clean:
++	rm -rf ./tmp/
++test: start trial stop clean
++
++.PHONY: trial test start-database stop-database start-redis stop-redis start stop
 === added file 'README'
 --- README	1970-01-01 00:00:00 +0000
 +++ README	2011-12-15 20:24:01 +0000
@@ -0,0 +1,10 @@
++To test:
++./bin/start-database.sh
++psql -h `pwd`/tmp/db1 -d distinct < bin/schema.sql
++
++Then:
++trial distinctdb
++
++When done:
++./bin/stop-database.sh
++
 === renamed file 'README' => 'README.moved'
 === added directory 'bin'
 === added file 'bin/redis.conf'
 --- bin/redis.conf	1970-01-01 00:00:00 +0000
 +++ bin/redis.conf	2011-12-15 20:24:01 +0000
@@ -0,0 +1,312 @@
++# Redis configuration file example
++
++# Note on units: when memory size is needed, it is possible to specifiy
++# it in the usual form of 1k 5GB 4M and so forth:
++#
++# 1k => 1000 bytes
++# 1kb => 1024 bytes
++# 1m => 1000000 bytes
++# 1mb => 1024*1024 bytes
++# 1g => 1000000000 bytes
++# 1gb => 1024*1024*1024 bytes
++#
++# units are case insensitive so 1GB 1Gb 1gB are all the same.
++
++# By default Redis does not run as a daemon. Use 'yes' if you need it.
++# Note that Redis will write a pid file in /var/run/redis.pid when daemonized.
++daemonize no
++
++# When running daemonized, Redis writes a pid file in /var/run/redis.pid by
++# default. You can specify a custom pid file location here.
++# pidfile ./tmp/redis.pid
++
++# Accept connections on the specified port, default is 6379
++port 16379
++
++# If you want you can bind a single interface, if the bind option is not
++# specified all the interfaces will listen for incoming connections.
++#
++bind 127.0.0.1
++
++# Close the connection after a client is idle for N seconds (0 to disable)
++timeout 300
++
++# Set server verbosity to 'debug'
++# it can be one of:
++# debug (a lot of information, useful for development/testing)
++# verbose (many rarely useful info, but not a mess like the debug level)
++# notice (moderately verbose, what you want in production probably)
++# warning (only very important / critical messages are logged)
++loglevel verbose
++
++# Specify the log file name. Also 'stdout' can be used to force
++# Redis to log on the standard output. Note that if you use standard
++# output for logging but daemonize, logs will be sent to /dev/null
++logfile tmp/redis/redis-server.log
++
++# Set the number of databases. The default database is DB 0, you can select
++# a different one on a per-connection basis using SELECT <dbid> where
++# dbid is a number between 0 and 'databases'-1
++databases 16
++
++################################ SNAPSHOTTING  #################################
++#
++# Save the DB on disk:
++#
++#   save <seconds> <changes>
++#
++#   Will save the DB if both the given number of seconds and the given
++#   number of write operations against the DB occurred.
++#
++#   In the example below the behaviour will be to save:
++#   after 900 sec (15 min) if at least 1 key changed
++#   after 300 sec (5 min) if at least 10 keys changed
++#   after 60 sec if at least 10000 keys changed
++#
++#   Note: you can disable saving at all commenting all the "save" lines.
++
++save 900 1
++save 300 10
++save 60 10000
++
++# Compress string objects using LZF when dump .rdb databases?
++# For default that's set to 'yes' as it's almost always a win.
++# If you want to save some CPU in the saving child set it to 'no' but
++# the dataset will likely be bigger if you have compressible values or keys.
++rdbcompression yes
++
++# The filename where to dump the DB
++dbfilename dump.rdb
++
++# The working directory.
++#
++# The DB will be written inside this directory, with the filename specified
++# above using the 'dbfilename' configuration directive.
++#
++# Also the Append Only File will be created inside this directory.
++#
++# Note that you must specify a directory here, not a file name.
++dir tmp/redis
++
++################################# REPLICATION #################################
++
++# Master-Slave replication. Use slaveof to make a Redis instance a copy of
++# another Redis server. Note that the configuration is local to the slave
++# so for example it is possible to configure the slave to save the DB with a
++# different interval, or to listen to another port, and so on.
++#
++# slaveof <masterip> <masterport>
++
++# If the master is password protected (using the "requirepass" configuration
++# directive below) it is possible to tell the slave to authenticate before
++# starting the replication synchronization process, otherwise the master will
++# refuse the slave request.
++#
++# masterauth <master-password>
++
++################################## SECURITY ###################################
++
++# Require clients to issue AUTH <PASSWORD> before processing any other
++# commands.  This might be useful in environments in which you do not trust
++# others with access to the host running redis-server.
++#
++# This should stay commented out for backward compatibility and because most
++# people do not need auth (e.g. they run their own servers).
++#
++# Warning: since Redis is pretty fast an outside user can try up to
++# 150k passwords per second against a good box. This means that you should
++# use a very strong password otherwise it will be very easy to break.
++#
++# requirepass foobared
++
++################################### LIMITS ####################################
++
++# Set the max number of connected clients at the same time. By default there
++# is no limit, and it's up to the number of file descriptors the Redis process
++# is able to open. The special value '0' means no limits.
++# Once the limit is reached Redis will close all the new connections sending
++# an error 'max number of clients reached'.
++#
++# maxclients 128
++
++# Don't use more memory than the specified amount of bytes.
++# When the memory limit is reached Redis will try to remove keys with an
++# EXPIRE set. It will try to start freeing keys that are going to expire
++# in little time and preserve keys with a longer time to live.
++# Redis will also try to remove objects from free lists if possible.
++#
++# If all this fails, Redis will start to reply with errors to commands
++# that will use more memory, like SET, LPUSH, and so on, and will continue
++# to reply to most read-only commands like GET.
++#
++# WARNING: maxmemory can be a good idea mainly if you want to use Redis as a
++# 'state' server or cache, not as a real DB. When Redis is used as a real
++# database the memory usage will grow over the weeks, it will be obvious if
++# it is going to use too much memory in the long run, and you'll have the time
++# to upgrade. With maxmemory after the limit is reached you'll start to get
++# errors for write operations, and this may even lead to DB inconsistency.
++#
++# maxmemory <bytes>
++
++############################## APPEND ONLY MODE ###############################
++
++# By default Redis asynchronously dumps the dataset on disk. If you can live
++# with the idea that the latest records will be lost if something like a crash
++# happens this is the preferred way to run Redis. If instead you care a lot
++# about your data and don't want to that a single record can get lost you should
++# enable the append only mode: when this mode is enabled Redis will append
++# every write operation received in the file appendonly.aof. This file will
++# be read on startup in order to rebuild the full dataset in memory.
++#
++# Note that you can have both the async dumps and the append only file if you
++# like (you have to comment the "save" statements above to disable the dumps).
++# Still if append only mode is enabled Redis will load the data from the
++# log file at startup ignoring the dump.rdb file.
++#
++# IMPORTANT: Check the BGREWRITEAOF to check how to rewrite the append
++# log file in background when it gets too big.
++
++appendonly no
++
++# The name of the append only file (default: "appendonly.aof")
++# appendfilename appendonly.aof
++
++# The fsync() call tells the Operating System to actually write data on disk
++# instead to wait for more data in the output buffer. Some OS will really flush
++# data on disk, some other OS will just try to do it ASAP.
++#
++# Redis supports three different modes:
++#
++# no: don't fsync, just let the OS flush the data when it wants. Faster.
++# always: fsync after every write to the append only log . Slow, Safest.
++# everysec: fsync only if one second passed since the last fsync. Compromise.
++#
++# The default is "everysec" that's usually the right compromise between
++# speed and data safety. It's up to you to understand if you can relax this to
++# "no" that will will let the operating system flush the output buffer when
++# it wants, for better performances (but if you can live with the idea of
++# some data loss consider the default persistence mode that's snapshotting),
++# or on the contrary, use "always" that's very slow but a bit safer than
++# everysec.
++#
++# If unsure, use "everysec".
++
++# appendfsync always
++appendfsync everysec
++# appendfsync no
++
++################################ VIRTUAL MEMORY ###############################
++
++# Virtual Memory allows Redis to work with datasets bigger than the actual
++# amount of RAM needed to hold the whole dataset in memory.
++# In order to do so very used keys are taken in memory while the other keys
++# are swapped into a swap file, similarly to what operating systems do
++# with memory pages.
++#
++# To enable VM just set 'vm-enabled' to yes, and set the following three
++# VM parameters accordingly to your needs.
++
++vm-enabled no
++# vm-enabled yes
++
++# This is the path of the Redis swap file. As you can guess, swap files
++# can't be shared by different Redis instances, so make sure to use a swap
++# file for every redis process you are running. Redis will complain if the
++# swap file is already in use.
++#
++# The best kind of storage for the Redis swap file (that's accessed at random)
++# is a Solid State Disk (SSD).
++#
++# *** WARNING *** if you are using a shared hosting the default of putting
++# the swap file under /tmp is not secure. Create a dir with access granted
++# only to Redis user and configure Redis to create the swap file there.
++vm-swap-file tmp/redis/redis.swap
++
++# vm-max-memory configures the VM to use at max the specified amount of
++# RAM. Everything that deos not fit will be swapped on disk *if* possible, that
++# is, if there is still enough contiguous space in the swap file.
++#
++# With vm-max-memory 0 the system will swap everything it can. Not a good
++# default, just specify the max amount of RAM you can in bytes, but it's
++# better to leave some margin. For instance specify an amount of RAM
++# that's more or less between 60 and 80% of your free RAM.
++vm-max-memory 0
++
++# Redis swap files is split into pages. An object can be saved using multiple
++# contiguous pages, but pages can't be shared between different objects.
++# So if your page is too big, small objects swapped out on disk will waste
++# a lot of space. If you page is too small, there is less space in the swap
++# file (assuming you configured the same number of total swap file pages).
++#
++# If you use a lot of small objects, use a page size of 64 or 32 bytes.
++# If you use a lot of big objects, use a bigger page size.
++# If unsure, use the default :)
++vm-page-size 32
++
++# Number of total memory pages in the swap file.
++# Given that the page table (a bitmap of free/used pages) is taken in memory,
++# every 8 pages on disk will consume 1 byte of RAM.
++#
++# The total swap size is vm-page-size * vm-pages
++#
++# With the default of 32-bytes memory pages and 134217728 pages Redis will
++# use a 4 GB swap file, that will use 16 MB of RAM for the page table.
++#
++# It's better to use the smallest acceptable value for your application,
++# but the default is large in order to work in most conditions.
++vm-pages 134217728
++
++# Max number of VM I/O threads running at the same time.
++# This threads are used to read/write data from/to swap file, since they
++# also encode and decode objects from disk to memory or the reverse, a bigger
++# number of threads can help with big objects even if they can't help with
++# I/O itself as the physical device may not be able to couple with many
++# reads/writes operations at the same time.
++#
++# The special value of 0 turn off threaded I/O and enables the blocking
++# Virtual Memory implementation.
++vm-max-threads 4
++
++############################### ADVANCED CONFIG ###############################
++
++# Glue small output buffers together in order to send small replies in a
++# single TCP packet. Uses a bit more CPU but most of the times it is a win
++# in terms of number of queries per second. Use 'yes' if unsure.
++glueoutputbuf yes
++
++# Hashes are encoded in a special way (much more memory efficient) when they
++# have at max a given numer of elements, and the biggest element does not
++# exceed a given threshold. You can configure this limits with the following
++# configuration directives.
++hash-max-zipmap-entries 64
++hash-max-zipmap-value 512
++
++# Active rehashing uses 1 millisecond every 100 milliseconds of CPU time in
++# order to help rehashing the main Redis hash table (the one mapping top-level
++# keys to values). The hash table implementation redis uses (see dict.c)
++# performs a lazy rehashing: the more operation you run into an hash table
++# that is rhashing, the more rehashing "steps" are performed, so if the
++# server is idle the rehashing is never complete and some more memory is used
++# by the hash table.
++#
++# The default is to use this millisecond 10 times every second in order to
++# active rehashing the main dictionaries, freeing memory when possible.
++#
++# If unsure:
++# use "activerehashing no" if you have hard latency requirements and it is
++# not a good thing in your environment that Redis can reply form time to time
++# to queries with 2 milliseconds delay.
++#
++# use "activerehashing yes" if you don't have such hard requirements but
++# want to free memory asap when possible.
++activerehashing yes
++
++################################## INCLUDES ###################################
++
++# Include one or more other config files here.  This is useful if you
++# have a standard template that goes to all redis server but also need
++# to customize a few per-server settings.  Include files can include
++# other files, so use this wisely.
++#
++# include /path/to/local.conf
++# include /path/to/other.conf
 === added file 'bin/schema.sql'
 --- bin/schema.sql	1970-01-01 00:00:00 +0000
 +++ bin/schema.sql	2011-12-15 20:24:01 +0000
@@ -0,0 +1,3 @@
++CREATE TABLE paths (id SERIAL PRIMARY KEY NOT NULL, path TEXT NOT NULL UNIQUE);
++CREATE TABLE points (path_id INTEGER, bucket INTEGER, value TEXT, count INTEGER);
++CREATE INDEX points_idx ON points (path_id, bucket);
 === added file 'bin/start-database.sh'
 --- bin/start-database.sh	1970-01-01 00:00:00 +0000
 +++ bin/start-database.sh	2011-12-15 20:24:01 +0000
@@ -0,0 +1,63 @@
++#! /bin/bash
++
++ROOTDIR=${ROOTDIR:-`bzr root`}
++if [ ! -d "$ROOTDIR"  ]; then
++    echo "ROOTDIR '$ROOTDIR' doesn't exist" >&2
++    exit 1
++fi
++
++DATABASES="
++distinct
++"
++
++function setup_database() {
++    local TESTDIR=$1
++
++    echo "## Starting postgres in $TESTDIR ##"
++    mkdir -p "$TESTDIR/data"
++    chmod 700 "$TESTDIR/data"
++
++    export PGHOST="$TESTDIR"
++    export PGDATA="$TESTDIR/data"
++    if [ -d /usr/lib/postgresql/8.4 ]; then
++        export PGBINDIR=/usr/lib/postgresql/8.4/bin
++    elif [ -d /usr/lib/postgresql/8.3 ]; then
++        export PGBINDIR=/usr/lib/postgresql/8.3/bin
++    else
++        echo "Cannot find valid parent for PGBINDIR"
++    fi
++    $PGBINDIR/initdb -E UNICODE -D $PGDATA
++    # set up the database options file
++    if [ ! -e $PGDATA/postgresql.conf ]; then
++        echo "PostgreSQL data directory apparently didn't init"
++    else
++    (
++        cat <<EOF
++search_path='\$user,public,ts2'
++add_missing_from=false
++log_statement='all'
++log_line_prefix='[%m] %q%u@%d %c '
++fsync = off
++EOF
++    ) > $PGDATA/postgresql.conf
++    fi
++    $PGBINDIR/initdb -A trust &>/dev/null
++    $PGBINDIR/pg_ctl start -w -D $TESTDIR/data -l $TESTDIR/postgres.log -o "-F -k $TESTDIR -h ''"
++    for db in $DATABASES; do
++        $PGBINDIR/createdb --encoding UNICODE "$db" &>/dev/null
++        $PGBINDIR/createlang plpgsql "$db"
++    done
++    $PGBINDIR/createuser --superuser --createdb "postgres" &>/dev/null
++    # create the additional users we need via a psql script
++    $PGBINDIR/psql -U postgres template1 <<EOF
++CREATE ROLE client INHERIT;
++
++CREATE USER client IN ROLE client;
++EOF
++    echo "To set your environment so psql will connect to this DB instance type:"
++    echo "    export PGHOST=$TESTDIR"
++    echo "## Done. ##"
++    echo -n host=$TESTDIR dbname=distinct > $ROOTDIR/tmp/pg.dsn
++}
++
++setup_database $ROOTDIR/tmp/db1
 === added file 'bin/start-redis.sh'
 --- bin/start-redis.sh	1970-01-01 00:00:00 +0000
 +++ bin/start-redis.sh	2011-12-15 20:24:01 +0000
@@ -0,0 +1,3 @@
++mkdir -p tmp/redis
++start-stop-daemon --start -b -m -d . -p tmp/redis.pid --exec /usr/bin/redis-server -- `pwd`/bin/redis.conf
++
 === added file 'bin/stop-database.sh'
 --- bin/stop-database.sh	1970-01-01 00:00:00 +0000
 +++ bin/stop-database.sh	2011-12-15 20:24:01 +0000
@@ -0,0 +1,22 @@
++#! /bin/bash
++
++ROOTDIR=${ROOTDIR:-`bzr root`}
++if [ ! -d "$ROOTDIR"  ]; then
++    echo "ROOTDIR '$ROOTDIR' doesn't exist" >&2
++    exit 1
++fi
++
++if [ -d /usr/lib/postgresql/8.4 ]; then
++    PGBINDIR=/usr/lib/postgresql/8.4/bin
++elif [ -d /usr/lib/postgresql/8.3 ]; then
++    PGBINDIR=/usr/lib/postgresql/8.3/bin
++else
++    echo "Cannot find valid parent for PGBINDIR"
++fi
++
++# setting PGDATA tells pg_ctl which DB to talk to
++export PGDATA=$ROOTDIR/tmp/db1/data/
++$PGBINDIR/pg_ctl status > /dev/null
++if [ $? = 0 ]; then
++    $PGBINDIR/pg_ctl stop -t 60 -w -m fast
++fi
 === added file 'bin/stop-redis.sh'
 --- bin/stop-redis.sh	1970-01-01 00:00:00 +0000
 +++ bin/stop-redis.sh	2011-12-15 20:24:01 +0000
@@ -0,0 +1,1 @@
++start-stop-daemon --stop -p tmp/redis.pid --exec /usr/bin/redis-server -- `pwd`/bin/redis.conf
 === added directory 'distinctdb'
 === added file 'distinctdb/__init__.py'
 === added file 'distinctdb/distinctmetric.py'
 --- distinctdb/distinctmetric.py	1970-01-01 00:00:00 +0000
 +++ distinctdb/distinctmetric.py	2011-12-15 20:24:01 +0000
@@ -0,0 +1,160 @@
++import time
++import threading
++
++import psycopg2
++import redis
++
++from zope.interface import implements
++from twisted.internet import reactor
++from txstatsd.itxstatsd import IMetric
++
++ONE_MINUTE = 60
++ONE_HOUR = 60 * ONE_MINUTE
++ONE_DAY = 24 * ONE_HOUR
++
++
++class DistinctMetricReporter(object):
++    """
++    Keeps an mesurement of the distinct numbers of items seen and the times
++    it has seen each one.
++    """
++    implements(IMetric)
++
++    periods = [ONE_MINUTE, 5 * ONE_MINUTE, ONE_HOUR, 12 * ONE_HOUR, ONE_DAY]
++
++    def __init__(self, name, wall_time_func=time.time, prefix="",
++            bucket_size=ONE_DAY, dsn=None, redis_host=None, redis_port=None):
++        """Construct a metric we expect to be periodically updated.
++
++        @param name: Indicates what is being instrumented.
++        @param wall_time_func: Function for obtaining wall time.
++        @param prefix: If present, a string to prepend to the message
++            composed when C{report} is called.
++        """
++        self.name = name
++        self.wall_time_func = wall_time_func
++        if prefix:
++            prefix += '.'
++        self.prefix = prefix
++        self.bucket_size = bucket_size
++        self.dsn = dsn
++        self.redis_host = redis_host
++        if redis_port is None:
++            redis.port = 6379
++        self.redis_port = redis_port
++        self.metric_id = None
++        self.build_bucket()
++        self.redis_flush_lock = threading.Lock()
++        self.redis_count = {}
++
++        if redis_host != None:
++            self.redis = redis.client.Redis(host=redis_host, port=redis_port)
++
++    def build_bucket(self, timestamp=None):
++        self.max = 0
++        self.bucket = {}
++        self.bucket_no = self.get_bucket_no(timestamp)
++
++    def get_bucket_no(self, timestamp=None):
++        if timestamp is None:
++            timestamp = self.wall_time_func()
++        return int(timestamp / (self.bucket_size))
++
++    def process(self, fields):
++        self.update(fields[0])
++
++    def update(self, item):
++        value = self.bucket.get(item, 0) + 1
++
++        self.bucket[item] = value
++        if value > self.max:
++            self.max = value
++
++        now = self.wall_time_func()
++        if self.redis_host is not None:
++            reactor.callInThread(self._update_count, value, now)
++
++    def bucket_name_for(self, period):
++        return "bucket_" + str(period)
++
++    def _update_count(self, period, value, when):
++        for period in self.periods:
++            self.redis.zadd(self.bucket_name_for(period), value, when)
++
++    def _flush_redis(self, now):
++        if self.redis_flush_lock.acquire(False) is False:
++            return
++        try:
++            for period in self.periods:
++                bucket = self.bucket_name_for(period)
++                self.redis.zremrangebyscore(bucket, 0, now - period)
++                self.redis_count[bucket] = self.redis.zcard(bucket)
++        finally:
++            self.redis_flush_lock.release()
++
++    def count(self, period):
++        return self.redis_count.get(self.bucket_name_for(period), 0)
++
++    def count_1min(self):
++        return self.count(ONE_MINUTE)
++
++    def count_5min(self):
++        return self.count(5 * ONE_MINUTE)
++
++    def count_1hour(self):
++        return self.count(ONE_HOUR)
++
++    def count_12hours(self):
++        return self.count(12 * ONE_HOUR)
++
++    def count_1day(self):
++        return self.count(ONE_DAY)
++
++    def _save_bucket(self, bucket, bucket_no):
++        path = self.prefix + self.name
++        if self.metric_id is None:
++            c = psycopg2.connect(self.dsn)
++            cr = c.cursor()
++            cr.execute("SELECT * FROM paths WHERE path = %s", (path,))
++            row = cr.fetchone()
++            if row is None:
++                cr.execute("INSERT INTO paths (path) VALUES (%s) "
++                            "RETURNING (id)", (path,))
++                row = cr.fetchone()
++                cr.execute("commit")
++
++            self.metric_id = row[0]
++
++        for i, (k, v) in enumerate(bucket.iteritems()):
++            cr.execute("INSERT INTO points (path_id, bucket, value, count) "
++                "VALUES (%s, %s, %s, %s)", (self.metric_id, bucket_no,
++                                            k, v))
++            if i % 1000 == 0:
++                cr.execute("commit")
++        cr.execute("commit")
++
++    def save_bucket(self, bucket, bucket_no):
++        if self.dsn is not None:
++            reactor.callInThread(self._save_bucket, bucket, bucket_no)
++
++    def flush(self, interval, timestamp):
++        current_bucket = self.get_bucket_no(timestamp)
++        if current_bucket != self.bucket_no:
++            self.save_bucket(self.bucket, self.bucket_no)
++            self.build_bucket(timestamp)
++
++        if self.redis_host is not None:
++            reactor.callInThread(self._flush_redis, timestamp)
++
++        metrics = []
++        items = {".messages": len(self.bucket),
++                 ".max": self.max,
++                 ".count_1min": self.count_1min(),
++                 ".count_5min": self.count_5min(),
++                 ".count_1hour": self.count_1hour(),
++                 ".count_12hours": self.count_12hours(),
++                 ".count_1day": self.count_1day(),
++                 }
++        for item, value in items.iteritems():
++            metrics.append((self.prefix + self.name + item, value, timestamp))
++        return metrics
 === added directory 'distinctdb/tests'
 === added file 'distinctdb/tests/__init__.py'
 === added file 'distinctdb/tests/test_distinct.py'
 --- distinctdb/tests/test_distinct.py	1970-01-01 00:00:00 +0000
 +++ distinctdb/tests/test_distinct.py	2011-12-15 20:24:01 +0000
@@ -0,0 +1,200 @@
++# Copyright (C) 2011 Canonical
++# All Rights Reserved
++
++import ConfigParser
++from cStringIO import StringIO
++import os
++import time
++try:
++    from subprocess import check_output
++except ImportError:
++    import subprocess
++    def check_output(args):
++        return subprocess.Popen(args,
++            stdout=subprocess.PIPE).communicate()[0]
++
++import psycopg2
++import redis
++
++from twisted.trial.unittest import TestCase
++from twisted.plugin import getPlugins
++from twisted.plugins import distinctdbplugin
++from txstatsd.itxstatsd import IMetricFactory
++from txstatsd import service
++
++from distinctdb import distinctmetric as distinct
++
++
++class TestDistinctMetricReporter(TestCase):
++
++    def test_get_bucket_no(self):
++        _wall_time = [0]
++
++        def _time():
++            return _wall_time[0]
++
++        dmr = distinct.DistinctMetricReporter("test", wall_time_func=_time)
++        self.assertEquals(dmr.get_bucket_no(), 0)
++        _wall_time = [60 * 60 * 24 + 1]
++        dmr.update("one")
++        self.assertEquals(dmr.get_bucket_no(), 1)
++
++    def test_max(self):
++        _wall_time = [0]
++
++        def _time():
++            return _wall_time[0]
++
++        dmr = distinct.DistinctMetricReporter("test", wall_time_func=_time)
++        self.assertEquals(dmr.get_bucket_no(), 0)
++        self.assertEquals(dmr.max, 0)
++        dmr.update("one")
++        dmr.update("one")
++        dmr.update("two")
++        self.assertEquals(dmr.max, 2)
++        dmr.flush(1, 60 * 60 * 24 + 1)
++        dmr.update("one")
++        self.assertEquals(dmr.max, 1)
++
++    def test_reports(self):
++        _wall_time = [0]
++
++        def _time():
++            return _wall_time[0]
++
++        result = {}
++
++        dmr = distinct.DistinctMetricReporter("test", wall_time_func=_time)
++
++        def save(b, b_no):
++            result["bucket"] = b
++            result["bucket_no"] = b_no
++        dmr.save_bucket = save
++        dmr.update("one")
++        dmr.update("one")
++        dmr.update("two")
++        day = 60 * 60 * 24 + 1
++        dmr.flush(1, day)
++        dmr.update("three")
++        self.assertEquals(result,
++            {"bucket": {"one": 2, "two": 1}, "bucket_no": 0})
++        result = dmr.flush(1, day)
++
++        self.assertTrue(("test.max", 1, day) in result)
++        self.assertTrue(("test.messages", 1, day) in result)
++
++    def test_configure(self):
++        class TestOptions(service.OptionsGlue):
++            optParameters = [["test", "t", "default", "help"]]
++            config_section = "statsd"
++
++        o = TestOptions()
++        config_file = ConfigParser.RawConfigParser()
++        config_file.readfp(StringIO("[statsd]\n\n[plugin_distinctdb]\n"
++            "dsn = dbdsn\nbucket_size = 100"))
++        o.configure(config_file)
++        dmf = distinctdbplugin.DistinctMetricFactory()
++        dmf.configure(o)
++        dmr = dmf.build_metric("foo", "bar", time.time)
++        self.assertEquals(dmr.bucket_size, 100)
++        self.assertEquals(dmr.dsn, "dbdsn")
++
++
++class TestPlugin(TestCase):
++
++    def test_factory(self):
++        self.assertTrue(distinctdbplugin.distinct_metric_factory in \
++                        list(getPlugins(IMetricFactory)))
++
++
++class TestDatabase(TestCase):
++
++    def setUp(self):
++        rootdir = check_output(["bzr", "root"]).strip()
++        dsn_file = os.path.join(rootdir, "tmp", "pg.dsn")
++        self.dsn = open(dsn_file).read()
++        self.conn = psycopg2.connect(self.dsn)
++
++    def tearDown(self):
++        cr = self.conn.cursor()
++        cr.execute("rollback")
++        cr.execute("DELETE FROM paths")
++        cr.execute("DELETE FROM points")
++        cr.execute("commit")
++
++    def test_connect(self):
++        cr = self.conn.cursor()
++        cr.execute("SELECT 0")
++        result = cr.fetchall()
++        self.assertTrue(result, [(0,)])
++
++    def test_create_metric_id(self):
++        dmr = distinct.DistinctMetricReporter("test", dsn=self.dsn)
++        dmr._save_bucket({}, 0)
++        cr = self.conn.cursor()
++        cr.execute("SELECT * FROM paths WHERE path = 'test'")
++        cr.execute("SELECT * FROM paths")
++        self.assertEquals(len(cr.fetchall()), 1)
++
++    def test_find_saved_data(self):
++        dmr = distinct.DistinctMetricReporter("test", dsn=self.dsn)
++        dmr.update("one")
++        dmr.update("one")
++        dmr.update("two")
++        dmr._save_bucket(dmr.bucket, 0)
++        cr = self.conn.cursor()
++        cr.execute("SELECT * FROM points ORDER BY value")
++        rows = cr.fetchall()
++        self.assertEquals(rows, [(dmr.metric_id, 0, "one", 2),
++                                 (dmr.metric_id, 0, "two", 1)])
++
++    def test_load_metric_id(self):
++        dmr = distinct.DistinctMetricReporter("test", dsn=self.dsn)
++        dmr._save_bucket({}, 0)
++        dmr2 = distinct.DistinctMetricReporter("test", dsn=self.dsn)
++        dmr2._save_bucket({}, 0)
++        self.assertEquals(dmr.metric_id, dmr2.metric_id)
++
++
++class TestRedis(TestCase):
++
++    def tearDown(self):
++        r = redis.client.Redis(host="localhost", port=16379)
++        r.flushdb()
++
++    def test_connect(self):
++        r = redis.client.Redis(host="localhost", port=16379)
++        r.ping()
++
++    def test_configure(self):
++        class TestOptions(service.OptionsGlue):
++            optParameters = [["test", "t", "default", "help"]]
++            config_section = "statsd"
++
++        o = TestOptions()
++        config_file = ConfigParser.RawConfigParser()
++        config_file.readfp(StringIO("[statsd]\n\n[plugin_distinctdb]\n"
++            "redis_host = localhost\nredis_port = 16379"))
++        o.configure(config_file)
++        dmf = distinctdbplugin.DistinctMetricFactory()
++        dmf.configure(o)
++        dmr = dmf.build_metric("foo", "bar", time.time)
++        self.assertEquals(dmr.redis_host, "localhost")
++        self.assertEquals(dmr.redis_port, 16379)
++
++    def test_usage(self):
++        dmr = distinct.DistinctMetricReporter("test",
++            redis_host="localhost", redis_port=16379)
++
++        self.assertEquals(dmr.count_1hour(), 0)
++        dmr._update_count(distinct.ONE_HOUR, "one", 0)
++        dmr._flush_redis(1)
++        self.assertEquals(dmr.count_1hour(), 1)
++        dmr._update_count(distinct.ONE_HOUR, "one", 0)
++        dmr._flush_redis(1)
++        self.assertEquals(dmr.count_1hour(), 1)
++        dmr._update_count(distinct.ONE_HOUR, "two", 30 * distinct.ONE_MINUTE)
++        dmr._flush_redis(30 * distinct.ONE_MINUTE)
++        self.assertEquals(dmr.count_1hour(), 2)
++        dmr._flush_redis(distinct.ONE_HOUR + 10 * distinct.ONE_MINUTE)
++        self.assertEquals(dmr.count_1hour(), 1)
 === added file 'distinctdb/version.py'
 --- distinctdb/version.py	1970-01-01 00:00:00 +0000
 +++ distinctdb/version.py	2011-12-15 20:24:01 +0000
@@ -0,0 +1,1 @@
++distinctplugin = "0.0.1"
 === added file 'setup.py'
 --- setup.py	1970-01-01 00:00:00 +0000
 +++ setup.py	2011-12-15 20:24:01 +0000
@@ -0,0 +1,64 @@
++from distutils.core import setup
++from distutils.command.install import install
++import os
++
++from twisted.plugin import IPlugin, getPlugins
++
++from distinctplugin import version
++
++# If setuptools is present, use it to find_packages(), and also
++# declare our dependency on epsilon.
++extra_setup_args = {}
++try:
++    import setuptools
++    from setuptools import find_packages
++except ImportError:
++    def find_packages():
++        """
++        Compatibility wrapper.
++
++        Taken from storm setup.py.
++        """
++        packages = []
++        for directory, subdirectories, files in os.walk("distinctplugin"):
++            if '__init__.py' in files:
++                packages.append(directory.replace(os.sep, '.'))
++        return packages
++
++long_description = """
++A plugin for txstatsd to count disctinct values using redis and postgres.
++"""
++
++
++class TxPluginInstaller(install):
++    def run(self):
++        install.run(self)
++        # Make sure we refresh the plugin list when installing, so we know
++        # we have enough write permissions.
++        # see http://twistedmatrix.com/documents/current/core/howto/plugin.html
++        # "when installing or removing software which provides Twisted plugins,
++        # the site administrator should be sure the cache is regenerated"
++        list(getPlugins(IPlugin))
++
++setup(
++    cmdclass={'install': TxPluginInstaller},
++    name="distinctplugin",
++    version=version.distinctplugin,
++    description="A txstatsd plugin for distinct counts",
++    author="txStatsD Developers",
++    url="https://launchpad.net/txstatsd",
++    license="MIT",
++    packages=find_packages() + ["twisted.plugins"],
++    long_description=long_description,
++    classifiers=[
++        "Development Status :: 4 - Beta",
++        "Intended Audience :: Developers",
++        "Intended Audience :: System Administrators",
++        "Intended Audience :: Information Technology",
++        "Programming Language :: Python",
++        "Topic :: Database",
++        "Topic :: Internet :: WWW/HTTP",
++        "License :: OSI Approved :: MIT License",
++       ],
++    **extra_setup_args
++    )
 === renamed file 'setup.py' => 'setup.py.moved'
 === added directory 'twisted'
 === renamed directory 'twisted' => 'twisted.moved'
 === added directory 'twisted/plugins'
 === added file 'twisted/plugins/distinctdbplugin.py'
 --- twisted/plugins/distinctdbplugin.py	1970-01-01 00:00:00 +0000
 +++ twisted/plugins/distinctdbplugin.py	2011-12-15 20:24:01 +0000
@@ -0,0 +1,38 @@
++from zope.interface import implements
++
++from twisted.plugin import IPlugin
++from txstatsd.itxstatsd import IMetricFactory
++from distinctdb.distinctmetric import DistinctMetricReporter, ONE_DAY
++
++
++class DistinctMetricFactory(object):
++    implements(IMetricFactory, IPlugin)
++
++    name = "distinct"
++    metric_type = "d"
++
++    bucket_size = None
++    dsn = None
++    metric_ids = None
++
++    def build_metric(self, prefix, name, wall_time_func=None):
++        return DistinctMetricReporter(name, prefix=prefix,
++                                      wall_time_func=wall_time_func,
++                                      bucket_size=self.bucket_size,
++                                      dsn=self.dsn, redis_host=self.redis_host,
++                                      redis_port=self.redis_port)
++
++    def configure(self, options):
++        self.section = dict(options.get("plugin_distinctdb", {}))
++        try:
++            self.bucket_size = int(self.section.get("bucket_size", ONE_DAY))
++        except ValueError:
++            self.bucket_size = ONE_DAY
++
++        self.dsn = self.section.get("dsn", None)
++        self.redis_host = self.section.get("redis_host", None)
++        self.redis_port = self.section.get("redis_port", None)
++        if self.redis_port is not None:
++            self.redis_port = int(self.redis_port)
++
++distinct_metric_factory = DistinctMetricFactory()