Microfiber

Merge lp:~jderose/microfiber/db-dump into lp:microfiber

db-dump
Merge into trunk

Proposed by Jason Gerard DeRose on 2012-08-16

Status:	Merged
Merged at revision:	133
Proposed branch:	lp:~jderose/microfiber/db-dump
Merge into:	lp:microfiber
Diff against target:	368 lines (+260/-26) 3 files modified doc/microfiber.rst (+29/-1) microfiber.py (+107/-25) test_microfiber.py (+124/-0)
To merge this branch:	bzr merge lp:~jderose/microfiber/db-dump
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
microfiber dev		2012-08-16	Pending
Review via email: mp+119829@code.launchpad.net

Description of the change

Use the revised Database.dump() method like this:

>>> db.dump('foo.json')

Or gzip-compress the dump:

>>> db.dump('foo.json.gz')

Like before, doc['_rev'] is deleted before dumping to the file. However, the attachments kwarg was removed, and now we only dump *without* the attachments. Even the stub doc['_attachments'] gets deleted when present. We'll probably add some more flexibility here later, but for now it suits the needs of Novacut and Dmedia.

Also, two big performance improvements were made:

1) We request docs 50 at a time (roughly 4x improvement)

2) We make CouchDB requests in a separate thread (roughly 2x improvement after above)

Revision history for this message

David Jordan (dmj726) wrote on 2012-08-16:

Approved.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Jason Gerard DeRose

microfiber dev

 === modified file 'doc/microfiber.rst'
 --- doc/microfiber.rst	2012-08-15 22:11:15 +0000
 +++ doc/microfiber.rst	2012-08-16 04:26:20 +0000
@@ -308,7 +308,7 @@
  .. class:: Database(name, env='http://localhost:5984/')
--
++
      Makes requests relative to a CouchDB database URL.
      Create a :class:`Database` like this:
@@ -424,6 +424,34 @@
          *Note:* for subtle reasons that take a while to explain, you probably
          don't want to use this method.  Instead use
          :meth:`Database.save_many()`.
++
++    .. method:: dump(filename)
++
++        Dump this database to regular JSON file *filename*.
++
++        For example:
++
++        >>> db = Database('foo')  #doctest: +SKIP
++        >>> db.dump('foo.json')  #doctest: +SKIP
++
++        Or if *filename* ends with ``'.json.gz'``, the file will be
++        gzip-compressed as it is written:
++
++        >>> db.dump('foo.json.gz')  #doctest: +SKIP
++
++        CouchDB is a bit awkward in that its API doesn't offer a nice way to
++        make a request whose response is suitable for writing directly to a
++        file, without decoding/encoding.  It would be nice if that dump could
++        be loaded directly from the file as well.  One of the biggest issues is
++        that a dump really needs to have doc['_rev'] removed.
++
++        This method is a compromise on many fronts, but it was made with these
++        priorities:
++
++            1. Readability of the dumped JSON file
++
++            2. High performance and low memory usage, despite the fact that
++               we must encode and decode each doc
 === modified file 'microfiber.py'
 --- microfiber.py	2012-08-15 22:11:15 +0000
 +++ microfiber.py	2012-08-16 04:26:20 +0000
@@ -40,15 +40,17 @@
  """
  from os import urandom
--from io import BufferedReader
++from io import BufferedReader, TextIOWrapper
  from base64 import b32encode, b64encode
  import json
++from gzip import GzipFile
  import time
  from hashlib import sha1
  import hmac
  from urllib.parse import urlparse, urlencode, quote_plus
  from http.client import HTTPConnection, HTTPSConnection, BadStatusLine
  import threading
++from queue import Queue
  import math
@@ -413,27 +415,74 @@
          super().__init__(msg.format(count))
++def _start_thread(target, *args):
++    thread = threading.Thread(target=target, args=args)
++    thread.daemon = True
++    thread.start()
++    return thread
++
++
++class SmartQueue(Queue):
++    """
++    Queue with custom get() that raises exception instances from the queue.
++    """
++
++    def get(self, block=True, timeout=None):
++        item = super().get(block, timeout)
++        if isinstance(item, Exception):
++            raise item
++        return item
++
++
++def _fakelist_worker(rows, db, queue):
++    try:
++        for doc_ids in id_slice_iter(rows, 50):
++            queue.put(db.get_many(doc_ids))
++        queue.put(None)
++    except Exception as e:
++        queue.put(e)
++
++
  class FakeList(list):
--    __slots__ = ('_count', '_iterable')
--
--    def __init__(self, count, iterable):
++    """
++    Trick ``json.dump()`` into doing memory-efficient incremental encoding.
++
++    This class is a hack to allow `Database.dump()` to dump a large database
++    while keeping the memory usage constant.
++
++    It also provides two hacks to improve the performance of `Database.dump()`:
++
++        1. Documents are retrieved 50 at a time using `Database.get_many()`
++
++        2. The CouchDB requests are made in a separate thread so `json.dump()`
++           can be busy doing work while we're waiting for a response
++    """
++
++    __slots__ = ('_rows', '_db')
++
++    def __init__(self, rows, db):
          super().__init__()
--        self._count = count
--        self._iterable = iterable
++        self._rows = rows
++        self._db = db
      def __len__(self):
--        return self._count
++        return len(self._rows)
      def __iter__(self):
--        for doc in self._iterable:
--            yield doc
--
--
--def iter_all_docs(rows, db, attachments=True):
--    for r in rows:
--        doc = db.get(r['id'], rev=r['value']['rev'], attachments=attachments)
--        del doc['_rev']
--        yield doc
++        queue = SmartQueue(2)
++        thread = _start_thread(_fakelist_worker, self._rows, self._db, queue)
++        while True:
++            docs = queue.get()
++            if docs is None:
++                break
++            for doc in docs:
++                del doc['_rev']
++                try:
++                    del doc['_attachments']
++                except KeyError:
++                    pass
++                yield doc
++        thread.join()  # Make sure reader() terminates
  class CouchBase(object):
@@ -876,12 +925,45 @@
              options['reduce'] = False
          return self.get('_design', design, '_view', view, **options)
--    def dump(self, fp, attachments=True):
--        rows = self.get('_all_docs')['rows']
--        iterable = iter_all_docs(rows, self, attachments)
--        docs = FakeList(len(rows), iterable)
--        json.dump({'docs': docs}, fp, ensure_ascii=False, sort_keys=True, indent=4, separators=(',', ': '))
--
--    def load(self, fp):
--        return self.post(fp, '_bulk_docs')
--
++    def dump(self, filename):
++        """
++        Dump this database to regular JSON file *filename*.
++
++        For example:
++
++        >>> db = Database('foo')  #doctest: +SKIP
++        >>> db.dump('foo.json')  #doctest: +SKIP
++
++        Or if *filename* ends with ``'.json.gz'``, the file will be
++        gzip-compressed as it is written:
++
++        >>> db.dump('foo.json.gz')  #doctest: +SKIP
++
++        CouchDB is a bit awkward in that its API doesn't offer a nice way to
++        make a request whose response is suitable for writing directly to a
++        file, without decoding/encoding.  It would be nice if that dump could
++        be loaded directly from the file as well.  One of the biggest issues is
++        that a dump really needs to have doc['_rev'] removed.
++
++        This method is a compromise on many fronts, but it was made with these
++        priorities:
++
++            1. Readability of the dumped JSON file
++
++            2. High performance and low memory usage, despite the fact that
++               we must encode and decode each doc
++        """
++        if filename.lower().endswith('.json.gz'):
++            _fp = open(filename, 'wb')
++            fp = TextIOWrapper(GzipFile('docs.json', fileobj=_fp, mtime=1))
++        else:
++            fp = open(filename, 'w')
++        rows = self.get('_all_docs', endkey='_')['rows']
++        docs = FakeList(rows, self)
++        json.dump(docs, fp,
++            ensure_ascii=False,
++            sort_keys=True,
++            indent=4,
++            separators=(',', ': '),
++        )
++
 === modified file 'test_microfiber.py'
 --- test_microfiber.py	2012-08-15 22:11:15 +0000
 +++ test_microfiber.py	2012-08-16 04:26:20 +0000
@@ -30,6 +30,7 @@
  from base64 import b64encode, b64decode, b32encode, b32decode
  from copy import deepcopy
  import json
++import gzip
  import time
  import io
  import tempfile
@@ -58,6 +59,26 @@
  B32ALPHABET = frozenset('234567ABCDEFGHIJKLMNOPQRSTUVWXYZ')
++# A sample view from Dmedia:
++doc_type = """
++function(doc) {
++    emit(doc.type, null);
++}
++"""
++doc_time = """
++function(doc) {
++    emit(doc.time, null);
++}
++"""
++doc_design = {
++    '_id': '_design/doc',
++    'views': {
++        'type': {'map': doc_type, 'reduce': '_count'},
++        'time': {'map': doc_time},
++    },
++}
++
++
  def is_microfiber_id(_id):
      assert isinstance(_id, str)
      return (
@@ -1014,6 +1035,52 @@
          self.env = None
++class TestFakeList(LiveTestCase):
++    def test_init(self):
++        db = microfiber.Database('foo', self.env)
++        self.assertTrue(db.ensure())
++
++        # Test when DB is empty
++        rows = []
++        fake = microfiber.FakeList(rows, db)
++        self.assertIsInstance(fake, list)
++        self.assertIs(fake._rows, rows)
++        self.assertIs(fake._db, db)
++        self.assertEqual(len(fake), 0)
++        self.assertEqual(list(fake), [])
++
++        # Test when there are some docs
++        ids = sorted(test_id() for i in range(201))
++        orig = [
++            {'_id': _id, 'hello': 'мир', 'welcome': 'все'}
++            for _id in ids
++        ]
++        docs = deepcopy(orig)
++        db.save_many(docs)
++        rows = db.get('_all_docs')['rows']
++        fake = microfiber.FakeList(rows, db)
++        self.assertIsInstance(fake, list)
++        self.assertIs(fake._rows, rows)
++        self.assertIs(fake._db, db)
++        self.assertEqual(len(fake), 201)
++        self.assertEqual(list(fake), orig)
++
++        # Verify that _attachments get deleted
++        for doc in docs:
++            db.put_att('application/octet-stream', b'foobar', doc['_id'], 'baz',
++                rev=doc['_rev']
++            )
++        for _id in ids:
++            self.assertIn('_attachments', db.get(_id))
++        rows = db.get('_all_docs')['rows']
++        fake = microfiber.FakeList(rows, db)
++        self.assertIsInstance(fake, list)
++        self.assertIs(fake._rows, rows)
++        self.assertIs(fake._db, db)
++        self.assertEqual(len(fake), 201)
++        self.assertEqual(list(fake), orig)
++
++
  class TestCouchBaseLive(LiveTestCase):
      klass = microfiber.CouchBase
@@ -1676,3 +1743,60 @@
              db.get_many([ids[17], nope, ids[18]]),
              [docs[17], None, docs[18]]
+         )
++
++    def test_dump(self):
++        db = microfiber.Database('foo', self.env)
++        self.assertTrue(db.ensure())
++        docs = [
++            {'_id': test_id(), 'hello': 'мир', 'welcome': 'все'}
++            for i in range(200)
++        ]
++        docs_s = microfiber.dumps(
++            sorted(docs, key=lambda d: d['_id']),
++            pretty=True
++        )
++        docs.append(deepcopy(doc_design))
++        checksum = md5(docs_s.encode('utf-8')).hexdigest()
++        db.save_many(docs)
++
++        # Test with .json
++        dst = path.join(self.tmpcouch.paths.bzr, 'foo.json')
++        db.dump(dst)
++        self.assertEqual(open(dst, 'r').read(), docs_s)
++        self.assertEqual(
++            md5(open(dst, 'rb').read()).hexdigest(),
++            checksum
++        )
++
++        # Test with .json.gz
++        dst = path.join(self.tmpcouch.paths.bzr, 'foo.json.gz')
++        db.dump(dst)
++        gz_checksum = md5(open(dst, 'rb').read()).hexdigest()
++        self.assertEqual(
++            md5(gzip.GzipFile(dst, 'rb').read()).hexdigest(),
++            checksum
++        )
++
++        # Test that timestamp doesn't change gz_checksum
++        time.sleep(2)
++        db.dump(dst)
++        self.assertEqual(
++            md5(open(dst, 'rb').read()).hexdigest(),
++            gz_checksum
++        )
++
++        # Test that filename doesn't change gz_checksum
++        dst = path.join(self.tmpcouch.paths.bzr, 'bar.json.gz')
++        db.dump(dst)
++        self.assertEqual(
++            md5(open(dst, 'rb').read()).hexdigest(),
++            gz_checksum
++        )
++
++        # Make sure .JSON.GZ also works, that case is ignored
++        dst = path.join(self.tmpcouch.paths.bzr, 'FOO.JSON.GZ')
++        db.dump(dst)
++        self.assertEqual(
++            md5(open(dst, 'rb').read()).hexdigest(),
++            gz_checksum
++        )

Microfiber

Merge lp:~jderose/microfiber/db-dump into lp:microfiber

Commit message

Description of the change

Preview Diff

Subscribers