Dmedia

Merge lp:~jderose/dmedia/core-split into lp:dmedia

core-split
Merge into trunk

Proposed by Jason Gerard DeRose on 2012-11-21

Status:	Merged
Merged at revision:	517
Proposed branch:	lp:~jderose/dmedia/core-split
Merge into:	lp:dmedia
Diff against target:	1314 lines (+502/-239) 17 files modified dmedia-service (+4/-4) dmedia/core.py (+0/-2) dmedia/extractor.py (+8/-20) dmedia/importer.py (+78/-56) dmedia/metastore.py (+114/-43) dmedia/schema.py (+33/-30) dmedia/server.py (+7/-3) dmedia/tests/test_extractor.py (+6/-18) dmedia/tests/test_importer.py (+126/-9) dmedia/tests/test_local.py (+2/-1) dmedia/tests/test_metastore.py (+94/-1) dmedia/tests/test_schema.py (+16/-39) dmedia/tests/test_server.py (+2/-1) dmedia/tests/test_transfers.py (+2/-2) dmedia/tests/test_verification.py (+9/-8) dmedia/verification.py (+1/-1) dmedia/views.py (+0/-1)
To merge this branch:	bzr merge lp:~jderose/dmedia/core-split
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
David Jordan		2012-11-21	Approve on 2012-11-21
Review via email: mp+135401@code.launchpad.net

Description of the change

While working on this, I found a serious bug in the CouchDB ProxyApp:

https://bugs.launchpad.net/dmedia/+bug/1080339

So the merge fixes that. I wasn't including the query string (when present) into the rebuilt request-line, so this request:

POST /db/docid?new_edits=false

Was being forwarded as:

POST /db/docid

Which resulted in a conflict and changes not being replicated.

The main focus was this bug, trimming the dmedia/file docs in dmedia-0 down to only their essential schema:

https://bugs.launchpad.net/dmedia/+bug/1078542

And now when you do an import, that's exactly what you get. There is also a new dmedia/log type of doc, and each time a file is imported, one of these log docs is saved. These log docs should never be updated, and they store what was unique about that specific occasion that the file was imported. Stuff like the file name and mtime, the time of the event, the import_id, batch_id, etc. If a duplicate file is again imported, that 2nd import gets its own log doc, preserving all the details we need for rich auditing.

I also fixed some small issues about how the dmedia/file doc gets updated on a duplicate import. Previously we weren't preserving the file pinning, so that's now fix. This is thanks to the new importer.merge_stored() function, which has a comprehensive test.

After thinking about it more, I decided *not* to include any extracted metadata in dmedia-0. Instead, we'll keep that just in the project databases, which is where Novacut and Dmedia are currently expecting this metadata anyway. I made a few small tweaks to the schema of the dmedia/file docs saved in the project databases, but all the same info is still there. Although I decided to no longer store the leaf_hashes redundantly in the project databases, because it's better to only download the leaf_hashes from a peer once you're going to put a file into your library, once you're going to have a corresponding doc in dmedia-0.

Some other small changes that made sense to include in this:

* extractor now uses microfiber Attachement, encode_attachment()

* Hugely improved performance of MetaStore.scan(), MetaStore.relink() by using Database.get_many() to grab 25 docs at a time

* Because of above performance improvement, I turned the background worker back on (was disabled for 12.10 release for performance reasons)

* doc['stored'][store_id]['verified'] must now be an int... 1 second granularity is plenty for this, and using an int instead of a float makes the doc a bit smaller

* removed the "user" design from dmedia-0 (wasn't being used anyway, and the metadata it drew on is no longer in the dmedia/file docs in dmedia-0)

* ImportWorker now does extraction in its own thread so it's less likely to stall the read/hash/write train

Revision history for this message

David Jordan (dmj726) wrote on 2012-11-21:

Approved!

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Jason Gerard DeRose

dmedia Dev

 === modified file 'dmedia-service'
 --- dmedia-service	2012-11-01 11:06:53 +0000
 +++ dmedia-service	2012-11-21 12:55:46 +0000
@@ -173,7 +173,7 @@
              self.core.set_default_store('shared')
          self.env_s = dumps(self.core.env, pretty=True)
          log.info('Finished core startup in %.3f', time.time() - start)
--        GObject.timeout_add(350, self.on_idle1)
++        GObject.timeout_add(250, self.on_idle1)
      def on_idle1(self):
          """
@@ -183,7 +183,7 @@
          if self.couch.pki.user.key_file is not None:
              self.peer = Browser(self, self.couch)
          self.udisks.monitor()
--        GObject.timeout_add(350, self.on_idle2)
++        GObject.timeout_add(500, self.on_idle2)
      def on_idle2(self):
          """
@@ -191,7 +191,7 @@
          """
          log.info('[idle2 at time %.3f]', time.time() - start_time)
          start_thread(self.core.init_project_views)
--        GObject.timeout_add(12*1000, self.on_idle3)
++        GObject.timeout_add(5000, self.on_idle3)
      def on_idle3(self):
          """
@@ -203,7 +203,7 @@
          port = env['port']
          self.avahi = Avahi(self.core.env, port, ssl_config)
          self.avahi.run()
--        #GObject.timeout_add(60*1000, self.on_idle4)
++        GObject.timeout_add(15*1000, self.on_idle4)
      def on_idle4(self):
          """
 === modified file 'dmedia/core.py'
 --- dmedia/core.py	2012-11-01 11:06:53 +0000
 +++ dmedia/core.py	2012-11-21 12:55:46 +0000
@@ -274,10 +274,8 @@
          while True:
              try:
                  fs = self.queue.get()
--                start = time.time()
                  self.ms.scan(fs)
                  self.ms.relink(fs)
--                log.info('%.3f to check %r', time.time() - start, fs)
              except Exception as e:
                  log.exception('Error in background worker:')
 === modified file 'dmedia/extractor.py'
 --- dmedia/extractor.py	2012-05-04 03:15:23 +0000
 +++ dmedia/extractor.py	2012-11-21 12:55:46 +0000
@@ -32,15 +32,13 @@
  from base64 import b64encode
  import time
  import calendar
--from collections import namedtuple
  from filestore import hash_fp
++from microfiber import Attachment, encode_attachment
  import dmedia
--Thumbnail = namedtuple('Thumbnail', 'content_type data')
--
  dmedia_extract = 'dmedia-extract'
  tree = path.dirname(path.dirname(path.abspath(dmedia.__file__)))
  if path.isfile(path.join(tree, 'setup.py')):
@@ -267,7 +265,7 @@
          dst,
+     ]
      check_call(cmd)
--    return Thumbnail('image/jpeg', open(dst, 'rb').read())
++    return Attachment('image/jpeg', open(dst, 'rb').read())
  def thumbnail_video(src, tmp):
@@ -294,11 +292,14 @@
      cmd = [
          'ufraw-batch',
          '--embedded-image',
++        '--noexif',
++        '--size', str(SIZE),
++        '--compression', '90',
          '--output', dst,
          src,
+     ]
      check_call(cmd)
--    return thumbnail_image(dst, tmp)
++    return Attachment('image/jpeg', open(dst, 'rb').read())
  thumbnailers = {
@@ -332,14 +333,6 @@
              shutil.rmtree(tmp)
--def to_attachment(thm):
--    assert isinstance(thm, Thumbnail)
--    return {
--        'content_type': thm.content_type,
--        'data': b64encode(thm.data).decode('utf-8'),
--    }
--
--
  def get_thumbnail_func(doc):
      media = doc.get('media')
      if media not in ('video', 'image'):
@@ -351,7 +344,6 @@
      return thumbnail_image
--
  def merge_thumbnail(src, doc):
      func = get_thumbnail_func(doc)
      if func is None:
@@ -359,10 +351,6 @@
      thm = wrap_thumbnail_func(func, src)
      if thm is None:
          return False
--    doc['_attachments']['thumbnail'] = to_attachment(thm)
++    doc['_attachments']['thumbnail'] = encode_attachment(thm)
      return True
--
--
--
--
--
++
 === modified file 'dmedia/importer.py'
 --- dmedia/importer.py	2012-11-01 11:06:53 +0000
 +++ dmedia/importer.py	2012-11-21 12:55:46 +0000
@@ -35,10 +35,12 @@
  import logging
  import mimetypes
  import shutil
++from queue import Queue
--import microfiber
++from microfiber import NotFound, has_attachment, encode_attachment
  from filestore import FileStore, scandir, batch_import_iter, statvfs
++from dmedia.parallel import start_thread
  from dmedia.util import get_project_db
  from dmedia.units import bytes10
  from dmedia import workers, schema
@@ -159,6 +161,16 @@
          pass
++def merge_stored(old, new):
++    for (key, value) in new.items():
++        assert set(value) == set(['copies', 'mtime'])
++        if key in old:
++            old[key].update(value)
++            old[key].pop('verified', None)
++        else:
++            old[key] = value
++
++
  class ImportWorker(workers.CouchWorker):
      def __init__(self, env, q, key, args):
          super().__init__(env, q, key, args)
@@ -169,6 +181,7 @@
          self.extract = self.env.get('extract', True)
          self.project = get_project_db(self.env['project_id'], self.env)
          self.project.ensure()
++        self.extraction_queue = Queue(10)
      def execute(self, basedir, extra=None):
          self.extra = extra
@@ -218,23 +231,38 @@
          return stores
      def import_all(self):
++        self.thumbnail = None
++        extractor = start_thread(self.extractor)
          stores = self.get_filestores()
          try:
--            for (status, file, doc) in self.import_iter(*stores):
++            for (status, file, ch) in self.import_iter(*stores):
                  self.doc['stats'][status]['count'] += 1
                  self.doc['stats'][status]['bytes'] += file.size
                  self.doc['files'][file.name]['status'] = status
--                if doc is not None:
--                    self.db.save(doc)
--                    self.doc['files'][file.name]['id'] = doc['_id']
++                if ch is not None:
++                    self.doc['files'][file.name]['id'] = ch.id
              self.doc['time_end'] = time.time()
              self.doc['rate'] = get_rate(self.doc)
          finally:
              self.db.save(self.doc)
++            self.extraction_queue.put(None)
++            extractor.join()
++            if self.thumbnail:
++                self.doc['_attachments'] = {
++                    'thumbnail': encode_attachment(self.thumbnail)
++                }
++                self.db.save(self.doc)
++            del self.doc['_rev']
++            self.project.post(self.doc)
          self.emit('finished', self.id, self.doc['stats'])
      def import_iter(self, *filestores):
--        need_thumbnail = True
++        common = {
++            'import_id': self.id,
++            'batch_id': self.env.get('batch_id'),
++            'machine_id': self.env.get('machine_id'),
++            'project_id': self.env.get('project_id'),
++        }
          for (file, ch) in batch_import_iter(self.batch, *filestores,
              callback=self.progress_callback
          ):
@@ -242,49 +270,12 @@
                  assert file.size == 0
                  yield ('empty', file, None)
                  continue
--
--            common = {
--                'import': {
--                    'import_id': self.id,
--                    'machine_id': self.env.get('machine_id'),
--                    'batch_id': self.env.get('batch_id'),
--                    'project_id': self.env.get('project_id'),
--                    'src': file.name,
--                    'mtime': file.mtime,
--                },
--                'meta': {},
--                'ctime': file.mtime,
--                'name': path.basename(file.name),
--            }
--            ext = normalize_ext(file.name)
--            if ext:
--                common['ext'] = ext
--            extract(file.name, common)
--
--            # Project doc
--            try:
--                doc = self.project.get(ch.id)
--            except microfiber.NotFound:
--                doc = schema.create_project_file(
--                    ch.id, ch.file_size, ch.leaf_hashes
--                )
--                doc.update(common)
--                merge_thumbnail(file.name, doc)
--                log.info('adding to %r', self.project)
--                self.project.save(doc)
--            if need_thumbnail and 'thumbnail' in doc['_attachments']:
--                (content_type, data) = self.project.get_att(ch.id, 'thumbnail')
--                self.db.save(self.doc)
--                self.db.put_att(content_type, data, self.id, 'thumbnail',
--                    rev=self.doc['_rev']
--                )
--                self.doc = self.db.get(self.id)
--                self.emit('import_thumbnail', self.id, ch.id)
--                need_thumbnail = False
--
--            # Core doc
++            timestamp = time.time()
++            self.extraction_queue.put((timestamp, file, ch))
++            log_doc = schema.create_log(timestamp, ch, file, **common)
              stored = dict(
--                (fs.id,
++                (
++                    fs.id,
+                     {
                          'copies': fs.copies,
                          'mtime': fs.stat(ch.id).mtime,
@@ -294,14 +285,15 @@
+             )
              try:
                  doc = self.db.get(ch.id)
--                doc['stored'].update(stored)
--                yield ('duplicate', file, doc)
--            except microfiber.NotFound:
--                doc = schema.create_file(
--                    ch.id, ch.file_size, ch.leaf_hashes, stored
--                )
--                doc.update(common)
--                yield ('new', file, doc)
++                doc['origin'] = 'user'
++                doc['atime'] = int(timestamp)
++                merge_stored(doc['stored'], stored)
++                self.db.save_many([log_doc, doc])
++                yield ('duplicate', file, ch)
++            except NotFound:
++                doc = schema.create_file(timestamp, ch, stored)
++                self.db.save_many([log_doc, doc])
++                yield ('new', file, ch)
      def progress_callback(self, count, size):
          self.emit('progress', self.id,
@@ -309,6 +301,36 @@
              size, self.batch.size
+         )
++    def extractor(self):
++        try:
++            need_thumbnail = True
++            common = {
++                'import_id': self.id,
++                'batch_id': self.env.get('batch_id'),
++                'machine_id': self.env.get('machine_id'),
++            }
++            while True:
++                item = self.extraction_queue.get()
++                if item is None:
++                    break
++                (timestamp, file, ch) = item
++                try:
++                    doc = self.project.get(ch.id)
++                except NotFound:
++                    doc = schema.create_project_file(timestamp, ch, file)
++                    ext = normalize_ext(file.name)
++                    if ext:
++                        doc['ext'] = ext
++                    extract(file.name, doc)
++                    merge_thumbnail(file.name, doc)
++                    doc.update(common)
++                    self.project.save(doc)
++                if need_thumbnail and has_attachment(doc, 'thumbnail'):
++                    need_thumbnail = False
++                    self.thumbnail = self.project.get_att(ch.id, 'thumbnail')
++                    self.emit('import_thumbnail', self.id, ch.id)
++        except Exception:
++            log.exception('Error in extractor thread:')
  class ImportManager(workers.CouchManager):
 === modified file 'dmedia/metastore.py'
 --- dmedia/metastore.py	2012-07-01 12:50:50 +0000
 +++ dmedia/metastore.py	2012-11-21 12:55:46 +0000
@@ -28,7 +28,7 @@
  import logging
  from filestore import CorruptFile, FileNotFound, check_root_hash
--from microfiber import NotFound
++from microfiber import NotFound, id_slice_iter
  from .util import get_db
@@ -108,14 +108,17 @@
  def mark_mismatch(doc, fs):
++    """
++    Update mtime and copies, delete verified, preserve pinned.
++    """
      _id = doc['_id']
      stored = get_dict(doc, 'stored')
--    new = {
--        'mtime': fs.stat(_id).mtime,
--        'copies': 0,
--        'verified': 0,
--    }
--    update(stored, fs.id, new)
++    value = get_dict(stored, fs.id)
++    value.update(
++        mtime=fs.stat(_id).mtime,
++        copies=0,
++    )
++    value.pop('verified', None)
  class VerifyContext:
@@ -154,10 +157,13 @@
          if exc_type is None:
              return
          if issubclass(exc_type, FileNotFound):
++            log.warning('%s is not in %r', self.doc['_id'], self.fs)
              remove_from_stores(self.doc, self.fs)
          elif issubclass(exc_type, CorruptFile):
++            log.warning('%s has wrong size in %r', self.doc['_id'], self.fs)
              mark_corrupt(self.doc, self.fs, time.time())
          elif issubclass(exc_type, MTimeMismatch):
++            log.warning('%s has wrong mtime in %r', self.doc['_id'], self.fs)
              mark_mismatch(self.doc, self.fs)
          else:
              return False
@@ -165,6 +171,17 @@
          return True
++def relink_iter(fs, count=25):
++    buf = []
++    for st in fs:
++        buf.append(st)
++        if len(buf) >= count:
++            yield buf
++            buf = []
++    if buf:
++        yield buf
++
++
  class MetaStore:
      def __init__(self, db):
          self.db = db
@@ -172,45 +189,99 @@
      def __repr__(self):
          return '{}({!r})'.format(self.__class__.__name__, self.db)
++    def scan(self, fs):
++        """
++        Make sure files we expect to be in the file-store *fs* actually are.
++
++        A fundamental design tenet of Dmedia is that it doesn't particularly
++        trust its metadata, and instead does frequent reality checks.  This
++        allows Dmedia to work even though removable storage is constantly
++        "offline".  In other distributed file-systems, this is usually called
++        being in a "network-partitioned" state.
++
++        Dmedia deals with removable storage via a quickly decaying confidence
++        in its metadata.  If a removable drive hasn't been connected longer
++        than some threshold, Dmedia will update all those copies to count for
++        zero durability.
++
++        And whenever a removable drive (on any drive for that matter) is
++        connected, Dmedia immediately checks to see what files are actually on
++        the drive, and whether they have good integrity.
++
++        `MetaStore.scan()` is the most important reality check that Dmedia does
++        because it's fast and can therefor be done quite often. Thousands of
++        files can be scanned in a few seconds.
++
++        The scan insures that for every file expected in this file-store, the
++        file exists, has the correct size, and the expected mtime.
++
++        If the file doesn't exist in this file-store, its store_id is deleted
++        from doc['stored'] and the doc is saved.
++
++        If the file has the wrong size, it's moved into the corrupt location in
++        the file-store. Then the doc is updated accordingly marking the file as
++        being corrupt in this file-store, and the doc is saved.
++
++        If the file doesn't have the expected mtime is this file-store, this
++        copy gets downgraded to zero copies worth of durability, and the last
++        verification timestamp is deleted, if present.  This will put the file
++        first in line for full content-hash verification.  If the verification
++        passes, the durability is raised back to the appropriate number of
++        copies.
++
++        :param fs: a `FileStore` instance
++        """
++        start = time.time()
++        log.info('Scanning FileStore %s at %r', fs.id, fs.parentdir)
++        rows = self.db.view('file', 'stored', key=fs.id)['rows']
++        for ids in id_slice_iter(rows):
++            for doc in self.db.get_many(ids):
++                _id = doc['_id']
++                with ScanContext(self.db, fs, doc):
++                    st = fs.stat(_id)
++                    if st.size != doc.get('bytes'):
++                        src_fp = open(st.name, 'rb')
++                        raise fs.move_to_corrupt(src_fp, _id,
++                            file_size=doc['bytes'],
++                            bad_file_size=st.size,
++                        )
++                    stored = get_dict(doc, 'stored')
++                    s = get_dict(stored, fs.id)
++                    if st.mtime != s['mtime']:
++                        raise MTimeMismatch()
++        # Update the atime for the dmedia/store doc
++        try:
++            doc = self.db.get(fs.id)
++            assert doc['type'] == 'dmedia/store'
++            doc['atime'] = int(time.time())
++            self.db.save(doc)
++            log.info('Updated FileStore %s atime to %r', fs.id, doc['atime'])
++        except NotFound:
++            log.warning('No doc for FileStore %s', fs.id)
++        log.info('%.3f to scan %r', time.time() - start, fs)
++
      def relink(self, fs):
++        """
++        Find known files that we didn't expect in `FileStore` *fs*.
++        """
++        start = time.time()
          log.info('Relinking FileStore %r at %r', fs.id, fs.parentdir)
--        for st in fs:
--            try:
--                doc = self.db.get(st.id)
--            except NotFound:
--                continue
--            stored = get_dict(doc, 'stored')
--            s = get_dict(stored, fs.id)
--            if s.get('mtime') == st.mtime:
--                continue
--            new = {
--                'mtime': st.mtime,
--                'verified': 0,
--                'copies': (0 if 'mtime' in s else fs.copies),
--            }
--            s.update(new)
--            self.db.save(doc)
--
--    def scan(self, fs):
--        log.info('Scanning FileStore %r at %r', fs.id, fs.parentdir)
--        v = self.db.view('file', 'stored', key=fs.id, reduce=False)
--        for row in v['rows']:
--            _id = row['id']
--            doc = self.db.get(_id)
--            leaf_hashes = self.db.get_att(_id, 'leaf_hashes')[1]
--            check_root_hash(_id, doc['bytes'], leaf_hashes)
--            with ScanContext(self.db, fs, doc):
--                st = fs.stat(_id)
--                if st.size != doc['bytes']:
--                    src_fp = open(st.name, 'rb')
--                    raise fs.move_to_corrupt(src_fp, _id,
--                        file_size=doc['bytes'],
--                        bad_file_size=st.size,
--                    )
++        for buf in relink_iter(fs):
++            docs = self.db.get_many([st.id for st in buf])
++            for (st, doc) in zip(buf, docs):
++                if doc is None:
++                    continue
                  stored = get_dict(doc, 'stored')
--                s = get_dict(stored, fs.id)
--                if st.mtime != s['mtime']:
--                    raise MTimeMismatch()
++                value = get_dict(stored, fs.id)
++                if value:
++                    continue
++                log.info('Relinking %s in %r', st.id, fs)
++                value.update(
++                    mtime=st.mtime,
++                    copies=fs.copies,
++                )
++                self.db.save(doc)
++        log.info('%.3f to relink %r', time.time() - start, fs)
      def remove(self, fs, _id):
          doc = self.db.get(_id)
 === modified file 'dmedia/schema.py'
 --- dmedia/schema.py	2012-11-14 00:37:47 +0000
 +++ dmedia/schema.py	2012-11-21 12:55:46 +0000
@@ -245,9 +245,10 @@
  import re
  import time
  import socket
++import os
  from filestore import DIGEST_B32LEN, B32ALPHABET, TYPE_ERROR
--from microfiber import random_id, RANDOM_B32LEN
++from microfiber import random_id, RANDOM_B32LEN, encode_attachment, Attachment
  # schema-compatibility version:
@@ -710,7 +711,7 @@
          _check(doc, ['stored', key, 'mtime'], (int, float),
              (_at_least, 0),
+         )
--        _check_if_exists(doc, ['stored', key, 'verified'], (int, float),
++        _check_if_exists(doc, ['stored', key, 'verified'], int,
              (_at_least, 0),
+         )
          _check_if_exists(doc, ['stored', key, 'pinned'], bool,
@@ -739,22 +740,12 @@
                  (_at_least, 0),
+             )
--    # 'ext' like 'mov'
--    _check_if_exists(doc, ['ext'], str,
--        (_matches, EXT_PAT),
--    )
--
--    # 'content_type' like 'video/quicktime'
--    _check_if_exists(doc, ['content_type'], str)
--
      # proxy_of
      if doc['origin'] == 'proxy':
          _check(doc, ['proxy_of'], str,
              _intrinsic_id,
+         )
--    check_file_optional(doc)
--
  def check_file_optional(doc):
@@ -825,45 +816,57 @@
  #######################################################
  # Functions for creating specific types of dmedia docs:
--def create_file(_id, file_size, leaf_hashes, stored, origin='user'):
++
++def create_log(timestamp, ch, file, **kw):
++    doc = {
++        '_id': ch.id[:4] + random_id()[4:],
++        'type': 'dmedia/log',
++        'time': timestamp,
++        'file_id': ch.id,
++        'bytes': ch.file_size,
++        'dir': os.path.dirname(file.name),
++        'name': os.path.basename(file.name),
++        'mtime': file.mtime,
++    }
++    doc.update(kw)
++    return doc
++
++
++def create_file(timestamp, ch, stored, origin='user'):
      """
      Create a minimal 'dmedia/file' document.
      """
--    timestamp = time.time()
++    leaf_hashes = Attachment('application/octet-stream', ch.leaf_hashes)
      return {
--        '_id': _id,
++        '_id': ch.id,
          '_attachments': {
--            'leaf_hashes': {
--                'data': b64encode(leaf_hashes).decode('utf-8'),
--                'content_type': 'application/octet-stream',
--            }
++            'leaf_hashes': encode_attachment(leaf_hashes),
          },
          'type': 'dmedia/file',
          'time': timestamp,
          'atime': int(timestamp),
--        'bytes': file_size,
++        'bytes': ch.file_size,
          'origin': origin,
          'stored': stored,
+     }
--def create_project_file(_id, file_size, leaf_hashes, origin='user'):
++def create_project_file(timestamp, ch, file, origin='user'):
      """
      Create a minimal 'dmedia/file' document.
      """
      return {
--        '_id': _id,
--        '_attachments': {
--            'leaf_hashes': {
--                'data': b64encode(leaf_hashes).decode('utf-8'),
--                'content_type': 'application/octet-stream',
--            }
--        },
++        '_id': ch.id,
++        '_attachments': {},
          'type': 'dmedia/file',
--        'time': time.time(),
--        'bytes': file_size,
++        'time': timestamp,
++        'bytes': ch.file_size,
          'origin': origin,
++        'ctime': file.mtime,
++        'dir': os.path.dirname(file.name),
++        'name': os.path.basename(file.name),
          'tags': {},
++        'meta': {},
+     }
 === modified file 'dmedia/server.py'
 --- dmedia/server.py	2012-10-24 23:23:30 +0000
 +++ dmedia/server.py	2012-11-21 12:55:46 +0000
@@ -31,7 +31,7 @@
  import logging
  from filestore import DIGEST_B32LEN, B32ALPHABET, LEAF_SIZE
--from microfiber import dumps, basic_auth_header, CouchBase
++from microfiber import dumps, basic_auth_header, CouchBase, dumps
  import dmedia
  from dmedia import __version__
@@ -46,7 +46,7 @@
  def iter_headers(environ):
      for (key, value) in environ.items():
--        if key in ('CONTENT_LENGHT', 'CONTENT_TYPE'):
++        if key in ('CONTENT_LENGTH', 'CONTENT_TYPE'):
              yield (key.replace('_', '-').lower(), value)
          elif key.startswith('HTTP_'):
              yield (key[5:].replace('_', '-').lower(), value)
@@ -58,7 +58,11 @@
          body = environ['wsgi.input'].read()
      else:
          body = None
--    return (environ['REQUEST_METHOD'], environ['PATH_INFO'], body, headers)
++    path = environ['PATH_INFO']
++    query = environ['QUERY_STRING']
++    if query:
++        path = '?'.join([path, query])
++    return (environ['REQUEST_METHOD'], path, body, headers)
  def get_slice(environ):
 === modified file 'dmedia/tests/test_extractor.py'
 --- dmedia/tests/test_extractor.py	2012-08-04 01:26:03 +0000
 +++ dmedia/tests/test_extractor.py	2012-11-21 12:55:46 +0000
@@ -28,7 +28,7 @@
  from os import path
  from subprocess import CalledProcessError
--from microfiber import random_id
++from microfiber import random_id, Attachment
  from .base import TempDir, SampleFilesTestCase, MagicLanternTestCase
@@ -612,12 +612,11 @@
+             }
+         )
--
      def test_thumbnail_video(self):
          # Test with sample_mov from 5D Mark II:
          tmp = TempDir()
          t = extractor.thumbnail_video(self.mov, tmp.dir)
--        self.assertIsInstance(t, extractor.Thumbnail)
++        self.assertIsInstance(t, Attachment)
          self.assertEqual(t.content_type, 'image/jpeg')
          self.assertIsInstance(t.data, bytes)
          self.assertGreater(len(t.data), 5000)
@@ -644,7 +643,7 @@
          # Test with sample_thm from 5D Mark II:
          tmp = TempDir()
          t = extractor.thumbnail_image(self.thm, tmp.dir)
--        self.assertIsInstance(t, extractor.Thumbnail)
++        self.assertIsInstance(t, Attachment)
          self.assertEqual(t.content_type, 'image/jpeg')
          self.assertIsInstance(t.data, bytes)
          self.assertGreater(len(t.data), 5000)
@@ -667,7 +666,7 @@
      def test_create_thumbnail(self):
          # Test with sample_mov from 5D Mark II:
          t = extractor.create_thumbnail(self.mov, 'mov')
--        self.assertIsInstance(t, extractor.Thumbnail)
++        self.assertIsInstance(t, Attachment)
          self.assertEqual(t.content_type, 'image/jpeg')
          self.assertIsInstance(t.data, bytes)
          self.assertGreater(len(t.data), 5000)
@@ -690,7 +689,7 @@
      def test_create_thumbnail(self):
          # Test with sample_mov from 5D Mark II:
          t = extractor.create_thumbnail(self.mov, 'mov')
--        self.assertIsInstance(t, extractor.Thumbnail)
++        self.assertIsInstance(t, Attachment)
          self.assertEqual(t.content_type, 'image/jpeg')
          self.assertIsInstance(t.data, bytes)
          self.assertGreater(len(t.data), 5000)
@@ -710,15 +709,6 @@
          nope = tmp.join('nope.mov')
          self.assertIsNone(extractor.create_thumbnail(nope, 'mov'))
--    def test_to_attachment(self):
--        data = os.urandom(2000)
--        thm = extractor.Thumbnail('image/png', data)
--        d = extractor.to_attachment(thm)
--        self.assertIsInstance(d, dict)
--        self.assertEqual(set(d), set(['content_type', 'data']))
--        self.assertEqual(d['content_type'], 'image/png')
--        self.assertEqual(d['data'], b64encode(data).decode('utf-8'))
--
      def test_get_thumbnail_func(self):
          f = extractor.get_thumbnail_func
          self.assertIsNone(f({}))
@@ -833,6 +823,4 @@
              extensions,
              set(extractor.NO_EXTRACT)
+         )
--
--
--
++
 === modified file 'dmedia/tests/test_importer.py'
 --- dmedia/tests/test_importer.py	2012-08-06 11:00:34 +0000
 +++ dmedia/tests/test_importer.py	2012-11-21 12:55:46 +0000
@@ -200,6 +200,123 @@
              (6, 8, 10, 12)
+         )
++    def test_merge_stored(self):
++        id1 = random_id()
++        id2 = random_id()
++        id3 = random_id()
++        ts1 = time.time()
++        ts2 = time.time() - 2.5
++        ts3 = time.time() - 5
++        new = {
++            id1: {
++                'copies': 2,
++                'mtime': ts1,
++            },
++            id2: {
++                'copies': 1,
++                'mtime': ts2,
++            },
++        }
++
++        old = {}
++        self.assertIsNone(importer.merge_stored(old, deepcopy(new)))
++        self.assertEqual(old, new)
++
++        old = {
++            id3: {
++                'copies': 1,
++                'mtime': ts3,
++                'verified': int(ts3 + 100),
++            }
++        }
++        self.assertIsNone(importer.merge_stored(old, deepcopy(new)))
++        self.assertEqual(old,
++            {
++                id1: {
++                    'copies': 2,
++                    'mtime': ts1,
++                },
++                id2: {
++                    'copies': 1,
++                    'mtime': ts2,
++                },
++                id3: {
++                    'copies': 1,
++                    'mtime': ts3,
++                    'verified': int(ts3 + 100),
++                }
++            }
++        )
++
++        old = {
++            id1: {
++                'copies': 1,
++                'mtime': ts1 - 100,
++                'verified': ts1 - 50,  # Should be removed
++            },
++            id2: {
++                'copies': 2,
++                'mtime': ts2 - 200,
++                'pinned': True,  # Should be preserved
++            },
++        }
++        self.assertIsNone(importer.merge_stored(old, deepcopy(new)))
++        self.assertEqual(old,
++            {
++                id1: {
++                    'copies': 2,
++                    'mtime': ts1,
++                },
++                id2: {
++                    'copies': 1,
++                    'mtime': ts2,
++                    'pinned': True,
++                },
++            }
++        )
++
++        old = {
++            id1: {
++                'copies': 1,
++                'mtime': ts1 - 100,
++                'pinned': True,  # Should be preserved
++                'verified': ts1 - 50,  # Should be removed
++            },
++            id2: {
++                'copies': 2,
++                'mtime': ts2 - 200,
++                'verified': ts1 - 50,  # Should be removed
++                'pinned': True,  # Should be preserved
++            },
++            id3: {
++                'copies': 1,
++                'mtime': ts3,
++                'verified': int(ts3 + 100),
++                'pinned': True,
++            },
++        }
++        self.assertIsNone(importer.merge_stored(old, deepcopy(new)))
++        self.assertEqual(old,
++            {
++                id1: {
++                    'copies': 2,
++                    'mtime': ts1,
++                    'pinned': True,
++                },
++                id2: {
++                    'copies': 1,
++                    'mtime': ts2,
++                    'pinned': True,
++                },
++                id3: {
++                    'copies': 1,
++                    'mtime': ts3,
++                    'verified': int(ts3 + 100),
++                    'pinned': True,
++                },
++            }
++        )
++
  class ImportCase(CouchCase):
@@ -341,9 +458,9 @@
          for (file, ch) in result:
              doc = self.db.get(ch.id)
              schema.check_file(doc)
--            self.assertEqual(doc['import']['import_id'], inst.id)
--            self.assertEqual(doc['import']['batch_id'], self.batch_id)
--            self.assertEqual(doc['ctime'], file.mtime)
++            #self.assertEqual(doc['import']['import_id'], inst.id)
++            #self.assertEqual(doc['import']['batch_id'], self.batch_id)
++            #self.assertEqual(doc['ctime'], file.mtime)
              self.assertEqual(doc['bytes'], ch.file_size)
              (content_type, leaf_hashes) = self.db.get_att(ch.id, 'leaf_hashes')
              self.assertEqual(content_type, 'application/octet-stream')
@@ -813,9 +930,9 @@
              doc = self.db.get(ch.id)
              schema.check_file(doc)
              self.assertTrue(doc['_rev'].startswith('1-'))
--            self.assertEqual(doc['import']['import_id'], import_id)
--            self.assertEqual(doc['import']['batch_id'], batch_id)
--            self.assertEqual(doc['ctime'], file.mtime)
++            #self.assertEqual(doc['import']['import_id'], import_id)
++            #self.assertEqual(doc['import']['batch_id'], batch_id)
++            #self.assertEqual(doc['ctime'], file.mtime)
              self.assertEqual(doc['bytes'], file.size)
              (content_type, leaf_hashes) = self.db.get_att(ch.id, 'leaf_hashes')
              self.assertEqual(content_type, 'application/octet-stream')
@@ -907,9 +1024,9 @@
              doc = self.db.get(ch.id)
              schema.check_file(doc)
              self.assertTrue(doc['_rev'].startswith('2-'))
--            self.assertNotEqual(doc['import']['import_id'], import_id)
--            self.assertNotEqual(doc['import']['batch_id'], batch_id)
--            self.assertEqual(doc['ctime'], file.mtime)
++            #self.assertNotEqual(doc['import']['import_id'], import_id)
++            #self.assertNotEqual(doc['import']['batch_id'], batch_id)
++            #self.assertEqual(doc['ctime'], file.mtime)
              self.assertEqual(doc['bytes'], file.size)
              (content_type, leaf_hashes) = self.db.get_att(ch.id, 'leaf_hashes')
              self.assertEqual(content_type, 'application/octet-stream')
 === modified file 'dmedia/tests/test_local.py'
 --- dmedia/tests/test_local.py	2012-07-09 18:28:44 +0000
 +++ dmedia/tests/test_local.py	2012-11-21 12:55:46 +0000
@@ -25,6 +25,7 @@
  from unittest import TestCase
  from random import Random
++import time
  import filestore
  from filestore import FileStore, DIGEST_B32LEN, DIGEST_BYTES
@@ -259,7 +260,7 @@
          self.assertEqual(cm.exception.id, ch.id)
          # When doc does exist
--        doc = schema.create_file(ch.id, ch.file_size, ch.leaf_hashes, {})
++        doc = schema.create_file(time.time(), ch, {})
          inst.db.save(doc)
          self.assertEqual(inst.content_hash(ch.id), unpacked)
          self.assertEqual(inst.content_hash(ch.id, False), ch)
 === modified file 'dmedia/tests/test_metastore.py'
 --- dmedia/tests/test_metastore.py	2012-01-28 21:43:12 +0000
 +++ dmedia/tests/test_metastore.py	2012-11-21 12:55:46 +0000
@@ -25,11 +25,17 @@
  from unittest import TestCase
  import time
++import os
++from random import SystemRandom
++from filestore import FileStore, DIGEST_BYTES
  from microfiber import random_id
++from dmedia.tests.base import TempDir
  from dmedia import metastore
++random = SystemRandom()
++
  class DummyStat:
      def __init__(self, mtime):
@@ -255,4 +261,91 @@
                  'corrupt': {id3: 'baz', fs.id: {'time': ts}},
+             }
+         )
--
++
++    def test_relink_iter(self):
++        tmp = TempDir()
++        fs = FileStore(tmp.dir)
++
++        def create():
++            _id = random_id(DIGEST_BYTES)
++            data = b'N' * random.randint(1, 1776)
++            open(fs.path(_id), 'wb').write(data)
++            st = fs.stat(_id)
++            assert st.size == len(data)
++            return st
++
++        # Test when empty
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            []
++        )
++
++        # Test with only 1
++        items = [create()]
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            [items]
++        )
++
++        # Test with 25
++        items.extend(create() for i in range(24))
++        assert len(items) == 25
++        items.sort(key=lambda st: st.id)
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            [items]
++        )
++
++        # Test with 26
++        items.append(create())
++        assert len(items) == 26
++        items.sort(key=lambda st: st.id)
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            [
++                items[:25],
++                items[25:],
++            ]
++        )
++
++        # Test with 49
++        items.extend(create() for i in range(23))
++        assert len(items) == 49
++        items.sort(key=lambda st: st.id)
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            [
++                items[:25],
++                items[25:],
++            ]
++        )
++
++        # Test with 100
++        items.extend(create() for i in range(51))
++        assert len(items) == 100
++        items.sort(key=lambda st: st.id)
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            [
++                items[0:25],
++                items[25:50],
++                items[50:75],
++                items[75:100],
++            ]
++        )
++
++        # Test with 118
++        items.extend(create() for i in range(18))
++        assert len(items) == 118
++        items.sort(key=lambda st: st.id)
++        self.assertEqual(
++            list(metastore.relink_iter(fs)),
++            [
++                items[0:25],
++                items[25:50],
++                items[50:75],
++                items[75:100],
++                items[100:118],
++            ]
++        )
++
 === modified file 'dmedia/tests/test_schema.py'
 --- dmedia/tests/test_schema.py	2012-11-14 00:37:47 +0000
 +++ dmedia/tests/test_schema.py	2012-11-21 12:55:46 +0000
@@ -30,7 +30,7 @@
  from copy import deepcopy
  import time
--from filestore import TYPE_ERROR, DIGEST_BYTES
++from filestore import TYPE_ERROR, DIGEST_BYTES, ContentHash
  from microfiber import random_id
  from .base import TempDir
@@ -331,6 +331,14 @@
              str(cm.exception),
              "doc['stored']['MZZG2ZDSOQVSW2TEMVZG643F']['verified'] must be >= 0; got -1"
+         )
++        bad = deepcopy(good)
++        bad['stored']['MZZG2ZDSOQVSW2TEMVZG643F']['verified'] = 123.0
++        with self.assertRaises(TypeError) as cm:
++            f(bad)
++        self.assertEqual(
++            str(cm.exception),
++            "doc['stored']['MZZG2ZDSOQVSW2TEMVZG643F']['verified']: need a <class 'int'>; got a <class 'float'>: 123.0"
++        )
          # Test with invalid stored "pinned":
          bad = deepcopy(good)
@@ -362,37 +370,6 @@
              "doc['corrupt'] cannot be empty; got {}"
+         )
--        # ext
--        copy = deepcopy(good)
--        copy['ext'] = 'ogv'
--        self.assertIsNone(f(copy))
--        copy['ext'] = 42
--        with self.assertRaises(TypeError) as cm:
--            f(copy)
--        self.assertEqual(
--            str(cm.exception),
--            TYPE_ERROR.format("doc['ext']", str, int, 42)
--        )
--        copy['ext'] = '.mov'
--        with self.assertRaises(ValueError) as cm:
--            f(copy)
--        self.assertEqual(
--            str(cm.exception),
--            "doc['ext']: '.mov' does not match '^[a-z0-9]+(\\\\.[a-z0-9]+)?$'"
--        )
--
--        # content_type
--        copy = deepcopy(good)
--        copy['content_type'] = 'video/quicktime'
--        self.assertIsNone(f(copy))
--        copy['content_type'] = 42
--        with self.assertRaises(TypeError) as cm:
--            f(copy)
--        self.assertEqual(
--            str(cm.exception),
--            TYPE_ERROR.format("doc['content_type']", str, int, 42)
--        )
--
          # proxy_of
          copy = deepcopy(good)
          copy['origin'] = 'proxy'
@@ -571,13 +548,15 @@
+         )
      def test_create_file(self):
++        timestamp = time.time()
          _id = random_id(DIGEST_BYTES)
++        file_size = 31415
          leaf_hashes = os.urandom(DIGEST_BYTES)
--        file_size = 31415
++        ch = ContentHash(_id, file_size, leaf_hashes)
          store_id = random_id()
          stored = {store_id: {'copies': 2, 'mtime': 1234567890}}
--        doc = schema.create_file(_id, file_size, leaf_hashes, stored)
++        doc = schema.create_file(timestamp, ch, stored)
          schema.check_file(doc)
          self.assertEqual(
              set(doc),
@@ -603,8 +582,8 @@
+             }
+         )
          self.assertEqual(doc['type'], 'dmedia/file')
--        self.assertLessEqual(doc['time'], time.time())
--        self.assertEqual(doc['atime'], int(doc['time']))
++        self.assertEqual(doc['time'], timestamp)
++        self.assertEqual(doc['atime'], int(timestamp))
          self.assertEqual(doc['bytes'], file_size)
          self.assertEqual(doc['origin'], 'user')
          self.assertIs(doc['stored'], stored)
@@ -616,9 +595,7 @@
          self.assertEqual(s[store_id]['copies'], 2)
          self.assertEqual(s[store_id]['mtime'], 1234567890)
--        doc = schema.create_file(_id, file_size, leaf_hashes, stored,
--            origin='proxy'
--        )
++        doc = schema.create_file(timestamp, ch, stored, origin='proxy')
          doc['proxy_of'] = random_id(DIGEST_BYTES)
          schema.check_file(doc)
          self.assertEqual(doc['origin'], 'proxy')
 === modified file 'dmedia/tests/test_server.py'
 --- dmedia/tests/test_server.py	2012-10-24 23:23:30 +0000
 +++ dmedia/tests/test_server.py	2012-11-21 12:55:46 +0000
@@ -271,6 +271,7 @@
          environ = {
              'REQUEST_METHOD': 'GET',
              'PATH_INFO': '/_config/foo',
++            'QUERY_STRING': '',
              'wsgi.input': Input(None, {'REQUEST_METHOD': 'GET'}),
+         }
          with self.assertRaises(WSGIError) as cm:
@@ -603,7 +604,7 @@
          docs = [{'_id': random_id()} for i in range(100)]
          for doc in docs:
              doc['_rev'] = s1.post(doc, name1)['rev']
--        time.sleep(0.5)
++        time.sleep(1)
          for doc in docs:
              self.assertEqual(s2.get(name2, doc['_id']), doc)
 === modified file 'dmedia/tests/test_transfers.py'
 --- dmedia/tests/test_transfers.py	2011-10-17 22:50:22 +0000
 +++ dmedia/tests/test_transfers.py	2012-11-21 12:55:46 +0000
@@ -60,10 +60,10 @@
          assert chunk is not None
          assert self._chunk is None
          self._chunk = chunk
--
++
  def create_file_doc(ch, store_id):
--    return schema.create_file(ch.id, ch.file_size, ch.leaf_hashes,
++    return schema.create_file(time.time(), ch,
          {store_id: {'mtime': 123456789, 'copies': 1}}
+     )
 === modified file 'dmedia/tests/test_verification.py'
 --- dmedia/tests/test_verification.py	2012-08-07 00:14:58 +0000
 +++ dmedia/tests/test_verification.py	2012-11-21 12:55:46 +0000
@@ -27,7 +27,7 @@
  import time
  from os import path
--from filestore import FileStore, DIGEST_BYTES
++from filestore import FileStore, DIGEST_BYTES, ContentHash
  from microfiber import random_id
  from .couch import CouchCase
@@ -68,7 +68,7 @@
              stored = {
                  fs.id: {'mtime': fs.stat(ch.id).mtime, 'copies': fs.copies}
+             }
--            doc = create_file(ch.id, ch.file_size, ch.leaf_hashes, stored)
++            doc = create_file(time.time(), ch, stored)
              self.db.save(doc)
              good.append(ch.id)
@@ -81,7 +81,8 @@
              stored = {
                  fs.id: {'mtime': fs.stat(_id).mtime, 'copies': fs.copies}
+             }
--            doc = create_file(_id, ch.file_size, ch.leaf_hashes, stored)
++            ch = ContentHash(_id, ch.file_size, ch.leaf_hashes)
++            doc = create_file(time.time(), ch, stored)
              self.db.save(doc)
              bad.append(_id)
@@ -94,7 +95,7 @@
              stored = {
                  fs.id: {'mtime': path.getmtime(fs.path(ch.id)), 'copies': fs.copies}
+             }
--            doc = create_file(ch.id, ch.file_size, ch.leaf_hashes, stored)
++            doc = create_file(time.time(), ch, stored)
              self.db.save(doc)
              empty.append(ch.id)
@@ -105,7 +106,7 @@
              stored = {
                  fs.id: {'mtime': path.getmtime(file.name), 'copies': fs.copies}
+             }
--            doc = create_file(ch.id, ch.file_size, ch.leaf_hashes, stored)
++            doc = create_file(time.time(), ch, stored)
              self.db.save(doc)
              missing.append(ch.id)
@@ -123,9 +124,9 @@
                  set(['copies', 'mtime', 'verified'])
+             )
              verified = doc['stored'][fs.id]['verified']
--            self.assertIsInstance(verified, (int, float))
--            self.assertLessEqual(start, verified)
--            self.assertLessEqual(verified, end)
++            self.assertIsInstance(verified, int)
++            self.assertLessEqual(int(start), verified)
++            self.assertLessEqual(verified, int(end))
              self.assertNotIn('corrupt', doc)
          for _id in bad:
              doc = self.db.get(_id)
 === modified file 'dmedia/verification.py'
 --- dmedia/verification.py	2012-08-07 00:14:58 +0000
 +++ dmedia/verification.py	2012-11-21 12:55:46 +0000
@@ -84,7 +84,7 @@
              doc['stored'][fs.id] = {
                  'copies': fs.copies,
                  'mtime': fs.stat(_id).mtime,
--                'verified': time.time(),
++                'verified': int(time.time()),
+             }
          except CorruptFile:
              mark_corrupt(doc, fs)
 === modified file 'dmedia/views.py'
 --- dmedia/views.py	2012-11-02 06:13:47 +0000
 +++ dmedia/views.py	2012-11-21 12:55:46 +0000
@@ -500,7 +500,6 @@
      file_design,
      project_design,
      job_design,
--    user_design,
+ )

Dmedia

Merge lp:~jderose/dmedia/core-split into lp:dmedia

Commit message

Description of the change

Preview Diff

Subscribers