Dmedia

Merge lp:~jderose/dmedia/empty-files into lp:dmedia

empty-files
Merge into trunk

Proposed by Jason Gerard DeRose on 2011-02-19

Status:	Merged
Approved by:	Jason Gerard DeRose on 2011-02-20
Approved revision:	190
Merged at revision:	163
Proposed branch:	lp:~jderose/dmedia/empty-files
Merge into:	lp:dmedia
Diff against target:	1424 lines (+597/-330) 8 files modified dmedia/filestore.py (+4/-1) dmedia/importer.py (+188/-97) dmedia/metastore.py (+0/-13) dmedia/tests/test_filestore.py (+1/-0) dmedia/tests/test_importer.py (+399/-181) dmedia/tests/test_metastore.py (+0/-36) dmedia/util.py (+4/-1) dmedia/workers.py (+1/-1)
To merge this branch:	bzr merge lp:~jderose/dmedia/empty-files
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Jason Gerard DeRose			Approve on 2011-02-20
Review via email: mp+50431@code.launchpad.net

Description of the change

Started out meaning just to change Importer so than rather than importing empty files into the FileStore, it would just note them in the "dmedia/import" record. But as I dug in a bit, decided that there was really some important enhancement needed in Importer and ImportManager, so I just bit the bullet. Changes include:

* Highly detailed logging in "dmedia/import" record: in addition to tracking empty files, it also tracks the files considered for import, files imported, files skipped because they're duplicates, and files for which an error occurred when attempted the import.

* Entirely axed use of quick_id: too error prone, not as super duty as dmedia needs to be.

* Well defined behavior when there is inconsistency between FileStore and database: new files are always copied into FileStore even if corresponding document exists; when document does not exist, document is always created, even if the file is a duplicate from the FileStore perspective

* Database is now compacted upon finishing a batch import: even if there are no old revisions, compacting at this point can dramatically reduce database size as the btree can be optimally laid out (the random IDs dmedia uses means the btree doesn't grow in a particularly space efficient way)

* Now that we're only targeting python-couchdb >= 0.8, I'm taking advantage of Database.save() so I can get rid of some hacky crap, make fewer CouchDB requests (bad python-couchdb, your wrapper-itis obscures and thwarts the elegant CouchDB REST API).

* A number of improvement to what is logged in service.log to make it easier to debug, including putting the process ID in the log format to make it easier to debug multiprocessing issues

You know what they say, "Refactor early, refactor often." Or maybe only I say that, but either way, it's awesome.

lp:~jderose/dmedia/empty-files updated on 2011-02-19

190. By Jason Gerard DeRose on 2011-02-19: Small tweak to make log more readable

Revision history for this message

Jason Gerard DeRose (jderose) wrote on 2011-02-20:

Perhaps bad form, but I'm approving this myself. I'm excited about reviews and they've already proven a great way to get more people engaged in the code, but I'm reserving the right to self approve when needed. If there aren't takers for a review, I can't let that hold things up for too long. Velocity needs to stay as high as possible.

This is an important change and needs to get some abuse through the daily builds before 0.4 is released, so I don't want to wait any longer on this one.

"Jason, looks great! --Jason"

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Jason Gerard DeRose

dmedia Dev

 === modified file 'dmedia/filestore.py'
 --- dmedia/filestore.py	2011-02-17 00:12:20 +0000
 +++ dmedia/filestore.py	2011-02-19 04:44:09 +0000
@@ -1037,5 +1037,8 @@
          try:
              self.tmp_move(tmp_fp, chash, ext)
          except DuplicateFile as e:
--            raise DuplicateFile(src=src_fp.name, dst=e.dst, chash=e.chash)
++            log.warning('File %r is duplicate of %r', src_fp.name, e.dst)
++            raise DuplicateFile(src=src_fp.name, dst=e.dst, tmp=e.src,
++                chash=chash, leaves=h.leaves
++            )
          return (chash, h.leaves)
 === modified file 'dmedia/importer.py'
 --- dmedia/importer.py	2011-02-15 07:53:55 +0000
 +++ dmedia/importer.py	2011-02-19 04:44:09 +0000
@@ -29,17 +29,21 @@
  import mimetypes
  import time
  from base64 import b64encode
++import logging
++
  import couchdb
++
  from .util import random_id
--from .workers import Worker, Manager, register, isregistered
++from .errors import DuplicateFile
++from .workers import Worker, Manager, register, isregistered, exception_name
  from .filestore import FileStore, quick_id, safe_open, safe_ext, pack_leaves
  from .metastore import MetaStore
  from .extractor import merge_metadata
++
  mimetypes.init()
--
--
  DOTDIR = '.dmedia'
++log = logging.getLogger()
  def normalize_ext(name):
@@ -118,7 +122,8 @@
      error to be interpreted as there being no files on the card!
      """
      if path.isfile(base):
--        yield base
++        s = os.stat(base)
++        yield (base, s.st_size, s.st_mtime)
          return
      names = sorted(os.listdir(base))
      dirs = []
@@ -127,12 +132,13 @@
          if path.islink(fullname):
              continue
          if path.isfile(fullname):
--            yield fullname
++            s = os.stat(fullname)
++            yield (fullname, s.st_size, s.st_mtime)
          elif path.isdir(fullname):
              dirs.append(fullname)
      for fullname in dirs:
--        for f in files_iter(fullname):
--            yield f
++        for tup in files_iter(fullname):
++            yield tup
  def create_batch(machine_id=None):
@@ -145,8 +151,14 @@
          'time': time.time(),
          'machine_id': machine_id,
          'imports': [],
--        'imported': {'count': 0, 'bytes': 0},
--        'skipped': {'count': 0, 'bytes': 0},
++        'errors': [],
++        'stats': {
++            'considered': {'count': 0, 'bytes': 0},
++            'imported': {'count': 0, 'bytes': 0},
++            'skipped': {'count': 0, 'bytes': 0},
++            'empty': {'count': 0, 'bytes': 0},
++            'error': {'count': 0, 'bytes': 0},
++        }
+     }
@@ -157,10 +169,22 @@
      return {
          '_id': random_id(),
          'type': 'dmedia/import',
++        'time': time.time(),
          'batch_id': batch_id,
          'machine_id': machine_id,
          'base': base,
--        'time': time.time(),
++        'log': {
++            'imported': [],
++            'skipped': [],
++            'empty': [],
++            'error': [],
++        },
++        'stats': {
++            'imported': {'count': 0, 'bytes': 0},
++            'skipped': {'count': 0, 'bytes': 0},
++            'empty': {'count': 0, 'bytes': 0},
++            'error': {'count': 0, 'bytes': 0},
++        }
+     }
@@ -181,59 +205,88 @@
          except couchdb.ResourceConflict:
              pass
--        self.__stats = {
--            'imported': {
--                'count': 0,
--                'bytes': 0,
--            },
--            'skipped': {
--                'count': 0,
--                'bytes': 0,
--            },
--        }
--        self.__files = None
--        self.__imported = []
--        self._import = None
--        self._import_id = None
++        self.filetuples = None
++        self._processed = []
++        self.doc = None
++        self._id = None
++
++    def save(self):
++        """
++        Save current 'dmedia/import' record to CouchDB.
++        """
++        self.db.save(self.doc)
      def start(self):
          """
--        Create the initial import record, return that record's ID.
++        Create the initial 'dmedia/import' record, return that record's ID.
          """
--        doc = create_import(self.base,
++        assert self._id is None
++        self.doc = create_import(self.base,
              batch_id=self.batch_id,
              machine_id=self.metastore.machine_id,
+         )
--        self._import_id = doc['_id']
--        assert self.metastore.db.create(doc) == self._import_id
--        self._import = self.metastore.db[self._import_id]
--        return self._import_id
--
--    def get_stats(self):
--        return dict(
--            (k, dict(v)) for (k, v) in self.__stats.iteritems()
--        )
++        self._id = self.doc['_id']
++        self.save()
++        return self._id
      def scanfiles(self):
--        if self.__files is None:
--            self.__files = tuple(files_iter(self.base))
--        return self.__files
--
--    def __import_file(self, src):
++        """
++        Build list of files that will be considered for import.
++
++        After this method has been called, the ``Importer.filetuples`` attribute
++        will contain ``(filename,size,mtime)`` tuples for all files being
++        considered.  This information is saved into the dmedia/import record to
++        provide a rich audio trail and aid in debugging.
++        """
++        assert self.filetuples is None
++        self.filetuples = tuple(files_iter(self.base))
++        self.doc['log']['considered'] = [
++            {'src': src, 'bytes': size, 'mtime': mtime}
++            for (src, size, mtime) in self.filetuples
++        ]
++        total_bytes = sum(size for (src, size, mtime) in self.filetuples)
++        self.doc['stats']['considered'] = {
++            'count': len(self.filetuples), 'bytes': total_bytes
++        }
++        self.save()
++        return self.filetuples
++
++    def _import_file(self, src):
++        """
++        Attempt to import *src* into dmedia library.
++        """
          fp = safe_open(src, 'rb')
--        quickid = quick_id(fp)
--        ids = list(self.metastore.by_quickid(quickid))
--        if ids:
--            # FIXME: Even if this is a duplicate, we should check if the file
--            # is stored on this machine, and if not copy into the FileStore.
--            doc = self.metastore.db[ids[0]]
--            return ('skipped', doc)
--        basename = path.basename(src)
--        (root, ext) = normalize_ext(basename)
--        # FIXME: We need to handle the (rare) case when a DuplicateFile
--        # exception is raised by FileStore.import_file()
--        (chash, leaves) = self.filestore.import_file(fp, ext)
          stat = os.fstat(fp.fileno())
++        if stat.st_size == 0:
++            log.warning('File size is zero: %r', src)
++            return ('empty', None)
++
++        name = path.basename(src)
++        (root, ext) = normalize_ext(name)
++        try:
++            (chash, leaves) = self.filestore.import_file(fp, ext)
++            action = 'imported'
++        except DuplicateFile as e:
++            chash = e.chash
++            leaves = e.leaves
++            action = 'skipped'
++            assert e.tmp.startswith(self.filestore.join('imports'))
++            # FIXME: We should really probably move this into duplicates/ or
++            # something and not delete till we verify integrity of what is
++            # already in the filestore.
++            os.remove(e.tmp)
++
++        try:
++            doc = self.db[chash]
++            if self.filestore._id not in doc['stored']:
++                doc['stored'][self.filestore._id] =  {
++                    'copies': 1,
++                    'time': time.time(),
++                }
++                self.db.save(doc)
++            return (action, doc)
++        except couchdb.ResourceNotFound as e:
++            pass
          ts = time.time()
          doc = {
@@ -256,42 +309,66 @@
                  },
              },
--            'qid': quickid,
--            'import_id': self._import_id,
++            'import_id': self._id,
              'mtime': stat.st_mtime,
--            'basename': basename,
--            'dirname': path.relpath(path.dirname(src), self.base),
++            'name': name,
++            'dir': path.relpath(path.dirname(src), self.base),
+         }
          if ext:
              doc['content_type'] = mimetypes.types_map.get('.' + ext)
          if self.extract:
              merge_metadata(src, doc)
--        (_id, _rev) = self.metastore.db.save(doc)
++        (_id, _rev) = self.db.save(doc)
          assert _id == chash
--        return ('imported', doc)
--
--    def import_file(self, src):
--        (action, doc) = self.__import_file(src)
--        self.__imported.append(src)
--        self.__stats[action]['count'] += 1
--        self.__stats[action]['bytes'] += doc['bytes']
          return (action, doc)
++    def import_file(self, src, size):
++        """
++        Wraps `Importer._import_file()` with error handling and logging.
++        """
++        self._processed.append(src)
++        try:
++            (action, doc) = self._import_file(src)
++            if action == 'empty':
++                entry = src
++            else:
++                entry = {
++                    'src': src,
++                    'id': doc['_id'],
++                }
++        except Exception as e:
++            log.exception('Error importing %r', src)
++            action = 'error'
++            entry = {
++                'src': src,
++                'name': exception_name(e),
++                'msg': str(e),
++            }
++        self.doc['log'][action].append(entry)
++        self.doc['stats'][action]['count'] += 1
++        self.doc['stats'][action]['bytes'] += size
++        if action == 'error':
++            self.save()
++        return (action, entry)
++
      def import_all_iter(self):
--        for src in self.scanfiles():
--            (action, doc) = self.import_file(src)
--            yield (src, action, doc)
++        for (src, size, mtime) in self.filetuples:
++            (action, entry) = self.import_file(src, size)
++            yield (src, action)
      def finalize(self):
--        files = self.scanfiles()
--        assert len(files) == len(self.__imported)
--        assert set(files) == set(self.__imported)
--        s = self.get_stats()
--        self._import.update(s)
--        self._import['time_end'] = time.time()
--        self.db[self._import_id] = self._import
--        assert s['imported']['count'] + s['skipped']['count'] == len(files)
--        return s
++        """
++        Finalize import and save final import record to CouchDB.
++
++        The method will add the ``"time_end"`` key into the import record and
++        save it to CouchDB.  There will likely also be being changes in the
++        ``"log"`` and ``"stats"`` keys, which will likewise be saved to CouchDB.
++        """
++        assert len(self.filetuples) == len(self._processed)
++        assert list(t[0] for t in self.filetuples) == self._processed
++        self.doc['time_end'] = time.time()
++        self.save()
++        return self.doc['stats']
  class ImportWorker(Worker):
@@ -307,12 +384,11 @@
          self.emit('count', import_id, total)
          c = 1
--        for (src, action, doc) in adapter.import_all_iter():
++        for (src, action) in adapter.import_all_iter():
              self.emit('progress', import_id, c, total,
                  dict(
                      action=action,
                      src=src,
--                    _id=doc['_id'],
+                 )
+             )
              c += 1
@@ -332,6 +408,8 @@
  def accumulate_stats(accum, stats):
      for (key, d) in stats.items():
++        if key not in accum:
++            accum[key] = {'count': 0, 'bytes': 0}
          for (k, v) in d.items():
              accum[key][k] += v
@@ -342,33 +420,46 @@
          self._dbname = dbname
          self.metastore = MetaStore(dbname=dbname)
          self.db = self.metastore.db
--        self._batch = None
++        self.doc = None
          self._total = 0
          self._completed = 0
          if not isregistered(ImportWorker):
              register(ImportWorker)
--    def _sync(self, doc):
--        _id = doc['_id']
--        self.db[_id] = doc
--        return self.db[_id]
++    def save(self):
++        """
++        Save current 'dmedia/batch' record to CouchDB.
++        """
++        self.db.save(self.doc)
      def _start_batch(self):
--        assert self._batch is None
++        assert self.doc is None
          assert self._workers == {}
          self._total = 0
          self._completed = 0
--        self._batch = self._sync(create_batch(self.metastore.machine_id))
--        self.emit('BatchStarted', self._batch['_id'])
++        self.doc = create_batch(self.metastore.machine_id)
++        self.save()
++        self.emit('BatchStarted', self.doc['_id'])
      def _finish_batch(self):
          assert self._workers == {}
--        self._batch['time_end'] = time.time()
--        self._batch = self._sync(self._batch)
--        self.emit('BatchFinished', self._batch['_id'],
--            to_dbus_stats(self._batch)
--        )
--        self._batch = None
++        self.doc['time_end'] = time.time()
++        self.save()
++        self.emit('BatchFinished', self.doc['_id'],
++            to_dbus_stats(self.doc['stats'])
++        )
++        self.doc = None
++        log.info('Batch complete, compacting database...')
++        self.db.compact()
++
++    def on_error(self, key, exception, message):
++        super(ImportManager, self).on_error(key, exception, message)
++        if self.doc is None:
++            return
++        self.doc['errors'].append(
++            {'key': key, 'name': exception, 'msg': message}
++        )
++        self.save()
      def on_terminate(self, key):
          super(ImportManager, self).on_terminate(key)
@@ -376,8 +467,8 @@
              self._finish_batch()
      def on_started(self, key, import_id):
--        self._batch['imports'].append(import_id)
--        self._batch = self._sync(self._batch)
++        self.doc['imports'].append(import_id)
++        self.save()
          self.emit('ImportStarted', key, import_id)
      def on_count(self, key, import_id, total):
@@ -389,8 +480,8 @@
          self.emit('ImportProgress', key, import_id, completed, total, info)
      def on_finished(self, key, import_id, stats):
--        accumulate_stats(self._batch, stats)
--        self._batch = self._sync(self._batch)
++        accumulate_stats(self.doc['stats'], stats)
++        self.save()
          self.emit('ImportFinished', key, import_id, to_dbus_stats(stats))
      def get_batch_progress(self):
@@ -404,7 +495,7 @@
              if len(self._workers) == 0:
                  self._start_batch()
              return self.do('ImportWorker', base,
--                self._batch['_id'], base, extract, self._dbname
++                self.doc['_id'], base, extract, self._dbname
+             )
      def list_imports(self):
 === modified file 'dmedia/metastore.py'
 --- dmedia/metastore.py	2011-02-07 09:53:28 +0000
 +++ dmedia/metastore.py	2011-02-19 04:44:09 +0000
@@ -109,14 +109,6 @@
+ }
  """
--file_qid = """
--function(doc) {
--    if (doc.type == 'dmedia/file' && doc.qid) {
--        emit(doc.qid, null);
--    }
--}
--"""
--
  file_import_id = """
  function(doc) {
      if (doc.type == 'dmedia/file' && doc.import_id) {
@@ -181,7 +173,6 @@
          )),
          ('file', (
--            ('qid', file_qid, None),
              ('import_id', file_import_id, None),
              ('bytes', file_bytes, _sum),
              ('ext', file_ext, _count),
@@ -257,10 +248,6 @@
              (_id, doc) = build_design_doc(name, views)
              self.update(doc)
--    def by_quickid(self, qid):
--        for row in self.db.view('_design/file/_view/qid', key=qid):
--            yield row.id
--
      def total_bytes(self):
          for row in self.db.view('_design/file/_view/bytes'):
              return row.value
 === modified file 'dmedia/tests/test_filestore.py'
 --- dmedia/tests/test_filestore.py	2011-02-15 00:54:58 +0000
 +++ dmedia/tests/test_filestore.py	2011-02-19 04:44:09 +0000
@@ -1188,3 +1188,4 @@
          self.assertEqual(e.chash, mov_hash)
          self.assertEqual(e.src, src)
          self.assertEqual(e.dst, dst)
++        self.assertTrue(e.tmp.startswith(base + '/imports/'))
 === modified file 'dmedia/tests/test_importer.py'
 --- dmedia/tests/test_importer.py	2011-02-07 09:46:16 +0000
 +++ dmedia/tests/test_importer.py	2011-02-19 04:44:09 +0000
@@ -105,13 +105,14 @@
          f = importer.files_iter
          tmp = TempDir()
          files = []
--        for args in relpaths:
--            p = tmp.touch('subdir', *args)
--            files.append(p)
++        for (i, args) in enumerate(relpaths):
++            content = 'a' * (2 ** i)
++            p = tmp.write(content, 'subdir', *args)
++            files.append((p, len(content), path.getmtime(p)))
          # Test when base is a file:
--        for p in files:
--            self.assertEqual(list(f(p)), [p])
++        for (p, s, t) in files:
++            self.assertEqual(list(f(p)), [(p, s, t)])
          # Test importing from tmp.path:
          self.assertEqual(list(f(tmp.path)), files)
@@ -143,9 +144,9 @@
                  'type',
                  'time',
                  'imports',
--                'imported',
--                'skipped',
++                'errors',
                  'machine_id',
++                'stats',
              ])
+         )
          _id = doc['_id']
@@ -155,9 +156,18 @@
          self.assertTrue(isinstance(doc['time'], (int, float)))
          self.assertTrue(doc['time'] <= time.time())
          self.assertEqual(doc['imports'], [])
--        self.assertEqual(doc['imported'], {'count': 0, 'bytes': 0})
--        self.assertEqual(doc['skipped'], {'count': 0, 'bytes': 0})
++        self.assertEqual(doc['errors'], [])
          self.assertEqual(doc['machine_id'], machine_id)
++        self.assertEqual(
++            doc['stats'],
++            {
++                'considered': {'count': 0, 'bytes': 0},
++                'imported': {'count': 0, 'bytes': 0},
++                'skipped': {'count': 0, 'bytes': 0},
++                'empty': {'count': 0, 'bytes': 0},
++                'error': {'count': 0, 'bytes': 0},
++            }
++        )
      def test_create_import(self):
          f = importer.create_import
@@ -173,6 +183,8 @@
              'base',
              'batch_id',
              'machine_id',
++            'log',
++            'stats',
          ])
          doc = f(base, batch_id=batch_id, machine_id=machine_id)
@@ -196,6 +208,24 @@
          self.assertEqual(set(doc), keys)
          self.assertEqual(doc['batch_id'], None)
          self.assertEqual(doc['machine_id'], None)
++        self.assertEqual(
++            doc['log'],
++            {
++                'imported': [],
++                'skipped': [],
++                'empty': [],
++                'error': [],
++            }
++        )
++        self.assertEqual(
++            doc['stats'],
++            {
++                'imported': {'count': 0, 'bytes': 0},
++                'skipped': {'count': 0, 'bytes': 0},
++                'empty': {'count': 0, 'bytes': 0},
++                'error': {'count': 0, 'bytes': 0},
++            }
++        )
      def test_to_dbus_stats(self):
          f = importer.to_dbus_stats
@@ -262,13 +292,13 @@
      def test_start(self):
          tmp = TempDir()
          inst = self.new(tmp.path)
--        self.assertTrue(inst._import is None)
++        self.assertTrue(inst.doc is None)
          _id = inst.start()
          self.assertEqual(len(_id), 24)
          store = MetaStore(dbname=self.dbname)
--        self.assertEqual(inst._import, store.db[_id])
++        self.assertEqual(inst.doc, store.db[_id])
          self.assertEqual(
--            set(inst._import),
++            set(inst.doc),
              set([
                  '_id',
                  '_rev',
@@ -277,60 +307,76 @@
                  'base',
                  'batch_id',
                  'machine_id',
++                'log',
++                'stats',
              ])
+         )
--        self.assertEqual(inst._import['batch_id'], self.batch_id)
++        self.assertEqual(inst.doc['batch_id'], self.batch_id)
          self.assertEqual(
--            inst._import['machine_id'],
++            inst.doc['machine_id'],
              inst.metastore.machine_id
+         )
--        self.assertEqual(inst._import['base'], tmp.path)
--
--    def test_get_stats(self):
--        tmp = TempDir()
--        inst = self.new(tmp.path)
--        one = inst.get_stats()
--        self.assertEqual(one,
--             {
--                'imported': {
--                    'count': 0,
--                    'bytes': 0,
--                },
--                'skipped': {
--                    'count': 0,
--                    'bytes': 0,
--                },
--            }
--        )
--        two = inst.get_stats()
--        self.assertFalse(one is two)
--        self.assertFalse(one['imported'] is two['imported'])
--        self.assertFalse(one['skipped'] is two['skipped'])
++        self.assertEqual(inst.doc['base'], tmp.path)
++        self.assertEqual(
++            inst.doc['log'],
++            {
++                'imported': [],
++                'skipped': [],
++                'empty': [],
++                'error': [],
++            }
++        )
++        self.assertEqual(
++            inst.doc['stats'],
++            {
++                'imported': {'count': 0, 'bytes': 0},
++                'skipped': {'count': 0, 'bytes': 0},
++                'empty': {'count': 0, 'bytes': 0},
++                'error': {'count': 0, 'bytes': 0},
++            }
++        )
      def test_scanfiles(self):
          tmp = TempDir()
          inst = self.new(tmp.path)
++        inst.start()
          files = []
--        for args in relpaths:
--            p = tmp.touch('subdir', *args)
--            files.append(p)
++        for (i, args) in enumerate(relpaths):
++            content = 'a' * (2 ** i)
++            p = tmp.write(content, 'subdir', *args)
++            files.append((p, len(content), path.getmtime(p)))
          got = inst.scanfiles()
          self.assertEqual(got, tuple(files))
--        self.assertTrue(inst.scanfiles() is got)
++        self.assertEqual(
++            inst.db[inst._id]['log']['considered'],
++            [{'src': src, 'bytes': size, 'mtime': mtime}
++            for (src, size, mtime) in files]
++        )
++        self.assertEqual(
++            inst.db[inst._id]['stats']['considered'],
++            {
++                'count': len(files),
++                'bytes': sum(t[1] for t in files),
++            }
++        )
--    def test_import_file(self):
++    def test_import_file_private(self):
++        """
++        Test the `Importer._import_file()` method.
++        """
          tmp = TempDir()
          inst = self.new(tmp.path)
++        inst.start()
          # Test that AmbiguousPath is raised:
          traversal = '/home/foo/.dmedia/../.ssh/id_rsa'
--        e = raises(AmbiguousPath, inst.import_file, traversal)
++        e = raises(AmbiguousPath, inst._import_file, traversal)
          self.assertEqual(e.pathname, traversal)
          self.assertEqual(e.abspath, '/home/foo/.ssh/id_rsa')
          # Test that IOError propagates up with missing file
          nope = tmp.join('nope.mov')
--        e = raises(IOError, inst.import_file, nope)
++        e = raises(IOError, inst._import_file, nope)
          self.assertEqual(
              str(e),
              '[Errno 2] No such file or directory: %r' % nope
@@ -339,7 +385,7 @@
          # Test that IOError propagates up with unreadable file
          nope = tmp.touch('nope.mov')
          os.chmod(nope, 0o000)
--        e = raises(IOError, inst.import_file, nope)
++        e = raises(IOError, inst._import_file, nope)
          self.assertEqual(
              str(e),
              '[Errno 13] Permission denied: %r' % nope
@@ -351,7 +397,7 @@
          # Test with new file
          size = path.getsize(src1)
--        (action, doc) = inst.import_file(src1)
++        (action, doc) = inst._import_file(src1)
          self.assertEqual(action, 'imported')
          self.assertEqual(
@@ -368,10 +414,9 @@
                  'stored',
                  'import_id',
--                'qid',
                  'mtime',
--                'basename',
--                'dirname',
++                'name',
++                'dir',
                  'content_type',
              ])
+         )
@@ -384,42 +429,235 @@
          self.assertEqual(doc['bytes'], size)
          self.assertEqual(doc['ext'], 'mov')
--        self.assertEqual(doc['import_id'], None)
--        self.assertEqual(doc['qid'], mov_qid)
++        self.assertEqual(doc['import_id'], inst._id)
          self.assertEqual(doc['mtime'], path.getmtime(src1))
--        self.assertEqual(doc['basename'], 'MVI_5751.MOV')
--        self.assertEqual(doc['dirname'], 'DCIM/100EOS5D2')
++        self.assertEqual(doc['name'], 'MVI_5751.MOV')
++        self.assertEqual(doc['dir'], 'DCIM/100EOS5D2')
          self.assertEqual(doc['content_type'], 'video/quicktime')
--        self.assertEqual(inst.get_stats(),
--             {
--                'imported': {
--                    'count': 1,
--                    'bytes': size,
--                },
--                'skipped': {
--                    'count': 0,
--                    'bytes': 0,
--                },
--            }
--        )
--
          # Test with duplicate
--        (action, wrapper) = inst.import_file(src2)
--        self.assertEqual(action, 'skipped')
--        doc2 = dict(wrapper)
--        doc2['_attachments'] = doc['_attachments']
--        self.assertEqual(doc2, doc)
--        self.assertEqual(inst.get_stats(),
--             {
--                'imported': {
--                    'count': 1,
--                    'bytes': size,
--                },
--                'skipped': {
--                    'count': 1,
--                    'bytes': size,
--                },
++        (action, doc) = inst._import_file(src2)
++        self.assertEqual(action, 'skipped')
++        self.assertEqual(doc, inst.db[mov_hash])
++
++        # Test with duplicate with missing doc
++        del inst.db[mov_hash]
++        (action, doc) = inst._import_file(src2)
++        self.assertEqual(action, 'skipped')
++        self.assertEqual(doc['time'], inst.db[mov_hash]['time'])
++
++        # Test with duplicate when doc is missing this filestore in store:
++        old = inst.db[mov_hash]
++        rid = random_id()
++        old['stored'] = {rid: {'copies': 2, 'time': 1234567890}}
++        inst.db.save(old)
++        (action, doc) = inst._import_file(src2)
++        fid = inst.filestore._id
++        self.assertEqual(action, 'skipped')
++        self.assertEqual(set(doc['stored']), set([rid, fid]))
++        t = doc['stored'][fid]['time']
++        self.assertEqual(
++            doc['stored'],
++            {
++                rid: {'copies': 2, 'time': 1234567890},
++                fid: {'copies': 1, 'time': t},
++            }
++        )
++        self.assertEqual(inst.db[mov_hash]['stored'], doc['stored'])
++
++        # Test with existing doc but missing file:
++        old = inst.db[mov_hash]
++        inst.filestore.remove(mov_hash, 'mov')
++        (action, doc) = inst._import_file(src2)
++        self.assertEqual(action, 'imported')
++        self.assertEqual(doc['_rev'], old['_rev'])
++        self.assertEqual(doc['time'], old['time'])
++        self.assertEqual(inst.db[mov_hash], old)
++
++        # Test with empty file:
++        src3 = tmp.touch('DCIM', '100EOS5D2', 'foo.MOV')
++        (action, doc) = inst._import_file(src3)
++        self.assertEqual(action, 'empty')
++        self.assertEqual(doc, None)
++
++    def test_import_file(self):
++        """
++        Test the `Importer.import_file()` method.
++        """
++        tmp = TempDir()
++        inst = self.new(tmp.path)
++        inst.start()
++
++        self.assertEqual(inst.doc['log']['error'], [])
++        self.assertEqual(inst._processed, [])
++
++        # Test that AmbiguousPath is raised:
++        nope1 = '/home/foo/.dmedia/../.ssh/id_rsa'
++        abspath = '/home/foo/.ssh/id_rsa'
++        (action, error1) = inst.import_file(nope1, 17)
++        self.assertEqual(action, 'error')
++        self.assertEqual(error1, {
++            'src': nope1,
++            'name': 'AmbiguousPath',
++            'msg': '%r resolves to %r' % (nope1, abspath),
++        })
++        self.assertEqual(
++            inst.doc['log']['error'],
++            [error1]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1]
++        )
++
++        # Test that IOError propagates up with missing file
++        nope2 = tmp.join('nope.mov')
++        (action, error2) = inst.import_file(nope2, 18)
++        self.assertEqual(action, 'error')
++        self.assertEqual(error2, {
++            'src': nope2,
++            'name': 'IOError',
++            'msg': '[Errno 2] No such file or directory: %r' % nope2,
++        })
++        self.assertEqual(
++            inst.doc['log']['error'],
++            [error1, error2]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2]
++        )
++
++        # Test that IOError propagates up with unreadable file
++        nope3 = tmp.touch('nope.mov')
++        os.chmod(nope3, 0o000)
++        try:
++            (action, error3) = inst.import_file(nope3, 19)
++            self.assertEqual(action, 'error')
++            self.assertEqual(error3, {
++                'src': nope3,
++                'name': 'IOError',
++                'msg': '[Errno 13] Permission denied: %r' % nope3,
++            })
++            self.assertEqual(
++                inst.doc['log']['error'],
++                [error1, error2, error3]
++            )
++            self.assertEqual(
++                inst._processed,
++                [nope1, nope2, nope3]
++            )
++        finally:
++            os.chmod(nope3, 0o600)
++
++
++        # Test with new files
++        src1 = tmp.copy(sample_mov, 'DCIM', '100EOS5D2', 'MVI_5751.MOV')
++        src2 = tmp.copy(sample_thm, 'DCIM', '100EOS5D2', 'MVI_5751.THM')
++        self.assertEqual(inst.doc['log']['imported'], [])
++
++        (action, imported1) = inst.import_file(src1, 17)
++        self.assertEqual(action, 'imported')
++        self.assertEqual(imported1, {
++            'src': src1,
++            'id': mov_hash,
++        })
++        self.assertEqual(
++            inst.doc['log']['imported'],
++            [imported1]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2, nope3, src1]
++        )
++
++        (action, imported2) = inst.import_file(src2, 17)
++        self.assertEqual(action, 'imported')
++        self.assertEqual(imported2, {
++            'src': src2,
++            'id': thm_hash,
++        })
++        self.assertEqual(
++            inst.doc['log']['imported'],
++            [imported1, imported2]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2, nope3, src1, src2]
++        )
++
++        # Test with duplicate files
++        dup1 = tmp.copy(sample_mov, 'DCIM', '100EOS5D2', 'MVI_5750.MOV')
++        dup2 = tmp.copy(sample_thm, 'DCIM', '100EOS5D2', 'MVI_5750.THM')
++        self.assertEqual(inst.doc['log']['skipped'], [])
++
++        (action, skipped1) = inst.import_file(dup1, 17)
++        self.assertEqual(action, 'skipped')
++        self.assertEqual(skipped1, {
++            'src': dup1,
++            'id': mov_hash,
++        })
++        self.assertEqual(
++            inst.doc['log']['skipped'],
++            [skipped1]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2, nope3, src1, src2, dup1]
++        )
++
++        (action, skipped2) = inst.import_file(dup2, 17)
++        self.assertEqual(action, 'skipped')
++        self.assertEqual(skipped2, {
++            'src': dup2,
++            'id': thm_hash,
++        })
++        self.assertEqual(
++            inst.doc['log']['skipped'],
++            [skipped1, skipped2]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2, nope3, src1, src2, dup1, dup2]
++        )
++
++        # Test with empty files
++        emp1 = tmp.touch('DCIM', '100EOS5D2', 'MVI_5759.MOV')
++        emp2 = tmp.touch('DCIM', '100EOS5D2', 'MVI_5759.THM')
++        self.assertEqual(inst.doc['log']['empty'], [])
++
++        (action, empty1) = inst.import_file(emp1, 17)
++        self.assertEqual(action, 'empty')
++        self.assertEqual(empty1, emp1)
++        self.assertEqual(
++            inst.doc['log']['empty'],
++            [empty1]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2, nope3, src1, src2, dup1, dup2, emp1]
++        )
++
++        (action, empty2) = inst.import_file(emp2, 17)
++        self.assertEqual(action, 'empty')
++        self.assertEqual(empty2, emp2)
++        self.assertEqual(
++            inst.doc['log']['empty'],
++            [empty1, empty2]
++        )
++        self.assertEqual(
++            inst._processed,
++            [nope1, nope2, nope3, src1, src2, dup1, dup2, emp1, emp2]
++        )
++
++        # Check state of log one final time
++        self.assertEqual(
++            inst.doc['log'],
++            {
++                'imported': [imported1, imported2],
++                'skipped': [skipped1, skipped2],
++                'empty': [empty1, empty2],
++                'error': [error1, error2, error3],
+             }
+         )
@@ -430,87 +668,30 @@
          src1 = tmp.copy(sample_mov, 'DCIM', '100EOS5D2', 'MVI_5751.MOV')
          dup1 = tmp.copy(sample_mov, 'DCIM', '100EOS5D2', 'MVI_5752.MOV')
          src2 = tmp.copy(sample_thm, 'DCIM', '100EOS5D2', 'MVI_5751.THM')
++        src3 = tmp.touch('DCIM', '100EOS5D2', 'Zar.MOV')
++        src4 = tmp.touch('DCIM', '100EOS5D2', 'Zoo.MOV')
++
          import_id = inst.start()
++        inst.scanfiles()
          items = tuple(inst.import_all_iter())
--        self.assertEqual(len(items), 3)
++        self.assertEqual(len(items), 5)
          self.assertEqual(
--            [t[:2] for t in items],
--            [
++            items,
++            (
                  (src1, 'imported'),
                  (src2, 'imported'),
                  (dup1, 'skipped'),
--            ]
--        )
--
--        doc = items[0][2]
--        self.assertEqual(schema.check_dmedia_file(doc), None)
--        self.assertEqual(doc,
--            {
--                '_id': mov_hash,
--                '_rev': doc['_rev'],
--                '_attachments': {
--                    'leaves': {
--                        'data': b64encode(''.join(mov_leaves)),
--                        'content_type': 'application/octet-stream',
--                    }
--                },
--                'type': 'dmedia/file',
--                'time': doc['time'],
--                'bytes': path.getsize(src1),
--                'ext': 'mov',
--                'origin': 'user',
--                'stored': {
--                    inst.filestore._id: {
--                        'copies': 1,
--                        'time': doc['time'],
--                    },
--                },
--
--                'import_id': import_id,
--                'qid': mov_qid,
--                'mtime': path.getmtime(src1),
--                'basename': 'MVI_5751.MOV',
--                'dirname': 'DCIM/100EOS5D2',
--                'content_type': 'video/quicktime',
--            }
--        )
--
--        doc = items[1][2]
--        self.assertEqual(schema.check_dmedia_file(doc), None)
--        self.assertEqual(doc,
--            {
--                '_id': thm_hash,
--                '_rev': doc['_rev'],
--                '_attachments': {
--                    'leaves': {
--                        'data': b64encode(''.join(thm_leaves)),
--                        'content_type': 'application/octet-stream',
--                    }
--                },
--                'type': 'dmedia/file',
--                'time': doc['time'],
--                'bytes': path.getsize(src2),
--                'ext': 'thm',
--                'origin': 'user',
--                'stored': {
--                    inst.filestore._id: {
--                        'copies': 1,
--                        'time': doc['time'],
--                    },
--                },
--
--                'import_id': import_id,
--                'qid': thm_qid,
--                'mtime': path.getmtime(src2),
--                'basename': 'MVI_5751.THM',
--                'dirname': 'DCIM/100EOS5D2',
--                'content_type': None,
--            }
--        )
--
++                (src3, 'empty'),
++                (src4, 'empty'),
++            )
++        )
          self.assertEqual(inst.finalize(),
+              {
++                'considered': {
++                    'count': 5,
++                    'bytes': path.getsize(src1) * 2 + path.getsize(src2),
++                },
                  'imported': {
                      'count': 2,
                      'bytes': path.getsize(src1) + path.getsize(src2),
@@ -519,6 +700,8 @@
                      'count': 1,
                      'bytes': path.getsize(dup1),
                  },
++                'empty': {'count': 2, 'bytes': 0},
++                'error': {'count': 0, 'bytes': 0},
+             }
+         )
@@ -569,7 +752,7 @@
              dict(
                  signal='progress',
                  args=(base, _id, 1, 3,
--                    dict(action='imported', src=src1, _id=mov_hash)
++                    dict(action='imported', src=src1)
                  ),
                  worker='ImportWorker',
                  pid=pid,
@@ -579,7 +762,7 @@
              dict(
                  signal='progress',
                  args=(base, _id, 2, 3,
--                    dict(action='imported', src=src2, _id=thm_hash)
++                    dict(action='imported', src=src2)
                  ),
                  worker='ImportWorker',
                  pid=pid,
@@ -589,7 +772,7 @@
              dict(
                  signal='progress',
                  args=(base, _id, 3, 3,
--                    dict(action='skipped', src=dup1, _id=mov_hash)
++                    dict(action='skipped', src=dup1)
                  ),
                  worker='ImportWorker',
                  pid=pid,
@@ -601,8 +784,11 @@
                  signal='finished',
                  args=(base, _id,
                      dict(
++                        considered={'count': 3, 'bytes': (mov_size*2 + thm_size)},
                          imported={'count': 2, 'bytes': (mov_size + thm_size)},
                          skipped={'count': 1, 'bytes': mov_size},
++                        empty={'count': 0, 'bytes': 0},
++                        error={'count': 0, 'bytes': 0},
                      ),
                  ),
                  worker='ImportWorker',
@@ -629,7 +815,7 @@
          inst._start_batch()
          self.assertEqual(inst._completed, 0)
          self.assertEqual(inst._total, 0)
--        batch = inst._batch
++        batch = inst.doc
          batch_id = batch['_id']
          self.assertTrue(isinstance(batch, dict))
          self.assertEqual(
@@ -639,9 +825,9 @@
                  'type',
                  'time',
                  'imports',
--                'imported',
--                'skipped',
++                'errors',
                  'machine_id',
++                'stats',
              ])
+         )
          self.assertEqual(batch['type'], 'dmedia/batch')
@@ -662,10 +848,12 @@
          callback = DummyCallback()
          inst = self.klass(callback, self.dbname)
          batch_id = random_id()
--        inst._batch = dict(
++        inst.doc = dict(
              _id=batch_id,
--            imported={'count': 17, 'bytes': 98765},
--            skipped={'count': 3, 'bytes': 12345},
++            stats=dict(
++                imported={'count': 17, 'bytes': 98765},
++                skipped={'count': 3, 'bytes': 12345},
++            ),
+         )
          # Make sure it checks that workers is empty
@@ -676,7 +864,7 @@
          # Check that it fires signal correctly
          inst._workers.clear()
          inst._finish_batch()
--        self.assertEqual(inst._batch, None)
++        self.assertEqual(inst.doc, None)
          stats = dict(
              imported=17,
              imported_bytes=98765,
@@ -695,20 +883,48 @@
              set([
                  '_id',
                  '_rev',
--                'imported',
--                'skipped',
++                'stats',
                  'time_end',
              ])
+         )
          cur = time.time()
          self.assertTrue(cur - 1 <= doc['time_end'] <= cur)
++    def test_on_error(self):
++        callback = DummyCallback()
++        inst = self.klass(callback, self.dbname)
++
++        # Make sure it works when doc is None:
++        inst.on_error('foo', 'IOError', 'nope')
++        self.assertEqual(inst.doc, None)
++
++        # Test normally:
++        inst._start_batch()
++        self.assertEqual(inst.doc['errors'], [])
++        inst.on_error('foo', 'IOError', 'nope')
++        doc = inst.db[inst.doc['_id']]
++        self.assertEqual(
++            doc['errors'],
++            [
++                {'key': 'foo', 'name': 'IOError', 'msg': 'nope'},
++            ]
++        )
++        inst.on_error('bar', 'error!', 'no way')
++        doc = inst.db[inst.doc['_id']]
++        self.assertEqual(
++            doc['errors'],
++            [
++                {'key': 'foo', 'name': 'IOError', 'msg': 'nope'},
++                {'key': 'bar', 'name': 'error!', 'msg': 'no way'},
++            ]
++        )
++
      def test_on_started(self):
          callback = DummyCallback()
          inst = self.klass(callback, self.dbname)
          self.assertEqual(callback.messages, [])
          inst._start_batch()
--        batch_id = inst._batch['_id']
++        batch_id = inst.doc['_id']
          self.assertEqual(inst.db[batch_id]['imports'], [])
          self.assertEqual(
              callback.messages,
@@ -816,10 +1032,12 @@
          callback = DummyCallback()
          inst = self.klass(callback, self.dbname)
          batch_id = random_id()
--        inst._batch = dict(
++        inst.doc = dict(
              _id=batch_id,
--            imported={'count': 0, 'bytes': 0},
--            skipped={'count': 0, 'bytes': 0},
++            stats=dict(
++                imported={'count': 0, 'bytes': 0},
++                skipped={'count': 0, 'bytes': 0},
++            ),
+         )
          # Call with first import
@@ -843,16 +1061,16 @@
+             ]
+         )
          self.assertEqual(
--            set(inst._batch),
--            set(['_id', '_rev', 'imported', 'skipped'])
++            set(inst.doc),
++            set(['_id', '_rev', 'stats'])
+         )
--        self.assertEqual(inst._batch['_id'], batch_id)
++        self.assertEqual(inst.doc['_id'], batch_id)
          self.assertEqual(
--            inst._batch['imported'],
++            inst.doc['stats']['imported'],
              {'count': 17, 'bytes': 98765}
+         )
          self.assertEqual(
--            inst._batch['skipped'],
++            inst.doc['stats']['skipped'],
              {'count': 3, 'bytes': 12345}
+         )
@@ -884,16 +1102,16 @@
+             ]
+         )
          self.assertEqual(
--            set(inst._batch),
--            set(['_id', '_rev', 'imported', 'skipped'])
++            set(inst.doc),
++            set(['_id', '_rev', 'stats'])
+         )
--        self.assertEqual(inst._batch['_id'], batch_id)
++        self.assertEqual(inst.doc['_id'], batch_id)
          self.assertEqual(
--            inst._batch['imported'],
++            inst.doc['stats']['imported'],
              {'count': 17 + 18, 'bytes': 98765 + 9876}
+         )
          self.assertEqual(
--            inst._batch['skipped'],
++            inst.doc['stats']['skipped'],
              {'count': 3 + 5, 'bytes': 12345 + 1234}
+         )
@@ -948,21 +1166,21 @@
          self.assertEqual(
              callback.messages[3],
              ('ImportProgress', (base, import_id, 1, 3,
--                    dict(action='imported', src=src1, _id=mov_hash)
++                    dict(action='imported', src=src1)
+                 )
+             )
+         )
          self.assertEqual(
              callback.messages[4],
              ('ImportProgress', (base, import_id, 2, 3,
--                    dict(action='imported', src=src2, _id=thm_hash)
++                    dict(action='imported', src=src2)
+                 )
+             )
+         )
          self.assertEqual(
              callback.messages[5],
              ('ImportProgress', (base, import_id, 3, 3,
--                    dict(action='skipped', src=dup1, _id=mov_hash)
++                    dict(action='skipped', src=dup1)
+                 )
+             )
+         )
 === modified file 'dmedia/tests/test_metastore.py'
 --- dmedia/tests/test_metastore.py	2011-01-30 21:24:30 +0000
 +++ dmedia/tests/test_metastore.py	2011-02-19 04:44:09 +0000
@@ -124,42 +124,6 @@
          self.assertEqual(inst.machine_id, _id)
          self.assertEqual(inst._machine_id, _id)
--    def test_by_quickid(self):
--        mov_chash = 'OMLUWEIPEUNRGYMKAEHG3AEZPVZ5TUQE'
--        mov_qid = 'GJ4AQP3BK3DMTXYOLKDK6CW4QIJJGVMN'
--        thm_chash = 'F6ATTKI6YVWVRBQQESAZ4DSUXQ4G457A'
--        thm_qid =  'EYCDXXCNDB6OIIX5DN74J7KEXLNCQD5M'
--        inst = self.new()
--        self.assertEqual(
--            list(inst.by_quickid(mov_qid)),
--            []
--        )
--        inst.db.create(
--            {'_id': thm_chash, 'qid': thm_qid, 'type': 'dmedia/file'}
--        )
--        self.assertEqual(
--            list(inst.by_quickid(mov_qid)),
--            []
--        )
--        inst.db.create(
--            {'_id': mov_chash, 'qid': mov_qid, 'type': 'dmedia/file'}
--        )
--        self.assertEqual(
--            list(inst.by_quickid(mov_qid)),
--            [mov_chash]
--        )
--        self.assertEqual(
--            list(inst.by_quickid(thm_qid)),
--            [thm_chash]
--        )
--        inst.db.create(
--            {'_id': 'should-not-happen', 'qid': mov_qid, 'type': 'dmedia/file'}
--        )
--        self.assertEqual(
--            list(inst.by_quickid(mov_qid)),
--            [mov_chash, 'should-not-happen']
--        )
--
      def test_total_bytes(self):
          inst = self.new()
          self.assertEqual(inst.total_bytes(), 0)
 === modified file 'dmedia/util.py'
 --- dmedia/util.py	2011-02-15 07:53:55 +0000
 +++ dmedia/util.py	2011-02-19 04:44:09 +0000
@@ -43,12 +43,15 @@
  def configure_logging(namespace):
      format = [
          '%(levelname)s',
--        '%(message)s'
++        '%(process)d',
++        '%(message)s',
+     ]
      cache = path.join(xdg.BaseDirectory.xdg_cache_home, 'dmedia')
      if not path.exists(cache):
          os.makedirs(cache)
      filename = path.join(cache, namespace + '.log')
++    if path.exists(filename):
++        os.rename(filename, filename + '.previous')
      logging.basicConfig(
          filename=filename,
          filemode='w',
 === modified file 'dmedia/workers.py'
 --- dmedia/workers.py	2011-02-07 07:19:55 +0000
 +++ dmedia/workers.py	2011-02-19 04:44:09 +0000
@@ -165,7 +165,7 @@
                  pass
      def _process_message(self, msg):
--        log.info('%(signal)s %(args)r', msg)
++        log.info('[From %(pid)d] %(signal)s %(args)r', msg)
          with self._lock:
              signal = msg['signal']
              args = msg['args']

Dmedia

Merge lp:~jderose/dmedia/empty-files into lp:dmedia

Commit message

Description of the change

Preview Diff

Subscribers