lp:staden-io-lib-trunk

Created by James Bonfield and last modified
Get this branch:
bzr branch lp:staden-io-lib-trunk

Branch merges

Related bugs

Related blueprints

Branch information

Owner:
James Bonfield
Project:
staden-io-lib-trunk
Status:
Development

Import details

Import Status: Failed

This branch is an import of the Subversion branch from https://svn.code.sf.net/p/staden/code/io_lib/trunk.

The import has been suspended because it failed 5 or more times in succession.

Last successful import was .

Import started on pear and finished taking 15 seconds — see the log
Import started on russkaya and finished taking 15 seconds — see the log
Import started on pear and finished taking 15 seconds — see the log
Import started on pear and finished taking 15 seconds — see the log

Recent revisions

590. By jkbonfield

Fixed a rare renormalisation bug in the rANS codec.

The symbol frequencies need to sum to TOTFREQ (4096 currently) and are
rounded up/down accordingly. The combination of integer rounding
means the renormalised frequences don't always total 4096 exactly, so
the remainder is added-to / subtracted-from the most frequent symbol.
In one particular data set this remainder was larger than the most
frequent symbol, causing it to become negative.

We now just do another round of renormalisation with slightly lower
products until we get it right. It's not the fastest solution, but a
very rare event.

589. By jkbonfield

Fix BAM bin value for placed but unmapped reads. (Reported by German
Tischler.)

This corresponds to a SAM spec change from 8th April 2014 where
unmapped data was explicitly stated to have length 1. Io_lib's
implementation assumed unmapped data to be zero length.

588. By jkbonfield

Fixed a CRAM encoder crash when no @SQ lines are present but the
sequences have reference names in use.

587. By jkbonfield

Removed a CRAM encoding crash.

When an @SQ line is present but no SN: entry exists, the name field
was NULL but dereferenced.

586. By jkbonfield

Fixed a compression inefficiency when switching to unsorted mode.

We switch from sorted to unsorted mode only after a couple tiny
containers have been created. (Ideally we'd detect upfront.)

We also compute compression metrics on the first few containers and
then keep those stats for the next 100 or so. The combination of
these meant we computed compression metrics based on data that was not
of comparable size to the rest of the container. In one test set this
meant Z_RLE was optimal on the 1-read slices but then applied to
10,000 read slices when Z_FILTERED is preferable (due to lots of
duplicate entries).

585. By jkbonfield

Removed an uninitialised memory access, although I'm a little unsure
why this is even there! (Bad memory.)

It's in code that is executed when the cram codec fails to initialise,
so I believe this change is a no-op on valid files.

584. By jkbonfield

Merged in the cram_filter branch.

This tool should still be considered as experimental.

583. By jkbonfield

Improved multi-threaded CRAM decoding.

When given a thread pool, we now migrate the cram_to_bam calls from
within the cram_get_bam_seq function (called in the main thread) to
the cram_decode_slice function (called inside a worker thread).

This significantly improves parallelisation opportunities.

Better still would be to change the API so that the bam object
returned has an associated free function pointer to deallocate. Eg

get_seq(fd, &s);
// do stuff
s->free(s);

Instead of just the "free(s)" we have now. Currently we have to
memcpy our cached bam structures to a new malloced location instead of
returning the address of the precomputed bam structs. Making this
change would remove another 40% or so CPU from the main thread of cram
decoding (not done, but see cram_get_bam_seq for comments).

582. By jkbonfield

Moved the block CRC32 checking from within block I/O to the block
uncompression code.

This has two outcomes:

1) We don't incurr integrity checking unless we use the data (both
good and bad).

2) When multi-threading, the CRC computation is spread between cores.

This means CRAM reading is around 10% faster real-time when using -t16.

581. By jkbonfield

Add io_lib/bgzip.h to pkginclude_HEADERS

(Thanks to German Tischler)

Branch metadata

Branch format:
Branch format 7
Repository format:
Bazaar repository format 2a (needs bzr 1.16 or later)
This branch contains Public information 
Everyone can see this information.

Subscribers