lp:staden-io-lib-trunk
- Get this branch:
- bzr branch lp:staden-io-lib-trunk
Branch merges
Branch information
Import details
This branch is an import of the Subversion branch from https://svn.code.sf.net/p/staden/code/io_lib/trunk.
Last successful import was .
Recent revisions
- 590. By jkbonfield
-
Fixed a rare renormalisation bug in the rANS codec.
The symbol frequencies need to sum to TOTFREQ (4096 currently) and are
rounded up/down accordingly. The combination of integer rounding
means the renormalised frequences don't always total 4096 exactly, so
the remainder is added-to / subtracted-from the most frequent symbol.
In one particular data set this remainder was larger than the most
frequent symbol, causing it to become negative.We now just do another round of renormalisation with slightly lower
products until we get it right. It's not the fastest solution, but a
very rare event. - 589. By jkbonfield
-
Fix BAM bin value for placed but unmapped reads. (Reported by German
Tischler.)This corresponds to a SAM spec change from 8th April 2014 where
unmapped data was explicitly stated to have length 1. Io_lib's
implementation assumed unmapped data to be zero length. - 588. By jkbonfield
-
Fixed a CRAM encoder crash when no @SQ lines are present but the
sequences have reference names in use. - 587. By jkbonfield
-
Removed a CRAM encoding crash.
When an @SQ line is present but no SN: entry exists, the name field
was NULL but dereferenced. - 586. By jkbonfield
-
Fixed a compression inefficiency when switching to unsorted mode.
We switch from sorted to unsorted mode only after a couple tiny
containers have been created. (Ideally we'd detect upfront.)We also compute compression metrics on the first few containers and
then keep those stats for the next 100 or so. The combination of
these meant we computed compression metrics based on data that was not
of comparable size to the rest of the container. In one test set this
meant Z_RLE was optimal on the 1-read slices but then applied to
10,000 read slices when Z_FILTERED is preferable (due to lots of
duplicate entries). - 585. By jkbonfield
-
Removed an uninitialised memory access, although I'm a little unsure
why this is even there! (Bad memory.)It's in code that is executed when the cram codec fails to initialise,
so I believe this change is a no-op on valid files. - 584. By jkbonfield
-
Merged in the cram_filter branch.
This tool should still be considered as experimental.
- 583. By jkbonfield
-
Improved multi-threaded CRAM decoding.
When given a thread pool, we now migrate the cram_to_bam calls from
within the cram_get_bam_seq function (called in the main thread) to
the cram_decode_slice function (called inside a worker thread).This significantly improves parallelisation opportunities.
Better still would be to change the API so that the bam object
returned has an associated free function pointer to deallocate. Egget_seq(fd, &s);
// do stuff
s->free(s);Instead of just the "free(s)" we have now. Currently we have to
memcpy our cached bam structures to a new malloced location instead of
returning the address of the precomputed bam structs. Making this
change would remove another 40% or so CPU from the main thread of cram
decoding (not done, but see cram_get_bam_seq for comments). - 582. By jkbonfield
-
Moved the block CRC32 checking from within block I/O to the block
uncompression code.This has two outcomes:
1) We don't incurr integrity checking unless we use the data (both
good and bad).2) When multi-threading, the CRC computation is spread between cores.
This means CRAM reading is around 10% faster real-time when using -t16.
Branch metadata
- Branch format:
- Branch format 7
- Repository format:
- Bazaar repository format 2a (needs bzr 1.16 or later)