Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Delta Compression for NGS Data - VBI

No description
by

Lin An

on 17 September 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Delta Compression for NGS Data - VBI

Shortcomings
Arithmetic compression
BACKGROUND
HIGH THROUGHT SEQUENCING
Sequencing technologies by Michael Metzker

COMPARISON OF NGS PLATFORMS
HIGH THROUGHT SEQUENCING
http://www.dkfz.de/gpcf/illumina_hiseq_technology.html
HIGH THROUGHT SEQUENCING
http://www.dkfz.de/gpcf/illumina_hiseq_technology.html

BACKGROUND
automated Sanger method is considered as a first-generation technology

newer methods are referred to as next-generation sequencing (NGS)

rely on a combination of template preparation, sequencing and imagining, and genome alignment and assembly methods

known as high-throughput sequencing
http://ivory.idyll.org/blog/cloud-not-the-solution.html
BACKGROUND
DNA sequencing per dollar is decreasing faster than storage capacity per dollar












how to store the data efficiently ?
http://ivory.idyll.org/blog/cloud-not-the-solution.html

BACKGROUND
NGS data output has increased at a rate that outpaces Moore’s law, more than doubling each year since it was invented
BACKGROUND

automated Sanger method is considered as a first-generation technology

newer methods are referred to as next-generation sequencing(NGS)

rely on a combination of template preparation, sequencing and imagining, and genome alignment and assembly methods

known as high-throughput sequencing




Group 4 compression for NGS
Tian Yu & Pawel Weber & Weiwei Zhang


P

Compression Tools
What's out There?
Prior Information
We need to know where we can exploit redundancy
Delta Compression
Previous Work
T H E B O T T O M L I N E
Detailed Cluster
FASTQ File
Compression Pipeline Overview
Not all files compress the same...
1.3%
60%
Highly Generic
Compression
Raw genomic data sets contain multiple coverages of the actual sequence
FASTA File
Quality File
A
C
G
T
G
C
C
T
A
C
C
T
G
G
C
T
A
C
G
T
G
C
C
T
2:C
5:G
HIGH THROUGHT SEQUENCING
http://www.dkfz.de/gpcf/illumina_hiseq_technology.html
Reformat
Cluster
Compress
Reformat.pl
FastQ
Sequences
Quality
FastQToFastA.py
FastA
Sequence
Cluster
Sequence
Cluster
Sequence
Cluster
Sequence
Cluster
Uclust
Sequence
Clustering
Multiple Sequence Alignment
Uclust.py
Consensus Sequence generation
Identifier
Singletons
Consensus sequences
pre_encode.pl
Centroid sequence
Centroid sequence
S = Same as centroid sequence
Sequence in cluster
Arithmetic Coding
Quality File
fp8
Arith_coder
Compressed binary files
ACTT
001000001 01000011 01010100 01010100
011011100
Quality string
Identifier
Sequence read
http://www.7-zip.org/7ziplogo.png
http://www.gzip.org/gzip3d.png
http://www.bzip.org/images/bzip2-logo.png
http://www.free-winrar.com/wp-content/uploads/2013/04/winrar.png
Quality string
Identifier
identifier removed
order of quality string rearranged
Centroid file
UCLUST ALGORITHM
Quality file
Derivative file
Identifier file
Result
Comparison
---Among different algorithms
Time vs Identity
Clusters vs Identity
Singletons vs Identity
HIGH THROUGHPUT SEQUENCING
DATA: E.coil genome SRR292770
5102041 Reads 46bp
Resource: 1 node, 12 cores, 48G mem
post_decode.pl
FASTQ File
40.4%
27.4%
31.7%
26.7%
15%
22.5%
14.2%
22%
Caveat
32 bit filesystem
4GB memory
maximum
1GB clusters
Sequencing run - 200GB
derivative
derivative
derivative
derivative
*centroid file contains concensus sequences and singletons
*centroid file contains concensus sequences and singletons
Figure from UCLUST website: www.drive5.com
Thank you
Conclusion
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcS5hA2oTG9i9T8obid6J-xq6O6lRG9rjXjYLFGtgP5lW7Qsy9PE
Reference-based compression
Given aligned reads in SAM or BAM format, and the reference sequence to which they are aligned (in FASTA format), the reads are compressed preserving all information in the SAM/BAM file, including the header, read IDs, alignment information and all optional fields allowed by the SAM format. Unaligned reads are retained and compressed using the Markov chain model.
Assembly-based compression
Requires no external sequence database and produces files which are entirely self-contained.
Comparison Between Different Method
© The Author(s) 2012. Published by Oxford University Press.

Jones D C et al. Nucl. Acids Res. 2012;nar.gks754
One lane of sequencing data from each of six publicly available data sets
was compressed using a variety of methods.

Delta Compression for NGS Data
Full transcript