Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


PARSES: A Pipeline for Analysis of RNA-Seq

BiCoB 2011 presentation and masters thesis defense.

Joseph Coco

on 26 May 2011

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of PARSES: A Pipeline for Analysis of RNA-Seq

Features Downloads, installs and builds necessary indices of latest versions of ABySS, MEGAN, BLAST+, NT Database, HG Database, GI to TaxID Nucl Database, Bowtie, Tophat, Novoalign, SAMtools, and Parallel::Iterator.
Serializes all parameters associated with a sequence run to disk in a human-readable, human-alterable form.
Automatically determines operating system, CPU architecture, number of CPUs, amount of memory, if locate database is installed, and default shell.
Execution of pipeline to any point with automatic detection for where the pipeline was last left off. Documentation: https://github.com/Lythimus/PARSES/wiki
Allow for arbitrary execution if user wishes to use only a specific set of tasks of PARSES.
Support for MEGAN4.
BLAST results with no hits added into MEGAN.
Support for a link repository in the event automatic installations fail it can try updated links without updating PARSES.
Provides publish-quality graphics of data. Motivation Provide an accessible means to discover contamination agents and analyze possible exogenous agents that could be involved in a variety of cancer cell lines. Dr. Erik Flemington & Flemington Lab (Tulane CC)
Dr. Christopher Taylor (UNO and RIC)
Dr. Dongxiao Zhu & Zhu Lab (UNO and RIC)
Ms. Qi Zhang (UNO) Collaborators Funding National Institute of Health
Research Institute for Children Acknowledgements PARSES Contamination >20% of non-primate DNA contaminated with human DNA
Primate specific AluY repeat used with BLASTN/BLAT.
Alignment if >98% identity. Approximately one half of this 920 bp sequence entry is >99% identical to human (blue and purple) while the other half is >99% identical to Pseudomonas aeruginosa (green). The Alu alignment used to identify this sequence is shown in purple. Parameter tuning.
Basic error detection.
No input required on subsequent runs.
Human-readable log of all processes with all parameters run per sequence, results, timing information, and analysis of data flow.
Complete list of all tasks of pipeline.
Clean and Clobber procedures to automatically remove unwanted files.
Supports MacOS and Linux.
Computes ideal parallel BLAST executions.
Supports FASTA, Sanger, Solexa, Illumina 1.3, and Illumina 1.5 data types. Results rake -f /rake/file/location install rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 rake -f /rake/file/location seq=NameYouGiveToYourSequence Execution Amount of data at each portion of PARSES.
Unique comparisons of taxonomies.
Percent of reads from host organism. Output References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403-10.
Birol I, Jackman SD, Nielsen CB, et al. De novo transcriptome assembly with ABySS. Bioinformatics (Oxford, England). 2009;25(21):2872-7.
Haas BJ, Zody MC. Advancing RNA-Seq analysis. Nature Biotechnology. 2010;28(5):421-423.
Huson DH, Mitra S, Ruscheweyh H-J, Weber N and Schuster SC, Integrative analysis of environmental sequences using MEGAN 4, submitted, 2011
Huson DH, Auch A, Qi J and Schuster SC, Megan Analysis of Metagenome Data, Genome Research. 17:377-386, 2007
Ioachim HL, Medeiros LJ, Ioachim's Lymph Node Pathology: Fourth edition. Philadelphia, PA: Lippincott Williams & Wilkins, 2008.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009;10(3):R25.
Lefébure T. Next Generation Sequencing Workshop – De novo genome assembly –. Sciences-New York. 2010.
Longo MS, ONeill MJ, ONeill RJ. Abundant Human DNA Contamination Identified in Non-Primate Genome Databases El-Sayed N, ed. PLoS ONE. 2011;6(2):e16410.
Marguerat S, Bähler J. RNA-seq: from technology to biology. Cellular and molecular life sciences : CMLS. 2010;67(4):569-79.
Nagalakshmi U, Waern K, Snyder M. RNA-Seq: a method for comprehensive transcriptome analysis. Current protocols in molecular biology / edited by Frederick M. Ausubel ... [et al.]. 2010;Chapter 4(January):Unit 4.11.1-13.
Novocraft. 26 Oct 2010 09:29 PM.http://www.novocraft.com/main/index.php
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM. ABySS : A parallel assembler for short read sequence data. Genome Research. 2009:1117-1123.
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England). 2009;25(9):1105-11.
Wang Z, Gerstein M, Snyder M. RNA-Seq : a revolutionary tool for transcriptomics. October. 2010;10(1):57-63.
Weirich J, et al. Rake. May 16 2009 10:53 PM. http://rubyforge.org/projects/rake/index.html
Xu G, et al., "Transcriptome and targetome analysis in MIR155 expressing cells using RNA-seq," RNA, vol. 16, no. 8, pp. 1610-1622, June 2010. Acinetobacter More research necessary to determine Etiology in tumor genesis and/or progression Gram-negative genus of Bacteria
Often presents as a nosocomial infection
Treated with antibodics PARSES: A Pipeline for Analysis of RNA-Seq Exogenous Sequences Joseph Coco N50 score computed by sorting the contigs by length, then adding to a new data set the largest until more than 50% of the data is in it. The shortest contig in the new set is the N50 score. De Novo Assembly 50 20 105 MEGAN LCA Parameters Examples 95 Taxonomical Analysis alignment via BLAST or similar tool.
bins the hits via a lowest common ancestry algorithm.
computationally expensive when using NT, WGS, or HTGS. Winscore 10 30 25 75 5 10 0 90 Min Score Min Support Top Percent MEGAN Sequence Alignment Removing Human DNA Novoalign 2.07.04 from Novocraft.
Using HG19 to remove all human DNA.
Must build Novoalign index. Tophat Tophat 1.2.0 from Center for Bioinformatics and Computational Biology.
Using HG19 to remove alternate splicing of coding regions.
Must build Bowtie index. BLASTN 2.2.24+ by National Center for Biotechnology Information.
Using NT database to search variety of taxonomies but could also use WGS, HTGS and/or ENV-NT
BLASTTAB format for efficiency
E value of 0.001. Ideally, would execute with E value of 100.
Soft masking enabled and remove dust filter for short reads. Identifying Taxonomy Novoalign BLASTN GI to TaxID Build index of all GIs of hits from BLASTN results.
Append taxonomy as final column of BLASTTAB file. RNA-Seq identification of novel splicing isoforms
detect otherwise unannotated transcript regions
not required to tailor to experiment
not required to target organism
broader and more specific quality score range
sensitive to even low concentrations of DNA
measures absolute concentration
has little background noise
scaled to any sequencing depth desired
allele specific expression
highly reproducable vs. Microarray Diffuse Large B-Cell Lymphoma 40% of all adult lymphomas in western countries
lymph node structure replaced by sheets of large lymphoma cells
commonly presents in a single intranodal location
wide target range of ages (70 median)
commonly affects younger immunosuppressed patients
average untreated survival time is 17 months
potentially curable via adjuvant radiotherapy or chemotherapy
fever, night sweats, weight loss common
bone marrow and peripheral blood involvement
anemia, fever, weight loss, and skin rashes common in late stages of elderly Diffuse Large B-Cell Lymphoma Contamination Removal Largest obstacle in ancient DNA analysis
Extreme physical processes not sufficient
Often due to DNA degradiation over time
No standard method to remove contamination post-processing
Results usually not published if contamination exists Neanderthal fossil Vi-80 Identification of Contamination Mutu I Noise Disambiguation Data set sequence quality
Database sequence quality
Conserved regions
Coverage 5,258 50-bp Pan Troglodytes reads extracted
21 Homo sapiens reads average 88.85-bps Sources: De novo Assembly
Full transcript