Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


PARSES: A Pipeline for Analysis of RNA-Seq Exogenous Sequences

Project to create a pipeline to analyze a tumor's DNA in search of foreign DNA.

Joseph Coco

on 7 April 2011

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of PARSES: A Pipeline for Analysis of RNA-Seq Exogenous Sequences

Dr. Erik Flemington & Flemington Lab (Tulane CC)
Dr. Christopher Taylor (UNO and RIC)
Dr. Dongxiao Zhu & Zhu Lab (UNO and RIC)
Ms. Qi Zhang (UNO) Collaborators Funding National Institute of Health
Research Institute for Children PARSES: A Pipeline for Analysis of RNA-Seq Exogenous Sequences Joseph Coco Need overall perspective of overall quality of data, quality of repeat reads, and of non-mapped reads.
MATLAB tools available but require Bioinformatics Toolbox.
FASTX-Toolkit 0.0.13 of Hannon Lab of CSHL capable of graphic and text statistical display.
Requires libgtextutils-0.6. /home/joseph/bin/fastq\ toolkit/fastx_quality_stats -Q33 -i /media/extra/joseph/Dropbox/Work/data/tumorNoMap/miRcntl_inf1_run1.novo.NM.fastq -o /media/extra/joseph/Dropbox/Work/data/tumorNoMap/miRcntl_inf1_run1.novo.NM.fastq.stats && /home/joseph/bin/fastq\ toolkit/fastq_quality_boxplot_graph.sh -i /media/extra/joseph/Dropbox/Work/data/tumorNoMap/miRcntl_inf1_run1.novo.NM.fastq.stats -o /media/extra/joseph/Dropbox/Work/data/tumorNoMap/miRcntl_inf1_run1.novo.NM.fastq.stats.quality.png Wrote blastContigs, a simple perl script to find the kmerOptimized ABySS FASTA files and BLAST them with aforementioned options.
Appends '.blast' to all files.
Uses glob command. /media/extra/joseph/Dropbox/Work/data/tumorNoMap/blastContigs.pl /media/extra/joseph/Dropbox/Work/data/tumorNoMap/miRcntl_inf1_run1.novo.NM.fasta.nospans.NT.*.kmer.contigs.fasta.kmerOptimized.fasta 100 5 Continue creating pipeline via rake file to download, install, configure pipeline.
Better test WinScore in MEGAN.
Presentation-quality data reporting for indepth analysis of PARSES.
Tune Parameters.
Research using Galaxy for a different version of PARSES.
Create a test unit so changes can safely be made. Future Read Quality Score by Base Pair ABySS degrades low quality reads so no need for trimming. Sequence Alignment Automate Ruby Make file.
Not part of standard installation.
Dependency resolution based on tasks or files which allows ensuring applications are installed and all of the pipeline has been run up to the requested point.

task :pancake => [:flour,:milk,:egg,:baking_powder] do
    puts "sizzle"

Logic in Ruby--MUCH easier and cleaner than make.
Easy logging with built-in logger library and human readable serialization with built-in yaml library.
Can return results from shell commands. Rake Full Automation A fair guess that the user doesn't have Ruby/Rake installed:
Find a way to compile the rake file into an executable.
Find a way to have the rake file run as a bash script if Ruby/Rake is not installed.
#!/bin/sh would be interpretted as a comment.
Backquoted commands would run in both Ruby and bash as bash commands.
Alternatively, wrap entire process in a bash or other type of frontend and have user always interact with it. Requirements Features Downloads, installs and builds necessary indices of latest versions of ABySS, MEGAN, BLAST+, NT Database, HG Database, GI to TaxID Nucl Database, Bowtie, Tophat, Novoalign, SAMtools, and Parallel::Iterator.
Serializes all parameters associated with a sequence run to disk in a human-readable, human-alterable form.
Automatically determines operating system, CPU architecture, number of CPUs, amount of memory, if locate database is installed, and default shell.
Execution of pipeline to any point with automatic detection for where the pipeline was last left off. Parameter tuning.
Basic error detection.
No input required on subsequent runs.
Human-readable log of all processes with all parameters run per sequence, results, timing information, and analysis of data flow.
Complete list of all tasks of pipeline.
Clean and Clobber procedures to automatically remove unwanted files.
Supports MacOS and Linux.
Computes ideal parallel BLAST executions. Pipeline --- !ruby/object:Sequence
abyssInputRegex: /mnt/hgfs/Dropbox/Work/tumornomap/data/miRcntl_inf1_run1.novo.fasta.NM.fasta
blastInputRegex: /mnt/hgfs/Dropbox/Work/tumornomap/data/miRcntl_inf1_run1.novo.fasta.NM.fasta.*.kmer.contigs.fasta.kmerOptimized.fasta
blastOutputFormat: 10
eValue: 100
filePath: /mnt/hgfs/Dropbox/Work/tumornomap/data/miRcntl_inf1_run1.novo.fasta
imageFileType: jpg
maxMatches: 0
meganBlastInputRegex: /mnt/hgfs/Dropbox/Work/tumornomap/data/miRcntl_inf1_run1.novo.fasta.NM.fasta.*.kmer.contigs.fasta.kmerOptimized.fasta.blast
minKmerLength: 7
minScoreByLength: 0
minSupport: 5
pipeEndRegex: /mnt/hgfs/Dropbox/Work/tumornomap/data/miRcntl_inf1_run1.novo.fasta.NM.fasta.*.kmer.contigs.fasta.kmerOptimized.fasta.blast.megan.rma
readLength: 50
topPercent: 10.0
winScore: 0.0 The Min Support item can be used to set a threshold for the minimum support that a taxon requires, that is, the number of reads that must be assigned to it so that it appears in the result. Any read that is assigned to a taxon that does not have the required support is counted as unassigned. By default, the minimum number of reads required for a taxon to appear in the result is 5.
The Min Score item can be used to set a minimum threshold for the bit score of hits. Any hit in the input data set that scores less than the given threshold is ignored. By default, the minimum score to be accepted as a hit is 35.0.
The Top Percentage item can be used to set a threshold for the maximum percentage by which the score of a hit may fall below the best score achieved for a given read. Any hit that falls below this threshold is discarded. Top Percent is 10.0 by default.
The Win Score item can be used to try and separate matches due to sequence identity and ones due to homology. If a win score is set, then, for a given read, if any match exceeds the win score, only matches exceeding the win score (“winners”) are used to place the given read. The hope is that secondary, homology-induced matches are discarded in the presence of stronger primary matches. Win Score is 0.0 by default. MEGAN +g -x "open meganfile=/media/extra/joseph/Dropbox/Work/data/tumorNoMap/allComparison.megan; expand direction=vertical; update;expand direction=vertical; update;expand direction=vertical; update;expand direction=vertical; update;expand direction=vertical; update;expand direction=vertical; update; uncollapse all; exportgraphics format=PDF file=/media/extra/joseph/Dropbox/Work/data/tumorNoMap/allComparison.pdf;quit" LCA Parameters Can likely replace Dr. Taylor's taxony scripts by increasing minSupport and topPercent and performing binning based on that.
Tune parameters until a certain percentage of the data is culled.
Record results from a range of parameters and allow human to decide which option is best, potentially utilizing some sort of slider system.
Tune parameters at end if unhappy with result by simply raising or lowering them. Metagenomic Analysis Phymm, which uses Interpolated Markov Models for classification
Requires 200 GB.
Having issues running Phymm.
PhymmBL does not support BLAST+
Ideal for short reads.
Faster and more accurate than BLAST and MEGAN. Alternatives Fully expand not documented and not an option via command line.
o Entered logical commands.
o Debugged with JDB.
o Decompiled with JD-GUI 0.3.3, ran all files against strings and grepped on desired command.
Repeated vertical expansions emulates Fully Expand.
Update command must be sent after any change.
meganContigs.pl to automate MEGAN processing. Automate Taxonomy Analysis Comparison of data sets MEGAN +g -x "import blastfile=/media/extra/Erik/miRcntl_inf1_run1.novo.NM.fasta.nospans.blast readfile=/media/extra/Erik/miRcntl_inf1_run1.novo.NM.fasta.nospans meganfile=/media/extra/Erik/miRcntl_inf1_run1.novo.NM.fasta.nospans.megan.rma maxmatches=100 minscore=35.0 toppercent=10.0 winscore=0.0 minsupport=5 summaryonly=false usecompression=true usecogs=false usegos=false useseed=false; quit;" DeNovo Assembly Novoalign Novoalign 2.07.03 from Novocraft.
Using HG19 to remove all human DNA.
Highly accurate, optimized for short sequences.
Can set maximum mismatch limit.
Must build Novoalign index.
Supports paired-end reads.
MPI and multi-treaded versions.
Needleman-Wunsch based.
Not suitable for 454 or ABI Solid reads. novoalign -c 4 -d hgChr19.ndx -F ILMFQ -f miRcntl_inf1_run1.novo.NM.fasta > miRcntl_inf1_run1.novo.NM.fasta.NM Tophat Tophat 1.1.1 from Center for Bioinformatics and Computational Biology.
Using HG19 to remove alternate splicing of coding regions.
Sliced reads mapper using Bowtie as sequence aligner.
Can set maximum mismatch limit.
Designed for Illumina reads.
Must build Bowtie index.
Supports multi-threading. tophat --solexa1.3-quals /usr/share/bowtie.ndx miRcntl_inf1_run1.novo.NM.fasta.NM Open megan file and produce PDF MEGAN +g -x "open meganfile=/media/extra/joseph/Dropbox/Work/data/tumorNoMap/allComparison.megan; export data=reads file=ebv.fa taxid=10376; quit;" Export all reads from taxonomy MEGAN +g -x "open meganfile=/media/extra/Erik/miRcntl_inf1_run1.novo.NM.fasta.nospans.megan.rma; uncollapse all; update; expand direction=vertical; update; expand direction=vertical; update; expand direction=vertical; update; expand direction=vertical; update; expand direction=vertical; update; expand direction=vertical; update; set minscore=35; update; set minsupport=5; update; exportgraphics format=PDF file=/media/extra/joseph/Dropbox/Work/tumornomap/data/minscore35minsupport5.pdf; set minsupport=30; update; exportgraphics format=PDF file=/media/extra/joseph/Dropbox/Work/tumornomap/data/minscore35minsupport30.pdf; quit" Produce two PDFs with distinct views Taxonomy Extraction Logarithmically scaling number of reads assigned to taxony versus number of qualifying reads of potential other taxonomies Data Views Other EBV extracted and assembled Medium & Low Priority High Priority Compare Velvet and VelvetOptimizer contigs with ABySS contigs using timing information and N50 scores for optimized contigs.
Execute Phymm and compare results with MEGAN.
Consider adding data to a Postgres database.
Add inclusive machine learning or human input process to tune parameters at end of process.
Add maxmimum kmer size detection to pipeline or force recompilation.
Add paired end support for Novoalign, Tophat, ABySS, BLAST, and MEGAN.
Convert to modules instead of specific applications which can be choosen specifically by user or automatically by system specifications, data quality, data type, time limitations, etc.
Figure out how to use 'extract' reads instead of 'export' reads in MEGAN.
Research GO and COG functionality of MEGAN.
Histrogram comparison of best N50 vs. worst N50.
Determine process to inject reads into MEGAN from Novoalign or Tophat.
Consider visualizing genome coverage of organism after DeNovo Assembly.
Determine if it would be beneficial to incorporate GPU versions of tools.
Bundle together results and intermediate files once a run is complete.
Benchmark PARSES for CPU, RAM, and disk usage.
Allow for either Pairwise or BLASTTAB formats.
Design an intelligent scheduler which will execute batch operations. Acknowledgements Motivation References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403-10.
Birol I, Jackman SD, Nielsen CB, et al. De novo transcriptome assembly with ABySS. Bioinformatics (Oxford, England). 2009;25(21):2872-7.
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Research. 2007:377-386.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009;10(3):R25.
Lefébure T. Next Generation Sequencing Workshop – De novo genome assembly –. Sciences-New York. 2010.
Novocraft. 26 Oct 2010 09:29 PM. http://www.novocraft.com/main/index.php
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England). 2009;25(9):1105-11.
Weirich J, et al. Rake. May 16 2009 10:53 PM. http://rubyforge.org/projects/rake/index.html ABySS 1.2.4 from BC Cancer Agency.
Stitched together 50 bp reads.
Low system requirements and fast run time.
Does not support supercontigs (contigs with mismatches).
Easy to use, outputs simple FASTA files.
Requires Kmer optimization for best results.
Decent documentation but not descriptive.
Able to remove low quality bps from reads which increases speed and decreases memory usage.
Developer recommends against removing low quality base pairs because low quality reads will be eroded if less than the square root of the median kmer coverage.
Does not support tracking which reads were used in constructing a contig. ABYSS -k 31 --illumina-quality --out=/media/extra/Erik/miRcntl_inf1_run1.novo.NM.fasta.nospans.NT.other.contigs.fasta /media/extra/Erik/miRcntl_inf1_run1.novo.NM.fasta.nospans.NT.other.fasta ABySS N50 score computed by sorting the contigs by length, then adding to a new data set the largest until more than 50% of the data is in it. The shortest contig in the new set is the N50 score. No documentation.
Failed to produce optimization formula.
N50 score, plots area of reads and selects kmer which encompasses 50% of the data.
Fixed ABySS-fac perl script to output N50 score.
Wrote abyssKmerOptimizer perl script to run ABySS with ideal parameters. ./media/extra/joseph/Dropbox/Work/code/tumorNoMap/abyssKmerOptimizer.pl miRcntl_inf1_run1.novo.NM.fasta.nospans.NT.*.fasta vo.NM.fasta.nospans.NT.other.fasta 7 50 Sorts results of ABySS-fac by N50 score and extracts best.
Utilitizes Parrallel::Iterator to check all kmer simultaneously without complicating or cluttering code.
Input multiple files as a glob so commands don't need to be strung together.
Renames best kmer output file to the FASTA file's original name with '.#.kmer.contigs.fasta.kmerOptimization.fasta' appended where # is the ideal kmer. Kmer Optimization Velvet 1.0.14 from EMBL-EMI
Stiched together 50 bp reads.
Easy to use, outputs simple FASTA files.
Requires Kmer optimization for best results--easily acheivable with VelvetOptimizer.
Fair documentation.
Uses De Bruijn graphs to produce contigs.
Specify MAXKMERLENGTH when building. perl VelvetOptimiser.pl -t 4 -s 7 -e 49 -f /Users/Joseph/Dropbox/Work/tumornomap/data/ebv.fasta Velvet Possible skew because optimal kmer was an even number
ABySS seems better, need to run on more datasets and getting timing information. ABySS vs Velvet BLAST 2.2.24+ by National Center for Biotechnology Information.
BLASTn for nucleotide.
E value of 0.001. Ideally, would execute with E value of 100.
Soft Masking for MEGAN processing.
Remove dust filter for short reads.
Using NT database but could also use WGS and/or ENV-NT
MPI and multi-threaded versions. blastn -db /Users/Shared/BlastDBs/nt -soft_masking true -num_threads 4 -outfmt 5 -evalue 0.001 -query /Users/Shared/Erik/Sequencemappings/MiRcntl/miRcntl_inf1_run1.Nonmapped/miRcntl_inf1_run1.novo.NM.fasta.nospans -out /Users/Shared/Erik/Sequencemappings/MiRcntl/miRcntl_inf1_run1.Nonmapped/miRcntl_inf1_run1.novo.NM.fasta.nospans.blast BLAST+ Features Rajesh Gollapudi, Kashi Vishwanath Revanna, Chris Hemmerich, Sarah Schaack, and Qunfeng Dong (2008) BOV - A Web-based BLAST Output Visualization Tool, BMC Genomics, 9(1):414. Provide an accessible means to discover contamination agents and analyze possible exogenous agents that could be involved in a variety of cancer cell lines. gcc/make
Dependent on dataset size
Solexa/Illumina data No Hits and Not Assigned Injection No Hits Not Assigned Original No Hits + Not Assigned No Hits Not Assigned MinSupport=5
WinScore=0 Overview of data No blast results Did not meet LCA parameters MinSupport=30
WinScore=0 MinSupport=150
WinScore=0 MinSupport=5
WinScore=0 MinSupport=30
WinScore=0 MinSupport=150
WinScore=0 MinSupport=5
WinScore=0 MinSupport=30
WinScore=0 MinSupport=150
WinScore=0 MinSupport=5
WinScore=0 Tuning LCA parameters MinSupport=10
WinScore=0 MinSupport=10
WinScore=120 MinSupport=10
WinScore=70 MinSupport=10
WinScore=40 Digging Deeper Data Comparisons datafile.fastq
alignSequence (Novoalign)
removeHuman (Xnovotonm)
removeSpans (Tophat)
localAlignReads (BLAST+)
metaGenomeAnalyzeReads (MEGAN)
denovoAssembleCluster (ABySS)
localAlignContigs (BLAST+)
metaGenomeAnalyzeContigs (MEGAN)
datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma Examples Output Log # Logfile created on Mon Nov 29 20:21:37 -0600 2010 by logger.rb/22285
I, [2010-12-27T14:54:16.761320 #94508] INFO -- : Begin run for seq=akata file=s_4_sequence_Akata.txt type=illumina1.3 task=-f
I, [2010-12-27T14:54:16.849927 #94508] INFO -- : Executing PARSES v0.30 in a 0 environment with 64GB of memory and 24 cores with a 64-bit architecture.
I, [2010-12-27T17:56:20.865114 #94508] INFO -- : tophat -p 24 --solexa1.3-quals --output-dir akata_tophat_out /usr/share/hgChrAll s_4_sequence_Akata.txt
I, [2010-12-27T17:56:20.865442 #94508] INFO -- : samtools view -h -o akata_tophat_out/accepted_hits.sam akata_tophat_out/accepted_hits.bam rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fasta type=illumina1.3 First Execution Run to Specified Point rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fasta type=illumina1.3 localAlignContigs Keep track of reads which were not used in contigs in a separate file
Make PARSES fully automated such that no prequistites exist.
Add support for other read types.
Add a GUI.
Add a program link repository incase one of the links to download the tools in PARSES change.
MEGAN auto-registration.
Add a system to notify the user at specific points during PARSES either through SMS, email, or Twitter.
Add support for cluster computing.
Record how many base pairs factor into the construction of a contig and duplicate the contig respectively divided by its length to simulate retention of concentration information.
Add support for targeting specific non-human DNA removal.
Add a progress bar.
Consider using ENV NT and/or WGS in addition to or as a pre-processing step to using NT.
Search for a faster method to BLASTn sequences. Supports FASTA, Sanger, Solexa, Illumina 1.3, and Illumina 1.5 data types.
Documentation: https://github.com/Lythimus/PARSES/wiki
Allow for arbitrary execution if user wishes to use only a specific set of tasks of PARSES.
Support for MEGAN4.
BLAST results with no hits added into MEGAN.
Support for a link repository in the event automatic installations fail it can try updated links without updating PARSES.
Provides publish-quality graphics of data. MEGAN 3.9 of Universität Tübingen.
Fed BLAST results into MEGAN.
Subsequently, fed BLAST ABySS results into MEGAN.
Java application, resource hog.
Give Java more memory:
export _JAVA_OPTIONS="-Xms128m -Xmx4096m -Xincgc"
Soft filtering gives better results with BLASTn.
Vertical spacing must be altered to be human readable.
Can perform comparisons, but all files must be open.
Provides several different views and scalings.
Allows individual read alignment inspections and alterations.
Well documented in manual.
Unable to change LCA parameters with injected reads or comparison data sets. MEGAN Create megan file Introduction RAIphy, based on an information theoretic measure referred as Relative Abundance Index
Required to create database.
Having issues creating database.
May not perform well with short reads.
Faster and more accurate than Phymm except with shorter reads. PARSES Lite Galaxy Supports all but DeNovo Assembly.
ABySS could likely easily be added and the developer has agreed to help porting it.
Would handle intelligent scheduling of batch data sets.
Would provide a GUI.
Modularization would already be implemented.
Could potentially add support for other data types.
Gives user much more more abilities to easily manipulate data.
Robust platform.
Could potentially implement a simpler version on standard Galaxy server.
Full transcript