Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

A view on bioinformatic pipelines Jun-27

WebValley 2014 Intl
by

Marco Chierici

on 27 June 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of A view on bioinformatic pipelines Jun-27

Conceptual steps
HOW
What is a pipeline?
A sequence of functional units ("modules") which performs a task in several steps,
like an assembly line in a factory
State of the art
tech aspects
WHY
Preprocessing
Screening
Sequence
Alignment
Postprocessing
Quantification
Data matrix:
Statistical analysis
Predictive classification
Outcomes
Predictive biomarkers
Example: NGS pipeline
Easy repeatable
Start-to-finish task automation

Modularity
Computational time
Parallelization
Memory usage
Understand the aims of your analysis
Decompose the problem into a sequence of "modules"
Assess the possible implementations of each module
Study the biological background
Evaluate the pipelines designed for similar problem
Pros and cons of the solutions adopted
for each module (wrt. your data)
Open Source
Hardware infrastructure
Software environment
Modularity
Input/Output formats
Provide proper documentation
Define the workflow
Repeatability
Filtering
Trimming
Example: bad sequences
Example: good sequences
sequence length
sequence quality
Burrows-Wheeler / Spaced seeds

ALIGNMENT

Short Reads
Reference
databases
(human,microbial,...)

Fragmentation
Burrows-Wheeler
transform
Spaced-seeds
Hash tables
BWA
Bowtie/Bowtie2
STAR
BFAST
SSAHA2
(Local) alignment algorithms
Mapping quality
Unique alignments
Properly paired alignments
PCR duplicates removal
Networks
Differential abundance/expression
normalize
counts to compensate for sources of variation in data
filter out
the features with normalized counts below a chosen cut off
Same data + same code = same results
Data deposited in public repositories

Raw data not available any more

All/some preprocessing steps lost

It won't be possible to start the analysis from the same point, regardless of whether same code used
What version of a particular software was used ?

Is it still available ?

Is it the software open/proprietary?

Better reproducibility if pipelines are carefully designed!!

Parallel computing
Reads filtering
Functional profiling
Blastn: reads mapped on 126 human gut bacterial genomes
Clustering
Compare genes/pathways among groups
Classification
Random Forest: identify population- and age- specific set of genes
Functional changes
Compare microbiome-encoded functions:
WGS data analysis
Short reads (<60 nucleotides)

Duplicates
BLASTX: reads aligned on 1,280 genomes in KEGG
KEGG KO annotations (genes)
Spearman rank correlation
Find under/overrepresented genes
Cross-validation (fivefold and leave-one-out)
ECs importance averaged over 100 rarefactions
within and between the 3 populations;
within and between families;

between children and adults.
KEGG EC numbers (enzymes)
Hellinger distance
Removed:
Reads highly similar to human genomes
Big data
Provide concurrency
Save time
Coordinate multiple processing tasks to solve single problem
We need many machines
Machines will store and process data
Use non local resources
The FBK Kore cluster
1100 cores, 8 TB RAM,
200 TB storage, 100 users
Cloud Computing
Use of computing
resources
as a
service

software
platform
infrastructure
automated setup of resources
Resources can be remote or local
but, in particular you care about
what
kind of resource you need and
when

you tipically do not care about
how
and
where
Principles of cloud Computing
Automation
Virtualization
Pay-as-you-go
schedule IT tasks
users have the complete and automatic control over their resources
decouple software from hardware
Elasticity
scale up/down your resources based on your needs
rent software and hardware, do not buy
Step 0: validation.
Test your pipeline on synthetic/benchmark datasets.
Useful tools
: FastQC, FASTX-Toolkit, seqtk, seqmagick
Embarassingly parallelizable task!
Useful software
: SAMtools (C), Picard tools (Java).
count
reads mapped on genomic features (microbes, genes, transcripts, ...)
e.g.: compute relative abundances
Useful tools
: HTSeq (Python), featureCounts,
MetaPhlAn (Python)
[Kang et al, 2013]
Taxonomic composition
Functional profiling
Gene prediction
Pathway analysis/inference
Variant detection
Microbial community structure
Case study: 16S and WGS
OTU picking
closed-reference
Taxonomic profiling
Unweighted UniFrac distamces between samples
OTU clustering
Find structure in the data
Classification
Random Forest: find most discriminatory OTUS among groups
Taxonomic changes
Find differences in gut microbiota composition:
16S dataset
Greengenes database

97% similarity threshold
alpha and beta diversity
Rarefaction analysis
PCoA of UniFrac distances
Distance-based clustering
Cross-validation (fivefold and leave-one-out)
OTUs importance averaged over 100 rarefactions
within and between the 3 populations;
within and between families;

between children and adults.
FBK-Trento
WebValley Intl 2014

Marco Chierici
Alessandro Zandonà

A view on bioinformatic pipelines
BLAST
Full transcript