Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Genome Assembly for Beginner

Genome Assembly for Beginner V120522

Wenbin Chen

on 10 March 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Genome Assembly for Beginner

Genome Assembly
More Fun
Thank you for your attention!
And one more thing...
is here
Filter reads with N>X percent base is N.
Filter reads with many low quality.
Filter adapter contamination.
(e.g. match length>=10bp, mismatch<=3)
Filter PCR duplication.
Filter reads with small insertsize.
(e.g. overlap >=10bp, mismatch <=10%)
Use Branch and Bound tree to correct sequencing errors.
All the K-mer paths added in the BB tree are high-freq K-mers.
The low-freq K-mer paths were not added into the BB tree.
Data filter
Error Correction
Data Filter

Support Theory
Kmer Analysis
Select K-mer to traverse the entire genome

K-mer depth frequency distribution obey Poisson distribution
knum: The number of K-mers
kdepth: The expected depth of K-mers
bnum: The number of bases
bdepth: The expected depth of bases
G: The genome size.
☆ Sequencing error
☆ Heterozygous rate  
☆ Repeat
☆ Pollution
☆ Sequencing depth
First Glance
★ If the heterozygous rate is higher, a small peak will be presented at 1/2 of kmer depth.

★ If this genome contains high proportion of repeat
Kmer Analysis
Try a Case

-d Delete the low frequency Kmers.
-R It will be better with –R in the first step pregraph,if the genome has more repeat.
-K The K would be bigger if the genome has more repeat, else it would be smaller when the genome with high heterozygous rate.
Note: contig1 and contig2 are the original contigs,contig1’ and contig2’ are their reverse and complementary contigs.
What is Genome assembly?

Aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence.

Ge+en+no+om+me --> Genome


e as

=> Genome****assembly
Clipped the short tips that had lengths less than 2 Kmers in the graph.
Filtered low-coverage nodes.
Using read path information, resolved tiny repeats.
Merge bubbles.
Command Line e.g.
grape_63mer0418 contig -g NAME -R >contig.log

-M :0-3,default 1.
-R :It must be set at the same time with the first step.
Warning -M and-R can not be set at the same time.
Store the kmers of contig in a hash, use kmer as key, contig id and position as values.
Locate a read onto contig, by using two bordering K-mers, this is equal to exact seed alignment.
Command Line e.g.
grape_63mer0418 map -s Lib.lst -g NAME >map.log
Command Line e.g.
grape_63mer0418 scaff -g ospup >scaff.log
Assembly Evaluation
Preliminary analysis
Try a case
ContigN50 >= 20K
ScaffoldN50 >= 300K
Single base error rate < 0.05%
Sequencing raw data
Data filter & Correction
Many widely used assembly programs adopted OLC
Arachne, Celera Assembler, CAP3, PCAP, Phrap, Phusion and Newbler …
After Illumina/solexa sequencing technology entering the market, and several short-read assembly software have since been developed based on DBG
Euler-USR, Velvet, ABySS, AllPath-LG and SOAPdenovo …
Besides the second-generation sequencing technologies, there are many other new technologies helpful for de novo sequencing such as:
The Optical Mapping physical technology OpGen
PacBioes extr produceme long reads but with a high error rate (>10%)
to be continued...
In total there were 59 assemblies, with 41 independently contributed by 17 different groups using 15 different assembly programs and the sum of the rankings from eight category were overall rank for the assemblies.
Overall: Sum of all rankings (possible range 8–160)
CPNG50: contig path NG50
SPNG50: scaffold path NG50
Struct: sum of structural errors
CC50: length for which half of any two valid columns in the assembly are correct in order and orientation
Subs: total substitution errors per correct bit
Copy num: proportion of columns with a copy number error
Cov tot: overall coverage
Cov genic: coverage within coding sequences
Hot Air
Copyright Wenbin Chen
Command Line e.g.
grape_63mer0418 pregraph -s Lib.lst -K 55 -R -o NAME >pregraph.log
e.g. Assemblathon 1 competition
Full transcript