Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Model selection

No description
by

Rupert Collins

on 25 June 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Model selection

An evaluation of nucleotide substitution models for specimen identification
Rupert A. Collins
Laura M. Boykin
Rob H. Cruickshank
Karen F. Armstrong
Is the K2P routinely underfitting or overfitting barcode data?
Choosing a best model
Bias (trends)
Variance (noise)
Number of parameters
Akaike's Information Criterion (AIC)
14 published studies:
BIRDS
Johnsen et al. (2010)
Kerr et al. (2007)
Kerr et al. (2009a)
Kerr et al. (2009b)
FISHES
Hubert et al. (2008)
Rasmussen et al. (2009)
Steinke et al. (2009b)
Steinke et al. (2009a)
Ward et al. (2005)
Wong et al. (2009)
22,000 DNA barcodes
Split into species with ≥ 5 individuals
"... the best metric when distances are low"
(Hebert et al., 2003)

"... when one is studying closely related sequences, there is no need to use complex distance measures ... it is better to use a simpler one"

(Nei & Kumar, 2000)
Pairwise genetic distances

The basis for most analyses with DNA barcode data

Sequence similarity: substitutions per site in an alignment

Raw data can underestimate degree of change

Models of DNA substitution quantify unobserved mutations

K2P (Kimura's 2 parameter) model is the most commonly used
Conclusions:
The K2P was a poorly fitting model at the species level

Identification success rates unaffected by model choice even when interspecific threshold values were reassessed

For barcoding, simpler metrics such as p-distance performed equally well
Thank you for listening
The K2P
A simple model

Equal base frequencies

Free parameter for transition and transversion rates

Is this realistic for barcode data?
(1) Are genetic distances systematically
biased i.e., underestimated by K2P?

(2) Or, can raw data serve adequately?

(3) Can identification success be improved?
Q. Is the K2P a good model at the species level?
1, 500 species (15,000 barcodes)
jModelTest (Posada, 2008)
Best model for the data

Published paper available online

Rupert A. Collins
Laura M. Boykin
Rob H. Cruickshank
Karen F. Armstrong

AIC differences
Evidence ratios
AIC weights (model probablities)
Used all 22,000 barcodes (including species with < 5 individuals)

Generated distance matrices for a range of commonly used models

Measure of identification success: "best close match"
(Meier et al., 2006)

Also optimised percent thesholds for each dataset
Methods
Models
p-distance
JC
F81
K2P
TrN
HKY
HKY+G
GTR+G
“... only distances based on the optimal substitution model should be used”
"... the use of uncorrected distances yields higher or similar identification success rates”
[compared to K2P correction]
Reanalysed the
Fregin et al.
data: no difference in identification success rates

Fregin et al.
had not distinguished disciplines of DNA barcoding and DNA taxonomy

DNA barcoding: identification success

DNA taxonomy: evolutionary scenarios

For taxonomy/evolution, we advocate quantifying uncertainty in distances
Reanalysis:
Background
Questions
Methods
Conclusions
Results
Q. Is the K2P a good model at the species level?
Results:
Correct ID rate: 91.8% for all models

Simpler models had negligibly higher rates

Even with optimised thresholds, success rates were not affected by model
Models
p-distance
JC
F81
K2P
TrN
HKY
HKY+G
GTR+G
p-distance
K2P, JC, HKY
HKY+G, GTR+G
Q. Do models change identification success rates?
Hirotugo Akaike
Motoo Kimura
Q. Do models change identification success rates?

LEPIDOPTERA
Dincă et al. (2011)
Hajibabaei et al. (2006)
Lukhtanov et al. (2009)
BATS
Francis et al. (2010)
Full transcript