Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Multiple Sequence Alignment
Transcript of Multiple Sequence Alignment
to N Dimensions Dynamite Programming! better, still optimal a representation of a set of sequences, where equivalent residues are aligned in columns complexity: O(L^n) L ...... length of sequences
n ....... number of sequences D. W. Mount - Bioinformatics (2004) compute all (n*(n-1))/2 pairwise optimal alignments to get boundaries for the search space "Multiple Alignment with SP-score, Star Alignment, and Tree Alignment (for a given phylogeny) [are NP-hard]. In addition, NP-hardness results are provided for Consensus Patterns and Substring Parsimony." Although it is widely suspected that there are no polynomial-time algorithms for NP-hard problems, this has never been proven. (Wikipedia -> http://en.wikipedia.org/wiki/Np_hard ) Isaac Elias (2007) - Settling the Intractability of Multiple Alignment.
JOURNAL OF COMPUTATIONAL BIOLOGY
( http://www.liebertonline.com/doi/pdfplus/10.1089/cmb.2006.13.1323 ) Progressive methods e.g. ClustalW iterative refinement CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice -
Julie D.Thompson, Desmond G.Higgins+ and Toby J.Gibson*
Nucleic Acids Research, 1994, Vol. 22, No. 22 4673-4680
http://nar.oxfordjournals.org/content/22/22/4673.short ".. the SP score has no rigorous theoretical foundation and, in particular, fails to exploit phylogeny or incorporate an evolutionary model."
From: Multiple sequence alignment - Robert C Edgar and Serafim Batzoglou - Current Opinion in Structural Biology 2006, 16:1–6 Recent developments in the MAFFT multiple sequence alignment program - Kazutaka Katoh and Hiroyuki Toh - BRIEFINGS IN BIOINFORMATICS. 2008; VOL 9. NO 4. 286-298 Multiple sequence alignment - Robert C Edgar and Serafim Batzoglou - Current Opinion in Structural Biology 2006, 16:1–6 "Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue." it's getting a little local... It has been shown several times in the past few years that MAFFT and ProbCons are the best performing of the programs based solely on maximizing sequence similarity, followed by T-Coffee and Muscle. However, this message does not seem to have affected phylogenetic practice, as MAFFT and ProbCons together contributed to have affected phylogenetic practice, as MAFFT and ProbCons together contributed to <2% of the papers.
There thus seems to be a contradiction here: if Clustal is being used, then similarity is being accepted as a valid criterion for alignment (which it will often not be, as similarity |= homology), and yet the best-performing similarity-based programs are not being widely used. The choice of Clustal may thus represent historical inertia rather than any form of active preferential choice. Over the last 30 years, >100 multiple sequence alignment methodshave been published, based on all kind of heuristics, including simulated annealing, genetic algorithms, Tabu search, branch and bound algorithms, Hidden Markov Modeling and countless agglomerative approaches including the progressive alignment algorithm, by far the most widely used nowadays.
The biological issue surrounding MSAs is even more complex: given a set of sequences, we do not know how to estimate similarity in a way that will guarantee the biological correctness of an alignment, whether this correctness is defined in evolutionary, structural or functional terms. Genetic Algorithm (i) accuracy improvement, achieved through the use of consistency-based methods (instead of SP-scored);
(ii) an expansion of MSA methods scope, thanks to the development of template-based approaches, a natural development of consistency-based methods that makes it possible to efficiently integrate alternative methods and alternative types of data;
(iii) large-scale alignments -> parallelization
(iv) phylogentic reconstruction. While DP- and HMM-based approaches are mostly interchangeable, the last ones make it easier to explore the parameter space using off-the-shelves statistical tools such as Baum–Welch and Viterbi training. HMM modeling also offers easy access to a wider range of scoring possibilities, thanks to posterior decoding, thus making it possible to assess complex alignment scoring schemes. HMM Consistency-based scoring consistency scoring Meta-Methods meta-methods Given a multiple sequence dataset, M-Coffee computes alternative MSAs using any selected method. Each of the alignments thus produced is then turned into a primary library and merged to the main T-Coffee library. The resulting library is used to compute an MSA consistent with the original alignments. This final MSA may be considered as some sort of average of all the considered alignments. When combining eight of the most accurate and distinct MSA packages, M-Coffee produces alignments that are on average better than any of the individual methods. The improvement is not very high (1–2-point percent) but relatively consistent since the meta-method outperforms the best individual method (ProbCons) on ∼2/3 of the 2000 considered datasets (Wallace et al., 2006). Template-based alignment refers to the notion of enriching a sequence with the information contained in a template. The template can either be a 3D structure, a profile or a prediction of any kind. Once the template is precisely mapped onto the sequence, its information content can be used to guide the sequence alignment in a sequence independent fashion. Depending on the nature of the template one refers to its usage as ‘structural extension’ (works also for RNA) or ‘homology extension’ (sequence profile). Template-based MSA methods template-based methods Large Datasets parallelization Consistency-based methods are cubic in complexity with N (number of sequences), but only quadratic with L (length of sequences). With N much smaller than L, the extra-cost incurred by consistency remains manageable, but things degrade rapidly when N becomes big. Yet, it is now clear that L is bounded, at most by the average length of a genome. N, on the other hand, has no foreseeable limit and could reflect the total number of species or the total number of individuals (past and present) in a population or even the total number of haplotypes in a system. Phylogenetic Analysis Visualization and Organisation of the Data Simple guide http://www.geneious.com/