Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences.
Mauve has been developed with the idea that a multiple genome aligner should require only modest computational resources. It employs algorithmic techniques that scale well in the amount of sequence being aligned. For example, a pair of Y. pestis genomes can be aligned in under a minute, while a group of 9 divergent Enterobacterial genomes can be aligned in a few hours.
progressiveMauve XMFA alignment visualized with the Mauve tool:
Usage | Notes |
---|---|
Align genomes | Simply select as many fasta files with one or more sequences as necessary |
Align genomes but also save the guide tree and produce a backbone file | Use the Output Guide Tree and Output Backbone options |
Align genomes, but do not detect forced alignment of unrelated sequences | Use the Disable backbone option |
Detect forced alignment of unrelated sequence in the alignment produced in previous example, use custom Homology HMM transition parameters. | Use the Apply Backbone option and specify the XMFA file produced in the previous example |
Compute ungapped local-multiple alignments among the input sequences | Use the MUMs option |
Compute an alignment of the same genomes, using previously computed local-multiple alignments | Set the Match Input to the tabular MUMs file produced in the previous example |
Set a minimum scaled breakpoint penalty to cope with the case where most genomes are aligned correctly, but manual inspection reveals that a divergent genome has too many predicted rearrangements. | Use the Min Scaled Penalty and set to a value like 5000 |
Globally align a set of collinear virus genomes, using seed families to improve anchoring sensitivity in regions below 70% sequence identity. | Use the Colinear, Seed Family options |
Comparative genomics has revealed that closely-related bacteria often have highly divergent gene content. While the original Mauve algorithm could align regions conserved among all organisms, the portion of the genome conserved among all taxa (the core genome) shrinks as more taxa are added to the analysis. As such, the original Mauve algorithm did not scale well to large numbers of taxa because it could not align regions conserved among subsets of the genomes under study. progressiveMauve employs a different algorithmic approach to scoring alignments that allows alignment of segments conserved among subsets of taxa. The progressiveMauve algorithm has been described in Aaron Darling's Ph.D. Thesis, and is also the subject of a manuscript published in PLoS ONE. A brief overview is given here.
progressiveMauve elaborates on the original algorithm for finding local multiple alignments. Instead of using a single seed pattern for match filtration, progressiveMauve uses a combination of three seed patterns for improved sensitivity. The palindromic seed patterns have been described in Darling et al. 2006 "Procrastination leads to efficient filtration for local multiple alignment"
Seed matches which represent a unique subsequence shared by two or more input genomes are subjected to ungapped extension until the seed pattern no longer matches. The result is an ungapped local multiple alignment with at most one component from each of the input genome sequences.
progressiveMauve builds up genome alignments progressively according to a guide tree. The guide tree is computed based on an estimate of the shared gene content among each pair of input genomes. For a pair of input genomes, g.x and g.y, shared gene content is estimated by counting the number of nucleotides in gx and gy aligned to each other in the initial set of local multiple alignments. The count is normalized to a similarity value between 0 and 1 by dividing by the average size of gx and gy. The similarity value is subtracted from 1 to arrive at a distance estimate. Neighbor joining is then applied to the matrix of distance estimates to yield a guide tree topology. Note that the guide tree is not intended to be a phylogeny indicative of the genealogy of input genomes. It is merely a computational crutch for progressive genome alignment. Also note that alignments are later refined independently of a single guide tree toplogy to avoid biasing later phylogenetic inference.
Prior to alignment, progressiveMauve attempts to compute a conservative estimate of the number of rearrangement breakpoints among any pair of genomes. For each pair of genomes, pairwise alignments are created from the local-multiple alignments and the pairwise alignments are subjected to greedy breakpoint elimination. The breakpoint penalty used for greedy breakpoint elimination is set high for closely related genomes and scaled downward according to the estimate of genomic content distance. Because the breakpoint penalty is high, the resulting set of locally collinear blocks represent robustly supported segmental homology, and a conservative estimate of the breakpoint distance can be made on this basis. The conservative estimate of breakpoint distance is used later during progressive alignment to scale breakpoint penalties.
A genome alignment is progressively built up according to the guide tree. At each step of the progressive genome alignment, alignment anchors are selected from the initial set of local multiple alignments. Anchors are selected so that they maximize a Sum-of-pairs scoring scheme which applies a penalty for predicting breakpoints among any pair of genomes. Because rates of genomic rearrangement are highly variable, especially in some bacterial pathogens, some genomes may be expected to exhibit greater rearrangement than others. As such, a single choice of scoring penalty is unlikely to yield accurate alignments for all genomes. To cope with this phenomenon, progressiveMauve scales the breakpoint penalty according to the expected level of sequence divergence and the number of well-supported genomic rearrangements among the pair of input genomes. These scaling values are taken from the distance matrices computed earlier in the algorithm.
Once anchors have been computed at a node in the guide tree, a global alignment is computed on the basis of the anchors. Given a set of anchors among two genomes, a genome and an alignment, or a pair of alignments, a modified MUSCLE global alignment algorithm is applied to compute an anchored profile-profile alignment. MUSCLE is then used to perform tree-independent iterative refinement on the global genome alignment.
The command line programme progressiveMauve seems to behave differently when:
--max-breakpoint-distance-scale=0.5 --conservation-distance-scale=0.5
are passed to the tool, compared to when those options are not passed. This means that if you wish to precisely replicate the results you see in Galaxy at the command line, you'll need to pass these flags with their "default" values.
@ATTRIBUTION@