Galaxy | (sandbox for testing) | Tool Preview

progressiveMauve (version
in fasta format
Read an existing sequence alignment in XMFA format and apply backbone statistics to it (--apply-backbone)
Alignment gaps above this size in nucleotides are considered to be islands (--island-gap-size)
Disable backbone detection (--disable-backbone)
Write out the guide tree used for alignment to a file (--output-guide-tree)
Write out the backbone to a file (--backbone-output)
Find MUMs only, do not attempt to determine locally collinear blocks (LCBs) (--mums)
Use the specified seed weight for calculating initial anchors (--seed-weight)
Use specified match file instead of searching for matches (--match-input)
Maximum number of base pairs to attempt aligning with the gapped aligner (--max-gapped-aligner-length)
A phylogenetic guide tree in Newick format that describes the order in which sequences will be aligned (--input-guide-tree)
Assume that input sequences are collinear--they have no rearrangements (--collinear)
Selects the anchoring score function. (--scoring-scheme)
Don't scale LCB weights by conservation distance and breakpoint distance (--no-weight-scaling)
Set the maximum weight scaling by breakpoint distance. (--max-breakpoint-distance-scale)
Scale conservation distances by this amount. (--conservation-distance-scale)
Do not perform iterative refinement (--skip-refinement)
Do not perform gapped alignment (--skip-gapped-alignment)
Minimum LCB score for estimating pairwise breakpoint distance (--bp-dist-estimate-min-score)
Gap open penalty (--gap-open)
Sets whether the repeat scores go negative or go to zero for highly repetitive sequences. (--repeat-penalty)
Gap extend penalty (--gap-extend)
Minimum pairwise LCB score (--weight)
Minimum breakpoint penalty after scaling the penalty by expected divergence (--min-scaled-penalty)
Probability of transitioning from the unrelated to the homologous state (--hmm-p-go-homologous)
Probability of transitioning from the homologous to the unrelated state (--hmm-p-go-unrelated)
Expected level of sequence identity among pairs of sequences(--hmm-identity)
Use a family of spaced seeds to improve sensitivity (--seed-family)
Use solid seeds. Do not permit substitutions in anchor matches. (--solid-seeds)
Use coding pattern seeds. Useful to generate matches coding regions with 3rd codon position degeneracy. (--coding-seeds)
Disable recursive anchor search (--no-recursion)

What it does

Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences.

Mauve has been developed with the idea that a multiple genome aligner should require only modest computational resources. It employs algorithmic techniques that scale well in the amount of sequence being aligned. For example, a pair of Y. pestis genomes can be aligned in under a minute, while a group of 9 divergent Enterobacterial genomes can be aligned in a few hours.

progressiveMauve XMFA alignment visualized with the Mauve tool:


Example Usage

Usage Notes
Align genomes Simply select as many fasta files with one or more sequences as necessary
Align genomes but also save the guide tree and produce a backbone file Use the Output Guide Tree and Output Backbone options
Align genomes, but do not detect forced alignment of unrelated sequences Use the Disable backbone option
Detect forced alignment of unrelated sequence in the alignment produced in previous example, use custom Homology HMM transition parameters. Use the Apply Backbone option and specify the XMFA file produced in the previous example
Compute ungapped local-multiple alignments among the input sequences Use the MUMs option
Compute an alignment of the same genomes, using previously computed local-multiple alignments Set the Match Input to the tabular MUMs file produced in the previous example
Set a minimum scaled breakpoint penalty to cope with the case where most genomes are aligned correctly, but manual inspection reveals that a divergent genome has too many predicted rearrangements. Use the Min Scaled Penalty and set to a value like 5000
Globally align a set of collinear virus genomes, using seed families to improve anchoring sensitivity in regions below 70% sequence identity. Use the Colinear, Seed Family options

The progressiveMauve algorithm: addressing limitations of the original algorithm

Comparative genomics has revealed that closely-related bacteria often have highly divergent gene content. While the original Mauve algorithm could align regions conserved among all organisms, the portion of the genome conserved among all taxa (the core genome) shrinks as more taxa are added to the analysis. As such, the original Mauve algorithm did not scale well to large numbers of taxa because it could not align regions conserved among subsets of the genomes under study. progressiveMauve employs a different algorithmic approach to scoring alignments that allows alignment of segments conserved among subsets of taxa. The progressiveMauve algorithm has been described in Aaron Darling's Ph.D. Thesis, and is also the subject of a manuscript published in PLoS ONE. A brief overview is given here.

Finding initial local multiple alignments

progressiveMauve elaborates on the original algorithm for finding local multiple alignments. Instead of using a single seed pattern for match filtration, progressiveMauve uses a combination of three seed patterns for improved sensitivity. The palindromic seed patterns have been described in Darling et al. 2006 "Procrastination leads to efficient filtration for local multiple alignment"

Seed matches which represent a unique subsequence shared by two or more input genomes are subjected to ungapped extension until the seed pattern no longer matches. The result is an ungapped local multiple alignment with at most one component from each of the input genome sequences.

Computing a pairwise genome content distance matrix and guide tree

progressiveMauve builds up genome alignments progressively according to a guide tree. The guide tree is computed based on an estimate of the shared gene content among each pair of input genomes. For a pair of input genomes, g.x and g.y, shared gene content is estimated by counting the number of nucleotides in gx and gy aligned to each other in the initial set of local multiple alignments. The count is normalized to a similarity value between 0 and 1 by dividing by the average size of gx and gy. The similarity value is subtracted from 1 to arrive at a distance estimate. Neighbor joining is then applied to the matrix of distance estimates to yield a guide tree topology. Note that the guide tree is not intended to be a phylogeny indicative of the genealogy of input genomes. It is merely a computational crutch for progressive genome alignment. Also note that alignments are later refined independently of a single guide tree toplogy to avoid biasing later phylogenetic inference.

Computing a pairwise breakpoint distance matrix

Prior to alignment, progressiveMauve attempts to compute a conservative estimate of the number of rearrangement breakpoints among any pair of genomes. For each pair of genomes, pairwise alignments are created from the local-multiple alignments and the pairwise alignments are subjected to greedy breakpoint elimination. The breakpoint penalty used for greedy breakpoint elimination is set high for closely related genomes and scaled downward according to the estimate of genomic content distance. Because the breakpoint penalty is high, the resulting set of locally collinear blocks represent robustly supported segmental homology, and a conservative estimate of the breakpoint distance can be made on this basis. The conservative estimate of breakpoint distance is used later during progressive alignment to scale breakpoint penalties.

Progressive genome alignment

A genome alignment is progressively built up according to the guide tree. At each step of the progressive genome alignment, alignment anchors are selected from the initial set of local multiple alignments. Anchors are selected so that they maximize a Sum-of-pairs scoring scheme which applies a penalty for predicting breakpoints among any pair of genomes. Because rates of genomic rearrangement are highly variable, especially in some bacterial pathogens, some genomes may be expected to exhibit greater rearrangement than others. As such, a single choice of scoring penalty is unlikely to yield accurate alignments for all genomes. To cope with this phenomenon, progressiveMauve scales the breakpoint penalty according to the expected level of sequence divergence and the number of well-supported genomic rearrangements among the pair of input genomes. These scaling values are taken from the distance matrices computed earlier in the algorithm.

Anchored alignment

Once anchors have been computed at a node in the guide tree, a global alignment is computed on the basis of the anchors. Given a set of anchors among two genomes, a genome and an alignment, or a pair of alignments, a modified MUSCLE global alignment algorithm is applied to compute an anchored profile-profile alignment. MUSCLE is then used to perform tree-independent iterative refinement on the global genome alignment.

Rejecting alignment of unrelated sequence

Although we compute a global alignment among sequences, genomes often contain lineage-specific sequence and are thus not globally related. The global alignment will often contain forced alignment of unrelated sequence. A simple hidden Markov model structure is used to detect forced alignment of unrelated sequence, which are then removed from the alignment.

Strengths of the progressiveMauve algorithm

  • It can be applied to a much larger number of genomes than the original Mauve algorithm
  • It can align more divergent genomes than the original algorithm. Genomes with as little as 50% nucleotide identity can be alignable
  • Manual adjustment of the alignment scoring parameters is usually not necessary
  • It aligns the pan-genome, e.g. regions conserved among subsets of the input genomes
  • It is more accurate than the previous Mauve algorithm

Notes on Reproducibility

The command line programme progressiveMauve seems to behave differently when:

--max-breakpoint-distance-scale=0.5 --conservation-distance-scale=0.5

are passed to the tool, compared to when those options are not passed. This means that if you wish to precisely replicate the results you see in Galaxy at the command line, you'll need to pass these flags with their "default" values.