progressiveMauve (version 2015_02_13.1)

Select sequences to align:

in fasta format

Apply Backbone:

Read an existing sequence alignment in XMFA format and apply backbone statistics to it (--apply-backbone)

Island gap size:

Alignment gaps above this size in nucleotides are considered to be islands

Disable backbone:

Disable backbone detection

Output Guide Tree:

Write out the guide tree used for alignment to a file

Output Backbone:

Write out the backbone to a file

MUMs:

Find MUMs only, do not attempt to determine locally collinear blocks (LCBs)

Seed weight:

Use the specified seed weight for calculating initial anchors

Match Input:

Use specified match file instead of searching for matches

Max gapped aligner length:

Maximum number of base pairs to attempt aligning with the gapped aligner

input-guide-tree:

A phylogenetic guide tree in Newick format that describes the order in which sequences will be aligned

Collinear inputs:

Assume that input sequences are collinear--they have no rearrangements

Scoring scheme:

Selects the anchoring score function

No weight scaling:

Don't scale LCB weights by conservation distance and breakpoint distance

max-breakpoint-distance-scale:

Set the maximum weight scaling by breakpoint distance

conservation-distance-scale:

Scale conservation distances by this amount

Skip refinement:

Do not perform iterative refinement

Skip gapped alignment:

Do not perform gapped alignment

BP dist estimate min score:

Minimum LCB score for estimating pairwise breakpoint distance

Gap open:

Gap open penalty

Repeat penalty:

Sets whether the repeat scores go negative or go to zero for highly repetitive sequences

Gap extend:

Gap extend penalty

Weight:

Minimum pairwise LCB score

Min scaled penalty:

Minimum breakpoint penalty after scaling the penalty by expected divergence

HMM p go homologous:

Probability of transitioning from the unrelated to the homologous state

HMM p go unrelated:

Probability of transitioning from the homologous to the unrelated state

HMM identity:

Expected level of sequence identity among pairs of sequences

Seed family:

Use a family of spaced seeds to improve sensitivity

Solid seeds:

Use solid seeds. Do not permit substitutions in anchor matches

Coding seeds:

Use coding pattern seeds. Useful to generate matches coding regions with 3rd codon position degeneracy

No recursion:

Disable recursive anchor search

What it does

Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences.

Mauve has been developed with the idea that a multiple genome aligner should require only modest computational resources. It employs algorithmic techniques that scale well in the amount of sequence being aligned. For example, a pair of Y. pestis genomes can be aligned in under a minute, while a group of 9 divergent Enterobacterial genomes can be aligned in a few hours.

progressiveMauve XMFA alignment visualized with the Mauve tool:

/repository/static/images/b44f986fface6d38/hemolysin.jpg

Example Usage

Usage	Notes
Align genomes	Simply select as many fasta files with one or more sequences as necessary
Align genomes but also save the guide tree and produce a backbone file	Use the Output Guide Tree and Output Backbone options
Align genomes, but do not detect forced alignment of unrelated sequences	Use the Disable backbone option
Detect forced alignment of unrelated sequence in the alignment produced in previous example, use custom Homology HMM transition parameters.	Use the Apply Backbone option and specify the XMFA file produced in the previous example
Compute ungapped local-multiple alignments among the input sequences	Use the MUMs option
Compute an alignment of the same genomes, using previously computed local-multiple alignments	Set the Match Input to the tabular MUMs file produced in the previous example
Set a minimum scaled breakpoint penalty to cope with the case where most genomes are aligned correctly, but manual inspection reveals that a divergent genome has too many predicted rearrangements.	Use the Min Scaled Penalty and set to a value like 5000
Globally align a set of collinear virus genomes, using seed families to improve anchoring sensitivity in regions below 70% sequence identity.	Use the Colinear, Seed Family options

The progressiveMauve algorithm: addressing limitations of the original algorithm

Comparative genomics has revealed that closely-related bacteria often have highly divergent gene content. While the original Mauve algorithm could align regions conserved among all organisms, the portion of the genome conserved among all taxa (the core genome) shrinks as more taxa are added to the analysis. As such, the original Mauve algorithm did not scale well to large numbers of taxa because it could not align regions conserved among subsets of the genomes under study. progressiveMauve employs a different algorithmic approach to scoring alignments that allows alignment of segments conserved among subsets of taxa. The progressiveMauve algorithm has been described in Aaron Darling's Ph.D. Thesis, and is also the subject of a manuscript published in PLoS ONE. A brief overview is given here.

Finding initial local multiple alignments

progressiveMauve elaborates on the original algorithm for finding local multiple alignments. Instead of using a single seed pattern for match filtration, progressiveMauve uses a combination of three seed patterns for improved sensitivity. The palindromic seed patterns have been described in Darling et al. 2006 "Procrastination leads to efficient filtration for local multiple alignment"

Seed matches which represent a unique subsequence shared by two or more input genomes are subjected to ungapped extension until the seed pattern no longer matches. The result is an ungapped local multiple alignment with at most one component from each of the input genome sequences.

Computing a pairwise genome content distance matrix and guide tree

progressiveMauve builds up genome alignments progressively according to a guide tree. The guide tree is computed based on an estimate of the shared gene content among each pair of input genomes. For a pair of input genomes, g.x and g.y, shared gene content is estimated by counting the number of nucleotides in gx and gy aligned to each other in the initial set of local multiple alignments. The count is normalized to a similarity value between 0 and 1 by dividing by the average size of gx and gy. The similarity value is subtracted from 1 to arrive at a distance estimate. Neighbor joining is then applied to the matrix of distance estimates to yield a guide tree topology. Note that the guide tree is not intended to be a phylogeny indicative of the genealogy of input genomes. It is merely a computational crutch for progressive genome alignment. Also note that alignments are later refined independently of a single guide tree toplogy to avoid biasing later phylogenetic inference.

Computing a pairwise breakpoint distance matrix

Prior to alignment, progressiveMauve attempts to compute a conservative estimate of the number of rearrangement breakpoints among any pair of genomes. For each pair of genomes, pairwise alignments are created from the local-multiple alignments and the pairwise alignments are subjected to greedy breakpoint elimination. The breakpoint penalty used for greedy breakpoint elimination is set high for closely related genomes and scaled downward according to the estimate of genomic content distance. Because the breakpoint penalty is high, the resulting set of locally collinear blocks represent robustly supported segmental homology, and a conservative estimate of the breakpoint distance can be made on this basis. The conservative estimate of breakpoint distance is used later during progressive alignment to scale breakpoint penalties.

Progressive genome alignment

A genome alignment is progressively built up according to the guide tree. At each step of the progressive genome alignment, alignment anchors are selected from the initial set of local multiple alignments. Anchors are selected so that they maximize a Sum-of-pairs scoring scheme which applies a penalty for predicting breakpoints among any pair of genomes. Because rates of genomic rearrangement are highly variable, especially in some bacterial pathogens, some genomes may be expected to exhibit greater rearrangement than others. As such, a single choice of scoring penalty is unlikely to yield accurate alignments for all genomes. To cope with this phenomenon, progressiveMauve scales the breakpoint penalty according to the expected level of sequence divergence and the number of well-supported genomic rearrangements among the pair of input genomes. These scaling values are taken from the distance matrices computed earlier in the algorithm.

Anchored alignment

Once anchors have been computed at a node in the guide tree, a global alignment is computed on the basis of the anchors. Given a set of anchors among two genomes, a genome and an alignment, or a pair of alignments, a modified MUSCLE global alignment algorithm is applied to compute an anchored profile-profile alignment. MUSCLE is then used to perform tree-independent iterative refinement on the global genome alignment.

Rejecting alignment of unrelated sequence

Although we compute a global alignment among sequences, genomes often contain lineage-specific sequence and are thus not globally related. The global alignment will often contain forced alignment of unrelated sequence. A simple hidden Markov model structure is used to detect forced alignment of unrelated sequence, which are then removed from the alignment.

Strengths of the progressiveMauve algorithm

It can be applied to a much larger number of genomes than the original Mauve algorithm
It can align more divergent genomes than the original algorithm. Genomes with as little as 50% nucleotide identity can be alignable
Manual adjustment of the alignment scoring parameters is usually not necessary
It aligns the pan-genome, e.g. regions conserved among subsets of the input genomes
It is more accurate than the previous Mauve algorithm

Notes on Reproducibility

The command line programme progressiveMauve seems to behave differently when:

--max-breakpoint-distance-scale=0.5 --conservation-distance-scale=0.5

are passed to the tool, compared to when those options are not passed. This means that if you wish to precisely replicate the results you see in Galaxy at the command line, you'll need to pass these flags with their "default" values.

@ATTRIBUTION@