GMAP Build
GMAP Build creates an index of a genomic sequence for mapping and alignment using GMAP (Genomic Mapping and Alignment Program for mRNA and EST sequences) and GSNAP (Genomic Short-read Nucleotide Alignment Program). (GMAP Build uses GMAP commands: gmap_build, iit_store, psl_splicesites, psl_introns, gtf_splicesites, gtf_introns, gff3_splicesites, gff3_introns, dbsnp_iit, snpindex, cmetindex, and atoiindex.)
You will want to read the README
Publication citation: Thomas D. Wu, Colin K. Watanabe Bioinformatics 2005 21(9):1859-1875; doi:10.1093/bioinformatics/bti310
circular chromosomes
Finally, you can provide information to gmap_build that certain chromosomes are circular, with the -c or --circular flag. The value for these flags is a list of chromosomes, separated by commas. The gmap_build program will then allow GSNAP and GMAP to align reads across the ends of the chromosome. For example, the mitochondrial genome in human beings is circular.
Detecting known and novel splice sites in GSNAP
GSNAP can detect splice junctions in individual reads. GSNAP allows for known splicing at two levels: at the level of known splice sites and at the level of known introns. At the site level, GSNAP finds splicing between arbitrary combinations of donor and acceptor splice sites, meaning that it can find alternative splicing events. At the intron level, GSNAP finds splicing only between the set of given donor-acceptor pairs, so it is constrained not to find alternative splicing events, only introns included in the given list. For most purposes, I would recommend using known splice sites, rather than known introns, unless you are certain that all alternative splicing events are known are represented in your file.
Splice site files can be generated from a GTF file or from refGenes table from UCSC.
SNP-tolerant alignment in GSNAP
GSNAP has the ability to align to a reference space of all possible major and minor alleles in a set of known SNPs provided by the user.
Process known SNP data, either from older dbSNP files or from newer files in VCF format. The older dbSNP files can be obtained from UCSC, either from the Galaxy UCSC table browser or downloaded:
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz
For versions before snp132, you may also want to exclude exceptions, which will require this file:
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130Exceptions.txt.gz
The option "-w weight" makes use of the dbSNP item weight, a value from 1 to 3, where lower weight means higher confidence. Items will be included if the item weight is the given value weight or less. The default value of -w is 1, which is the criterion UCSC uses to build its ambiguous version of the genome. To allow all item weights, specify "-w 3".
The more recent SNP data are provided in VCF format, and can be retrieved like this:
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
The VCF file contains multiple versions of dbSNP, so if you want a particular version, such as 135, you would use the flag "-v 135". The vcf_iit program tries to pick a subset of SNPs that somewhat parallel the ones without exceptions in the UCSC dbSNP file. It keeps all SNPs that have been validated (marked in the VCF file as "VLD") or have a submitter link-out ("SLO"). Otherwise, it excludes SNPs that are individual genotypes ("GNO"). If none of these conditions hold, then the SNP is allowed. These rules might not be the best ones; I made them up by trying to compare version 135 of the VCF data with version 135 of the UCSC dbSNP data.
Alignment of reads from bisulfite-treated DNA in GSNAP
GSNAP has the ability to align reads from bisulfite-treated DNA, which converts unmethylated cytosines to uracils that appear as thymines in reads. GSNAP is able to identify genomic-T to read-C mismatches, if a cmetindex is generated.
RNA-editing tolerance in GSNAP
Just as GSNAP has a program cmetindex and a mode called "cmet" for tolerance to C-to-T changes, it can be tolerant to A-to-G changes using the program atoiindex and a mode called "atoi". This mode is designed to facilitate alignments that are tolerant to RNA editing where A's are converted to I's, which appear as G's to a sequencer.
To process reads under RNA-editing tolerance, you will first need to create th atoi index.
K-mer size
You can control the k-mer size for the genomic index with the -k flag, which can range from 12 to 15. The default value for -k is 15, but this requires your machine to have 4 GB of RAM to build the indices. If you do not have 4 GB of RAM, then you will need to reduce the value of -k or find another machine. Here are the RAM requirements for building various indices:
k-mer of 12: 64 MB k-mer of 13: 256 MB k-mer of 14: 1 GB k-mer of 15: 4 GB
These are the RAM requirements for building indices, but not to run the GMAP/GSNAP programs once the indices are built, because the genomic indices are compressed. For example, the genomic index for a k-mer of 15 gives a gammaptrs file of 64 MB and an offsetscomp file of about 350 MB, much smaller than the 4 GB that would otherwise be required. Therefore, you may want to build your genomic index on a computer with sufficient RAM, and distribute that index to be used by computers with less RAM.
The amount of compression can be controlled using the -b or --basesize parameter to gmap_build. By default, the value for k-mer size is 15, and the value for basesize is 12. If you select a different value for k-mer size, then basesize is made by default to be equal to that k-mer size.
If you want to build your genomic databases with more than one k-mer size, you can re-run gmap_build with different values of -k. This will overwrite only the identical files from the previous runs. You can then choose the k-mer size at run-time by using the -k flag for either GMAP or GSNAP.