# HG changeset patch # User bgruening # Date 1376401087 14400 # Node ID d602d8b1dc4fe4632540fbc291bf833f43dd38fa # Parent 5b919ef94655d94980f68d24bd654d84080779dd Uploaded diff -r 5b919ef94655 -r d602d8b1dc4f tool-data/homer_available_genomes.loc.sample --- a/tool-data/homer_available_genomes.loc.sample Mon Aug 12 14:39:25 2013 -0400 +++ b/tool-data/homer_available_genomes.loc.sample Tue Aug 13 09:38:07 2013 -0400 @@ -2,5 +2,3 @@ hg19 mm9 mm10 - - diff -r 5b919ef94655 -r d602d8b1dc4f tools/README --- a/tools/README Mon Aug 12 14:39:25 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,15 +0,0 @@ -Homer wrapper for Galaxy - -The homer tools will need to be accessible from command line - -Code repo: https://bitbucket.org/gvl/homer - -=========================================: -LICENSE for this wrapper: -=========================================: -Kevin Ying -Garvan Institute: http://www.garvan.org.au -GVL: https://genome.edu.au/wiki/GVL - -http://opensource.org/licenses/mit-license.php - diff -r 5b919ef94655 -r d602d8b1dc4f tools/annotatePeaks.xml --- a/tools/annotatePeaks.xml Mon Aug 12 14:39:25 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,164 +0,0 @@ - - - homer - - - - - annotatePeaks.pl $input_bed $genome_selector 1> $out_annotated - 2> $out_log || echo "Error running annotatePeaks." >&2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .. class:: infomark - - **Homer annoatePeaks** - - More information on accepted formats and options - - http://biowhat.ucsd.edu/homer/ngs/annotation.html - - TIP: use homer_bed2pos and homer_pos2bed to convert between the homer peak positions and the BED format. - -**Parameter list** - -Command line options (not all of them are supported):: - - Usage: annotatePeaks.pl <peak file | tss> <genome version> [additional options...] - - Available Genomes (required argument): (name,org,directory,default promoter set) - -- or -- - Custom: provide the path to genome FASTA files (directory or single file) - - User defined annotation files (default is UCSC refGene annotation): - annotatePeaks.pl accepts GTF (gene transfer formatted) files to annotate positions relative - to custom annotations, such as those from de novo transcript discovery or Gencode. - -gtf <gtf format file> (-gff and -gff3 can work for those files, but GTF is better) - - Peak vs. tss/tts/rna mode (works with custom GTF file): - If the first argument is "tss" (i.e. annotatePeaks.pl tss hg18 ...) then a TSS centric - analysis will be carried out. Tag counts and motifs will be found relative to the TSS. - (no position file needed) ["tts" now works too - e.g. 3' end of gene] - ["rna" specifies gene bodies, will automaticall set "-size given"] - NOTE: The default TSS peak size is 4000 bp, i.e. +/- 2kb (change with -size option) - -list <gene id list> (subset of genes to perform analysis [unigene, gene id, accession, - probe, etc.], default = all promoters) - -cTSS <promoter position file i.e. peak file> (should be centered on TSS) - - Primary Annotation Options: - -mask (Masked repeats, can also add 'r' to end of genome name) - -m <motif file 1> [motif file 2] ... (list of motifs to find in peaks) - -mscore (reports the highest log-odds score within the peak) - -nmotifs (reports the number of motifs per peak) - -mdist (reports distance to closest motif) - -mfasta <filename> (reports sites in a fasta file - for building new motifs) - -fm <motif file 1> [motif file 2] (list of motifs to filter from above) - -rmrevopp <#> (only count sites found within <#> on both strands once, i.e. palindromic) - -matrix <prefix> (outputs a motif co-occurrence files: - prefix.count.matrix.txt - number of peaks with motif co-occurrence - prefix.ratio.matrix.txt - ratio of observed vs. expected co-occurrence - prefix.logPvalue.matrix.txt - co-occurrence enrichment - prefix.stats.txt - table of pair-wise motif co-occurrence statistics - additional options: - -matrixMinDist <#> (minimum distance between motif pairs - to avoid overlap) - -matrixMaxDist <#> (maximum distance between motif pairs) - -mbed <filename> (Output motif positions to a BED file to load at UCSC (or -mpeak)) - -mlogic <filename> (will output stats on common motif orientations) - -d <tag directory 1> [tag directory 2] ... (list of experiment directories to show - tag counts for) NOTE: -dfile <file> where file is a list of directories in first column - -bedGraph <bedGraph file 1> [bedGraph file 2] ... (read coverage counts from bedGraph files) - -wig <wiggle file 1> [wiggle file 2] ... (read coverage counts from wiggle files) - -p <peak file> [peak file 2] ... (to find nearest peaks) - -pdist to report only distance (-pdist2 gives directional distance) - -pcount to report number of peaks within region - -vcf <VCF file> (annotate peaks with genetic variation infomation, one col per individual) - -editDistance (Computes the # bp changes relative to reference) - -individuals <name1> [name2] ... (restrict analysis to these individuals) - -gene <data file> ... (Adds additional data to result based on the closest gene. - This is useful for adding gene expression data. The file must have a header, - and the first column must be a GeneID, Accession number, etc. If the peak - cannot be mapped to data in the file then the entry will be left empty. - -go <output directory> (perform GO analysis using genes near peaks) - -genomeOntology <output directory> (perform genomeOntology analysis on peaks) - -gsize <#> (Genome size for genomeOntology analysis, default: 2e9) - - Annotation vs. Histogram mode: - -hist <bin size in bp> (i.e 1, 2, 5, 10, 20, 50, 100 etc.) - The -hist option can be used to generate histograms of position dependent features relative - to the center of peaks. This is primarily meant to be used with -d and -m options to map - distribution of motifs and ChIP-Seq tags. For ChIP-Seq peaks for a Transcription factor - you might want to use the -center option (below) to center peaks on the known motif - ** If using "-size given", histogram will be scaled to each region (i.e. 0-100%), with - the -hist parameter being the number of bins to divide each region into. - Histogram Mode specific Options: - -nuc (calculated mononucleotide frequencies at each position, - Will report by default if extracting sequence for other purposes like motifs) - -di (calculated dinucleotide frequencies at each position) - -histNorm <#> (normalize the total tag count for each region to 1, where <#> is the - minimum tag total per region - use to avoid tag spikes from low coverage - -ghist (outputs profiles for each gene, for peak shape clustering) - -rm <#> (remove occurrences of same motif that occur within # bp) - - Peak Centering: (other options are ignored) - -center <motif file> (This will re-center peaks on the specified motif, or remove peak - if there is no motif in the peak. ONLY recentering will be performed, and all other - options will be ignored. This will output a new peak file that can then be reanalyzed - to reveal fine-grain structure in peaks (It is advised to use -size < 200) with this - to keep peaks from moving too far (-mirror flips the position) - -multi (returns genomic positions of all sites instead of just the closest to center) - - Advanced Options: - -len <#> / -fragLength <#> (Fragment length, default=auto, might want to set to 0 for RNA) - -size <#> (Peak size[from center of peak], default=inferred from peak file) - -size #,# (i.e. -size -10,50 count tags from -10 bp to +50 bp from center) - -size "given" (count tags etc. using the actual regions - for variable length regions) - -log (output tag counts as log2(x+1+rand) values - for scatter plots) - -sqrt (output tag counts as sqrt(x+rand) values - for scatter plots) - -strand <+|-|both> (Count tags on specific strands relative to peak, default: both) - -pc <#> (maximum number of tags to count per bp, default=0 [no maximum]) - -cons (Retrieve conservation information for peaks/sites) - -CpG (Calculate CpG/GC content) - -ratio (process tag values as ratios - i.e. chip-seq, or mCpG/CpG) - -nfr (report nuclesome free region scores instead of tag counts, also -nfrSize <#>) - -norevopp (do not search for motifs on the opposite strand [works with -center too]) - -noadj (do not adjust the tag counts based on total tags sequenced) - -norm <#> (normalize tags to this tag count, default=1e7, 0=average tag count in all directories) - -pdist (only report distance to nearest peak using -p, not peak name) - -map <mapping file> (mapping between peak IDs and promoter IDs, overrides closest assignment) - -noann, -nogene (skip genome annotation step, skip TSS annotation) - -homer1/-homer2 (by default, the new version of homer [-homer2] is used for finding motifs) - - - - - diff -r 5b919ef94655 -r d602d8b1dc4f tools/bed2pos.xml --- a/tools/bed2pos.xml Mon Aug 12 14:39:25 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,37 +0,0 @@ - - - homer - - - - - bed2pos.pl $input_bed 1> $out_pos - 2> $out_log || echo "Error running bed2pos." >&2 - - - - - - - - - - - - - - - - - - - .. class:: infomark - - Converts: BED -(to)-> homer peak positions - - **Homer bed2pos.pl** - - http://biowhat.ucsd.edu/homer/ngs/miscellaneous.html - - - diff -r 5b919ef94655 -r d602d8b1dc4f tools/findMotifsGenome.xml --- a/tools/findMotifsGenome.xml Mon Aug 12 14:39:25 2013 -0400 +++ b/tools/findMotifsGenome.xml Tue Aug 13 09:38:07 2013 -0400 @@ -1,48 +1,116 @@ - + + blat weblogo ghostscript - - + #import os #import tempfile - #set $tmpdir = tempfile.mkdtemp() + + #set $tmpdir = os.path.abspath( tempfile.mkdtemp() ) export PATH=\$PATH:$database.fields.path; - findMotifsGenome $infile ${infile.metadata.dbkey} $tmpdir + findMotifsGenome.pl $infile ${infile.metadata.dbkey} $tmpdir + + -p 4 + $mask + -size $size + -len $motif_len + -mis $mismatches + -S $number_of_motifs + $noweight + $cpg + -nlen $nlen + -olen $olen + $hypergeometric + $norevopp + $rna + + #if $bg_infile: + -bg $bg_infile + #end if + + #if $logfile_output: + 2> $out_logfile + #else: + 2>&1 + #end if + + ; + cp $tmpdir/knownResults.txt $known_results_tabular; + + #if $concat_motifs_output: + cp $tmpdir/homerMotifs.all.motifs $out_concat_motifs; + #end if + + #if $html_output: + #set $go_path = os.path.join($tmpdir, 'geneOntology.html') + + mkdir $denovo_results_html.files_path; + cp $tmpdir/homerResults.html $denovo_results_html; + cp $tmpdir/homerResults.html "$denovo_results_html.files_path"; + cp -r $tmpdir/homerResults/ "$denovo_results_html.files_path"; - ; - cp $tmpdir/homerResults.html $denovo_results_html; - cp -r $tmpdir/homerResults/* "$denovo_results_html.files_path"; + mkdir "$known_results_html.files_path"; + cp $tmpdir/knownResults.html $known_results_html; + cp $tmpdir/knownResults.html "$known_results_html.files_path"; + cp $tmpdir/homerResults.html "$known_results_html.files_path"; + cp -r $tmpdir/knownResults/ "$known_results_html.files_path"; - cp $tmpdir/knownResults.html $known_results_html; - cp -r $tmpdir/knownResults/* "$known_results_html.files_path"; + #if os.path.exists( $go_path ): + cp $go_path "$denovo_results_html.files_path"; + cp $go_path "$known_results_html.files_path"; + #end if - + #end if - 2>&1 + ##rm -rf $tmpdir - - - - - - - + + + + + + + + + + + + + + + + + + + + - - + + + html_output is True + + + html_output is True + + + concat_motifs_output is True + + + logfile_output is True + @@ -56,6 +124,11 @@ **Homer findMotifsGenome** +Autonormalization attempts to remove sequence bias from lower order oligos (1-mers, 2-mers ... up to #). +Region level autonormalization, which is for 1/2/3 mers by default, attempts to normalize background regions by adjusting their weights. +If this isn't getting the job done (autonormalization is not guaranteed to remove all sequence bias), you can try the more aggressive motif level autonormalization (-olen #). +This performs the autonormalization routine on the oligo table during de novo motif discovery. + diff -r 5b919ef94655 -r d602d8b1dc4f tools/findPeaks.xml --- a/tools/findPeaks.xml Mon Aug 12 14:39:25 2013 -0400 +++ b/tools/findPeaks.xml Tue Aug 13 09:38:07 2013 -0400 @@ -1,45 +1,38 @@ - + + blat weblogo ghostscript - Homer's peakcaller. Requires tag directories (see makeTagDirectory) export PATH=\$PATH:$database.fields.path; findPeaks $affected_tag_dir.extra_files_path -o $outputPeakFile - #if $control_tag_dir: - -i $control_tag_dir.extra_files_path - #end if + #if $control_tag_dir: + -i $control_tag_dir.extra_files_path + #end if - 2>&1 + #if $logfile_output: + 2> $out_logfile + #else: + 2>&1 + #end if - - - - - - - + - + - - - - - @@ -49,76 +42,8 @@ **Homer findPeaks** - For more options, look under: "Command line options for findPeaks" - - http://biowhat.ucsd.edu/homer/ngs/peaks.html - - TIP: use homer_bed2pos and homer_pos2bed to convert between the homer peak positions and the BED format. - -**Parameter list** - -Command line options (not all of them are supported):: - - Usage: findPeaks <tag directory> [options] - - Finds peaks in the provided tag directory. By default, peak list printed to stdout - - General analysis options: - -o <filename|auto> (file name for to output peaks, default: stdout) - "-o auto" will send output to "<tag directory>/peaks.txt", ".../regions.txt", - or ".../transcripts.txt" depending on the "-style" option - -style <option> (Specialized options for specific analysis strategies) - factor (transcription factor ChIP-Seq, uses -center, output: peaks.txt, default) - histone (histone modification ChIP-Seq, region based, uses -region -size 500 -L 0, regions.txt) - groseq (de novo transcript identification from GroSeq data, transcripts.txt) - tss (TSS identification from 5' RNA sequencing, tss.txt) - dnase (Hypersensitivity [crawford style (nicking)], peaks.txt) +Requires tag directories (see makeTagDirectory) - chipseq/histone options: - -i <input tag directory> (Experiment to use as IgG/Input/Control) - -size <#> (Peak size, default: auto) - -minDist <#> (minimum distance between peaks, default: peak size x2) - -gsize <#> (Set effective mappable genome size, default: 2e9) - -fragLength <#|auto> (Approximate fragment length, default: auto) - -inputFragLength <#|auto> (Approximate fragment length of input tags, default: auto) - -tbp <#> (Maximum tags per bp to count, 0 = no limit, default: auto) - -inputtbp <#> (Maximum tags per bp to count in input, 0 = no limit, default: auto) - -strand <both|separate> (find peaks using tags on both strands or separate, default:both) - -norm # (Tag count to normalize to, default 10000000) - -region (extends start/stop coordinates to cover full region considered "enriched") - -center (Centers peaks on maximum tag overlap and calculates focus ratios) - -nfr (Centers peaks on most likely nucleosome free region [works best with mnase data]) - (-center and -nfr can be performed later with "getPeakTags" - - Peak Filtering options: (set -F/-L/-C to 0 to skip) - -F <#> (fold enrichment over input tag count, default: 4.0) - -P <#> (poisson p-value threshold relative to input tag count, default: 0.0001) - -L <#> (fold enrichment over local tag count, default: 4.0) - -LP <#> (poisson p-value threshold relative to local tag count, default: 0.0001) - -C <#> (fold enrichment limit of expected unique tag positions, default: 2.0) - -localSize <#> (region to check for local tag enrichment, default: 10000) - -inputSize <#> (Size of region to search for control tags, default: 2x peak size) - -fdr <#> (False discovery rate, default = 0.001) - -poisson <#> (Set poisson p-value cutoff, default: uses fdr) - -tagThreshold <#> (Set # of tags to define a peak, default: 25) - -ntagThreshold <#> (Set # of normalized tags to define a peak, by default uses 1e7 for norm) - -minTagThreshold <#> (Absolute minimum tags per peak, default: expected tags per peak) - - GroSeq Options: (Need to specify "-style groseq"): - -tssSize <#> (size of region for initiation detection/artifact size, default: 250) - -minBodySize <#> (size of regoin for transcript body detection, default: 1000) - -maxBodySize <#> (size of regoin for transcript body detection, default: 10000) - -tssFold <#> (fold enrichment for new initiation dectection, default: 4.0) - -bodyFold <#> (fold enrichment for new transcript dectection, default: 4.0) - -endFold <#> (end transcript when levels are this much less than the start, default: 10.0) - -fragLength <#> (Approximate fragment length, default: 150) - -uniqmap <directory> (directory of binary files specifying uniquely mappable locations) - Download from http://biowhat.ucsd.edu/homer/groseq/ - -confPvalue <#> (confidence p-value: 1.00e-05) - -minReadDepth <#> (Minimum initial read depth for transcripts, default: auto) - -pseudoCount <#> (Pseudo tag count, default: 2.0) - -gtf <filename> (Output de novo transcripts in GTF format) - "-o auto" will produce <dir>/transcripts.txt and <dir>/transcripts.gtf diff -r 5b919ef94655 -r d602d8b1dc4f tools/homer_macros.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/homer_macros.xml Tue Aug 13 09:38:07 2013 -0400 @@ -0,0 +1,11 @@ + + + + + + + + + + + diff -r 5b919ef94655 -r d602d8b1dc4f tools/makeTagDirectory.py --- a/tools/makeTagDirectory.py Mon Aug 12 14:39:25 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,94 +0,0 @@ -""" - - -""" -import re -import os -import sys -import subprocess -import optparse -import shutil -import tempfile - -def getFileString(fpath, outpath): - """ - format a nice file size string - """ - size = '' - fp = os.path.join(outpath, fpath) - s = '? ?' - if os.path.isfile(fp): - n = float(os.path.getsize(fp)) - if n > 2**20: - size = ' (%1.1f MB)' % (n/2**20) - elif n > 2**10: - size = ' (%1.1f KB)' % (n/2**10) - elif n > 0: - size = ' (%d B)' % (int(n)) - s = '%s %s' % (fpath, size) - return s - -class makeTagDirectory(): - """wrapper - """ - - def __init__(self,opts=None, args=None): - self.opts = opts - self.args = args - - def run_makeTagDirectory(self): - """ - makeTagDirectory [options] [alignment file 2] - - """ - if self.opts.format != "bam": - cl = [self.opts.executable] + args + ["-format" , self.opts.format] - else: - cl = [self.opts.executable] + args - print cl - p = subprocess.Popen(cl) - retval = p.wait() - - - html = self.gen_html(args[0]) - #html = self.gen_html() - return html,retval - - def gen_html(self, dr=os.getcwd()): - flist = os.listdir(dr) - print flist - """ add a list of all files in the tagdirectory - """ - res = ['

Files created by makeTagDirectory

\n'] - - flist.sort() - for i,f in enumerate(flist): - if not(os.path.isdir(f)): - fn = os.path.split(f)[-1] - res.append('\n' % (fn,getFileString(fn, dr))) - - res.append('

\n') - - return res - -if __name__ == '__main__': - op = optparse.OptionParser() - op.add_option('-e', '--executable', default='makeTagDirectory') - op.add_option('-o', '--htmloutput', default=None) - op.add_option('-f', '--format', default="sam") - opts, args = op.parse_args() - #assert os.path.isfile(opts.executable),'## makeTagDirectory.py error - cannot find executable %s' % opts.executable - - #if not os.path.exists(opts.outputdir): - #os.makedirs(opts.outputdir) - f = makeTagDirectory(opts, args) - - html,retval = f.run_makeTagDirectory() - f = open(opts.htmloutput, 'w') - f.write(''.join(html)) - f.close() - if retval <> 0: - print >> sys.stderr, serr # indicate failure - - - diff -r 5b919ef94655 -r d602d8b1dc4f tools/makeTagDirectory.xml --- a/tools/makeTagDirectory.xml Mon Aug 12 14:39:25 2013 -0400 +++ b/tools/makeTagDirectory.xml Tue Aug 13 09:38:07 2013 -0400 @@ -1,34 +1,31 @@ - + blat weblogo ghostscript - (TagDirectory). Used by findPeaks - + (TagDirectory) - #set $HOMER_PATH = str($database.fields.path) - export PATH=\$PATH:$database.fields.path; + #set $HOMER_PATH = str($database.fields.path) + export PATH=\$PATH:$database.fields.path; + + makeTagDirectory $tag_dir.extra_files_path + #for $infile in $alignment_files: + $infile.file + #end for - makeTagDirectory $tag_dir.extra_files_path - #for $infile in $alignment_files: - $infile.file - #end for - - 2>&1 + #if $logfile_output: + 2> $out_logfile + #else: + 2>&1 + #end if - - - - - - - + - + diff -r 5b919ef94655 -r d602d8b1dc4f tools/pos2bed.xml --- a/tools/pos2bed.xml Mon Aug 12 14:39:25 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,37 +0,0 @@ - - - homer - - - - - pos2bed.pl $input_peak 1> $out_bed - 2> $out_log || echo "Error running pos2bed." >&2 - - - - - - - - - - - - - - - - - - - .. class:: infomark - - Converts: homer peak positions -(to)-> BED format - - **Homer pos2bed.pl** - - http://biowhat.ucsd.edu/homer/ngs/miscellaneous.html - - -