comparison region-motif-compare/README.md @ 17:7afdfd4f4c1b draft

Uploaded
author jeremyjliu
date Wed, 12 Nov 2014 15:21:11 -0500
parents
children
comparison
equal deleted inserted replaced
16:9a84f76db861 17:7afdfd4f4c1b
1 # Region-Motif-Compare Tools
2 Version 1.1 Released 2014
3 Park Laboratory
4 Center for Biomedical Informatics
5 Harvard University
6
7 Contact
8 Jeremy Liu (jeremy.liu@yale.edu)
9 Nils Gehlenborg (nils@hms.harvard.edu)
10
11 ## Overview
12 ### Structure
13 The tool suite consists of:
14
15 1. Two Rscripts: region_motif_compare.r and region_motif_intersect.r
16 2. Two Xml Files: region_motif_compare.xml and region_motif_intersect.xml
17 3. Motif Database Directory: region_motif_db
18 4. Dependency Library Directory: region_motif_lib
19 5. Galaxy Workflows: Files with suffix ".ga" that can be imported into the local
20 Galaxy instance after installation of the tool.
21
22 ### Description
23 1. **region_motif_intersect.r** (1 bed -> 1 tsv):
24 Takes one bed file of regions as input. Then it calculates
25 the number of intersections of the regions and the motifs. region_motifs_intersect.r
26 outputs a tab separated values (tsv) file of motif names and intersection counts.
27 **Important Note:** region_motif_intersect.r makes no assumptions about the nature
28 of the input regions. For example, if overlapping regions are inputted, motifs that
29 intersect the overlap will be double counted. Thus, it is recommended that regions
30 be merged before using this tool, using the merge tool in the Galaxy toolshed.
31
32 2. **region_motif_compare.r** (2 tsv -> 2 tsv & 1 png):
33 Takes as input two tsv files of motifs / regions intersection
34 counts. These generally originate from running region_motif_intersect.r on two sets
35 of different regions with the same query motif database. Based on the counts,
36 region_motif_compare.r then determines the enrichment (or depletion) of certain
37 motifs across the two regions. This is done by a correcting for the size and gc
38 content of the region, and applying a Poisson test to the counts.
39 Then, region_motif_compare.r outputs the most significant enriched or depleted
40 motifs as a tsv. In addition, the tool outputs a diagnostic plot containing
41 graphical representations of the motif counts, gc correction curves, and significant
42 motifs that distinguish the two regions (selected via p value).
43
44 3. **region_motif_db**: Contains motif positions as compressed, indexed tabix files.
45
46 4. **region_motif_lib**: Contains dependencies (i.e. plotting.r) for region_motif_compare.r
47
48 ## Installation
49 Directions for installing the region-motif-compare tools into a personal computer
50 and a local Galaxy instance.
51
52 1. Follow the online directions to install a local instance of Galaxy (getgalaxy.org).
53 Optionally, follow the directions to install Refinery (refinery-platform.readthedocs.org)
54
55 2. Clone the github repository to your local computer
56 ````
57 git clone https://github.com/parklab/refinery-galaxy-tools.git
58 cd refinery-galaxy-tools/region-motif-compare
59 ````
60
61 3. Make a directory for the tools in Galaxy instance. This serves as a category
62 for the tool in the tools sidebar. You can also place the tools in an existing
63 or alternatively named directory, but remember to update tool_conf.xml to reflect this.
64 ````
65 cd ~/galaxy-dist/tools/
66 mkdir my_tools
67 cd my_tools
68 ````
69
70 4. Copy over ".r" and ".xml" files, as well as `region_motif_db` and `region_motif_lib`
71 ````
72 cd refinery-galaxy-tools/region-motif-compare
73 cp *.r ~/galaxy-dist/tools/my_tools
74 cp *.xml ~/galaxy-dist/tools/my_tools
75 cp -r region_motif_db ~/galaxy-dist/tools/my_tools
76 cp -r region_motif_lib ~/galaxy-dist/tools/my_tools
77 ````
78
79 5. Edit `~/galaxy-dist/tool_conf.xml` to reflect the addition of the new tools.
80 Add the following lines within the `<toolbox>` tags. If in Step 3 you copied
81 the tools to a different directory than `my_tools`, edit the code snippet
82 to reflect the correct path name.
83 ````
84 <section id="mTools" name="My Tools">
85 <tool file="my_tools/region_motif_intersect.xml" />
86 <tool file="my_tools/region_motif_compare.xml" />
87 </section>
88 ````
89
90 6. Download the motif databases and place them into `region_motif_db`
91 ````
92 cd ~/galaxy-dist/tools/my_tools/region_motif_db
93 wget ????/pouya_motifs.bed.bgz
94 wget ????/pouya_motifs.bed.bgz.tbi
95 wget ????/jaspar_jolma_motifs.bed.bgz
96 wget ????/jaspar_jolma_motifs.bed.bgz.tbi
97 wget ????/mm9_motifs.bed.bgz
98 wget ????/mm9_motifs.bed.bgz.tbi
99 ````
100
101 7. Install the Bioconductor R package Rsamtools for dealing with tabix files
102 ```
103 $ R
104 > source("http://bioconductor.org/biocLite.R")
105 > biocLite("Rsamtools")
106 ````
107
108 8. If in Step 3 you copied the tools to an existing directory or an alternatively
109 named directory, you must edit the following file paths.
110 In `region_motif_intersect.r` and `region_motif_compare.r` edit `commonDir`:
111 ````
112 # Replace this line
113 commonDir = concat(workingDir, "/tools/my_tools")
114 # With this edited line
115 commonDir = concat(workingDir, "<relative_path_from_galaxy_root>/<tool_directory>")
116 ````
117 In addition, edit `region_motif_intersect.xml` and `region_motif_compare.xml` to
118 reflect the path of the tools relative to the galaxy root directory.
119 ````
120 <command interpreter="bash">
121 /usr/bin/R --slave --vanilla -f $GALAXY_ROOT_DIR/<path_to_tools>/region_motif_intersect.r --args $GALAXY_ROOT_DIR $db_type $in_bed $out_tab
122 </command>
123 ````
124 ````
125 <command interpreter="bash">
126 /usr/bin/R --slave --vanilla -f $GALAXY_ROOT_DIR/<path_to_tools>/region_motif_compare.r --args $GALAXY_ROOT_DIR $db_type $in_tab_1 $in_tab_2 $out_enriched $out_depleted $out_plots
127 </command>
128 ````
129
130 ## Running the Tools
131 ### Running from Galaxy
132 1. To run the tools as workflows, import the .ga workflows included in the github
133 via the Galaxy workflow user interface. Then, upload and select two input BED files.
134
135 2. To run the tools individually, select the tool from the tools toolbar, provide
136 a BED file (Region Motif Intersect) or two tsv files (Region Motif Compare), and
137 select a query database from the dropdown menu.
138
139 ### Running from Refinery
140 1. Import the .ga workflows into a local Galaxy instance. These workflows have
141 already been annotated for Refinery.
142
143 2. Add the local Galaxy instance to the Refinery installation.
144 ````
145 python manage.py create_workflowengine <instance_id> "<group_name>"
146 ````
147
148 3. Import the Galaxy workflows into Refinery.
149 ````
150 python manage.py import_workflows
151 ````
152 4. Run the tools from the Refinery user interface.
153
154 ### Running as Command Line Tools
155 You can also run the tools from the command line, an example of which is shown below.
156 More information is found in the headers of the r source files.
157 ````
158 cd ~/galaxy-dist/tools/my_tools
159 R --slave --vanilla -f region_motif_intersect.r --args ~/galaxy-dist p <path_to_bed_file> <path_to_output_tsv>
160 R --slave --vanilla -f region_motif_compare.r --args ~/galaxy-dist p <path_to_region1_counts> <path_to_region2_counts> <enriched_motifs_output_tsv> <depleted_motifs_output_tsv> <plots_png>
161 ````
162
163 ## Interpreting Results
164 ### Motif Database and Result Notation
165 TF motif positions for hg19 and mm9 were curated from three databases:
166 ENCODE TF motif database "Pouya" (http://compbio.mit.edu/encode-motifs/)
167 JASPAR database "Jaspar" (http://jaspar.genereg.net/)
168 DNA binding specificities of human transciption factors "Jolma" (http://www.ncbi.nlm.nih.gov/pubmed/23332764)
169
170 For ENCODE TF motifs, the genomic locations were taken straight from the database.
171 In addition, position weight matrices (pwms) were obtained by averaging the
172 sites in the genome for a motif. These are labeled with "\_8mer\_".
173 Fake motifs were also generated, by shuffling the pwms of actual motifs and
174 mapping to the genome and are labeled with "_8mer_C".
175
176 For JASPAR and Jolma motifs, mast was run to determine genomic locations from the
177 provided pwms. The motif alignmment thresholds were set to the top 5k, 20k, 100k, and
178 250k sites and the redundant maps removed with the top 30k sites have the same score.
179 These are labeled with "_t5000" and likewise.
180
181
182 ## Motif Tabix File Creation
183 Starting with a BED file of motif positions (minimal chr, start, end), follow
184 below to generate a tabix file that can be placed in `region_motif_db` and
185 used by the tools.
186
187 1. Download Tabix (http://sourceforge.net/projects/samtools/files/tabix/) and install.
188 Add `tabix` and `bgzip` binaries to your file path.
189 ````
190 tar -xvjf tabix-0.2.6.tar.bz2
191 cd tabix-0.2.6
192 make
193 ````
194
195 2. Construct bgzip files and index files.
196 ````
197 cd ~/galaxy-dist/tools/my_tools/region_motif/db
198 (grep ^"#" jaspar_motifs.bed; grep -v ^"#" jaspar_motifs.bed | sort -k1,1 -k2,2n) | bgzip > jaspa_motifs.bed.bgz
199 tabix -p bed jaspar_motifs.bed.bgz # this generates jaspar_motifs.bed.bgz.tbi
200 ````
201
202 3. Add the path to `jaspar_motifs.bed.bgz` to the selection options for the variable
203 `motifDB` in `region_motif_intersect.r` and `region_motif_compare.r`. To enable
204 the new database in Galaxy, you will have to edit the xml files for both tools.