17
|
1 # Region-Motif-Compare Tools
|
|
2 Version 1.1 Released 2014
|
|
3 Park Laboratory
|
|
4 Center for Biomedical Informatics
|
|
5 Harvard University
|
|
6
|
|
7 Contact
|
|
8 Jeremy Liu (jeremy.liu@yale.edu)
|
|
9 Nils Gehlenborg (nils@hms.harvard.edu)
|
|
10
|
|
11 ## Overview
|
|
12 ### Structure
|
|
13 The tool suite consists of:
|
|
14
|
|
15 1. Two Rscripts: region_motif_compare.r and region_motif_intersect.r
|
|
16 2. Two Xml Files: region_motif_compare.xml and region_motif_intersect.xml
|
|
17 3. Motif Database Directory: region_motif_db
|
|
18 4. Dependency Library Directory: region_motif_lib
|
|
19 5. Galaxy Workflows: Files with suffix ".ga" that can be imported into the local
|
|
20 Galaxy instance after installation of the tool.
|
|
21
|
|
22 ### Description
|
|
23 1. **region_motif_intersect.r** (1 bed -> 1 tsv):
|
|
24 Takes one bed file of regions as input. Then it calculates
|
|
25 the number of intersections of the regions and the motifs. region_motifs_intersect.r
|
|
26 outputs a tab separated values (tsv) file of motif names and intersection counts.
|
|
27 **Important Note:** region_motif_intersect.r makes no assumptions about the nature
|
|
28 of the input regions. For example, if overlapping regions are inputted, motifs that
|
|
29 intersect the overlap will be double counted. Thus, it is recommended that regions
|
|
30 be merged before using this tool, using the merge tool in the Galaxy toolshed.
|
|
31
|
|
32 2. **region_motif_compare.r** (2 tsv -> 2 tsv & 1 png):
|
|
33 Takes as input two tsv files of motifs / regions intersection
|
|
34 counts. These generally originate from running region_motif_intersect.r on two sets
|
|
35 of different regions with the same query motif database. Based on the counts,
|
|
36 region_motif_compare.r then determines the enrichment (or depletion) of certain
|
|
37 motifs across the two regions. This is done by a correcting for the size and gc
|
|
38 content of the region, and applying a Poisson test to the counts.
|
|
39 Then, region_motif_compare.r outputs the most significant enriched or depleted
|
|
40 motifs as a tsv. In addition, the tool outputs a diagnostic plot containing
|
|
41 graphical representations of the motif counts, gc correction curves, and significant
|
|
42 motifs that distinguish the two regions (selected via p value).
|
|
43
|
|
44 3. **region_motif_db**: Contains motif positions as compressed, indexed tabix files.
|
|
45
|
|
46 4. **region_motif_lib**: Contains dependencies (i.e. plotting.r) for region_motif_compare.r
|
|
47
|
|
48 ## Installation
|
|
49 Directions for installing the region-motif-compare tools into a personal computer
|
|
50 and a local Galaxy instance.
|
|
51
|
|
52 1. Follow the online directions to install a local instance of Galaxy (getgalaxy.org).
|
|
53 Optionally, follow the directions to install Refinery (refinery-platform.readthedocs.org)
|
|
54
|
|
55 2. Clone the github repository to your local computer
|
|
56 ````
|
|
57 git clone https://github.com/parklab/refinery-galaxy-tools.git
|
|
58 cd refinery-galaxy-tools/region-motif-compare
|
|
59 ````
|
|
60
|
|
61 3. Make a directory for the tools in Galaxy instance. This serves as a category
|
|
62 for the tool in the tools sidebar. You can also place the tools in an existing
|
|
63 or alternatively named directory, but remember to update tool_conf.xml to reflect this.
|
|
64 ````
|
|
65 cd ~/galaxy-dist/tools/
|
|
66 mkdir my_tools
|
|
67 cd my_tools
|
|
68 ````
|
|
69
|
|
70 4. Copy over ".r" and ".xml" files, as well as `region_motif_db` and `region_motif_lib`
|
|
71 ````
|
|
72 cd refinery-galaxy-tools/region-motif-compare
|
|
73 cp *.r ~/galaxy-dist/tools/my_tools
|
|
74 cp *.xml ~/galaxy-dist/tools/my_tools
|
|
75 cp -r region_motif_db ~/galaxy-dist/tools/my_tools
|
|
76 cp -r region_motif_lib ~/galaxy-dist/tools/my_tools
|
|
77 ````
|
|
78
|
|
79 5. Edit `~/galaxy-dist/tool_conf.xml` to reflect the addition of the new tools.
|
|
80 Add the following lines within the `<toolbox>` tags. If in Step 3 you copied
|
|
81 the tools to a different directory than `my_tools`, edit the code snippet
|
|
82 to reflect the correct path name.
|
|
83 ````
|
|
84 <section id="mTools" name="My Tools">
|
|
85 <tool file="my_tools/region_motif_intersect.xml" />
|
|
86 <tool file="my_tools/region_motif_compare.xml" />
|
|
87 </section>
|
|
88 ````
|
|
89
|
|
90 6. Download the motif databases and place them into `region_motif_db`
|
|
91 ````
|
|
92 cd ~/galaxy-dist/tools/my_tools/region_motif_db
|
|
93 wget ????/pouya_motifs.bed.bgz
|
|
94 wget ????/pouya_motifs.bed.bgz.tbi
|
|
95 wget ????/jaspar_jolma_motifs.bed.bgz
|
|
96 wget ????/jaspar_jolma_motifs.bed.bgz.tbi
|
|
97 wget ????/mm9_motifs.bed.bgz
|
|
98 wget ????/mm9_motifs.bed.bgz.tbi
|
|
99 ````
|
|
100
|
|
101 7. Install the Bioconductor R package Rsamtools for dealing with tabix files
|
|
102 ```
|
|
103 $ R
|
|
104 > source("http://bioconductor.org/biocLite.R")
|
|
105 > biocLite("Rsamtools")
|
|
106 ````
|
|
107
|
|
108 8. If in Step 3 you copied the tools to an existing directory or an alternatively
|
|
109 named directory, you must edit the following file paths.
|
|
110 In `region_motif_intersect.r` and `region_motif_compare.r` edit `commonDir`:
|
|
111 ````
|
|
112 # Replace this line
|
|
113 commonDir = concat(workingDir, "/tools/my_tools")
|
|
114 # With this edited line
|
|
115 commonDir = concat(workingDir, "<relative_path_from_galaxy_root>/<tool_directory>")
|
|
116 ````
|
|
117 In addition, edit `region_motif_intersect.xml` and `region_motif_compare.xml` to
|
|
118 reflect the path of the tools relative to the galaxy root directory.
|
|
119 ````
|
|
120 <command interpreter="bash">
|
|
121 /usr/bin/R --slave --vanilla -f $GALAXY_ROOT_DIR/<path_to_tools>/region_motif_intersect.r --args $GALAXY_ROOT_DIR $db_type $in_bed $out_tab
|
|
122 </command>
|
|
123 ````
|
|
124 ````
|
|
125 <command interpreter="bash">
|
|
126 /usr/bin/R --slave --vanilla -f $GALAXY_ROOT_DIR/<path_to_tools>/region_motif_compare.r --args $GALAXY_ROOT_DIR $db_type $in_tab_1 $in_tab_2 $out_enriched $out_depleted $out_plots
|
|
127 </command>
|
|
128 ````
|
|
129
|
|
130 ## Running the Tools
|
|
131 ### Running from Galaxy
|
|
132 1. To run the tools as workflows, import the .ga workflows included in the github
|
|
133 via the Galaxy workflow user interface. Then, upload and select two input BED files.
|
|
134
|
|
135 2. To run the tools individually, select the tool from the tools toolbar, provide
|
|
136 a BED file (Region Motif Intersect) or two tsv files (Region Motif Compare), and
|
|
137 select a query database from the dropdown menu.
|
|
138
|
|
139 ### Running from Refinery
|
|
140 1. Import the .ga workflows into a local Galaxy instance. These workflows have
|
|
141 already been annotated for Refinery.
|
|
142
|
|
143 2. Add the local Galaxy instance to the Refinery installation.
|
|
144 ````
|
|
145 python manage.py create_workflowengine <instance_id> "<group_name>"
|
|
146 ````
|
|
147
|
|
148 3. Import the Galaxy workflows into Refinery.
|
|
149 ````
|
|
150 python manage.py import_workflows
|
|
151 ````
|
|
152 4. Run the tools from the Refinery user interface.
|
|
153
|
|
154 ### Running as Command Line Tools
|
|
155 You can also run the tools from the command line, an example of which is shown below.
|
|
156 More information is found in the headers of the r source files.
|
|
157 ````
|
|
158 cd ~/galaxy-dist/tools/my_tools
|
|
159 R --slave --vanilla -f region_motif_intersect.r --args ~/galaxy-dist p <path_to_bed_file> <path_to_output_tsv>
|
|
160 R --slave --vanilla -f region_motif_compare.r --args ~/galaxy-dist p <path_to_region1_counts> <path_to_region2_counts> <enriched_motifs_output_tsv> <depleted_motifs_output_tsv> <plots_png>
|
|
161 ````
|
|
162
|
|
163 ## Interpreting Results
|
|
164 ### Motif Database and Result Notation
|
|
165 TF motif positions for hg19 and mm9 were curated from three databases:
|
|
166 ENCODE TF motif database "Pouya" (http://compbio.mit.edu/encode-motifs/)
|
|
167 JASPAR database "Jaspar" (http://jaspar.genereg.net/)
|
|
168 DNA binding specificities of human transciption factors "Jolma" (http://www.ncbi.nlm.nih.gov/pubmed/23332764)
|
|
169
|
|
170 For ENCODE TF motifs, the genomic locations were taken straight from the database.
|
|
171 In addition, position weight matrices (pwms) were obtained by averaging the
|
|
172 sites in the genome for a motif. These are labeled with "\_8mer\_".
|
|
173 Fake motifs were also generated, by shuffling the pwms of actual motifs and
|
|
174 mapping to the genome and are labeled with "_8mer_C".
|
|
175
|
|
176 For JASPAR and Jolma motifs, mast was run to determine genomic locations from the
|
|
177 provided pwms. The motif alignmment thresholds were set to the top 5k, 20k, 100k, and
|
|
178 250k sites and the redundant maps removed with the top 30k sites have the same score.
|
|
179 These are labeled with "_t5000" and likewise.
|
|
180
|
|
181
|
|
182 ## Motif Tabix File Creation
|
|
183 Starting with a BED file of motif positions (minimal chr, start, end), follow
|
|
184 below to generate a tabix file that can be placed in `region_motif_db` and
|
|
185 used by the tools.
|
|
186
|
|
187 1. Download Tabix (http://sourceforge.net/projects/samtools/files/tabix/) and install.
|
|
188 Add `tabix` and `bgzip` binaries to your file path.
|
|
189 ````
|
|
190 tar -xvjf tabix-0.2.6.tar.bz2
|
|
191 cd tabix-0.2.6
|
|
192 make
|
|
193 ````
|
|
194
|
|
195 2. Construct bgzip files and index files.
|
|
196 ````
|
|
197 cd ~/galaxy-dist/tools/my_tools/region_motif/db
|
|
198 (grep ^"#" jaspar_motifs.bed; grep -v ^"#" jaspar_motifs.bed | sort -k1,1 -k2,2n) | bgzip > jaspa_motifs.bed.bgz
|
|
199 tabix -p bed jaspar_motifs.bed.bgz # this generates jaspar_motifs.bed.bgz.tbi
|
|
200 ````
|
|
201
|
|
202 3. Add the path to `jaspar_motifs.bed.bgz` to the selection options for the variable
|
|
203 `motifDB` in `region_motif_intersect.r` and `region_motif_compare.r`. To enable
|
|
204 the new database in Galaxy, you will have to edit the xml files for both tools.
|