Mercurial > repos > jeremyjliu > region_motif_enrichment
diff README.md @ 21:2c909bfdd090 draft
Uploaded
author | jeremyjliu |
---|---|
date | Wed, 12 Nov 2014 15:25:44 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Wed Nov 12 15:25:44 2014 -0500 @@ -0,0 +1,204 @@ +# Region-Motif-Compare Tools +Version 1.1 Released 2014 +Park Laboratory +Center for Biomedical Informatics +Harvard University + +Contact +Jeremy Liu (jeremy.liu@yale.edu) +Nils Gehlenborg (nils@hms.harvard.edu) + +## Overview +### Structure +The tool suite consists of: + +1. Two Rscripts: region_motif_compare.r and region_motif_intersect.r +2. Two Xml Files: region_motif_compare.xml and region_motif_intersect.xml +3. Motif Database Directory: region_motif_db +4. Dependency Library Directory: region_motif_lib +5. Galaxy Workflows: Files with suffix ".ga" that can be imported into the local +Galaxy instance after installation of the tool. + +### Description +1. **region_motif_intersect.r** (1 bed -> 1 tsv): +Takes one bed file of regions as input. Then it calculates +the number of intersections of the regions and the motifs. region_motifs_intersect.r +outputs a tab separated values (tsv) file of motif names and intersection counts. +**Important Note:** region_motif_intersect.r makes no assumptions about the nature +of the input regions. For example, if overlapping regions are inputted, motifs that +intersect the overlap will be double counted. Thus, it is recommended that regions +be merged before using this tool, using the merge tool in the Galaxy toolshed. + +2. **region_motif_compare.r** (2 tsv -> 2 tsv & 1 png): +Takes as input two tsv files of motifs / regions intersection +counts. These generally originate from running region_motif_intersect.r on two sets +of different regions with the same query motif database. Based on the counts, +region_motif_compare.r then determines the enrichment (or depletion) of certain +motifs across the two regions. This is done by a correcting for the size and gc +content of the region, and applying a Poisson test to the counts. +Then, region_motif_compare.r outputs the most significant enriched or depleted +motifs as a tsv. In addition, the tool outputs a diagnostic plot containing +graphical representations of the motif counts, gc correction curves, and significant +motifs that distinguish the two regions (selected via p value). + +3. **region_motif_db**: Contains motif positions as compressed, indexed tabix files. + +4. **region_motif_lib**: Contains dependencies (i.e. plotting.r) for region_motif_compare.r + +## Installation +Directions for installing the region-motif-compare tools into a personal computer +and a local Galaxy instance. + +1. Follow the online directions to install a local instance of Galaxy (getgalaxy.org). +Optionally, follow the directions to install Refinery (refinery-platform.readthedocs.org) + +2. Clone the github repository to your local computer + ```` + git clone https://github.com/parklab/refinery-galaxy-tools.git + cd refinery-galaxy-tools/region-motif-compare + ```` + +3. Make a directory for the tools in Galaxy instance. This serves as a category +for the tool in the tools sidebar. You can also place the tools in an existing +or alternatively named directory, but remember to update tool_conf.xml to reflect this. + ```` + cd ~/galaxy-dist/tools/ + mkdir my_tools + cd my_tools + ```` + +4. Copy over ".r" and ".xml" files, as well as `region_motif_db` and `region_motif_lib` + ```` + cd refinery-galaxy-tools/region-motif-compare + cp *.r ~/galaxy-dist/tools/my_tools + cp *.xml ~/galaxy-dist/tools/my_tools + cp -r region_motif_db ~/galaxy-dist/tools/my_tools + cp -r region_motif_lib ~/galaxy-dist/tools/my_tools + ```` + +5. Edit `~/galaxy-dist/tool_conf.xml` to reflect the addition of the new tools. +Add the following lines within the `<toolbox>` tags. If in Step 3 you copied +the tools to a different directory than `my_tools`, edit the code snippet +to reflect the correct path name. + ```` + <section id="mTools" name="My Tools"> + <tool file="my_tools/region_motif_intersect.xml" /> + <tool file="my_tools/region_motif_compare.xml" /> + </section> + ```` + +6. Download the motif databases and place them into `region_motif_db` + ```` + cd ~/galaxy-dist/tools/my_tools/region_motif_db + wget ????/pouya_motifs.bed.bgz + wget ????/pouya_motifs.bed.bgz.tbi + wget ????/jaspar_jolma_motifs.bed.bgz + wget ????/jaspar_jolma_motifs.bed.bgz.tbi + wget ????/mm9_motifs.bed.bgz + wget ????/mm9_motifs.bed.bgz.tbi + ```` + +7. Install the Bioconductor R package Rsamtools for dealing with tabix files + ``` + $ R + > source("http://bioconductor.org/biocLite.R") + > biocLite("Rsamtools") + ```` + +8. If in Step 3 you copied the tools to an existing directory or an alternatively +named directory, you must edit the following file paths. + In `region_motif_intersect.r` and `region_motif_compare.r` edit `commonDir`: + ```` + # Replace this line + commonDir = concat(workingDir, "/tools/my_tools") + # With this edited line + commonDir = concat(workingDir, "<relative_path_from_galaxy_root>/<tool_directory>") + ```` + In addition, edit `region_motif_intersect.xml` and `region_motif_compare.xml` to + reflect the path of the tools relative to the galaxy root directory. + ```` + <command interpreter="bash"> + /usr/bin/R --slave --vanilla -f $GALAXY_ROOT_DIR/<path_to_tools>/region_motif_intersect.r --args $GALAXY_ROOT_DIR $db_type $in_bed $out_tab + </command> + ```` + ```` + <command interpreter="bash"> + /usr/bin/R --slave --vanilla -f $GALAXY_ROOT_DIR/<path_to_tools>/region_motif_compare.r --args $GALAXY_ROOT_DIR $db_type $in_tab_1 $in_tab_2 $out_enriched $out_depleted $out_plots + </command> + ```` + +## Running the Tools +### Running from Galaxy +1. To run the tools as workflows, import the .ga workflows included in the github +via the Galaxy workflow user interface. Then, upload and select two input BED files. + +2. To run the tools individually, select the tool from the tools toolbar, provide +a BED file (Region Motif Intersect) or two tsv files (Region Motif Compare), and +select a query database from the dropdown menu. + +### Running from Refinery +1. Import the .ga workflows into a local Galaxy instance. These workflows have +already been annotated for Refinery. + +2. Add the local Galaxy instance to the Refinery installation. + ```` + python manage.py create_workflowengine <instance_id> "<group_name>" + ```` + +3. Import the Galaxy workflows into Refinery. + ```` + python manage.py import_workflows + ```` +4. Run the tools from the Refinery user interface. + +### Running as Command Line Tools +You can also run the tools from the command line, an example of which is shown below. +More information is found in the headers of the r source files. +```` +cd ~/galaxy-dist/tools/my_tools +R --slave --vanilla -f region_motif_intersect.r --args ~/galaxy-dist p <path_to_bed_file> <path_to_output_tsv> +R --slave --vanilla -f region_motif_compare.r --args ~/galaxy-dist p <path_to_region1_counts> <path_to_region2_counts> <enriched_motifs_output_tsv> <depleted_motifs_output_tsv> <plots_png> +```` + +## Interpreting Results +### Motif Database and Result Notation +TF motif positions for hg19 and mm9 were curated from three databases: +ENCODE TF motif database "Pouya" (http://compbio.mit.edu/encode-motifs/) +JASPAR database "Jaspar" (http://jaspar.genereg.net/) +DNA binding specificities of human transciption factors "Jolma" (http://www.ncbi.nlm.nih.gov/pubmed/23332764) + +For ENCODE TF motifs, the genomic locations were taken straight from the database. +In addition, position weight matrices (pwms) were obtained by averaging the +sites in the genome for a motif. These are labeled with "\_8mer\_". +Fake motifs were also generated, by shuffling the pwms of actual motifs and +mapping to the genome and are labeled with "_8mer_C". + +For JASPAR and Jolma motifs, mast was run to determine genomic locations from the +provided pwms. The motif alignmment thresholds were set to the top 5k, 20k, 100k, and +250k sites and the redundant maps removed with the top 30k sites have the same score. +These are labeled with "_t5000" and likewise. + + +## Motif Tabix File Creation +Starting with a BED file of motif positions (minimal chr, start, end), follow +below to generate a tabix file that can be placed in `region_motif_db` and +used by the tools. + +1. Download Tabix (http://sourceforge.net/projects/samtools/files/tabix/) and install. +Add `tabix` and `bgzip` binaries to your file path. + ```` +tar -xvjf tabix-0.2.6.tar.bz2 +cd tabix-0.2.6 +make + ```` + +2. Construct bgzip files and index files. + ```` +cd ~/galaxy-dist/tools/my_tools/region_motif/db +(grep ^"#" jaspar_motifs.bed; grep -v ^"#" jaspar_motifs.bed | sort -k1,1 -k2,2n) | bgzip > jaspa_motifs.bed.bgz +tabix -p bed jaspar_motifs.bed.bgz # this generates jaspar_motifs.bed.bgz.tbi + ```` + +3. Add the path to `jaspar_motifs.bed.bgz` to the selection options for the variable +`motifDB` in `region_motif_intersect.r` and `region_motif_compare.r`. To enable +the new database in Galaxy, you will have to edit the xml files for both tools.