Mercurial > repos > jjohnson > defuse8
changeset 0:63f23d5db27c draft
planemo upload for repository https://github.com/jj-umn/galaxytools/tree/master/defuse commit 2c2fd38cb761ec57bac7a0bd376e6aa2b88265d0-dirty
author | jjohnson |
---|---|
date | Mon, 20 May 2019 15:25:03 -0400 |
parents | |
children | e0a835a2f74a |
files | README config_sub.sh create_reference_dataset.xml data_manager_conf.xml datamanager_create_reference.py datamanager_create_reference.xml datatypes_conf.xml defuse.xml defuse_bamfastq.xml defuse_results_to_vcf.py defuse_results_to_vcf.xml defuse_trinity_analysis.py defuse_trinity_analysis.xml macros.xml make_html.sh test-data/mm10_results.filtered.tsv test-data/mm10_results.filtered.vcf test-data/tophat_out2h.bam tool-data/defuse_reference.loc.sample tool_data_table_conf.xml.sample tool_dependencies.xml |
diffstat | 21 files changed, 2665 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,69 @@ +The DeFuse galaxy tool is based on DeFuse_Version_0.6.2 +http://sourceforge.net/apps/mediawiki/defuse/index.php?title=Main_Page +https://bitbucket.org/dranew/defuse + +DeFuse is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion. + + +Manual: +http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.6.2 + +The included tool_dependencies.xml will download and install the defuse code. +It will set the environment variable: "DEFUSE_PATH" to the location of the defuse install. +The tool_dependencies.xml also has the download for bowtie. + + +The defuse.pl command relies on a configuration file to specifiy options, the location of reference data, and other applications that it depends upon: bowtie, bowtie-build, samtools, gmap, blat, fatotwobit, R, and Rscript. + +The DeFuse galaxy tool can either construct the config.txt file that is mentioned in the defuse manual, or select an existing config.txt file in the users history. +When constructing the config.txt file, the DeFuse tool uses the values selected in: tool-data/defuse.loc +The dictionary field in the tool-data/defuse.loc can be used to set fields in the config.txt file, including the site specific location of reference data and the paths to the other application binaries. +The "Defuse parameter settings" are used to alter options in the config.txt file. + +The DeFuse galaxy tool also generates a bash script to run defuse. +That script will attempt to edit the config.txt file to specifiy any unset paths to applications that defuse relies upon: +bowtie, bowtie-build, samtools, blat, fatotwobit, R, and Rscript +The script uses the using the shell "which" command to discover the application path, so the required applications should in PATH environment variable. + + +Generate Reference Datasets as described in the Manual: + +Reference Dataset +The reference dataset setup process has been simplified as of deFuse 0.6.0, and deFuse now automatically downloads all required files. +The create_reference_dataset.pl script will download the genome and other source files, and build any derivative files including bowtie indices, gmap indices, and 2bit files. Run the following command. Expect this step to take at least 12 hours. +create_reference_dataset.pl -c config.txt + +These datasets should be referenced in the tool-data/defuse.loc file. + +The create_reference_dataset will run the create_reference_dataset.pl script to generate deFuse genome reference data in a galaxy dataset. +This should me made available in the future as a Galaxy DataManager. + + +Galaxy will try to auto-install dependencies: + +External Tools ( http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.6.2 ) +deFuse relies on other publically available tools as part of its pipeline. Some of these tools are not included with the deFuse download. Obtain these tools as detailed below. +Download samtools +The latest version of samtools can be downloaded from sourceforge: https://sourceforge.net/projects/samtools/files/samtools. +Set the samtools_bin entry in config.txt to the fully qualified paths of the samtools binary. +Download bowtie +The latest version of bowtie can be downloaded from sourceforge: http://sourceforge.net/projects/bowtie-bio/files/bowtie/. deFuse has been tested on version 0.12.5. +Set the bowtie_bin and bowtie_build_bin entries in config.txt to the fully qualified paths of the bowtie and bowtie-build binaries. +Download blat and faToTwoBit +The latest blat tool suite can be downloaded from the ucsc website: http://hgdownload.cse.ucsc.edu/admin/exe/. Download blat and faToTwoBit and set the blat_bin and fatotwobit_bin entries in config.txt to the fully qualified paths of the blat and faToTwoBit binaries. +Download GMAP +The latest version of GMAP can be downloaded here http://research-pub.gene.com/gmap/. Build with a default configuration. Do not worry about the `--with-gmapdb` build flag, deFuse will request a specific directory for the database anyway. +Download R +The latest version of R can be downloaded from the R project website: http://www.r-project.org/. Install R and then locate the R and Rscript executables, and set the r_bin and rscript_bin entries in config.txt to the path of those executables. +Install the ada package. Run R, then at the prompt type install.packages("ada") +Reference Dataset +The reference dataset setup process has been simplified as of deFuse 0.6.0, and deFuse now automatically downloads all required files. +The create_reference_dataset.pl script will download the genome and other source files, and build any derivative files including bowtie indices, gmap indices, and 2bit files. Run the following command. Expect this step to take at least 12 hours. +create_reference_dataset.pl -c config.txt + + +defuse_trinity_analysis.py - Validating deFuse predictions using Trinity de novo assembled transcripts + +DeFuse provides a total fusion sequence of 200-500 nucleotides (nts) around the fusion breakpoint. This may be insufficient to predict the effect of the fusion on protein production. To get a view of the full transcript containing the fusion, Trinity de novo transcripts from the RNA-seq data are compared with the deFuse fusion sequences using a subsequence around the deFuse indetified fusion breakpoint. The Trinity transcriptToOrfs output provides potential proteins from the projected fusion transcript. + +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/config_sub.sh Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,29 @@ +#!/bin/bash +## substitute pathnames into config file +#!/bin/bash +SAMTOOLS_BIN=`which samtools` +BOWTIE_BIN=`which bowtie` +BOWTIE_BUILD_BIN=`which bowtie-build` +BLAT_BIN=`which blat` +FATOTWOBIT_BIN=`which faToTwoBit` +GMAP_BIN=`which gmap` +GMAP_BUILD_BIN=`which gmap_build` +GMAP_SETUP_BIN=$GMAP_BUILD_BIN +R_BIN=`which R` +RSCRIPT_BIN=`which Rscript` +DEFUSE_BIN=`which defuse_run.pl` +DEFUSE_PATH=`python -c "import os.path; print(os.path.dirname(os.path.dirname(os.path.realpath(\"$DEFUSE_BIN\"))))"` +cat $1 | sed "s#__DEFUSE_PATH__#${DEFUSE_PATH}#" | \ +sed "s#__SAMTOOLS_BIN__#${SAMTOOLS_BIN}#" | \ +sed "s#__BOWTIE_BIN__#${BOWTIE_BIN}#" | \ +sed "s#__BOWTIE_BUILD_BIN__#${BOWTIE_BUILD_BIN}#" | \ +sed "s#__BLAT_BIN__#${BLAT_BIN}#"| \ +sed "s#__FATOTWOBIT_BIN__#${FATOTWOBIT_BIN}#" | \ +sed "s#__GMAP_BIN__#${GMAP_BIN}#" | \ +sed "s#^gmap_setup_bin#gmap_build_bin#" | \ +sed "s#__GMAP_SETUP_BIN__#${GMAP_SETUP_BIN}#" | \ +sed "s#__GMAP_BUILD_BIN__#${GMAP_BUILD_BIN}#" | \ +sed "s#__R_BIN__#${R_BIN}#" | \ +sed "s#__RSCRIPT_BIN__#${RSCRIPT_BIN}#" > $2 +export DATASET_DIRECTORY=`grep '^dataset_directory' $1 | awk '{print \$NF}'` +echo "$DATASET_DIRECTORY"
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/create_reference_dataset.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,281 @@ +<tool id="create_defuse_reference" name="Create DeFuse Reference" version="@DEFUSE_VERSION@.1"> + <description>create a defuse reference from Ensembl and UCSC sources</description> + <macros> + <import>macros.xml</import> + </macros> + <requirements> + <expand macro="defuse_requirement" /> + </requirements> + <command detect_errors="aggressive"><![CDATA[ + mkdir -p $config_txt.dataset.extra_files_path && + ## Put executable paths in config file + $__tool_directory__/config_sub.sh $defuse_config $config_txt && + ## defuse_create_ref.pl + defuse_create_ref.pl -c $config_txt + ]]></command> + <configfiles> + <configfile name="defuse_config"> +# +# Configuration file for defuse +# +# Variables that desiganate the PATH to an application, e.g. __SAMTOOLS_BIN__ +# will be set by the runtime script using the ENV PATH +# + +# Directory where the defuse code was unpacked +source_directory = __DEFUSE_PATH__ + +# Organism IDs +ensembl_organism = $genome.ensembl_organism +ensembl_prefix = $genome.ensembl_prefix +ensembl_version = $genome.ensembl_version +ensembl_genome_version = $genome.ensembl_genome_version +ucsc_genome_version = $genome.ucsc_genome_version +ncbi_organism = $genome.ncbi_organism +ncbi_prefix = $genome.ncbi_prefix + +# Directory where you want your dataset +dataset_directory = $config_txt.dataset.extra_files_path + +#raw +# Input genome and gene models +gene_models = $(dataset_directory)/$(ensembl_prefix).$(ensembl_genome_version).$(ensembl_version).gtf +genome_fasta = $(dataset_directory)/$(ensembl_prefix).$(ensembl_genome_version).$(ensembl_version).dna.chromosomes.fa + +# Repeat table from ucsc genome browser +repeats_filename = $(dataset_directory)/repeats.txt + +# EST info downloaded from ucsc genome browser +est_fasta = $(dataset_directory)/est.fa +est_alignments = $(dataset_directory)/intronEst.txt + +# Unigene clusters downloaded from ncbi +unigene_fasta = $(dataset_directory)/$(ncbi_prefix).seq.uniq +#end raw + +# Paths to external tools +samtools_bin = __SAMTOOLS_BIN__ +bowtie_bin = __BOWTIE_BIN__ +bowtie_build_bin = __BOWTIE_BUILD_BIN__ +blat_bin = __BLAT_BIN__ +fatotwobit_bin = __FATOTWOBIT_BIN__ +gmap_bin = __GMAP_BIN__ +gmap_setup_bin = __GMAP_SETUP_BIN__ +r_bin = __R_BIN__ +rscript_bin = __RSCRIPT_BIN__ + +#raw +# Directory where you want your dataset +gmap_index_directory = $(dataset_directory)/gmap +#end raw + +#raw +# Dataset files +dataset_prefix = $(dataset_directory)/defuse +chromosome_prefix = $(dataset_prefix).dna.chromosomes +exons_fasta = $(dataset_prefix).exons.fa +cds_fasta = $(dataset_prefix).cds.fa +cdna_regions = $(dataset_prefix).cdna.regions +cdna_fasta = $(dataset_prefix).cdna.fa +reference_fasta = $(dataset_prefix).reference.fa +rrna_fasta = $(dataset_prefix).rrna.fa +ig_gene_list = $(dataset_prefix).ig.gene.list +repeats_regions = $(dataset_directory)/repeats.regions +est_split_fasta1 = $(dataset_directory)/est.1.fa +est_split_fasta2 = $(dataset_directory)/est.2.fa +est_split_fasta3 = $(dataset_directory)/est.3.fa +est_split_fasta4 = $(dataset_directory)/est.4.fa +est_split_fasta5 = $(dataset_directory)/est.5.fa +est_split_fasta6 = $(dataset_directory)/est.6.fa +est_split_fasta7 = $(dataset_directory)/est.7.fa +est_split_fasta8 = $(dataset_directory)/est.8.fa +est_split_fasta9 = $(dataset_directory)/est.9.fa + +# Fasta files with bowtie indices for prefiltering reads for concordantly mapping pairs +prefilter1 = $(unigene_fasta) + +# deFuse scripts and tools +scripts_directory = $(source_directory)/scripts +tools_directory = $(source_directory)/tools +data_directory = $(source_directory)/data +#end raw + +# Parameters for building the dataset +chromosomes = $genome.chromosomes +mt_chromosome = $genome.mt_chromosome +gene_sources = $genome.gene_sources +ig_gene_sources = $genome.ig_gene_sources +rrna_gene_sources = $genome.rrna_gene_sources +gene_biotypes = $genome.gene_sources +ig_gene_biotypes = $genome.ig_gene_sources +rrna_gene_biotypes = $genome.rrna_gene_sources + +#raw +# Remove temp files +remove_job_files = yes +remove_job_temp_files = yes +#end raw + </configfile> + </configfiles> + <inputs> + <conditional name="genome"> + <param name="choice" type="select" label="Select a Genome Build"> + <option value="GRCh38">Homo_sapiens GRCh38 hg38</option> + <option value="GRCh37">Homo_sapiens GRCh37 hg19</option> + <option value="NCBI36">Homo_sapiens NCBI36 hg18</option> + <option value="GRCm38">Mus_musculus GRCm38 mm10</option> + <option value="NCBIM37">Mus_musculus NCBIM37 mm9</option> + <option value="Rnor_5.0">Rattus_norvegicus Rnor_5.0 rn5</option> + <option value="user_specified">User specified</option> + </param> + <when value="GRCh38"> + <param name="ensembl_organism" type="hidden" value="homo_sapiens"/> + <param name="ensembl_prefix" type="hidden" value="Homo_sapiens"/> + <param name="ensembl_genome_version" type="hidden" value="GRCh38"/> + <param name="ensembl_version" type="hidden" value="80"/> + <param name="ncbi_organism" type="hidden" value="Homo_sapiens"/> + <param name="ncbi_prefix" type="hidden" value="Hs"/> + <param name="ucsc_genome_version" type="hidden" value="hg38"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="GRCh37"> + <param name="ensembl_organism" type="hidden" value="homo_sapiens"/> + <param name="ensembl_prefix" type="hidden" value="Homo_sapiens"/> + <param name="ensembl_genome_version" type="hidden" value="GRCh37"/> + <param name="ensembl_version" type="hidden" value="71"/> + <param name="ncbi_organism" type="hidden" value="Homo_sapiens"/> + <param name="ncbi_prefix" type="hidden" value="Hs"/> + <param name="ucsc_genome_version" type="hidden" value="hg19"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="NCBI36"> + <param name="ensembl_organism" type="hidden" value="homo_sapiens"/> + <param name="ensembl_prefix" type="hidden" value="Homo_sapiens"/> + <param name="ensembl_genome_version" type="hidden" value="NCBI36"/> + <param name="ensembl_version" type="hidden" value="54"/> + <param name="ncbi_organism" type="hidden" value="Homo_sapiens"/> + <param name="ncbi_prefix" type="hidden" value="Hs"/> + <param name="ucsc_genome_version" type="hidden" value="hg18"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="GRCm38"> + <param name="ensembl_organism" type="hidden" value="mus_musculus"/> + <param name="ensembl_prefix" type="hidden" value="Mus_musculus"/> + <param name="ensembl_genome_version" type="hidden" value="GRCm38"/> + <param name="ensembl_version" type="hidden" value="71"/> + <param name="ncbi_organism" type="hidden" value="Mus_musculus"/> + <param name="ncbi_prefix" type="hidden" value="Mm"/> + <param name="ucsc_genome_version" type="hidden" value="mm10"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="NCBIM37"> + <param name="ensembl_organism" type="hidden" value="mus_musculus"/> + <param name="ensembl_prefix" type="hidden" value="Mus_musculus"/> + <param name="ensembl_genome_version" type="hidden" value="NCBIM37"/> + <param name="ensembl_version" type="hidden" value="67"/> + <param name="ncbi_organism" type="hidden" value="Mus_musculus"/> + <param name="ncbi_prefix" type="hidden" value="Mm"/> + <param name="ucsc_genome_version" type="hidden" value="mm9"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="Rnor_5.0"> + <param name="ensembl_organism" type="hidden" value="rattus_norvegicus"/> + <param name="ensembl_prefix" type="hidden" value="Rattus_norvegicus"/> + <param name="ensembl_genome_version" type="hidden" value="Rnor_5.0"/> + <param name="ensembl_version" type="hidden" value="71"/> + <param name="ncbi_organism" type="hidden" value="Rattus_norvegicus"/> + <param name="ncbi_prefix" type="hidden" value="Rn"/> + <param name="ucsc_genome_version" type="hidden" value="rn5"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="user_specified"> + <param name="ensembl_organism" type="text" value="" label="Ensembl Organism Name"> + <help> + Examples: homo_sapiens, mus_musculus, rattus_norvegicus + ftp://ftp.ensembl.org/pub/release-$ensembl_version/fasta/$ensembl_organism/dna/$ensembl_prefix.$ensembl_genome_version.$ensembl_version.dna.chromosome.$chromosome.fa.gz + </help> + </param> + <param name="ensembl_prefix" type="text" value="" label="Ensembl Organism prefix" help="Examples: Homo_sapiens, Mus_musculus, Rattus_norvegicus"/> + <param name="ensembl_genome_version" type="text" value="" label="Ensembl Genome Version" help="Examples: GRCh37, GRCm38, Rnor_5.0"/> + <param name="ensembl_version" type="integer" value="" label="Ensembl Release Version" help="Example: 71"/> + <param name="ncbi_organism" type="text" value="" label="NCBI Organism Name" help="Examples: Homo_sapiens, Mus_musculus, Rattus_norvegicus"/> + <param name="ncbi_prefix" type="text" value="" label="NCBI Organism Unigene prefix" help="Examples: Hs, Mm, Rn"/> + <param name="ucsc_genome_version" type="text" value="" label="UCSC Genome Version" help="Examples: hg19, mm10, rn5"/> + <param name="chromosomes" type="text" value="" label="Chromosomes for Ensembl genome build" > + <help> Examples: + Homo_sapiens: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT + Mus_musculus: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT + Rattus_norvegicus: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,MT + ( ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/ ) + </help> + </param> + <param name="mt_chromosome" type="text" value="MT" label="Ensembl Mitochonrial Chromosome name" /> + <param name="gene_sources" type="text" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding" label="Gene sources" /> + <param name="ig_gene_sources" type="text" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene" label="IG Gene sources" /> + <param name="rrna_gene_sources" type="text" value="Mt_rRNA,rRNA,rRNA_pseudogene" label="Ribosomal Gene sources" /> + </when> + </conditional> + </inputs> + + <outputs> + <data format="defuse.conf" name="config_txt" label="${tool.name} on ${genome.ensembl_genome_version} : config.txt"/> + </outputs> + <tests> + </tests> + <help> +**DeFuse** + +DeFuse_ is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion. See the DeFuse_Version_0.6_ manual for details. + +DeFuse uses a Reference Dataset to search for gene fusions. The Reference Dataset is generated from the following sources in DeFuse_Version_0.6_: + - genome_fasta from Ensembl + - gene_models from Ensembl + - repeats_filename from UCSC RepeatMasker rmsk.txt + - est_fasta from UCSC + - est_alignments from UCSC intronEst.txt + - unigene_fasta from NCBI + +The create_defuse_reference Galaxy tool downloads the reference genome and other source files, and builds any derivative files including bowtie indices, gmap indices, and 2bit files. Expect this step to take at least 12 hours. + + +It will generate a config.txt file that can be input into the deFuse Galaxy tool. + +Journal reference: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001138 + +.. _DeFuse: http://sourceforge.net/apps/mediawiki/defuse/index.php?title=Main_Page + +.. _DeFuse_Version_0.6: http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.6.1 + +------ + +**Outputs** + +The galaxy history will contain: the config.txt file that provides DeFuse with the reference data paths. + + </help> + <expand macro="citations"/> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data_manager_conf.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,25 @@ +<?xml version="1.0"?> +<data_managers> + <data_manager tool_file="datamanager_create_reference.xml" id="data_manager_defuse_reference" > + <data_table name="defuse_reference"> <!-- Defines a Data Table to be modified. --> + <output> <!-- Handle the output of the Data Manager Tool --> + <column name="value" /> <!-- columns that are going to be specified by the Data Manager Tool --> + <column name="dbkey" /> + <column name="name" /> + <column name="path" output_ref="out_file" > <!-- The value of this column will be modified based upon data in "out_file". example value "phiX.fa" --> + <move type="directory"> <!-- Moving a file from the extra files path of "out_file" --> + <!-- <source>${path}</source>--> <!-- out_file.extra_files_path is used as base by default --> <!-- if no source, eg for type=directory, then refers to base --> + <target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${value}/defuse</target> <!-- Target Location to store the file, directories are created as needed --> + </move> + <!-- datamanager_create_reference.py should have copied the defuse config file to the working directory. + so if we put the ${dbkey}.config path in this column, defuse.xml can set the data_directory to this this directory. + --> + <value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${value}/defuse/${value}.config</value_translation> <!-- Store this value in the final Data Table --> + <value_translation type="function">abspath</value_translation> + </column> + </output> + </data_table> + </data_manager> +</data_managers> + +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/datamanager_create_reference.py Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,118 @@ +#!/usr/bin/env python + +import sys +import os +import re +import tempfile +import subprocess +import fileinput +import shutil +import optparse +import urllib2 +from ftplib import FTP +import tarfile + +from galaxy.util.json import from_json_string, to_json_string + + +def stop_err(msg): + sys.stderr.write(msg) + sys.exit(1) + +def get_config_dict(config,dataset_directory=None): + keys = ['dataset_directory','ensembl_organism','ensembl_prefix','ensembl_version','ensembl_genome_version','ucsc_genome_version','ncbi_organism','ncbi_prefix','chromosomes','mt_chromosome','gene_sources','ig_gene_sources','rrna_gene_sources'] + pat = '^([^=]+?)\s*=\s*(.*)$' + config_dict = {} + try: + fh = open(config) + for i,l in enumerate(fh): + line = l.strip() + if line.startswith('#'): + continue + m = re.match(pat,line) + if m and len(m.groups()) == 2: + (k,v) = m.groups() + if k in keys: + config_dict[k] = v + except Exception, e: + stop_err( 'Error parsing %s %s\n' % (config,str( e )) ) + else: + fh.close() + if dataset_directory: + config_dict['dataset_directory'] = dataset_directory + return config_dict + +def run_defuse_script(data_manager_dict, params, target_directory, dbkey, description, config, script): + if not os.path.isdir(target_directory): + os.makedirs(target_directory) + ## Name the config consistently with data_manager_conf.xml + # copy the config file to the target_directory + # when DataManager moves files to there tool-data location, the config will get moved as well, + # and the value_translation in data_manager_conf.xml will tell us the new location + # defuse.xml will use the path to this config file to set the dataset_directory + config_name = '%s.config' % dbkey + defuse_config = os.path.join( target_directory, config_name) + shutil.copyfile(config,defuse_config) + cmd = "/bin/bash %s %s" % (script,target_directory) + # Run + try: + tmp_out = tempfile.NamedTemporaryFile().name + tmp_stdout = open( tmp_out, 'wb' ) + tmp_err = tempfile.NamedTemporaryFile().name + tmp_stderr = open( tmp_err, 'wb' ) + proc = subprocess.Popen( args=cmd, shell=True, cwd=".", stdout=tmp_stdout, stderr=tmp_stderr ) + returncode = proc.wait() + tmp_stderr.close() + # get stderr, allowing for case where it's very large + tmp_stderr = open( tmp_err, 'rb' ) + stderr = '' + buffsize = 1048576 + try: + while True: + stderr += tmp_stderr.read( buffsize ) + if not stderr or len( stderr ) % buffsize != 0: + break + except OverflowError: + pass + tmp_stdout.close() + tmp_stderr.close() + if returncode != 0: + raise Exception, stderr + + # TODO: look for errors in program output. + except Exception, e: + stop_err( 'Error creating defuse reference:\n' + str( e ) ) + config_dict = get_config_dict(config, dataset_directory=target_directory) + data_table_entry = dict(value=dbkey, dbkey=dbkey, name=description, path=config_name) + _add_data_table_entry( data_manager_dict, data_table_entry ) +def _add_data_table_entry( data_manager_dict, data_table_entry ): + data_manager_dict['data_tables'] = data_manager_dict.get( 'data_tables', {} ) + data_manager_dict['data_tables']['defuse_reference'] = data_manager_dict['data_tables'].get( 'defuse_reference', [] ) + data_manager_dict['data_tables']['defuse_reference'].append( data_table_entry ) + return data_manager_dict + +def main(): + #Parse Command Line + parser = optparse.OptionParser() + parser.add_option( '-k', '--dbkey', dest='dbkey', action='store', type="string", default=None, help='dbkey' ) + parser.add_option( '-d', '--description', dest='description', action='store', type="string", default=None, help='description' ) + parser.add_option( '-c', '--defuse_config', dest='defuse_config', action='store', type="string", default=None, help='defuse_config' ) + parser.add_option( '-s', '--defuse_script', dest='defuse_script', action='store', type="string", default=None, help='defuse_script' ) + (options, args) = parser.parse_args() + + filename = args[0] + + params = from_json_string( open( filename ).read() ) + target_directory = params[ 'output_data' ][0]['extra_files_path'] + os.mkdir( target_directory ) + data_manager_dict = {} + + + #Create Defuse Reference Data + run_defuse_script( data_manager_dict, params, target_directory, options.dbkey, options.description,options.defuse_config,options.defuse_script) + + #save info to json file + open( filename, 'wb' ).write( to_json_string( data_manager_dict ) ) + +if __name__ == "__main__": main() +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/datamanager_create_reference.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,298 @@ +<tool id="data_manager_defuse_reference" name="DeFuse Reference DataManager" version="@DEFUSE_VERSION@.1" tool_type="manage_data"> + <description>create a defuse reference from Ensembl and UCSC sources</description> + <macros> + <import>macros.xml</import> + </macros> + <requirements> + <expand macro="defuse_requirement" /> + </requirements> + <command detect_errors="exit_code"><![CDATA[ + python '$__tool_directory__/datamanager_create_reference.py' + --dbkey $genome.ensembl_genome_version + --description "$genome.ensembl_prefix $genome.ensembl_genome_version ($genome.ucsc_genome_version)" + --defuse_config $defuse_config + --defuse_script $defuse_script + $out_file + ]]></command> + <configfiles> + <configfile name="defuse_config"> +# +# Configuration file for defuse +# +# Variables that desiganate the PATH to an application, e.g. __SAMTOOLS_BIN__ +# will be set by the runtime script using the ENV PATH +# + +# Directory where the defuse code was unpacked +source_directory = __DEFUSE_PATH__ + +# Organism IDs +ensembl_organism = $genome.ensembl_organism +ensembl_prefix = $genome.ensembl_prefix +ensembl_version = $genome.ensembl_version +ensembl_genome_version = $genome.ensembl_genome_version +ucsc_genome_version = $genome.ucsc_genome_version +ncbi_organism = $genome.ncbi_organism +ncbi_prefix = $genome.ncbi_prefix + +# Directory where you want your dataset +dataset_directory = __DATASET_DIRECTORY__ + +#raw +# Input genome and gene models +gene_models = $(dataset_directory)/$(ensembl_prefix).$(ensembl_genome_version).$(ensembl_version).gtf +genome_fasta = $(dataset_directory)/$(ensembl_prefix).$(ensembl_genome_version).$(ensembl_version).dna.chromosomes.fa + +# Repeat table from ucsc genome browser +repeats_filename = $(dataset_directory)/repeats.txt + +# EST info downloaded from ucsc genome browser +est_fasta = $(dataset_directory)/est.fa +est_alignments = $(dataset_directory)/intronEst.txt + +# Unigene clusters downloaded from ncbi +unigene_fasta = $(dataset_directory)/$(ncbi_prefix).seq.uniq +#end raw + +# Paths to external tools +samtools_bin = __SAMTOOLS_BIN__ +bowtie_bin = __BOWTIE_BIN__ +bowtie_build_bin = __BOWTIE_BUILD_BIN__ +blat_bin = __BLAT_BIN__ +fatotwobit_bin = __FATOTWOBIT_BIN__ +gmap_bin = __GMAP_BIN__ +gmap_setup_bin = __GMAP_SETUP_BIN__ +gmap_build_bin = __GMAP_BUILD_BIN__ +r_bin = __R_BIN__ +rscript_bin = __RSCRIPT_BIN__ + +#raw +# Directory where you want your dataset +gmap_index_directory = $(dataset_directory)/gmap +#end raw + +#raw +# Dataset files +dataset_prefix = $(dataset_directory)/defuse +chromosome_prefix = $(dataset_prefix).dna.chromosomes +exons_fasta = $(dataset_prefix).exons.fa +cds_fasta = $(dataset_prefix).cds.fa +cdna_regions = $(dataset_prefix).cdna.regions +cdna_fasta = $(dataset_prefix).cdna.fa +reference_fasta = $(dataset_prefix).reference.fa +rrna_fasta = $(dataset_prefix).rrna.fa +ig_gene_list = $(dataset_prefix).ig.gene.list +repeats_regions = $(dataset_directory)/repeats.regions +est_split_fasta1 = $(dataset_directory)/est.1.fa +est_split_fasta2 = $(dataset_directory)/est.2.fa +est_split_fasta3 = $(dataset_directory)/est.3.fa +est_split_fasta4 = $(dataset_directory)/est.4.fa +est_split_fasta5 = $(dataset_directory)/est.5.fa +est_split_fasta6 = $(dataset_directory)/est.6.fa +est_split_fasta7 = $(dataset_directory)/est.7.fa +est_split_fasta8 = $(dataset_directory)/est.8.fa +est_split_fasta9 = $(dataset_directory)/est.9.fa + +# Fasta files with bowtie indices for prefiltering reads for concordantly mapping pairs +prefilter1 = $(unigene_fasta) + +# deFuse scripts and tools +scripts_directory = $(source_directory)/scripts +tools_directory = $(source_directory)/tools +data_directory = $(source_directory)/data +#end raw + +# Parameters for building the dataset +chromosomes = $genome.chromosomes +mt_chromosome = $genome.mt_chromosome +gene_sources = $genome.gene_sources +ig_gene_sources = $genome.ig_gene_sources +rrna_gene_sources = $genome.rrna_gene_sources +gene_biotypes = $genome.gene_sources +ig_gene_biotypes = $genome.ig_gene_sources +rrna_gene_biotypes = $genome.rrna_gene_sources + +#raw +# Remove temp files +remove_job_files = yes +remove_job_temp_files = yes +#end raw + </configfile> + <configfile name="defuse_script">#slurp +#!/bin/bash +## define some things for cheetah proccessing +#set $amp = chr(38) +#set $gt = chr(62) +## substitute pathnames into config file +if `grep __DATASET_DIRECTORY__ $defuse_config ${gt} /dev/null`;then sed -i'.tmp' "s#__DATASET_DIRECTORY__#\$1#" $defuse_config; fi +if `grep __DEFUSE_PATH__ $defuse_config ${gt} /dev/null`;then sed -i'.tmp' "s#__DEFUSE_PATH__#\${DEFUSE_PATH}#" $defuse_config; fi +if `grep __SAMTOOLS_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} SAMTOOLS_BIN=`which samtools`;then sed -i'.tmp' "s#__SAMTOOLS_BIN__#\${SAMTOOLS_BIN}#" $defuse_config; fi +if `grep __BOWTIE_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} BOWTIE_BIN=`which bowtie`;then sed -i'.tmp' "s#__BOWTIE_BIN__#\${BOWTIE_BIN}#" $defuse_config; fi +if `grep __BOWTIE_BUILD_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} BOWTIE_BUILD_BIN=`which bowtie-build`;then sed -i'.tmp' "s#__BOWTIE_BUILD_BIN__#\${BOWTIE_BUILD_BIN}#" $defuse_config; fi +if `grep __BLAT_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} BLAT_BIN=`which blat`;then sed -i'.tmp' "s#__BLAT_BIN__#\${BLAT_BIN}#" $defuse_config; fi +if `grep __FATOTWOBIT_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} FATOTWOBIT_BIN=`which faToTwoBit`;then sed -i'.tmp' "s#__FATOTWOBIT_BIN__#\${FATOTWOBIT_BIN}#" $defuse_config; fi +if `grep __GMAP_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} GMAP_BIN=`which gmap`;then sed -i'.tmp' "s#__GMAP_BIN__#\${GMAP_BIN}#" $defuse_config; fi +if `grep __GMAP_SETUP_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} GMAP_SETUP_BIN=`which gmap_setup`;then sed -i'.tmp' "s#__GMAP_SETUP_BIN__#\${GMAP_SETUP_BIN}#" $defuse_config; fi +if `grep __GMAP_BUILD_BIN__ $defuse_config ${gt} /dev/null` ${amp}${amp} GMAP_BUILD_BIN=`which gmap_build`;then sed -i'.tmp' "s#__GMAP_BUILD_BIN__#\${GMAP_BUILD_BIN}#" $defuse_config; fi +if `grep __GMAP_INDEX_DIR__ $defuse_config ${gt} /dev/null` ${amp}${amp} GMAP_INDEX_DIR=`pwd`/gmap;then sed -i'.tmp' "s#__GMAP_INDEX_DIR__#\${GMAP_INDEX_DIR}#" $defuse_config; fi +## copy config to output +cp $defuse_config \$1/defuse_config.txt +## Run the create_reference_dataset.pl +perl \${DEFUSE_PATH}/scripts/create_reference_dataset.pl -c $defuse_config + </configfile> + </configfiles> + <inputs> + <conditional name="genome"> + <param name="choice" type="select" label="Select a Genome Build"> + <option value="GRCh38">Homo_sapiens GRCh38 hg38</option> + <option value="GRCh37">Homo_sapiens GRCh37 hg19</option> + <option value="NCBI36">Homo_sapiens NCBI36 hg18</option> + <option value="GRCm38">Mus_musculus GRCm38 mm10</option> + <option value="NCBIM37">Mus_musculus NCBIM37 mm9</option> + <option value="Rnor_5.0">Rattus_norvegicus Rnor_5.0 rn5</option> + <option value="user_specified">User specified</option> + </param> + <when value="GRCh38"> + <param name="ensembl_organism" type="hidden" value="homo_sapiens"/> + <param name="ensembl_prefix" type="hidden" value="Homo_sapiens"/> + <param name="ensembl_genome_version" type="hidden" value="GRCh38"/> + <param name="ensembl_version" type="hidden" value="80"/> + <param name="ncbi_organism" type="hidden" value="Homo_sapiens"/> + <param name="ncbi_prefix" type="hidden" value="Hs"/> + <param name="ucsc_genome_version" type="hidden" value="hg38"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="GRCh37"> + <param name="ensembl_organism" type="hidden" value="homo_sapiens"/> + <param name="ensembl_prefix" type="hidden" value="Homo_sapiens"/> + <param name="ensembl_genome_version" type="hidden" value="GRCh37"/> + <param name="ensembl_version" type="hidden" value="71"/> + <param name="ncbi_organism" type="hidden" value="Homo_sapiens"/> + <param name="ncbi_prefix" type="hidden" value="Hs"/> + <param name="ucsc_genome_version" type="hidden" value="hg19"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="NCBI36"> + <param name="ensembl_organism" type="hidden" value="homo_sapiens"/> + <param name="ensembl_prefix" type="hidden" value="Homo_sapiens"/> + <param name="ensembl_genome_version" type="hidden" value="NCBI36"/> + <param name="ensembl_version" type="hidden" value="54"/> + <param name="ncbi_organism" type="hidden" value="Homo_sapiens"/> + <param name="ncbi_prefix" type="hidden" value="Hs"/> + <param name="ucsc_genome_version" type="hidden" value="hg18"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="GRCm38"> + <param name="ensembl_organism" type="hidden" value="mus_musculus"/> + <param name="ensembl_prefix" type="hidden" value="Mus_musculus"/> + <param name="ensembl_genome_version" type="hidden" value="GRCm38"/> + <param name="ensembl_version" type="hidden" value="71"/> + <param name="ncbi_organism" type="hidden" value="Mus_musculus"/> + <param name="ncbi_prefix" type="hidden" value="Mm"/> + <param name="ucsc_genome_version" type="hidden" value="mm10"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="NCBIM37"> + <param name="ensembl_organism" type="hidden" value="mus_musculus"/> + <param name="ensembl_prefix" type="hidden" value="Mus_musculus"/> + <param name="ensembl_genome_version" type="hidden" value="NCBIM37"/> + <param name="ensembl_version" type="hidden" value="67"/> + <param name="ncbi_organism" type="hidden" value="Mus_musculus"/> + <param name="ncbi_prefix" type="hidden" value="Mm"/> + <param name="ucsc_genome_version" type="hidden" value="mm9"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="Rnor_5.0"> + <param name="ensembl_organism" type="hidden" value="rattus_norvegicus"/> + <param name="ensembl_prefix" type="hidden" value="Rattus_norvegicus"/> + <param name="ensembl_genome_version" type="hidden" value="Rnor_5.0"/> + <param name="ensembl_version" type="hidden" value="71"/> + <param name="ncbi_organism" type="hidden" value="Rattus_norvegicus"/> + <param name="ncbi_prefix" type="hidden" value="Rn"/> + <param name="ucsc_genome_version" type="hidden" value="rn5"/> + <param name="chromosomes" type="hidden" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,MT"/> + <param name="mt_chromosome" type="hidden" value="MT"/> + <param name="gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding"/> + <param name="ig_gene_sources" type="hidden" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene"/> + <param name="rrna_gene_sources" type="hidden" value="Mt_rRNA,rRNA,rRNA_pseudogene"/> + </when> + <when value="user_specified"> + <param name="ensembl_organism" type="text" value="" label="Ensembl Organism Name" help="Examples: homo_sapiens, mus_musculus, rattus_norvegicus"/> + <param name="ensembl_prefix" type="text" value="" label="Ensembl Organism prefix" help="Examples: Homo_sapiens, Mus_musculus, Rattus_norvegicus"/> + <param name="ensembl_genome_version" type="text" value="" label="Ensembl Genome Version" help="Examples: GRCh38, GRCh37, GRCm38, Rnor_5.0"/> + <param name="ensembl_version" type="integer" value="" label="Ensembl Release Version" help="Example: 86"/> + <param name="ncbi_organism" type="text" value="" label="NCBI Organism Name" help="Examples: Homo_sapiens, Mus_musculus, Rattus_norvegicus"/> + <param name="ncbi_prefix" type="text" value="" label="NCBI Organism Unigene prefix" help="Examples: Hs, Mm, Rn"/> + <param name="ucsc_genome_version" type="text" value="" label="UCSC Genome Version" help="Examples: hg38, hg19, mm10, rn5"/> + <param name="chromosomes" type="text" value="" label="Chromosomes for Ensembl genome build" > + <help> Examples: + Homo_sapiens: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT + Mus_musculus: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT + Rattus_norvegicus: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,MT + ( ftp://ftp.ensembl.org/pub/release-71/fasta/homo_sapiens/dna/ ) + </help> + </param> + <param name="mt_chromosome" type="text" value="MT" label="Ensembl Mitochonrial Chromosome name" /> + <param name="gene_sources" type="text" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding" label="Gene sources" /> + <param name="ig_gene_sources" type="text" value="IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene" label="IG Gene sources" /> + <param name="rrna_gene_sources" type="text" value="Mt_rRNA,rRNA,rRNA_pseudogene" label="Ribosomal Gene sources" /> + </when> + </conditional> + </inputs> + <outputs> + <data name="out_file" format="data_manager_json" label="${tool.name} : ${genome.ensembl_genome_version}"/> + </outputs> + <tests> + </tests> + <help> +**DeFuse** + +DeFuse_ is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion. See the DeFuse_Version_0.6_ manual for details. + +DeFuse uses a Reference Dataset to search for gene fusions. The Reference Dataset is generated from the following sources in DeFuse_Version_0.6_: + - genome_fasta from Ensembl + - gene_models from Ensembl + - repeats_filename from UCSC RepeatMasker rmsk.txt + - est_fasta from UCSC + - est_alignments from UCSC intronEst.txt + - unigene_fasta from NCBI + +The create_defuse_reference Galaxy tool downloads the reference genome and other source files, and builds any derivative files including bowtie indices, gmap indices, and 2bit files. Expect this step to take at least 12 hours. + + +It will generate the refernce data for deFuse Galaxy tool. + +Journal reference: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001138 + +.. _DeFuse: http://sourceforge.net/apps/mediawiki/defuse/index.php?title=Main_Page + +.. _DeFuse_Version_0.6: http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.6.1 + +------ + +**Outputs** + +The galaxy history will contain: the config.txt file that provides DeFuse with the reference data paths. + + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/datatypes_conf.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,7 @@ +<?xml version="1.0"?> +<datatypes> + <registration> + <datatype extension="defuse.conf" type="galaxy.datatypes.data:Text" subclass="True" display_in_upload="true"/> + <datatype extension="defuse.results.tsv" type="galaxy.datatypes.tabular:Tabular" subclass="True" display_in_upload="true"/> + </registration> +</datatypes>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/defuse.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,701 @@ +<tool id="defuse" name="DeFuse" version="@DEFUSE_VERSION@.1"> + <description>identify fusion transcripts</description> + <macros> + <import>macros.xml</import> + </macros> + <requirements> + <expand macro="defuse_requirement" /> + </requirements> + <command detect_errors="default"><![CDATA[ + #if $defuse_out.__str__ != 'None': + ## ln to output_dir in from_work_dir + mkdir -p $defuse_out.files_path && + ln -s $defuse_out.files_path output_dir && + #else + mkdir -p output_dir && + #end if + ## Put executable paths in config file + $__tool_directory__/config_sub.sh $defuse_config output_dir/defuse.cfg && + ## copy config to output + cp output_dir/defuse.cfg $config_txt && + ## make a data_dir and ln -s the input fastq + mkdir -p data_dir && + ln -s "$left_pairendreads" data_dir/reads_1.fastq && + ln -s "$right_pairendreads" data_dir/reads_2.fastq && + ## run + DATASET_DIRECTORY=`grep '^dataset_directory' output_dir/defuse.cfg | awk '{print \$NF}'` && + defuse_run.pl --name "$library_name" --config output_dir/defuse.cfg --dataset \$DATASET_DIRECTORY -1 data_dir/reads_1.fastq -2 data_dir/reads_2.fastq -o output_dir -p \$GALAXY_SLOTS && + grep -v cluster_id output_dir/results.filtered.tsv | awk '{print $1}' > cluster_id_list && + get_fusion_fastq.pl --list cluster_id_list --output output_dir --fastq1 results.fusions_1.fq --fastq2 results.fusions_2.fq && + cp output_dir/results.* . && + cp `find -L output_dir -name defuse.log` $defuse_log + #if $defuse_out.__str__ != 'None': + && $__tool_directory__/make_html.sh $defuse_out $defuse_out.files_path + #end if + ]]></command> + <configfiles> + <configfile name="defuse_config"> +#import re +#if $refGenomeSource.genomeSource == "history": +#set config_file = $refGenomeSource.config.__str__ +#else +#set config_file = $refGenomeSource.index.value +#end if +#set pat = '^\s*([^#=][^=]*?)\s*=\s*(.*?)\s*$' +#set fh = open($config_file) +#set keys = ['dataset_directory','ensembl_organism','ensembl_prefix','ensembl_version','ensembl_genome_version','ucsc_genome_version','ncbi_organism','ncbi_prefix','chromosomes','mt_chromosome','gene_sources','ig_gene_sources','rrna_gene_sources'] +#set kv = [] +#for $line in $fh: + #set m = $re.match($pat,$line) + #if $m and len($m.groups()) == 2: + ## #echo $line + #if $m.groups()[0] in keys: + #set k = $m.groups()[0] + #if k == 'dataset_directory' and $refGenomeSource.genomeSource == "indexed": + ## The DataManager is conifgured to place the config file in the same directory as the defuse_data: dataset_directory + #set v = $os.path.dirname($config_file) + #else: + #set v = $m.groups()[1] + #end if + #set kv = $kv + [[$k, $v]] + #end if + #end if +#end for +## #echo $kv +#set ref_dict = dict($kv) +## #echo $ref_dict +## include raw $refGenomeSource.config.__str__ +# +# Configuration file for defuse +# +# At a minimum, change all values enclused by [] +# + +# Directory where the defuse code was unpacked +## Default location in the tool/defuse directory +# source_directory = ${__root_dir__}/tools/defuse +source_directory = __DEFUSE_PATH__ + +# Directory where you want your dataset +dataset_directory = #slurp +#try +$ref_dict['dataset_directory'] +#except +/project/db/genomes/Hsapiens/hg19/defuse +#end try + +# Organism IDs +ensembl_organism = #slurp +#try +$ref_dict['ensembl_organism'] +#except +homo_sapiens +#end try + +ensembl_prefix = #slurp +#try +$ref_dict['ensembl_prefix'] +#except +Homo_sapiens +#end try + +ensembl_version = #slurp +#try +$ref_dict['ensembl_version'] +#except +71 +#end try + +ensembl_genome_version = #slurp +#try +$ref_dict['ensembl_genome_version'] +#except +GRCh37 +#end try + +ucsc_genome_version = #slurp +#try +$ref_dict['ucsc_genome_version'] +#except +hg19 +#end try + +ncbi_organism = #slurp +#try +$ref_dict['ncbi_organism'] +#except +Homo_sapiens +#end try + +ncbi_prefix = #slurp +#try +$ref_dict['ncbi_prefix'] +#except +Hs +#end try + +# Input genome and gene models +gene_models = #slurp +#try +$ref_dict['gene_models'] +#except +\$(dataset_directory)/\$(ensembl_prefix).\$(ensembl_genome_version).\$(ensembl_version).gtf +#end try +genome_fasta = #slurp +#try +$ref_dict['genome_fasta'] +#except +\$(dataset_directory)/\$(ensembl_prefix).\$(ensembl_genome_version).\$(ensembl_version).dna.chromosomes.fa +#end try + +# Repeat table from ucsc genome browser +repeats_filename = #slurp +#try +$ref_dict['repeats_filename'] +#except +\$(dataset_directory)/rmsk.txt +#end try + +# EST info downloaded from ucsc genome browser +est_fasta = #slurp +#try +$ref_dict['est_fasta'] +#except +\$(dataset_directory)/est.fa +#end try +est_alignments = #slurp +#try +$ref_dict['est_alignments'] +#except +\$(dataset_directory)/intronEst.txt +#end try + +# Unigene clusters downloaded from ncbi +unigene_fasta = #slurp +#try +$ref_dict['unigene_fasta'] +#except +\$(dataset_directory)/\$(ncbi_prefix).seq.uniq +#end try + +# Paths to external tools +bowtie_bin = __BOWTIE_BIN__ +bowtie_build_bin = __BOWTIE_BUILD_BIN__ +blat_bin = __BLAT_BIN__ +fatotwobit_bin = __FATOTWOBIT_BIN__ +gmap_bin = __GMAP_BIN__ +gmap_bin = __GMAP_BIN__ +gmap_setup_bin = __GMAP_SETUP_BIN__ +r_bin = __R_BIN__ +rscript_bin = __RSCRIPT_BIN__ + +# Directory where you want your dataset +gmap_index_directory = #slurp +#try +$ref_dict['gmap_index_directory'] +#except +#raw +$(dataset_directory)/gmap +#end raw +#end try + +#raw +# Dataset files +dataset_prefix = $(dataset_directory)/defuse +chromosome_prefix = $(dataset_prefix).dna.chromosomes +exons_fasta = $(dataset_prefix).exons.fa +cds_fasta = $(dataset_prefix).cds.fa +cdna_regions = $(dataset_prefix).cdna.regions +cdna_fasta = $(dataset_prefix).cdna.fa +reference_fasta = $(dataset_prefix).reference.fa +rrna_fasta = $(dataset_prefix).rrna.fa +ig_gene_list = $(dataset_prefix).ig.gene.list +repeats_regions = $(dataset_directory)/repeats.regions +est_split_fasta1 = $(dataset_directory)/est.1.fa +est_split_fasta2 = $(dataset_directory)/est.2.fa +est_split_fasta3 = $(dataset_directory)/est.3.fa +est_split_fasta4 = $(dataset_directory)/est.4.fa +est_split_fasta5 = $(dataset_directory)/est.5.fa +est_split_fasta6 = $(dataset_directory)/est.6.fa +est_split_fasta7 = $(dataset_directory)/est.7.fa +est_split_fasta8 = $(dataset_directory)/est.8.fa +est_split_fasta9 = $(dataset_directory)/est.9.fa + +# Fasta files with bowtie indices for prefiltering reads for concordantly mapping pairs +prefilter1 = $(unigene_fasta) + +# deFuse scripts and tools +scripts_directory = $(source_directory)/scripts +tools_directory = $(source_directory)/tools +data_directory = $(source_directory)/data +#end raw + +# Path to samtools, 0.1.8 is compiled for you, use other versions at your own risk +samtools_bin = #slurp +#try +$ref_dict['samtools_bin'] +#except +\$(source_directory)/external/samtools-0.1.8/samtools +#end try + +# Bowtie parameters +bowtie_threads = #slurp +#try +$ref_dict['bowtie_threads'] +#except +4 +#end try +bowtie_quals = #slurp +#try +$ref_dict['bowtie_quals'] +#except +--phred33-quals +#end try +bowtie_params = #slurp +#try +$ref_dict['bowtie_params'] +#except +--chunkmbs 200 +#end try +max_insert_size = #slurp +#if $defuse_param.settings == "full" and $defuse_param.max_insert_size.__str__ != "": +$defuse_param.max_insert_size +#else +#try +$ref_dict['max_insert_size'] +#except +500 +#end try +#end if + +# Parameters for building the dataset +chromosomes = #slurp +#try +$ref_dict.chromosomes +#except +1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT +#end try +mt_chromosome = #slurp +#try +$ref_dict['mt_chromosome'] +#except +MT +#end try +gene_sources = #slurp +#try +$ref_dict['gene_sources'] +#except +IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,processed_transcript,protein_coding +#end try +ig_gene_sources = #slurp +#try +$ref_dict['ig_gene_sources'] +#except +IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,IG_pseudogene +#end try +rrna_gene_sources = #slurp +#try +$ref_dict['rrna_gene_sources'] +#except +Mt_rRNA,rRNA,rRNA_pseudogene +#end try + +# Blat sequences per job +num_blat_sequences = #slurp +#try +$ref_dict['num_blat_sequences'] +#except +10000 +#end try + +# Minimum gene fusion range +dna_concordant_length = #slurp +#if $defuse_param.settings == "full" and $defuse_param.dna_concordant_length.__str__ != "": +$defuse_param.dna_concordant_length +#else +#try +$ref_dict['dna_concordant_length'] +#except +2000 +#end try +#end if + +# Trim length for discordant reads (split reads are not trimmed) +discord_read_trim = #slurp +#if $defuse_param.settings == "full" and $defuse_param.discord_read_trim.__str__ != "": +$defuse_param.discord_read_trim +#else +#try +$ref_dict['discord_read_trim'] +#except +50 +#end try +#end if +# Calculate extra annotations, fusion splice index and interrupted index +calculate_extra_annotations = #slurp +#if $defuse_param.settings == "full" and $defuse_param.calculate_extra_annotations.__str__ != "": +$defuse_param.calculate_extra_annotations +#else +#try +$ref_dict['calculate_extra_annotations'] +#except +no +#end try +#end if +# Filtering parameters +clustering_precision = #slurp +#if $defuse_param.settings == "full" and $defuse_param.clustering_precision.__str__ != "" +$defuse_param.clustering_precision +#else +#try +$ref_dict['clustering_precision'] +#except +0.95 +#end try +#end if +span_count_threshold = #slurp +#if $defuse_param.settings == "full" and $defuse_param.span_count_threshold.__str__ != "" +$defuse_param.span_count_threshold +#else +#try +$ref_dict['span_count_threshold'] +#except +5 +#end try +#end if +percent_identity_threshold = #slurp +#if $defuse_param.settings == "full" and $defuse_param.percent_identity_threshold.__str__ != "" +$defuse_param.percent_identity_threshold +#else +#try +$ref_dict['percent_identity_threshold'] +#except +0.90 +#end try +#end if +split_min_anchor = #slurp +#if $defuse_param.settings == "full" and $defuse_param.split_min_anchor.__str__ != "" +$defuse_param.split_min_anchor +#else +#try +$ref_dict['split_min_anchor'] +#except +4 +#end try +#end if +splice_bias = #slurp +#if $defuse_param.settings == "full" and $defuse_param.splice_bias.__str__ != "" +$defuse_param.splice_bias +#else +#try +$ref_dict['splice_bias'] +#except +10 +#end try +#end if +denovo_assembly = #slurp +#if $defuse_param.settings == "full" and $defuse_param.denovo_assembly.__str__ != "" +$defuse_param.denovo_assembly +#else +#try +$ref_dict['denovo_assembly'] +#except +no +#end try +#end if +probability_threshold = #slurp +#if $defuse_param.settings == "full" and $defuse_param.probability_threshold.__str__ != "" +$defuse_param.probability_threshold +#else +#try +$ref_dict['probability_threshold'] +#except +0.50 +#end try +#end if +positive_controls = \$(data_directory)/controls.txt + +# Use multiple exon transcripts for stats calculations (yes/no) +# should be enabled for very small libraries +multi_exon_transcripts_stats = #slurp +#if $defuse_param.settings == "full" and $defuse_param.multi_exon_transcripts_stats.__str__ != "" +$defuse_param.multi_exon_transcripts_stats +#else +#try +$ref_dict['multi_exon_transcripts_stats'] +#except +no +#end try +#end if + +# Position density when calculating covariance +covariance_sampling_density = #slurp +#if $defuse_param.settings == "full" and $defuse_param.covariance_sampling_density.__str__ != "" +$defuse_param.covariance_sampling_density +#else +#try +$ref_dict['covariance_sampling_density'] +#except +0.01 +#end try +#end if + +# Maximum number of alignments for a read pair +# Pairs with more alignments are filtered +max_paired_alignments = #slurp +#if $defuse_param.settings == "full" and $defuse_param.max_paired_alignments.__str__ != "" +$defuse_param.max_paired_alignments +#else +#try +$ref_dict['max_paired_alignments'] +#except +10 +#end try +#end if + +# Number of reads for each job in split +reads_per_job = #slurp +#if $defuse_param.settings == "full" and $defuse_param.reads_per_job.__str__ != "" +$defuse_param.reads_per_job +#else +#try +$ref_dict['reads_per_job'] +#except +1000000 +#end try +#end if + +#raw +# If you have command line 'mail' and wish to be notified +# mailto = andrew.mcpherson@gmail.com + +# Remove temp files +remove_job_files = yes +remove_job_temp_files = yes + +qsub_params = "" + +#end raw + + </configfile> + </configfiles> + <inputs> + <param name="left_pairendreads" type="data" format="fastq" label="left part of read pairs" help="The left and right reads pairs must be in the same order, and not have any unpaired reads. (FASTQ interlacer will pair reads and remove the unpaired. FASTQ de-interlacer will separate the result into left and right reads.)"/> + <param name="right_pairendreads" type="data" format="fastq" label="right part of read pairs" help="In the same order as the left reads"/> + <param name="library_name" type="text" value="unknown" label="library name" help="Value to put in the results library_name column"> + <validator type="length" min="1"/> + </param> + <conditional name="refGenomeSource"> + <param name="genomeSource" type="select" label="Will you select a built-in DeFuse Reference Dataset, or supply a configuration from your history" help=""> + <option value="indexed">Use a built-in DeFuse Reference Dataset</option> + <option value="history">Use a configuration from your history that specifies the DeFuse Reference Dataset</option> + </param> + <when value="indexed"> + <param name="index" type="select" label="Select a Reference Dataset" help="if your genome of interest is not listed - contact Galaxy team"> + <options from_file="defuse_reference.loc"> + <column name="name" index="1"/> + <column name="value" index="3"/> + <filter type="sort_by" column="0" /> + <validator type="no_options" message="No indexes are available" /> + </options> + </param> + </when> + <when value="history"> + <param name="config" type="data" format="defuse.conf" label="Defuse Config file" help=""/> + </when> <!-- history --> + </conditional> <!-- refGenomeSource --> + <conditional name="defuse_param"> + <param name="settings" type="select" label="Defuse parameter settings" help=""> + <option value="preSet">Default settings</option> + <option value="full">Full parameter list</option> + </param> + <when value="preSet" /> + <when value="full"> + <param name="max_insert_size" type="integer" value="500" optional="true" label="Bowtie max_insert_size" /> + <param name="dna_concordant_length" type="integer" value="2000" optional="true" label="Minimum gene fusion range dna_concordant_length" /> + <param name="discord_read_trim" type="integer" value="50" optional="true" label="Trim length for discordant reads discord_read_trim" help="(split reads are not trimmed)" /> + <param name="calculate_extra_annotations" type="select" label="Calculate extra annotations, fusion splice index and interrupted index" help=""> + <option value="">Use Default</option> + <option value="no">no</option> + <option value="yes">yes</option> + </param> + <param name="clustering_precision" type="float" value=".95" optional="true" label="Filter clustering_precision"> + <validator type="in_range" message="Choose a value between .1 and 1.0" min=".1" max="1"/> + </param> + <param name="span_count_threshold" type="integer" value="5" optional="true" label="Filter span_count_threshold" /> + <param name="percent_identity_threshold" type="float" value=".90" optional="true" label="Filter percent_identity_threshold"> + <validator type="in_range" message="Choose a value between .1 and 1.0" min=".1" max="1"/> + </param> + <param name="split_min_anchor" type="integer" value="4" optional="true" label="Filter split_min_anchor" /> + <param name="splice_bias" type="integer" value="10" optional="true" label="Filter splice_bias" /> + <param name="probability_threshold" type="float" value="0.50" optional="true" label="Filter probability_threshold"> + <validator type="in_range" message="Choose a value between 0.0 and 1.0" min="0" max="1"/> + </param> + <param name="multi_exon_transcripts_stats" type="select" label="Use multiple exon transcripts for stats calculations" help="should be enabled for very small libraries"> + <option value="no" selected="true">no</option> + <option value="yes">yes</option> + </param> + <param name="covariance_sampling_density" type="float" value="0.01" optional="true" label="covariance_sampling_density"> + <help>Position density when calculating covariance</help> + <validator type="in_range" message="Choose a value between 0.0 and 1.0" min="0" max="1"/> + </param> + <param name="max_paired_alignments" type="integer" value="10" optional="true" label="max_paired_alignments"> + <help>Maximum number of alignments for a read pair, Pairs with more alignments are filtered, default is 10</help> + <validator type="in_range" message="Choose a value between 0.0 and 1.0" min="1" max="100"/> + </param> + <param name="denovo_assembly" type="select" label="denovo_assembly" help=""> + <option value="">Use Default</option> + <option value="no">no</option> + <option value="yes">yes</option> + </param> + <!-- + <param name="positive_controls" type="data" format="txt" optional=true label="Defuse positive_controls" help=""/> + --> + <param name="reads_per_job" type="integer" value="1000000" optional="true" label="Number of reads for each job in split" /> + </when> <!-- full --> + </conditional> <!-- defuse_param --> + <param name="keep_output" type="boolean" checked="true" truevalue="yes" falsevalue="no" label="Save DeFuse working directory files" + help="The defuse output working directory can be helpful for determining errors that may have occurred during the run, + but they require considerable diskspace, and should be deleted and purged when no longer needed."/> + </inputs> + <outputs> + <data format="txt" name="config_txt" label="${tool.name} on ${on_string}: config.txt"/> + <data format="txt" name="defuse_log" label="${tool.name} on ${on_string}: defuse.log" /> + <data format="html" name="defuse_out" label="${tool.name} on ${on_string}: defuse_output (purge when no longer needed)"> + <filter>keep_output == True</filter> + </data> + <data format="defuse.results.tsv" name="results_classify_tsv" label="${tool.name} on ${on_string}: results.classify.tsv" from_work_dir="results.classify.tsv"/> + <data format="defuse.results.tsv" name="results_filtered_tsv" label="${tool.name} on ${on_string}: results.filtered.tsv" from_work_dir="results.filtered.tsv"/> + <data format="fastqsanger" name="results_fusions1_fq" label="${tool.name} on ${on_string}: fusions_1.fq" from_work_dir="results.fusions_1.fq" /> + <data format="fastqsanger" name="results_fusions2_fq" label="${tool.name} on ${on_string}: fusions_2.fq" from_work_dir="results.fusions_2.fq" /> + <!-- + expression_plot + circos plot + --> + </outputs> + + <tests> + </tests> + <help> +**DeFuse** + +DeFuse_ is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion. + +Journal reference: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001138 + +.. _DeFuse: http://sourceforge.net/apps/mediawiki/defuse/index.php?title=Main_Page + +------ + +**Inputs** + +DeFuse requires 2 fastq files for paried reads, one with the left mate of the paired reads, and a second fastq with the the right mate of the paired reads (**with reads in the same order as in the first fastq dataset**). + +If your fastq files have reads in different orders or include unpaired reads, you can preprocess them with **FASTQ interlacer** to create a single interlaced fastq dataset with only the paired reads and input that to **FASTQ de-interlacer** to separate the reads into a left fastq and right fastq. + +DeFuse uses a Reference Dataset to search for gene fusions. The Reference Dataset is generated from the following sources in DeFuse_Version_0.4_: + - genome_fasta from Ensembl + - gene_models from Ensembl + - repeats_filename from UCSC RepeatMasker rmsk.txt + - est_fasta from UCSC + - est_alignments from UCSC intronEst.txt + - unigene_fasta from NCBI + +.. _DeFuse_Version_0.4: http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.4.2 + +------ + +**Outputs** + +The galaxy history will contain 5 outputs: the config.txt file that provides DeFuse with its parameters, the defuse.log which details what DeFuse has done and can be useful in determining any errors, and the 3 results files that defuse generates. + +DeFuse generates 3 results files: results.txt, results.filtered.txt, and results.classify.txt. All three files have the same format, though results.classify.txt has a probability column from the application of the classifier to results.txt, and results.filtered.txt has been filtered according to the threshold probability as set in config.txt. + +The file format is tab delimited with one prediction per line, and the following fields per prediction (not necessarily in this order): + + - **Identification** + - cluster_id : random identifier assigned to each prediction + - library_name : library name given on the command line of defuse + - gene1 : ensembl id of gene 1 + - gene2 : ensembl id of gene 2 + - gene_name1 : name of gene 1 + - gene_name2 : name of gene 2 + - **Evidence** + - break_predict : breakpoint prediction method, denovo or splitr, that is considered most reliable + - concordant_ratio : proportion of spanning reads considered concordant by blat + - denovo_min_count : minimum kmer count across denovo assembled sequence + - denovo_sequence : fusion sequence predicted by debruijn based denovo sequence assembly + - denovo_span_pvalue : p-value, lower values are evidence the prediction is a false positive + - gene_align_strand1 : alignment strand for spanning read alignments to gene 1 + - gene_align_strand2 : alignment strand for spanning read alignments to gene 2 + - min_map_count : minimum of the number of genomic mappings for each spanning read + - max_map_count : maximum of the number of genomic mappings for each spanning read + - mean_map_count : average of the number of genomic mappings for each spanning read + - num_multi_map : number of spanning reads that map to more than one genomic location + - span_count : number of spanning reads supporting the fusion + - span_coverage1 : coverage of spanning reads aligned to gene 1 as a proportion of expected coverage + - span_coverage2 : coverage of spanning reads aligned to gene 2 as a proportion of expected coverage + - span_coverage_min : minimum of span_coverage1 and span_coverage2 + - span_coverage_max : maximum of span_coverage1 and span_coverage2 + - splitr_count : number of split reads supporting the prediction + - splitr_min_pvalue : p-value, lower values are evidence the prediction is a false positive + - splitr_pos_pvalue : p-value, lower values are evidence the prediction is a false positive + - splitr_sequence : fusion sequence predicted by split reads + - splitr_span_pvalue : p-value, lower values are evidence the prediction is a false positive + - **Annotation** + - adjacent : fusion between adjacent genes + - altsplice : fusion likely the product of alternative splicing between adjacent genes + - break_adj_entropy1 : di-nucleotide entropy of the 40 nucleotides adjacent to the fusion splice in gene 1 + - break_adj_entropy2 : di-nucleotide entropy of the 40 nucleotides adjacent to the fusion splice in gene 2 + - break_adj_entropy_min : minimum of break_adj_entropy1 and break_adj_entropy2 + - breakpoint_homology : number of nucleotides at the fusion splice that align equally well to gene 1 or gene 2 + - breakseqs_estislands_percident : maximum percent identity of fusion sequence alignments to est islands + - cdna_breakseqs_percident : maximum percent identity of fusion sequence alignments to cdna + - deletion : fusion produced by a genomic deletion + - est_breakseqs_percident : maximum percent identity of fusion sequence alignments to est + - eversion : fusion produced by a genomic eversion + - exonboundaries : fusion splice at exon boundaries + - expression1 : expression of gene 1 as number of concordant pairs aligned to exons + - expression2 : expression of gene 2 as number of concordant pairs aligned to exons + - gene_chromosome1 : chromosome of gene 1 + - gene_chromosome2 : chromosome of gene 2 + - gene_end1 : end position for gene 1 + - gene_end2 : end position for gene 2 + - gene_location1 : location of breakpoint in gene 1 + - gene_location2 : location of breakpoint in gene 2 + - gene_start1 : start of gene 1 + - gene_start2 : start of gene 2 + - gene_strand1 : strand of gene 1 + - gene_strand2 : strand of gene 2 + - genome_breakseqs_percident : maximum percent identity of fusion sequence alignments to genome + - genomic_break_pos1 : genomic position in gene 1 of fusion splice / breakpoint + - genomic_break_pos2 : genomic position in gene 2 of fusion splice / breakpoint + - genomic_strand1 : genomic strand in gene 1 of fusion splice / breakpoint, retained sequence upstream on this strand, breakpoint is downstream + - genomic_strand2 : genomic strand in gene 2 of fusion splice / breakpoint, retained sequence upstream on this strand, breakpoint is downstream + - interchromosomal : fusion produced by an interchromosomal translocation + - interrupted_index1 : ratio of coverage before and after the fusion splice / breakpoint in gene 1 + - interrupted_index2 : ratio of coverage before and after the fusion splice / breakpoint in gene 2 + - inversion : fusion produced by genomic inversion + - orf : fusion combines genes in a way that preserves a reading frame + - probability : probability produced by classification using adaboost and example positives/negatives (only given in results.classified.txt) + - read_through : fusion involving adjacent potentially resulting from co-transcription rather than genome rearrangement + - repeat_proportion1 : proportion of the spanning reads in gene 1 that span a repeat region + - repeat_proportion2 : proportion of the spanning reads in gene 2 that span a repeat region + - max_repeat_proportion : max of repeat_proportion1 and repeat_proportion2 + - splice_score : number of nucleotides similar to GTAG at fusion splice + - num_splice_variants : number of potential splice variants for this gene pair + - splicing_index1 : number of concordant pairs in gene 1 spanning the fusion splice / breakpoint, divided by number of spanning reads supporting the fusion with gene 2 + - splicing_index2 : number of concordant pairs in gene 2 spanning the fusion splice / breakpoint, divided by number of spanning reads supporting the fusion with gene 1 + + +**Example** + +results.tsv:: + + cluster_id splitr_sequence splitr_count splitr_span_pvalue splitr_pos_pvalue splitr_min_pvalue adjacent altsplice break_adj_entropy1 break_adj_entropy2 break_adj_entropy_min break_predict breakpoint_homology breakseqs_estislands_percident cdna_breakseqs_percident concordant_ratio deletion est_breakseqs_percident eversion exonboundaries expression1 expression2 gene1 gene2 gene_align_strand1 gene_align_strand2 gene_chromosome1 gene_chromosome2 gene_end1 gene_end2 gene_location1 gene_location2 gene_name1 gene_name2 gene_start1 gene_start2 gene_strand1 gene_strand2 genome_breakseqs_percident genomic_break_pos1 genomic_break_pos2 genomic_strand1 genomic_strand2 interchromosomal interrupted_index1 interrupted_index2 inversion library_name max_map_count max_repeat_proportion mean_map_count min_map_count num_multi_map num_splice_variants orf read_through repeat_proportion1 repeat_proportion2 span_count span_coverage1 span_coverage2 span_coverage_max span_coverage_min splice_score splicing_index1 splicing_index2 + 1169 GCTTACTGTATGCCAGGCCCCAGAGGGGCAACCACCCTCTAAAGAGAGCGGCTCCTGCCTCCCAGAAAGCTCACAGACTGTGGGAGGGAAACAGGCAGCAGGTGAAGATGCCAAATGCCAGGATATCTGCCCTGTCCTTGCTTGATGCAGCTGCTGGCTCCCACGTTCTCCCCAGAATCCCCTCACACTCCTGCTGTTTTCTCTGCAGGTTGGCAGAGCCCCATGAGGGCAGGGCAGCCACTTTGTTCTTGGGCGGCAAACCTCCCTGGGCGGCACGGAAACCACGGTGAGAAGGGGGCAGGTCGGGCACGTGCAGGGACCACGCTGCAGG|TGTACCCAACAGCTCCGAAGAGACAGCGACCATCGAGAACGGGCCATGATGACGATGGCGGTTTTGTCGAAAAGAAAAGGGGGAAATGTGGGGAAAAGCAAGAGAGATCAGATTGTTACTGTGTCTGTGTAGAAAGAAGTAGACATGGGAGACTCCATTTTGTTCTGTACTAAGAAAAATTCTTCTGCCTTGAGATTCGGTGACCCCACCCCCAACCCCGTGCTCTCTGAAACATGTGCTGTGTCCACTCAGGGTTGAATGGATTAAGGGCGGTGCGAGACGTGCTTT 2 0.000436307890680442 0.110748295953850 0.0880671602973091 N Y 3.19872427442695 3.48337348351473 3.19872427442695 splitr 0 0 0 0 Y 0 N N 0 0 ENSG00000105549 ENSG00000213753 + - 19 19 376013 59111168 intron upstream THEG AC016629.2 361750 59084870 - + 0 375099 386594 + - N 8.34107429512245 - N output_dir 82 0.677852348993289 40.6666666666667 1 11 1 N N 0.361271676300578 0.677852348993289 12 0.758602776578432 0.569678713445872 0.758602776578432 0.569678713445872 2 0.416666666666667 - + 3596 TGGGGGTTGAGGCTTCTGTTCCCAGGTTCCATGACCTCAGAGGTGGCTGGTGAGGTTATGACCTTTGCCCTCCAGCCCTGGCTTAAAACCTCAGCCCTAGGACCTGGTTAAAGGAAGGGGAGATGGAGCTTTGCCCCGACCCCCCCCCGTTCCCCTCACCTGTCAGCCCGAGCTGGGCCAGGGCCCCTAGGTGGGGAACTGGGCCGGGGGGCGGGCACAAGCGGAGGTGGTGCCCCCAAAAGGGCTCCCGGTGGGGTCTTGCTGAGAAGGTGAGGGGTTCCCGGGGCCGCAGCAGGTGGTGGTGGAGGAGCCAAGCGGCTGTAGAGCAAGGGGTGAGCAGGTTCCAGACCGTAGAGGCGGGCAGCGGCCACGGCCCCGGGTCCAGTTAGCTCCTCACCCGCCTCATAGAAGCGGGGTGGCCTTGCCAGGCGTGGGGGTGCTGCC|TTCCTTGGATGTGGTAGCCGTTTCTCAGGCTCCCTCTCCGGAATCGAACCCTGATTCCCCGTCACCCGTGGTCACCATGGTAGGCACGGCGACTACCATCGAAAGTTGATAGGGCAGACGTTCGAATGGGTCGTCGCCGCCACGGGGGGCGTGCGATCAGCCCGAGGTTATCTAGAGTCACCAAAGCCGCCGGCGCCCGCCCCCCGGCCGGGGCCGGAGAGGGGCTGACCGGGTTGGTTTTGATCTGATAAATGCACGCATCCCCCCCGCGAAGGGGGTCAGCGCCCGTCGGCATGTATTAGCTCTAGAATTACCACAGTTATCCAAGTAGGAGAGGAGCGAGCGACCAAAGGAACCATAACTGATTTAATGAGCCATTCGCAGTTTCACTGTACCGGCCGTGCGTACTTAGACATGCATGGCTTAATCTTTGAGACAAGCATATGCTACTGGCAGG 250 7.00711162298275e-72 0.00912124762512338 0.00684237452309549 N N 3.31745197152461 3.47233119514066 3.31745197152461 splitr 7 0.0157657657657656 0 0 N 0.0135135135135136 N N 0 0 ENSG00000156860 ENSG00000212932 - + 16 21 30682131 48111157 coding upstream FBRS RPL23AP4 30670289 48110676 + + 0.0157657657657656 30680678 9827473 - + Y - - N output_dir 2 1 1.11111111111111 1 1 1 N N 0 1 9 0.325530693397641 0.296465452915709 0.325530693397641 0.296465452915709 2 - - + + </help> + <expand macro="citations"/> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/defuse_bamfastq.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,60 @@ +<?xml version="1.0"?> +<tool id="defuse_bamfastq" name="Defuse BamFastq" version="@DEFUSE_VERSION@.1"> + <description>converts a bam file to fastq files.</description> + <macros> + <import>macros.xml</import> + </macros> + <requirements> + <expand macro="defuse_requirement" /> + </requirements> + <command detect_errors="default">bamfastq + #if $pair == True : + $pair + #end if + #if $multiple == True : + $multiple + #end if + #if $rename == True : + $rename + #end if + -b $bamfile + -1 $fastq1 + -2 $fastq2 + </command> + <inputs> + <param name="bamfile" type="data" format="bam" label="Bam file"/> + <param name="pair" type="boolean" truevalue="-p" falsevalue="" checked="true" label="Name contains pair info as /1 /2."/> + <param name="multiple" type="boolean" truevalue="-m" falsevalue="" checked="true" label="Bam contains multiple mappings per read."/> + <param name="rename" type="boolean" truevalue="-r" falsevalue="" checked="true" label="Rename with integer IDs."/> + </inputs> + <outputs> + <data format="fastqsanger" name="fastq1" label="fastq1" /> + <data format="fastqsanger" name="fastq2" label="fastq2" /> + </outputs> + <tests> + <test> + <param name="bamfile" ftype="bam" value="tophat_out2h.bam" /> + <param name="pair" value="True" /> + <param name="multiple" value="True" /> + <param name="rename" value="True" /> + <output name="fastq1"> + <assert_contents> + <has_text text="@test_mRNA_36_146_27/1" /> + <not_has_text text="@test_mRNA_36_146_27/2" /> + <not_has_text text="test_mRNA_150_290_0" /> + </assert_contents> + </output> + <output name="fastq2"> + <assert_contents> + <has_text text="@test_mRNA_36_146_27/2" /> + <not_has_text text="@test_mRNA_36_146_27/1" /> + <not_has_text text="test_mRNA_150_290_0" /> + </assert_contents> + </output> + </test> + </tests> + <help> + bamfastq converts a bam file input into a pair of fastq files that can be used as input to deFuse. + </help> + <expand macro="citations"/> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/defuse_results_to_vcf.py Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,273 @@ +#!/usr/bin/env python +""" +# +#------------------------------------------------------------------------------ +# University of Minnesota +# Copyright 2012, Regents of the University of Minnesota +#------------------------------------------------------------------------------ +# Author: +# +# James E Johnson +# Jesse Erdmann +# +#------------------------------------------------------------------------------ +""" + + +""" +This tool takes the defuse results.tsv tab-delimited file as input and creates a Variant Call Format file as output. +""" + +import sys,re,os.path +import optparse +from optparse import OptionParser + +""" +http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 + +5. INFO keys used for structural variants +When the INFO keys reserved for encoding structural variants are used for imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles). +The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles). +##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> +##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation"> +##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> +For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate. +##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> +Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering. +##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> +One value for each ALT allele. Longer ALT alleles (e.g. insertions) have positive values, shorter ALT alleles (e.g. deletions) have negative values. +##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> +##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> +##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints"> +##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints"> +##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file"> +For precise variants, the consensus sequence the alternate allele assembly is derivable from the REF and ALT fields. However, the alternate allele assembly file may contain additional information about the characteristics of the alt allele contigs. +##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY"> +##INFO=<ID=METRANS,Number=4,Type=String,Description="Mobile element transduction info of the form CHR,START,END,POLARITY"> +##INFO=<ID=DGVID,Number=1,Type=String,Description="ID of this element in Database of Genomic Variation"> +##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in DBVAR"> +##INFO=<ID=DBRIPID,Number=1,Type=String,Description="ID of this element in DBRIP"> +##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends"> +##INFO=<ID=PARID,Number=1,Type=String,Description="ID of partner breakend"> +##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend"> +##INFO=<ID=CILEN,Number=2,Type=Integer,Description="Confidence interval around the length of the inserted material between breakends"> +##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend"> +##INFO=<ID=DPADJ,Number=.,Type=Integer,Description="Read Depth of adjacency"> +##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number of segment containing breakend"> +##INFO=<ID=CNADJ,Number=.,Type=Integer,Description="Copy number of adjacency"> +##INFO=<ID=CICN,Number=2,Type=Integer,Description="Confidence interval around copy number for the segment"> +##INFO=<ID=CICNADJ,Number=.,Type=Integer,Description="Confidence interval around copy number for the adjacency"> +6. FORMAT keys used for structural variants +##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events"> +##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events"> +##FORMAT=<ID=CNL,Number=.,Type=Float,Description="Copy number genotype likelihood for imprecise events"> +##FORMAT=<ID=NQ,Number=1,Type=Integer,Description="Phred style probability score that the variant is novel with respect to the genome's ancestor"> +##FORMAT=<ID=HAP,Number=1,Type=Integer,Description="Unique haplotype identifier"> +##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype"> +These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality -10log_10p(copy number genotype call is wrong). CNL specifies a list of log10 likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys. + +Specifying Complex Rearrangements with Breakends +An arbitrary rearrangement event can be summarized as a set of novel adjacencies. +Each adjacency ties together 2 breakends. The two breakends at either end of a novel adjacency are called mates. +There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency. A breakend record is identified with the tag SYTYPE=BND" in the INFO field. The REF field of a breakend record indicates a base or sequence s of bases beginning at position POS, as in all VCF records. The ALT field of a breakend record indicates a replacement for s. This "breakend replacement" has three parts: +the string t that replaces places s. The string t may be an extended version of s if some novel bases are inserted during the formation of the novel adjacency. +The position p of the mate breakend, indicated by a string of the form "chr:pos". This is the location of the first mapped base in the piece being joined at this novel adjacency. +The direction that the joined sequence continues in, starting from p. This is indicated by the orientation of square brackets surrounding p. +These 3 elements are combined in 4 possible ways to create the ALT. In each of the 4 cases, the assertion is that s is replaced with t, and then some piece starting at position p is joined to t. The cases are: +REF ALT Meaning +s t[p[ piece extending to the right of p is joined after t +s t]p] reverse comp piece extending left of p is joined after t +s ]p]t piece extending to the left of p is joined before t +s [p[t reverse comp piece extending right of p is joined before t + +Examples: +#CHROM POS ID REF ALT QUAL FILT INFO +2 321681 bnd_W G G]17:198982] 6 PASS SVTYPE=BND;MATEID=bnd_Y +2 321682 bnd_V T ]13:123456]T 6 PASS SVTYPE=BND;MATEID=bnd_U +13 123456 bnd_U C C[2:321682[ 6 PASS SVTYPE=BND;MATEID=bnd_V +13 123457 bnd_X A [17:198983[A 6 PASS SVTYPE=BND;MATEID=bnd_Z +17 198982 bnd_Y A A]2:321681] 6 PASS SVTYPE=BND;MATEID=bnd_W +17 198983 bnd_Z C [13:123457[C 6 PASS SVTYPE=BND;MATEID=bnd_X +""" + +vcf_header = """\ +##fileformat=VCFv4.1 +##source=defuse +##reference=%s +##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> +##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> +##INFO=<ID=MATEID,Number=1,Type=String,Description="ID of the BND mate"> +##INFO=<ID=MATELOC,Number=1,Type=String,Description="The chrom:position of the BND mate"> +##INFO=<ID=GENESTRAND,Number=2,Type=String,Description="Strands"> +##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend"> +##INFO=<ID=SPLITCNT,Number=1,Type=Integer,Description="number of split reads supporting the prediction"> +##INFO=<ID=SPANCNT,Number=1,Type=Integer,Description="number of spanning reads supporting the fusion"> +##INFO=<ID=HOMLEN,Number=1,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints"> +##INFO=<ID=SPLICESCORE,Number=1,Type=Integer,Description="number of nucleotides similar to GTAG at fusion splice"> +##INFO=<ID=GENE,Number=2,Type=String,Description="Gene Names at each breakend"> +##INFO=<ID=GENEID,Number=2,Type=String,Description="Gene IDs at each breakend"> +##INFO=<ID=GENELOC,Number=2,Type=String,Description="location of breakpoint releative to genes"> +##INFO=<ID=EXPR,Number=2,Type=Integer,Description="expression of genes as number of concordant pairs aligned to exons"> +##INFO=<ID=ORF,Number=0,Type=Flag,Description="fusion combines genes in a way that preserves a reading frame"> +##INFO=<ID=EXONBND,Number=0,Type=Flag,Description="fusion splice at exon boundaries"> +##INFO=<ID=INTERCHROM,Number=0,Type=Flag,Description="fusion produced by an interchromosomal translocation"> +##INFO=<ID=READTHROUGH,Number=0,Type=Flag,Description="fusion involving adjacent potentially resulting from co-transcription rather than genome rearrangement"> +##INFO=<ID=ADJACENT,Number=0,Type=Flag,Description="fusion between adjacent genes"> +##INFO=<ID=ALTSPLICE,Number=0,Type=Flag,Description="fusion likely the product of alternative splicing between adjacent genes"> +##INFO=<ID=DELETION,Number=0,Type=Flag,Description="fusion produced by a genomic deletion"> +##INFO=<ID=EVERSION,Number=0,Type=Flag,Description="fusion produced by a genomic eversion"> +##INFO=<ID=INVERSION,Number=0,Type=Flag,Description="fusion produced by a genomic inversion"> +#CHROM POS ID REF ALT QUAL FILTER INFO\ +""" + +def cmp_alphanumeric(s1,s2): + if s1 == s2: + return 0 + a1 = re.findall("\d+|[a-zA-Z]+",s1) + a2 = re.findall("\d+|[a-zA-Z]+",s2) + for i in range(min(len(a1),len(a2))): + if a1[i] == a2[i]: + continue + if a1[i].isdigit() and a2[i].isdigit(): + return int(a1[i]) - int(a2[i]) + return 1 if a1[i] > a2[i] else -1 + return len(a1) - len(a2) + +def __main__(): + # VCF functions + chr_dict = dict() + def add_vcf_line(chr,pos,id,line): + if chr not in chr_dict: + pos_dict = dict() + chr_dict[chr] = pos_dict + if pos not in chr_dict[chr]: + id_dict = dict() + chr_dict[chr][pos] = id_dict + chr_dict[chr][pos][id] = line + + def write_vcf(): + print >> outputFile, vcf_header % (refname) + for chr in sorted(chr_dict.keys(),cmp=cmp_alphanumeric): + for pos in sorted(chr_dict[chr].keys()): + for id in chr_dict[chr][pos]: + print >> outputFile, chr_dict[chr][pos][id] + #Parse Command Line + parser = optparse.OptionParser() + # files + parser.add_option( '-i', '--input', dest='input', help='The input defuse results.tsv file (else read from stdin)' ) + parser.add_option( '-o', '--output', dest='output', help='The output vcf file (else write to stdout)' ) + parser.add_option( '-r', '--reference', dest='reference', default=None, help='The genomic reference id' ) + (options, args) = parser.parse_args() + + # results.tsv input + if options.input != None: + try: + inputPath = os.path.abspath(options.input) + inputFile = open(inputPath, 'r') + except Exception, e: + print >> sys.stderr, "failed: %s" % e + exit(2) + else: + inputFile = sys.stdin + # vcf output + if options.output != None: + try: + outputPath = os.path.abspath(options.output) + outputFile = open(outputPath, 'w') + except Exception, e: + print >> sys.stderr, "failed: %s" % e + exit(3) + else: + outputFile = sys.stdout + + refname = options.reference if options.reference else 'unknown' + + svtype = 'SVTYPE=BND' + filt = 'PASS' + columns = [] + try: + for linenum,line in enumerate(inputFile): + ## print >> sys.stderr, "%d: %s\n" % (linenum,line) + fields = line.strip().split('\t') + if line.startswith('cluster_id'): + columns = fields + ## print >> sys.stderr, "columns: %s\n" % columns + continue + cluster_id = fields[columns.index('cluster_id')] + gene_chromosome1 = fields[columns.index('gene_chromosome1')] + gene_chromosome2 = fields[columns.index('gene_chromosome2')] + genomic_strand1 = fields[columns.index('genomic_strand1')] + genomic_strand2 = fields[columns.index('genomic_strand2')] + gene1 = fields[columns.index('gene1')] + gene2 = fields[columns.index('gene2')] + gene_info = 'GENEID=%s,%s' % (gene1,gene2) + gene_name1 = fields[columns.index('gene_name1')] + gene_name2 = fields[columns.index('gene_name2')] + gene_name_info = 'GENE=%s,%s' % (gene_name1,gene_name2) + gene_location1 = fields[columns.index('gene_location1')] + gene_location2 = fields[columns.index('gene_location2')] + gene_loc = 'GENELOC=%s,%s' % (gene_location1,gene_location2) + expression1 = int(fields[columns.index('expression1')]) + expression2 = int(fields[columns.index('expression2')]) + expr = 'EXPR=%d,%d' % (expression1,expression2) + genomic_break_pos1 = int(fields[columns.index('genomic_break_pos1')]) + genomic_break_pos2 = int(fields[columns.index('genomic_break_pos2')]) + breakpoint_homology = int(fields[columns.index('breakpoint_homology')]) + homlen = 'HOMLEN=%s' % breakpoint_homology + orf = fields[columns.index('orf')] == 'Y' + exonboundaries = fields[columns.index('exonboundaries')] == 'Y' + read_through = fields[columns.index('read_through')] == 'Y' + interchromosomal = fields[columns.index('interchromosomal')] == 'Y' + adjacent = fields[columns.index('adjacent')] == 'Y' + altsplice = fields[columns.index('altsplice')] == 'Y' + deletion = fields[columns.index('deletion')] == 'Y' + eversion = fields[columns.index('eversion')] == 'Y' + inversion = fields[columns.index('inversion')] == 'Y' + span_count = int(fields[columns.index('span_count')]) + splitr_count = int(fields[columns.index('splitr_count')]) + splice_score = int(fields[columns.index('splice_score')]) + probability = fields[columns.index('probability')] if columns.index('probability') else '.' + splitr_sequence = fields[columns.index('splitr_sequence')] + split_seqs = splitr_sequence.split('|') + mate_id1 = "bnd_%s_1" % cluster_id + mate_id2 = "bnd_%s_2" % cluster_id + ref1 = split_seqs[0][-1] + ref2 = split_seqs[1][0] + b1 = '[' if genomic_strand1 == '+' else ']' + b2 = '[' if genomic_strand2 == '+' else ']' + alt1 = "%s%s%s:%d%s" % (ref1,b2,gene_chromosome2,genomic_break_pos2,b2) + alt2 = "%s%s:%d%s%s" % (b1,gene_chromosome1,genomic_break_pos1,b1,ref2) + #TODO evaluate what should be included in the INFO field + info = ['DP=%d' % (span_count + splitr_count),'SPLITCNT=%d' % splitr_count,'SPANCNT=%d' % span_count,gene_name_info,gene_info,gene_loc,expr,homlen,'SPLICESCORE=%d' % splice_score] + if orf: + info.append('ORF') + if exonboundaries: + info.append('EXONBND') + if interchromosomal: + info.append('INTERCHROM') + if read_through: + info.append('READTHROUGH') + if adjacent: + info.append('ADJACENT') + if altsplice: + info.append('ALTSPLICE') + if deletion: + info.append('DELETION') + if eversion: + info.append('EVERSION') + if inversion: + info.append('INVERSION') + info1 = [svtype,'MATEID=%s;MATELOC=%s:%d' % (mate_id2,gene_chromosome2,genomic_break_pos2)] + info + info2 = [svtype,'MATEID=%s;MATELOC=%s:%d' % (mate_id1,gene_chromosome1,genomic_break_pos1)] + info + qual = int(float(fields[columns.index('probability')]) * 255) if columns.index('probability') else '.' + vcf1 = '%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s'% (gene_chromosome1,genomic_break_pos1, mate_id1, ref1, alt1, qual, filt, ';'.join(info1) ) + vcf2 = '%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s'% (gene_chromosome2,genomic_break_pos2, mate_id2, ref2, alt2, qual, filt, ';'.join(info2) ) + add_vcf_line(gene_chromosome1,genomic_break_pos1,mate_id1,vcf1) + add_vcf_line(gene_chromosome2,genomic_break_pos2,mate_id2,vcf2) + write_vcf() + except Exception, e: + print >> sys.stderr, "failed: %s" % e + sys.exit(1) + +if __name__ == "__main__" : __main__() +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/defuse_results_to_vcf.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,32 @@ +<?xml version="1.0"?> +<tool id="defuse_results_to_vcf" name="Defuse Results to VCF" version="@DEFUSE_VERSION@.2"> + <description>generate a VCF from a DeFuse Results file</description> + <macros> + <import>macros.xml</import> + </macros> + <command detect_errors="default"><![CDATA[ + $__tool_directory__/defuse_results_to_vcf.py --input $defuse_results --reference ${defuse_results.metadata.dbkey} --output $vcf + ]]></command> + <inputs> + <param name="defuse_results" type="data" format="defuse.results.tsv" label="Defuse Results file"/> + </inputs> + <outputs> + <data name="vcf" metadata_source="defuse_results" format="vcf"/> + </outputs> + <tests> + <test> + <param name="defuse_results" value="mm10_results.filtered.tsv" ftype="defuse.results.tsv" dbkey="mm10"/> + <output name="vcf" file="mm10_results.filtered.vcf"/> + </test> + </tests> + <help> +**Defuse Results to VCF** + +Generates a VCF_ Variant Call Format file from a DeFuse_ results.tsv file. + +This program relies on the header line of the results.tsv to determine which columns to use for genrating the VCF file. + +.. _VCF: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 +.. _DeFuse: http://sourceforge.net/apps/mediawiki/defuse + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/defuse_trinity_analysis.py Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,466 @@ +#!/usr/bin/env python +""" +# +#------------------------------------------------------------------------------ +# University of Minnesota +# Copyright 2014, Regents of the University of Minnesota +#------------------------------------------------------------------------------ +# Author: +# +# James E Johnson +# +#------------------------------------------------------------------------------ +""" + + +""" +This tool takes the defuse results.tsv tab-delimited file, trinity +and creates a tabular report + +Would it be possible to create 2 additional files from the deFuse-Trinity comparison program. +One containing all the Trinity records matched to deFuse records (with the deFuse ID number), +and the other with the ORFs records matching back to the Trinity records in the first files? + +M045_Report.csv +"","deFuse_subset.count","deFuse.gene_name1","deFuse.gene_name2","deFuse.span_count","deFuse.probability","deFuse.gene_chromosome1","deFuse.gene_location1","deFuse.gene_chromosome2","deFuse.gene_location2","deFuse_subset.type" +"1",1,"Rps6","Dennd4c",7,0.814853504,"4","coding","4","coding","TIC " + + + +OS03_Matched_Rev.csv +"count","gene1","gene2","breakpoint","fusion","Trinity_transcript_ID","Trinity_transcript","ID1","protein" + +"","deFuse.splitr_sequence","deFuse.gene_chromosome1","deFuse.gene_chromosome2","deFuse.gene_location1","deFuse.gene_location2","deFuse.gene_name1","deFuse.gene_name2","deFuse.span_count","deFuse.probability","word1","word2","fusion_part_1","fusion_part_2","fusion_point","fusion_point_rc","count","transcript" + +""" + +import sys,re,os.path,math +import textwrap +import optparse +from optparse import OptionParser + +revcompl = lambda x: ''.join([{'A':'T','C':'G','G':'C','T':'A','a':'t','c':'g','g':'c','t':'a','N':'N','n':'n'}[B] for B in x][::-1]) + +codon_map = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L", + "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S", + "UAU":"Y", "UAC":"Y", "UAA":"*", "UAG":"*", + "UGU":"C", "UGC":"C", "UGA":"*", "UGG":"W", + "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L", + "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P", + "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q", + "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R", + "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M", + "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T", + "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K", + "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R", + "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V", + "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A", + "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E", + "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",} + +def translate(seq) : + rna = seq.upper().replace('T','U') + aa = [] + for i in range(0,len(rna) - 2, 3): + codon = rna[i:i+3] + aa.append(codon_map[codon] if codon in codon_map else 'X') + return ''.join(aa) + +def get_stop_codons(seq) : + rna = seq.upper().replace('T','U') + stop_codons = [] + for i in range(0,len(rna) - 2, 3): + codon = rna[i:i+3] + aa = codon_map[codon] if codon in codon_map else 'X' + if aa == '*': + stop_codons.append(codon) + return stop_codons + +def read_fasta(fp): + name, seq = None, [] + for line in fp: + line = line.rstrip() + if line.startswith(">"): + if name: yield (name, ''.join(seq)) + name, seq = line, [] + else: + seq.append(line) + if name: yield (name, ''.join(seq)) + + +def test_rcomplement(seq, target): + try: + comp = revcompl(seq) + return comp in target + except: + pass + return False + +def test_reverse(seq,target): + return options.test_reverse and seq and seq[::-1] in target + +def cmp_alphanumeric(s1,s2): + if s1 == s2: + return 0 + a1 = re.findall("\d+|[a-zA-Z]+",s1) + a2 = re.findall("\d+|[a-zA-Z]+",s2) + for i in range(min(len(a1),len(a2))): + if a1[i] == a2[i]: + continue + if a1[i].isdigit() and a2[i].isdigit(): + return int(a1[i]) - int(a2[i]) + return 1 if a1[i] > a2[i] else -1 + return len(a1) - len(a2) + +def parse_defuse_results(inputFile): + defuse_results = [] + columns = [] + coltype_int = ['expression1', 'expression2', 'gene_start1', 'gene_start2', 'gene_end1', 'gene_end2', 'genomic_break_pos1', 'genomic_break_pos2', 'breakpoint_homology', 'span_count', 'splitr_count', 'splice_score'] + coltype_float = ['probability'] + coltype_yn = [ 'orf', 'exonboundaries', 'read_through', 'interchromosomal', 'adjacent', 'altsplice', 'deletion', 'eversion', 'inversion'] + try: + for linenum,line in enumerate(inputFile): + ## print >> sys.stderr, "%d: %s\n" % (linenum,line) + fields = line.strip().split('\t') + if line.startswith('cluster_id'): + columns = fields + ## print >> sys.stderr, "columns: %s\n" % columns + continue + elif fields and len(fields) == len(columns): + cluster_id = fields[columns.index('cluster_id')] + cluster = dict() + flags = [] + defuse_results.append(cluster) + for i,v in enumerate(columns): + if v in coltype_int: + cluster[v] = int(fields[i]) + elif v in coltype_float: + cluster[v] = float(fields[i]) + elif v in coltype_yn: + cluster[v] = fields[i] == 'Y' + if cluster[v]: + flags.append(columns[i]) + else: + cluster[v] = fields[i] + cluster['flags'] = ','.join(flags) + except Exception, e: + print >> sys.stderr, "failed to read cluster_dict: %s" % e + exit(1) + return defuse_results + +## deFuse params to the mapping application? + +def __main__(): + #Parse Command Line + parser = optparse.OptionParser() + # files + parser.add_option( '-i', '--input', dest='input', default=None, help='The input defuse results.tsv file (else read from stdin)' ) + parser.add_option( '-t', '--transcripts', dest='transcripts', default=None, help='Trinity transcripts' ) + parser.add_option( '-p', '--peptides', dest='peptides', default=None, help='Trinity ORFs' ) + parser.add_option( '-o', '--output', dest='output', default=None, help='The output report (else write to stdout)' ) + parser.add_option( '-m', '--matched', dest='matched', default=None, help='The output matched report' ) + parser.add_option( '-a', '--transcript_alignment', dest='transcript_alignment', default=None, help='The output alignment file' ) + parser.add_option( '-A', '--orf_alignment', dest='orf_alignment', default=None, help='The output ORF alignment file' ) + parser.add_option( '-N', '--nbases', dest='nbases', type='int', default=12, help='Number of bases on either side of the fusion to compare' ) + parser.add_option( '-L', '--min_pep_len', dest='min_pep_len', type='int', default=100, help='Minimum length of peptide to report' ) + parser.add_option( '-T', '--ticdist', dest='ticdist', type='int', default=1000000, help='Maximum intrachromosomal distance to be classified a Transcription-induced chimera (TIC)' ) + parser.add_option( '-P', '--prior_aa', dest='prior_aa', type='int', default=11, help='Number of protein AAs to show preceeding fusion point' ) + parser.add_option( '-I', '--incomplete_orfs', dest='incomplete_orfs', action='store_true', default=False, help='Count incomplete ORFs' ) + parser.add_option( '-O', '--orf_type', dest='orf_type', action='append', default=['complete','5prime_partial'], choices=['complete','5prime_partial','3prime_partial','internal'], help='ORF types to report' ) + parser.add_option( '-r', '--readthrough', dest='readthrough', type='int', default=3, help='Number of stop_codons to read through' ) + # min_orf_len + # split_na_len + # tic_len = 1000000 + # prior + # deFuse direction reversed + # in frame ? + # contain known protein elements + # what protein change + # trinity provides full transctipt, defuse doesn't show full + #parser.add_option( '-r', '--reference', dest='reference', default=None, help='The genomic reference fasta' ) + #parser.add_option( '-g', '--gtf', dest='gtf', default=None, help='The genomic reference gtf feature file') + (options, args) = parser.parse_args() + + # results.tsv input + if options.input != None: + try: + inputPath = os.path.abspath(options.input) + inputFile = open(inputPath, 'r') + except Exception, e: + print >> sys.stderr, "failed: %s" % e + exit(2) + else: + inputFile = sys.stdin + # vcf output + if options.output != None: + try: + outputPath = os.path.abspath(options.output) + outputFile = open(outputPath, 'w') + except Exception, e: + print >> sys.stderr, "failed: %s" % e + exit(3) + else: + outputFile = sys.stdout + outputTxFile = None + outputOrfFile = None + if options.transcript_alignment: + try: + outputTxFile = open(options.transcript_alignment,'w') + except Exception, e: + print >> sys.stderr, "failed: %s" % e + exit(3) + if options.orf_alignment: + try: + outputOrfFile = open(options.orf_alignment,'w') + except Exception, e: + print >> sys.stderr, "failed: %s" % e + exit(3) + # Add percent match after transcript + report_fields = ['gene_name1','gene_name2','span_count','probability','gene_chromosome1','gene_location1','gene_chromosome2','gene_location2','fusion_type','Transcript','coverage','Protein','flags','alignments1','alignments2'] + report_fields = ['cluster_id','gene_name1','gene_name2','span_count','probability','genomic_bkpt1','gene_location1','genomic_bkpt2','gene_location2','fusion_type','Transcript','coverage','Protein','flags','alignments1','alignments2'] + report_colnames = {'gene_name1':'Gene 1','gene_name2':'Gene 2','span_count':'Span cnt','probability':'Probability','gene_chromosome1':'From Chr','gene_location1':'Fusion point','gene_chromosome2':'To Chr','gene_location2':'Fusion point', 'cluster_id':'cluster_id', 'splitr_sequence':'splitr_sequence', 'splitr_count':'splitr_count', 'splitr_span_pvalue':'splitr_span_pvalue', 'splitr_pos_pvalue':'splitr_pos_pvalue', 'splitr_min_pvalue':'splitr_min_pvalue', 'adjacent':'adjacent', 'altsplice':'altsplice', 'break_adj_entropy1':'break_adj_entropy1', 'break_adj_entropy2':'break_adj_entropy2', 'break_adj_entropy_min':'break_adj_entropy_min', 'breakpoint_homology':'breakpoint_homology', 'breakseqs_estislands_percident':'breakseqs_estislands_percident', 'cdna_breakseqs_percident':'cdna_breakseqs_percident', 'deletion':'deletion', 'est_breakseqs_percident':'est_breakseqs_percident', 'eversion':'eversion', 'exonboundaries':'exonboundaries', 'expression1':'expression1', 'expression2':'expression2', 'gene1':'gene1', 'gene2':'gene2', 'gene_align_strand1':'gene_align_strand1', 'gene_align_strand2':'gene_align_strand2', 'gene_end1':'gene_end1', 'gene_end2':'gene_end2', 'gene_start1':'gene_start1', 'gene_start2':'gene_start2', 'gene_strand1':'gene_strand1', 'gene_strand2':'gene_strand2', 'genome_breakseqs_percident':'genome_breakseqs_percident', 'genomic_break_pos1':'genomic_break_pos1', 'genomic_break_pos2':'genomic_break_pos2', 'genomic_strand1':'genomic_strand1', 'genomic_strand2':'genomic_strand2', 'interchromosomal':'interchromosomal', 'interrupted_index1':'interrupted_index1', 'interrupted_index2':'interrupted_index2', 'inversion':'inversion', 'library_name':'library_name', 'max_map_count':'max_map_count', 'max_repeat_proportion':'max_repeat_proportion', 'mean_map_count':'mean_map_count', 'min_map_count':'min_map_count', 'num_multi_map':'num_multi_map', 'num_splice_variants':'num_splice_variants', 'orf':'orf', 'read_through':'read_through', 'repeat_proportion1':'repeat_proportion1', 'repeat_proportion2':'repeat_proportion2', 'span_coverage1':'span_coverage1', 'span_coverage2':'span_coverage2', 'span_coverage_max':'span_coverage_max', 'span_coverage_min':'span_coverage_min', 'splice_score':'splice_score', 'splicing_index1':'splicing_index1', 'splicing_index2':'splicing_index2', 'fusion_type':'Type', 'coverage':'fusion%','Transcript':'Transcript?','Protein':'Protein?','flags':'descriptions','fwd_seq':'fusion','alignments1':'alignments1','alignments2':'alignments2','genomic_bkpt1':'From Chr', 'genomic_bkpt2':'To Chr'} + + ## Read defuse results + fusions = parse_defuse_results(inputFile) + ## Create a field with the 12 nt before and after the fusion point. + ## Create a field with the reverse complement of the 24 nt fusion point field. + ## Add fusion type filed (INTER, INTRA, TIC) + for i,fusion in enumerate(fusions): + fusion['ordinal'] = i + 1 + fusion['genomic_bkpt1'] = "%s:%d" % (fusion['gene_chromosome1'], fusion['genomic_break_pos1']) + fusion['genomic_bkpt2'] = "%s:%d" % (fusion['gene_chromosome2'], fusion['genomic_break_pos2']) + fusion['alignments1'] = "%s%s%s" % (fusion['genomic_strand1'], fusion['gene_strand1'], fusion['gene_align_strand1']) + fusion['alignments2'] = "%s%s%s" % (fusion['genomic_strand2'], fusion['gene_strand2'], fusion['gene_align_strand2']) + split_seqs = fusion['splitr_sequence'].split('|') + fusion['split_seqs'] = split_seqs + fusion['split_seqs'] = split_seqs + fusion['split_seq_lens'] = [len(split_seqs[0]),len(split_seqs[1])] + fusion['split_max_lens'] = [len(split_seqs[0]),len(split_seqs[1])] + fwd_off = min(abs(options.nbases),len(split_seqs[0])) + rev_off = min(abs(options.nbases),len(split_seqs[1])) + fusion['fwd_off'] = fwd_off + fusion['rev_off'] = rev_off + fwd_seq = split_seqs[0][-fwd_off:] + split_seqs[1][:rev_off] + rev_seq = revcompl(fwd_seq) + fusion['fwd_seq'] = fwd_seq + fusion['rev_seq'] = rev_seq + fusion_type = 'inter' if fusion['gene_chromosome1'] != fusion['gene_chromosome2'] else 'intra' if abs(fusion['genomic_break_pos1'] - fusion['genomic_break_pos2']) > options.ticdist else 'TIC' + fusion['fusion_type'] = fusion_type + fusion['transcripts'] = dict() + fusion['Transcript'] = 'No' + fusion['coverage'] = 0 + fusion['Protein'] = 'No' + # print >> sys.stdout, "%4d\t%6s\t%s\t%s\t%s\t%s\t%s" % (i,fusion['cluster_id'],fwd_seq,rev_seq,fusion_type,fusion['gene_name1'],fusion['gene_name2']) + inputFile.close() + + ## Process Trinity data and compare to deFuse + matched_transcripts = dict() + matched_orfs = dict() + transcript_orfs = dict() + fusions_with_transcripts = set() + fusions_with_orfs = set() + ## fusion['transcripts'][tx_id] { revcompl:?, bkpt:n, seq1: , seq2: , match1:n, match2:n} + n = 0 + if options.transcripts: + with open(options.transcripts) as fp: + for tx_full_id, seq in read_fasta(fp): + n += 1 + for i,fusion in enumerate(fusions): + if fusion['fwd_seq'] in seq or fusion['rev_seq'] in seq: + fusions_with_transcripts.add(i) + fusion['Transcript'] = 'Yes' + tx_id = tx_full_id.lstrip('>').split()[0] + matched_transcripts[tx_full_id] = seq + fusion['transcripts'][tx_id] = dict() + fusion['transcripts'][tx_id]['seq'] = seq + fusion['transcripts'][tx_id]['full_id'] = tx_full_id + pos = seq.find(fusion['fwd_seq']) + if pos >= 0: + tx_bkpt = pos + fusion['fwd_off'] + # fusion['transcripts'][tx_full_id] = tx_bkpt + if tx_bkpt > fusion['split_max_lens'][0]: + fusion['split_max_lens'][0] = tx_bkpt + len2 = len(seq) - tx_bkpt + if len2 > fusion['split_max_lens'][1]: + fusion['split_max_lens'][1] = len2 + fusion['transcripts'][tx_id]['bkpt'] = tx_bkpt + fusion['transcripts'][tx_id]['revcompl'] = False + fusion['transcripts'][tx_id]['seq1'] = seq[:tx_bkpt] + fusion['transcripts'][tx_id]['seq2'] = seq[tx_bkpt:] + else: + pos = seq.find(fusion['rev_seq']) + tx_bkpt = pos + fusion['rev_off'] + # fusion['transcripts'][tx_full_id] = -tx_bkpt + if tx_bkpt > fusion['split_max_lens'][1]: + fusion['split_max_lens'][1] = tx_bkpt + len2 = len(seq) - tx_bkpt + if len2 > fusion['split_max_lens'][0]: + fusion['split_max_lens'][0] = len2 + rseq = revcompl(seq) + pos = rseq.find(fusion['fwd_seq']) + tx_bkpt = pos + fusion['fwd_off'] + fusion['transcripts'][tx_id]['bkpt'] = tx_bkpt + fusion['transcripts'][tx_id]['revcompl'] = True + fusion['transcripts'][tx_id]['seq1'] = rseq[:tx_bkpt] + fusion['transcripts'][tx_id]['seq2'] = rseq[tx_bkpt:] + fseq = fusion['split_seqs'][0] + tseq = fusion['transcripts'][tx_id]['seq1'] + mlen = min(len(fseq),len(tseq)) + fusion['transcripts'][tx_id]['match1'] = mlen + for j in range(1,mlen+1): + if fseq[-j] != tseq[-j]: + fusion['transcripts'][tx_id]['match1'] = j - 1 + break + fseq = fusion['split_seqs'][1] + tseq = fusion['transcripts'][tx_id]['seq2'] + mlen = min(len(fseq),len(tseq)) + fusion['transcripts'][tx_id]['match2'] = mlen + for j in range(mlen): + if fseq[j] != tseq[j]: + fusion['transcripts'][tx_id]['match2'] = j + break + # coverage = math.floor(float(fusion['transcripts'][tx_id]['match1'] + fusion['transcripts'][tx_id]['match2']) * 100. / len(fusion['split_seqs'][0]+fusion['split_seqs'][1])) + coverage = int((fusion['transcripts'][tx_id]['match1'] + fusion['transcripts'][tx_id]['match2']) * 1000. / len(fusion['split_seqs'][0]+fusion['split_seqs'][1])) * .1 + # print >> sys.stderr, "%s\t%d\t%d\t%d\%s\t\t%d\t%d\t%d\t%d" % (tx_id,fusion['transcripts'][tx_id]['match1'],fusion['transcripts'][tx_id]['match2'],len(fusion['split_seqs'][0]+fusion['split_seqs'][1]),coverage,len( fusion['split_seqs'][0]),len(fusion['transcripts'][tx_id]['seq1']),len(fusion['split_seqs'][1]),len(fusion['transcripts'][tx_id]['seq2'])) + fusion['coverage'] = max(coverage,fusion['coverage']) + print >> sys.stdout, "fusions_with_transcripts: %d %s\n matched_transcripts: %d" % (len(fusions_with_transcripts),fusions_with_transcripts,len(matched_transcripts)) + ##for i,fusion in enumerate(fusions): + ## print >> sys.stdout, "%4d\t%6s\t%s\t%s\t%s\t%s\t%s\t%s" % (i,fusion['cluster_id'],fusion['fwd_seq'],fusion['rev_seq'],fusion['fusion_type'],fusion['gene_name1'],fusion['gene_name2'], fusion['transcripts']) + ## Process ORFs and compare to matched deFuse and Trinity data. + ## Proteins must be at least 100 aa long, starting at the first "M" and must end with an "*". + if options.peptides: + with open(options.peptides) as fp: + for orf_full_id, seq in read_fasta(fp): + n += 1 + if len(seq) < options.min_pep_len: + continue + orf_type = re.match('^.* type:(\S+) .*$',orf_full_id).groups()[0] + ## if not seq[-1] == '*' and not options.incomplete_orfs: + ## if not orf_type 'complete' and not options.incomplete_orfs: + if orf_type not in options.orf_type: + continue + for i,fusion in enumerate(fusions): + if len(fusion['transcripts']) > 0: + for tx_id in fusion['transcripts']: + ## >m.196252 g.196252 ORF g.196252 m.196252 type:complete len:237 (+) comp100000_c5_seq2:315-1025(+) + ## >m.134565 g.134565 ORF g.134565 m.134565 type:5prime_partial len:126 (-) comp98702_c1_seq21:52-429(-) + if tx_id+':' not in orf_full_id: + continue + m = re.match("^.*%s:(\d+)-(\d+)[(]([+-])[)].*" % re.sub('([|.{}()$?^])','[\\1]',tx_id),orf_full_id) + if m: + if not m.groups() or len(m.groups()) < 3 or m.groups()[0] == None: + print >> sys.stderr, "Error:\n%s\n%s\n" % (tx_id,orf_full_id) + orf_id = orf_full_id.lstrip('>').split()[0] + if not tx_id in transcript_orfs: + transcript_orfs[tx_id] = [] + alignments = "%s%s%s %s%s%s" % (fusion['genomic_strand1'], fusion['gene_strand1'], fusion['gene_align_strand1'], fusion['genomic_strand2'], fusion['gene_strand2'], fusion['gene_align_strand2']) + # print >> sys.stdout, "%d %s bkpt:%d %s rc:%s (%s) %s" % (fusion['ordinal'], tx_id, int(fusion['transcripts'][tx_id]['bkpt']), str(m.groups()), str(fusion['transcripts'][tx_id]['revcompl']), alignments, orf_full_id) + start = seq.find('M') + pep_len = len(seq) + if pep_len - start < options.min_pep_len: + continue + orf_dict = dict() + transcript_orfs[tx_id].append(orf_dict) + fusions_with_orfs.add(i) + matched_orfs[orf_full_id] = seq + fusion['Protein'] = 'Yes' + tx_start = int(m.groups()[0]) + tx_end = int(m.groups()[1]) + tx_strand = m.groups()[2] + tx_bkpt = fusion['transcripts'][tx_id]['bkpt'] + orf_dict['orf_id'] = orf_id + orf_dict['tx_start'] = tx_start + orf_dict['tx_end'] = tx_end + orf_dict['tx_strand'] = tx_strand + orf_dict['tx_bkpt'] = tx_bkpt + orf_dict['seq'] = seq[:start].lower() + seq[start:] if start > 0 else seq + ## >m.208656 g.208656 ORF g.208656 m.208656 type:5prime_partial len:303 (+) comp100185_c2_seq9:2-910(+) + ## translate(tx34[1:910]) + ## translate(tx34[1:2048]) + ## comp99273_c1_seq1 len=3146 (-2772) + ## >m.158338 g.158338 ORF g.158338 m.158338 type:complete len:785 (-) comp99273_c1_seq1:404-2758(-) + ## translate(tx[-2758:-403]) + ## comp100185_c2_seq9 len=2048 (904) + ## novel protein sequence + ## find first novel AA + ## get prior n AAs + ## get novel AA seq thru n stop codons + ### tx_seq = matched_transcripts[tx_full_id] if tx_bkpt >= 0 else revcompl(tx_seq) + tx_seq = fusion['transcripts'][tx_id]['seq'] + orf_dict['tx_seq'] = tx_seq + novel_tx_seq = tx_seq[tx_start - 1:] if tx_strand == '+' else revcompl(tx_seq[:tx_end]) + read_thru_pep = translate(novel_tx_seq) + # fusion['transcripts'][tx_id]['revcompl'] = True + # tx_bkpt = fusion['transcripts'][tx_id]['bkpt'] + # bkpt_aa_pos = tx_bkpt - tx_start - 1 + # bkpt_aa_pos = (tx_bkpt - tx_start - 1) / 3 if tx_strand == '+' else tx_end + # print >> sys.stdout, "%s\n%s" % (seq,read_thru_pep) + stop_codons = get_stop_codons(novel_tx_seq) + if options.readthrough: + readthrough = options.readthrough + 1 + read_thru_pep = '*'.join(read_thru_pep.split('*')[:readthrough]) + stop_codons = stop_codons[:readthrough] + orf_dict['read_thru_pep'] = read_thru_pep + orf_dict['stop_codons'] = ','.join(stop_codons) + print >> sys.stdout, "fusions_with_orfs: %d %s\n matched_orfs: %d" % (len(fusions_with_orfs),fusions_with_orfs,len(matched_orfs)) + ## Alignments 3 columns, seq columns padded out to longest seq, UPPERCASE_match diffs lowercase + ### defuse_id pre_split_seq post_split_seq + ### trinity_id pre_split_seq post_split_seq + ## Transcripts alignment output + ## Peptide alignment output + ## Write reports + ## OS03_Matched_Rev.csv + ## "count","gene1","gene2","breakpoint","fusion","Trinity_transcript_ID","Trinity_transcript","ID1","protein" + if options.transcripts and options.matched: + #match_fields = ['ordinal','gene_name1','gene_name2','fwd_seq'] + outputMatchFile = open(options.matched,'w') + #print >> outputMatchFile, '\t'.join(["#fusion_id","cluster_id","gene1","gene2","breakpoint","fusion","Trinity_transcript_ID","Trinity_transcript","Trinity_ORF_Transcript","Trinity_ORF_ID","protein","read_through","stop_codons"]) + print >> outputMatchFile, '\t'.join(["#fusion_id","cluster_id","gene1","gene2","breakpoint","fusion","Trinity_transcript_ID","Trinity_transcript","Trinity_ORF_Transcript","Trinity_ORF_ID","protein","stop_codons"]) + for i,fusion in enumerate(fusions): + if len(fusion['transcripts']) > 0: + for tx_id in fusion['transcripts'].keys(): + if tx_id in transcript_orfs: + for orf_dict in transcript_orfs[tx_id]: + if 'tx_seq' not in orf_dict: + print >> sys.stderr, "orf_dict %s" % orf_dict + #fields = [str(fusion['ordinal']),str(fusion['cluster_id']),fusion['gene_name1'],fusion['gene_name2'],fusion['fwd_seq'],fusion['splitr_sequence'],tx_id, fusion['transcripts'][tx_id]['seq1']+'|'+fusion['transcripts'][tx_id]['seq2'],orf_dict['tx_seq'],orf_dict['orf_id'],orf_dict['seq'],orf_dict['read_thru_pep'],orf_dict['stop_codons']] + fields = [str(fusion['ordinal']),str(fusion['cluster_id']),fusion['gene_name1'],fusion['gene_name2'],fusion['fwd_seq'],fusion['splitr_sequence'],tx_id, fusion['transcripts'][tx_id]['seq1']+'|'+fusion['transcripts'][tx_id]['seq2'],orf_dict['tx_seq'],orf_dict['orf_id'],orf_dict['read_thru_pep'],orf_dict['stop_codons']] + print >> outputMatchFile, '\t'.join(fields) + outputMatchFile.close() + if options.transcripts and options.transcript_alignment: + if outputTxFile: + id_fields = ['gene_name1','alignments1','gene_name2','alignments2','span_count','probability','gene_chromosome1','gene_location1','gene_chromosome2','gene_location2','fusion_type','Transcript','Protein','flags'] + fa_width = 80 + for i,fusion in enumerate(fusions): + if len(fusion['transcripts']) > 0: + alignments1 = "%s%s%s" % (fusion['genomic_strand1'], fusion['gene_strand1'], fusion['gene_align_strand1']) + alignments2 = "%s%s%s" % (fusion['genomic_strand2'], fusion['gene_strand2'], fusion['gene_align_strand2']) + alignments = "%s%s%s %s%s%s" % (fusion['genomic_strand1'], fusion['gene_strand1'], fusion['gene_align_strand1'], fusion['genomic_strand2'], fusion['gene_strand2'], fusion['gene_align_strand2']) + fusion_id = "%s (%s) %s" % (i + 1,alignments,' '.join([str(fusion[x]) for x in report_fields])) + for tx_id in fusion['transcripts'].keys(): + m1 = fusion['transcripts'][tx_id]['match1'] + f_seq1 = fusion['split_seqs'][0][:-m1].lower() + fusion['split_seqs'][0][-m1:] + t_seq1 = fusion['transcripts'][tx_id]['seq1'][:-m1].lower() + fusion['transcripts'][tx_id]['seq1'][-m1:] + if len(f_seq1) > len(t_seq1): + t_seq1 = t_seq1.rjust(len(f_seq1),'.') + elif len(f_seq1) < len(t_seq1): + f_seq1 = f_seq1.rjust(len(t_seq1),'.') + m2 = fusion['transcripts'][tx_id]['match2'] + f_seq2 = fusion['split_seqs'][1][:m2] + fusion['split_seqs'][1][m2:].lower() + t_seq2 = fusion['transcripts'][tx_id]['seq2'][:m2] + fusion['transcripts'][tx_id]['seq2'][m2:].lower() + if len(f_seq2) > len(t_seq2): + t_seq2 = t_seq2.ljust(len(f_seq2),'.') + elif len(f_seq2) < len(t_seq2): + f_seq2 = f_seq2.ljust(len(t_seq2),'.') + print >> outputTxFile, ">%s\n%s\n%s" % (fusion_id,'\n'.join(textwrap.wrap(f_seq1,fa_width)),'\n'.join(textwrap.wrap(f_seq2,fa_width))) + print >> outputTxFile, "%s bkpt:%d rev_compl:%s\n%s\n%s" % (fusion['transcripts'][tx_id]['full_id'],fusion['transcripts'][tx_id]['bkpt'],str(fusion['transcripts'][tx_id]['revcompl']),'\n'.join(textwrap.wrap(t_seq1,fa_width)),'\n'.join(textwrap.wrap(t_seq2,fa_width))) + """ + if options.peptides and options.orf_alignment: + pass + """ + print >> outputFile,"%s\t%s" % ('#','\t'.join([report_colnames[x] for x in report_fields])) + for i,fusion in enumerate(fusions): + print >> outputFile,"%s\t%s" % (i + 1,'\t'.join([str(fusion[x]) for x in report_fields])) + +if __name__ == "__main__" : __main__() +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/defuse_trinity_analysis.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,58 @@ +<?xml version="1.0"?> +<tool id="defuse_trinity_analysis" name="Defuse Trinity" version="@DEFUSE_VERSION@.2"> + <description>verify fusions with trinity</description> + <macros> + <import>macros.xml</import> + </macros> + <command detect_errors="default"><![CDATA[ + python $__tool_directory__/defuse_trinity_analysis.py --input $defuse_results --transcripts $trinity_transcripts --peptides $trinity_orfs + --nbases $nbases --min_pep_len $min_pep_len --ticdist $ticdist --readthrough=$readthrough + #if 'matched' in str($outputs).split(','): + --matched="$matched_output" + #end if + #if 'aligned' in str($outputs).split(','): + --transcript_alignment="$aligned_output" + #end if + --output $output + ]]></command> + <inputs> + <param name="defuse_results" type="data" format="defuse.results.tsv" label="Defuse Results file"/> + <param name="trinity_transcripts" type="data" format="fasta" label="TrinityRNAseq: Assembled Transcripts"/> + <param name="trinity_orfs" type="data" format="fasta" label="transcriptsToOrfs: Candidate Peptide Sequences"/> + <param name="nbases" type="integer" value="12" min="1" label="Number of bases on either side of the fusion to compare"/> + <param name="min_pep_len" type="integer" value="100" min="0" label="Minimum length of peptide to report"/> + <param name="ticdist" type="integer" value="1000000" min="0" label="Maximum intrachromosomal distance to be classified a Transcription-induced chimera (TIC)"/> + <param name="readthrough" type="integer" value="4" min="0" label="Number of stop_codons to read through"/> + <param name="outputs" type="select" multiple="true" display="checkboxes" label="Additional outputs"> + <option value="matched">Matched Fusions Trinity Tanscripts and ORFs Tabular</option> + <option value="aligned">Aligned Fusion and Trinity Transcipts Fasta</option> + </param> + </inputs> + <outputs> + <data name="matched_output" metadata_source="defuse_results" format="tabular" label="${tool.name} on ${on_string}: Fusions Trinity Matched "> + <filter>(outputs and 'matched' in outputs)</filter> + </data> + <data name="aligned_output" metadata_source="defuse_results" format="fasta" label="${tool.name} on ${on_string}: Fusion Trinity Sequences"> + <filter>(outputs and 'aligned' in outputs)</filter> + </data> + <data name="output" metadata_source="defuse_results" format="tabular" label="${tool.name} on ${on_string}: Fusion Report"/> + </outputs> + <tests> + </tests> + <help> +**Defuse Results** + +Verifies DeFuse_ fusion predictions in results.tsv with TrinityRNAseq_ assembled transcripts and ORFs. + +DeFuse provides a total fusion sequence of 200-500 nucleotides (nts) around the fusion breakpoint. This may be insufficient to predict the effect of the fusion on protein production. To get a view of the full transcript containing the fusion, Trinity de novo transcripts from the RNA-seq data are compared with the deFuse fusion sequences using a subsequence around the deFuse indetified fusion breakpoint. The Trinity transcriptToOrfs output provides potential proteins from the projected fusion transcript. + +This program relies on the header line of the deFuse results.tsv to determine which columns to use for analysis. + +.. _DeFuse: http://sourceforge.net/apps/mediawiki/defuse +.. _TrinityRNAseq: http://trinityrnaseq.github.io/ + </help> + <expand macro="citations"> + <citation type="doi">10.1038/nbt.1883</citation> + <citation type="doi">10.1038/s41598-018-36840-z</citation> + </expand> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/macros.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,28 @@ +<macros> + <token name="@DEFUSE_VERSION@">0.8.2</token> + <xml name="defuse_requirement"> + <requirement type="package" version="@DEFUSE_VERSION@">defuse</requirement> + </xml> + <xml name="mapping_requirements"> + <requirement type="package" version="0.1.19">samtools</requirement> + <requirement type="package" version="1.0.0">bowtie</requirement> + <requirement type="package" version="2013-05-09">gmap</requirement> + <requirement type="package" version="35x1">blat</requirement> + </xml> + <xml name="r_requirements"> + <requirement type="package" version="3.1.2">R</requirement> + <requirement type="package" version="2.0.3">ada</requirement> + </xml> + <xml name="stdio"> + <stdio> + <exit_code range=":-1" level="fatal" description="Error: Cannot open file" /> + <exit_code range="1:" level="fatal" description="Error" /> + </stdio> + </xml> + <xml name="citations"> + <citations> + <citation type="doi">10.1371/journal.pcbi.1001138</citation> + <yield /> + </citations> + </xml> +</macros>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/make_html.sh Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,18 @@ +#!/bin/bash +## create html with links for output_dir +## usage: make_html.sh dataset_path dataset_extra_file_path +defuse_out=$1 +extra_files_path=$2 +if [ -e $defuse_out ] +then + echo '<html><head><title>Defuse Output</title></head><body>' > $defuse_out + echo '<h2>Defuse Output Files</h2><ul>' >> $defuse_out + pushd $extra_files_path + for f in `find -L . -maxdepth 1 -type f`; + do fn=`basename ${f}`; echo '<li><a href="'${fn}'">'${fn}'</a></li>' >> $defuse_out; + done + popd + echo '</ul>' >> $defuse_out + echo '</body></html>' >> $defuse_out +fi +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/mm10_results.filtered.tsv Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,46 @@ +cluster_id splitr_sequence splitr_count splitr_span_pvalue splitr_pos_pvalue splitr_min_pvalue adjacent altsplice break_adj_entropy1 break_adj_entropy2 break_adj_entropy_min breakpoint_homology breakseqs_estislands_percident cdna_breakseqs_percident deletion est_breakseqs_percident eversion exonboundaries expression1 expression2 gene1 gene2 gene_align_strand1 gene_align_strand2 gene_chromosome1 gene_chromosome2 gene_end1 gene_end2 gene_location1 gene_location2 gene_name1 gene_name2 gene_start1 gene_start2 gene_strand1 gene_strand2 genome_breakseqs_percident genomic_break_pos1 genomic_break_pos2 genomic_strand1 genomic_strand2 interchromosomal interrupted_index1 interrupted_index2 inversion library_name max_map_count max_repeat_proportion mean_map_count min_map_count num_multi_map num_splice_variants orf read_through repeat_proportion1 repeat_proportion2 span_count span_coverage1 span_coverage2 span_coverage_max span_coverage_min splice_score splicing_index1 splicing_index2 probability +8647 GTGCTCCTGCTGCCAGGCGCAGCTGGGCGACATTGGCACGTCCTGTTACACCAAGAGCGGCATGATCCTTTGCAGAAATGACTACATTAGGT|GAAGATTGTAAAAAATTGACATCAGAAATATTTACAGAAATAGATACCTGTTTGAATAAAGTTAGAGATGAAATTTTTGCTAAACTTCAACCGAAGCTTAGATGCACATTAGGTGACATGGAAAGTCCTGTGTTTGCACTTCCTG 4 0.849232794977309 0.875860929877954 0.794775556794258 N N 3.72551845106187 3.02448896101185 3.02448896101185 1 0.0366733649981732 0 N 0 N Y 2482 2085 ENSMUSG00000028266 ENSMUSG00000041264 + - 3 5 144205220 149215434 coding coding Lmo4 Uspl1 144188530 149184350 - + 0 144201813 149198645 - - Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 5 0.831776131589327 0.982288003019777 0.982288003019777 0.831776131589327 4 - - 0.950339354539546 +12095 CTTCCAGGGTCCCCCGAGCCTAATGGATGCCGAGACAGACGAGGGCATGGACTATACAGGCTGTAGCCCTGGAGCGGCGTCCTCAGAGTCTTCCACCATGGACCGTAGCTGTTCCAGCACCC|CTGGCCCTTGACATCTAGCACCCCTTCACCCTCTTCCTGGGGACCCAGCAGGTGGTATGTGGCCGTGGAGCCCTCCGGGCTGTGGCTGTCCTTCCCAGGAGAGGATGACGTAGACTCGTTGCTGACAGGGGAGATGTCACTGCTGC 6 0.869976758916331 0.907802910133282 0.774396849342807 Y Y 3.64400585547602 3.28189243439864 3.28189243439864 0 0.983667499542934 0 Y 0 N N 0 1453 ENSMUSG00000086606 ENSMUSG00000028975 - + 4 4 148948914 149099876 intron coding Gm13205 Pex14 148947492 148960535 - - 0 148948907 148961548 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 5 0.760481034595956 1.02189639023832 1.02189639023832 0.760481034595956 4 - - 0.976704594819493 +4068 CGACACGCCGGGGCTGGCTGAGGAAAACAAAACGAAGCCCCTGGAGACCCGGCTTATCTCAGAGACCACAGCTATTTGCAAACCAGAGCAGGTGGCCAAACAAATTGTCAAAGATGCCATA|TACCGTTCCCCAGCTGAAGAGTTCTGAATCCACGCCGGATCCTTCTCAACAGTCTGTTTTACGGGAACTTTTATTAACCACTCCTTCCCCGTGATGCAGTTCTGAATCCTCCCTGTAGCAGGGGGTCTTCACTCATGCCTGAAGATGTTTCTTTTCC 8 1 0.541521196503195 0.26082452496772 N Y 3.46128890676658 3.81976875220098 3.46128890676658 2 0 0 N 0.794128973014604 N N 775 35409 ENSMUSG00000009905 ENSMUSG00000022816 + - 1 16 106759742 37836514 coding intron Kdsr Fstl1 106720410 37776873 - + 0.00358422939068104 106734547 37799068 - - Y - - N dataset_6344_files 4 1 1.81818181818182 1 4 1 N N 0 1 11 0.974366325576069 1.17240826166877 1.17240826166877 0.974366325576069 4 - - 0.913877182950088 +12868 GCGGTCTCGGCTCCAGCGGCAGTAGCAGCGGCGCCGGTCCCGTGTGCAGGAGCTCCTTTGCGGCCCAGTTTCTTGGCCATCGCCTGCTCTCCCCACAGCGCCAGGACGAGTCCCGTGCGCGTCCGTCCGCGGAGGTCTTTCTCATCTCGCTCGGCTGCGGGAAATCGGGCTGAAGCGACTGAGTCCGCGATGGAGA|AAACTTTAGAAACTGTTCCTTTGGAGAGGAAAAAGAGAGAAAAGGAAAACTT 69 1 0.439895614069599 0.332872660216425 N Y 3.30124852419771 2.96787791690762 2.96787791690762 0 0.0997837503861599 0.00401606425702794 N 0.540315106580167 N N 41631 0 ENSMUSG00000004980 ENSMUSG00000085456 + + 6 10 51469894 73327027 coding intron Hnrnpa2b1 Gm15398 51460434 73201399 - - 0.0997837503861599 51467295 73201702 - - Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 41 1.14864322933764 0.720872647377417 1.14864322933764 0.720872647377417 1 - - 0.520098627306064 +5160 TACGGATTCATTCAGTGTTCAGAACGGCAAGCTAGACTTTTCTTCCACTGTTCACAATATAATGGCAACCTCCAAGACTTAAAAGTAGGAGATGATGTTGAATTTGAAGTATCATCTGACCGGAGGACTGGGAAACCTATTGCTATTAAATTGGTGAAGATAAAACCAGAAATACATCCTGAAGAACGAATGAACGGAC|AAGAAGTATTTTATCTGACTTACACCCCTGAAGATGTGGAAGGGAAAGTTCAGC 33 1 0.767323324767952 0.471361574015707 N Y 3.3022815939676 3.55553920790569 3.3022815939676 91 0.944663167104112 0.00393700787401585 N 0.944663167104112 N N 10681 0 ENSMUSG00000068823 ENSMUSG00000087940 + - 3 18 103058189 28309891 coding upstream Csde1 SNORA17 103020546 28309760 + + 0 103040040 28188917 + - Y - - N dataset_6344_files 2 0 1.03333333333333 1 1 1 N N 0 0 30 1.1011131646754 0.67334258271517 1.1011131646754 0.67334258271517 3 - - 0.835626866413202 +8748 GGGGTAGATCACCTTCCGAGGGTCTCCATGGGTCCAGGCTATGATGCCCACAGCCACATAGCCTACGATGGCCAGGAAGAGCAACACACAGCAAATGACATCTGTGCATCCCCTGTTGTAAATGGGTCCTTTGAAGGTGGGGTCGTATTTCTGAGGCGTCCC|ATAGACTGCGTCCTTCCGATCGTCCTCCATGGCCTCAACCGAGGAGAGCTGAGTCCGAAGCCAGCGCGACCCCAACCCAAGCGGGCGGGAGACACCGCGCGCTGCTGGCCCCGCC 51 1 0.844316325604616 0.871821260113135 Y Y 3.29447165365238 3.73336824417043 3.29447165365238 2 0 0 Y 0.991335627150454 N N 6343 2 ENSMUSG00000057193 ENSMUSG00000003309 - - 9 9 21355025 21312333 coding upstream Slc44a2 Ap1m2 21337828 21295457 + - 0 21337962 21320850 - + N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 52 1.3070767782118 0.958522970688653 1.3070767782118 0.958522970688653 2 - - 0.74403558965543 +11169 CCACATCTGACAGAACTTGCCACTGTGCCTGCAACCTTGTCTGAGAGGAACCCTTCTCTG|AGGATGGACACTTCTCACACTACAAAGTCCTGTTTGCTGATTCTTCTTGTGGCCCTACTGTGTGCAGAAAGAGCTCAGGGACTGGAGTGTTACCAGTGCTATGGAGTCCCATTTGAGACTTCTTGCCCATCAATTACCTGCCCCTACCCTGATGGAGTCTGTGTTACTCAGGAGGCAGCAGTTATTGTGGATTCTCAAACAAGGAAAGTAAAGAACAATCTTTGCTTACCC 410 1 0.87136611677351 0.978590093310325 Y Y 3.45293525254824 3.55976477294109 3.45293525254824 0 0 0.152910958904109 Y 0.966780821917808 N Y 1416 5049 ENSMUSG00000079018 ENSMUSG00000075602 + - 15 15 75048837 74997634 utr5p utr5p Ly6c1 Ly6a 75045017 74994878 - - 0 75048442 74996568 - + N - - N dataset_6344_files 2 0 1.53082191780822 1 155 2 Y Y 0 0 292 0.728794324821125 1.49719703686079 1.49719703686079 0.728794324821125 4 - - 0.93076564160702 +12600 CTAAAATCGCCAAGCCTGTCAAGTTTGAGCTTTCTGGCTGCACCAGTGTGAAGACATACAGGGCTAAGTTCTGCGGGGTGTGCACAGACGGCCGCTGCTGCACACCGCACAGAACCACCACTCTGCCAGTGGAGTTCAAATGCCCCGATGGCGAGATCATGAAAAAGAATATGATGTTCATCAAG|CCCTGCCCCTGCCATTACCACTGTTCTGGGGTCCCTGGCATGCTCCCATTAC 1 1 0.820229055260519 0.873212995139997 N N 3.28548615724827 3.17936818462789 3.17936818462789 0 0.482950872656755 0.482950872656755 N 0.482950872656755 N N 7298 2735 ENSMUSG00000019997 ENSMUSG00000015133 + + 10 7 24598683 66388350 coding intron Ctgf Lrrk1 24595442 66226912 + - 0.482950872656755 24597524 66355922 + - Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 5 1.02981806768202 0.41192722707281 1.02981806768202 0.41192722707281 3 - - 0.58275607356487 +1855 CCTGGTCACACAGCTCTCCTAGGAAGCGCGTCTCCAATGACTCGTTTGTCAACCCCAGGTGTCTTCCACGTTGTGGTTTCAACTTCATAATTCTCTGAAGTCTTATTCAT|CTCTTTGTCAAGCCCACGGCGTCAGCCTCTGACACGAGGGCGGGCGTCCTCGCCTCCGGGGTGAAAGAGGGGCGCAGCGGGCCGCCCCTCCCCCCCGCCCCCCTACGACACGCGGGGCCCTGCCTCGGGCGACCTGTTGGCAGGGCGCGTCACGTGACGCGGCGGGCCGCGCCGTCCCC 4 0.769054178023969 0.159736089157399 0.42467594723228 Y Y 3.40528677690917 3.51935717616377 3.40528677690917 0 0.981880877742947 0 Y 0.0487460815047023 N N 0 8476 ENSMUSG00000097162 ENSMUSG00000039361 - - 7 7 90129474 90209447 intron upstream AC130210.1 Picalm 90124858 90130232 - + 0 90125032 90129872 + - N - - N dataset_6344_files 1 0.23489932885906 1 1 0 1 N Y 0 0.23489932885906 9 0.887227873695282 1.18032993911247 1.18032993911247 0.887227873695282 4 - - 0.949188379753168 +1870 CTGGGTCAAGAGCCGGAGGGACAGGACCAGAGCACCCCTTACGCCAGAACTAGCTCTCCTTGTTCCTACTGGGTGACCTCATCTCGCCACGCCTCCTCAGGTGAACACCCGGGCTGGTAACGTCACTTCCTGC|CAGGGTTTCACTATGTAGACCCTGGTCGGCCTGGAACTCTATAGACCAGACTGGCCTCGAGCTCAGATCCGTCCCCCTCTGCTGTCCCAGCACGGGGATTAAGAACGCGCCACCACTACAGCTGACCGGA 2 0.788660184540219 0.550050944563454 0.182990959521483 Y Y 3.63552422018169 3.63193049733206 3.63193049733206 4 0 0 Y 0 N N 1543 697 ENSMUSG00000044786 ENSMUSG00000003444 - + 7 7 28379255 28392708 downstream intron Zfp36 Med29 28376784 28386146 - - 0 28376683 28392166 + - N - - N dataset_6344_files 1 0.869158878504673 1 1 0 1 N Y 0 0.869158878504673 8 1.08526980978798 0.847619486476743 1.08526980978798 0.847619486476743 1 - - 0.841318537051835 +3153 CAGAGCTATGTAGAAAGACCCTGTCTGGTAAGTAAATAAAAACATAGCCAGGCATGGTGGCAATCAGCAGGTAGATTGGAGTTTGAGGTCATCCTGGTCTGGAGAGTGAGTTCCAGGAGAGCCAAGATTACACAAACCC|TGTCTTTTTCTTTCTTTCTGTTGTTGTTGTTGCTGTTCCTGCTGCTGCTGCTGTTTTGCTTTTCATGACAGGGTTTCTCTGTGTCTCTGTGCAACTTTGGCTATCCTGGAATTCACACTGTAGACCAGGCTGGCCTTGAATTCACACAGATCCATCTG 16 1 0.656277930324132 0.671834466829468 Y Y 3.40740634961144 2.41484814640267 2.41484814640267 0 0.9928298971561 0 Y 0.985659794312201 N N 0 30 ENSMUSG00000097379 ENSMUSG00000047786 - - 17 17 17395303 17459387 intron intron AC154200.1 Lix1 17389943 17402672 - + 0 17395298 17411818 + - N - - N dataset_6344_files 1 0.709459459459459 1 1 0 1 N Y 0.685314685314685 0.709459459459459 25 1.13279987445023 1.17240826166877 1.17240826166877 1.13279987445023 1 - - 0.81041504882724 +11596 GCCGGAGCAGATCAGGCTGAAAGTTGGTGGTGTGGACCCAAAGCAGCTAGCCGTCTATGAAGAGTTTGCACGAAATGTGCCTGGCTTCTTACCTACAAATGACTTAAGTCAGCCTACAGGGTTTTTAGCTCAGCCCATGAA|GTTTCTGGATCAAGATGTGAGCTTTCAGCTGTTGCTTGAGCATCATGCCTGCCTGCCTGCCACCACGCTCTCGGCCAGGATAGTGATGGATTCCTTCCCTCGGACTATAAGCCCCAAATGAACCCTTCCTTCTATGAGTTGCCT 8 0.974101244860941 0.684001003266195 0.530048089286902 N Y 3.49640736351508 3.32783191248938 3.32783191248938 0 0 0 Y 0 N N 6004 544 ENSMUSG00000036550 ENSMUSG00000036564 + + 8 8 95807462 95715119 coding intron Cnot1 Ndrg4 95719451 95676980 - + 0 95739809 95682960 - + N - - N dataset_6344_files 1 1 1 1 0 1 N N 0 1 16 1.12487819700652 1.17240826166877 1.17240826166877 1.12487819700652 1 - - 0.873952352979217 +3783 CTCTGTTCGTGCATCCCTGGGATATGCAGATGGATGGACGAATGGCCAAATACTGGCTGCCTTGGCTCGTAGGTTTGTGCTGGACTAGGGTTGGGAACACAGCACCAACTCTTGGGTTTTTTCTGTTATCCATGGAATTCTGTTCT|TTTCTTTCCGAGTGACCCCCACTTTACGGCGTCTCCCAATGACAAGAGGCGCAGAGCATTGTGGTCCCGCGGGTCCGAGA 2 0.867750583125795 0.425502304916607 0.692249690893423 Y Y 3.22804973003685 3.42048008240514 3.22804973003685 0 0 0 Y 0.987555066079295 N N 0 752 ENSMUSG00000026348 ENSMUSG00000026349 + - 1 1 127767564 127808061 intron upstream Acmsd Ccnt2 127729413 127774164 + + 0 127753955 127773930 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 6 1.02189639023832 0.649577550384046 1.02189639023832 0.649577550384046 3 - - 0.711778687837589 +4912 TAACAGAACAAGAGCAATGTGCTAGAATAGAAGACCAGAAAATGAAATGGTGGAGTTTGA|GGCTGGATAGACAGTTTGAAAGGTAAGTATTGAAAAACACTTGAATTTGGGTCAGTACAAAGGGACATGCAGAGACTTTGAATCATCAAAACTCCAGCATGCATTGTCTTACGGATGTTTAGATAGGTGTGTTTTGGACAACACTCTGGGTTCTTGTAATGATGTTGATCAAATGTCTGAG 15 1 0.895469718833858 0.922600375157455 N Y 3.29868787111796 3.36447795254773 3.29868787111796 0 0.933608815426997 0 N 0.933608815426997 N N 0 0 ENSMUSG00000088422 ENSMUSG00000053332 + - 5 1 79639109 161038539 downstream intron 7SK Gas5 79638815 161034422 - + 0.933608815426997 79638354 161036831 - - Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 5 0.459457291735057 0.808011099258204 0.808011099258204 0.459457291735057 1 - - 0.592482164323654 +5141 TCTTATCTCTGTGTTGGTGTGCTTCTCTGTTTGATGACAGAGCAGGGTCTTGCTGAACTGCACCTACTGGAGTAGGCAACTGTTTACCAATCCAGTCATATTCATAATCAAACATATACCCTTTTCGATCAAACAAGTCAGTAAAAAGCTTTCTTAGGTAATCATAGTCTGGCTTTTCAAAAAAATCCAGCCTTCGTACGTAACGGAGAT|ACGTTGCCATTTCTGGGAAGTTCTCACACAACACTTCTATTGGTGTGGCTCGTTT 9 1 0.853164221793603 0.997416520640376 N Y 3.60831833727406 3.59847695253922 3.59847695253922 266 0.981886534518113 0.981886534518113 N 0.981886534518113 N N 2227 0 ENSMUSG00000073563 ENSMUSG00000083798 - + 18 X 53955684 53418967 coding intron Csnk1g3 Gm14584 53862113 53418331 + + 0 53932206 53418862 - + Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 17 1.2040949714436 0.990209680463485 1.2040949714436 0.990209680463485 2 - - 0.788432447586541 +1886 CGCGGGGCGCCCGCCGGCTCACACTCCCGCGCTCTCTCTCCGGGTTTGGCGGCCGCCAGGAGGAGGAAGAGGAGGAGGAGGAGAAGAAGAAGGAGGAGGAGTGGAGCGAGCGCAGAAG|TAGCTGCTGCTGCTGGTGGTGACAATGTCAAATAACGGCGTAGACATCCAGGACAAACCCCCAGCCCCTCCGATGAGAAACACCAGCACTATGATTGGAGCCGGCAGCAAAGACACTGGAACCCTAAACCACGGCTCCAAACCTCTGCCTCCAAACCCAGAGGAG 18 1 0.708948431161152 0.783431002546196 Y Y 2.65454686157692 3.52101274054962 2.65454686157692 0 0.983110527572213 0 Y 0.983110527572213 N N 16 2476 ENSMUSG00000042797 ENSMUSG00000030774 - - 7 7 97738247 97912381 upstream utr5p Aqp11 Pak1 97726379 97842935 - + 0 97788974 97854436 + - N - - N dataset_6344_files 1 0.366666666666667 1 1 0 1 N Y 0.366666666666667 0 18 0.950601293244945 1.29915510076809 1.29915510076809 0.950601293244945 4 - - 0.970255256454431 +9297 GTGCTGGGCAGGAAGTCCCGGGCCAGGCAGCCCATGGCCACCAGATTCTTATCAGACAGGGGGCTCTCGCAGGAGACGAGGGGGAAGACATTTGGGAAGGACTGACTCT|CATTTGCGGTGCCTGGTTTCGGAGAGGTCCAGAGTCTTTGTGTGGAATTGTTCCTTCAAAGCCACCGAGGCTGGCTGGTCCATGAGCAGCCAGGTGGATGGGTGGCAGAAGCC 20 0.318065745640931 0.64026046146514 0.56228459372928 N Y 3.05166433869594 3.43614916339873 3.05166433869594 0 0 0 Y 0.990866828485621 N N 1488 0 ENSMUSG00000076617 ENSMUSG00000092748 - + 12 12 113422730 113425227 coding upstream Ighm AC073553.3 113418950 113425150 - - 0 113422730 113426655 + - N - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 6 0.665420905271462 0.879306196251575 0.879306196251575 0.665420905271462 4 - - 0.831715391282697 +5326 GTACACTGTAGCTGTCTTCAGACACACCAGAAGAGGGAGTCAGATCTTGTTACGGATGGTTGTGAGCCACCATGTGGTTGCTGGGATTTGAACTCTGGACCTTCGGAAGAGCAGTCGGGTGCTCTTACCCACTGAGCCATCTCACCAGCCCC|GAGTTACACAGTTTTAATGACTTCCTTACTTCTCATGATCTCTATACTGTTTAACTTTCCTCGGTGTGTTTTTAAGTCTTTGCTATCTTGTTCTAGTTTTTTGCAAGAATCACGAAAATGATTCTTGGATTTCCAGACTCTTCCTTTTGAAAGCTG 7 1 0.607242467201396 0.755062684986217 N Y 3.49744043328497 3.49528100778935 3.49528100778935 9 0.960654062340317 0 N 0.967211718616931 N N 2963 0 ENSMUSG00000039671 ENSMUSG00000073647 + + 2 18 165899016 4198969 utr3p downstream Zmynd8 Gm10557 165784152 4198216 - - 0.00979390223130638 165827729 4198107 - - Y - - N dataset_6344_files 476 0.993464052287582 298.461538461538 168 13 1 N N 0.993464052287582 0 13 1.21201664888731 1.34668516543034 1.34668516543034 1.21201664888731 1 - - 0.536685629613134 +1721 CGTACCTCCCTCCCAGCAACCGGCCTGGCGGCAGCGCGGCTACAAAACTGAGGAGGCGGAGCCGAGACGGAGTCGGTACTGCGCTCTGACTCCTAGACCAGGTTTAAGTTTTTGAAGTTGAAGTAGGTCTACACAGTAGGAACCCATGTCTTTTCTTGTAAGTAAACCAGAGCGCATTA|GGGCCAATGAGGCGAGCTCAGAGTCCATAGCATTGTTCTCCAAACCA 8 0.903915350299452 0.697638501162383 0.785699237375327 N Y 3.65384724021086 3.68527868815389 3.65384724021086 142 0.936451401255975 0.343331146311745 N 0.936451401255975 N N 800 0 ENSMUSG00000069631 ENSMUSG00000087034 + + 11 5 106202168 60232198 coding upstream Strada Cbfa2t2-ps1 106163330 60231482 - + 0 106187103 60176541 - + Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 5 0.879306196251575 0.475300646622473 0.879306196251575 0.475300646622473 1 - - 0.738431070020728 +1027 GCGGCAGCCGGTTCGGGCGGGCGGCATCATGGACGAGAAGTTGTTCACCAAGGAGCTGGACCAGTGGATCGAGCAGCTGAACGAGTGCAAGCAGCTCTCCGAGTCCCAGGTCAAGAGCCTCTGCGAGAAG|TGCTGCCTTTGACAGAGATGACATGCTTATACCATGCGGGTGGCACGAAGCTGTGAAGTGGTGATGACGGGGATGAGCTTGGACATCCTACGAGAGTGGCAAAGGTGAAGCAAGCCCAGGTGCTGGCTGTGCAAGG 2 1 0.148625497007208 0.21082086276253 N N 3.33524062522047 3.43039601505348 3.33524062522047 3 0 0 N 0 N N 10063 1785 ENSMUSG00000020349 ENSMUSG00000035021 + + 11 12 52127778 55014348 coding intron Ppp2ca Baz1a 52098681 54892989 + - 0 52099150 54980379 + - Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 10 0.942679615801238 1.07734813234427 1.07734813234427 0.942679615801238 4 - - 0.83438414122479 +7746 AAAGGGAGTCGAGACTGCCTTCTGCGCGCGCCCGGCTTTGCGCGCCTCCGCCACCAGATGTGGGGGGATGGGAGGCCCCCTCCGCGGCCCCTTCCCCACCCAGCCCAGAAAGCTGAACTGGCAAG|AGGCAATTTGAAACAAGCCACTCCAACCTCTTTTTCAAAGTTATAGGAGGTTCCCTGTCTTGAAGTCTCCTGCCTTGGATTTTCTGAGGTGCTGCTATTCTGGGCAACTATCAAAATCCTACCTGTTAAAACATGATGGATTAGAGAAAAAAAACAACCCAC 26 1 0.707412664952061 0.861438628736582 Y Y 3.14574680596571 3.22961716614005 3.14574680596571 0 0 0 Y 0 N N 0 0 ENSMUSG00000087658 ENSMUSG00000087626 + - 6 6 52162119 52174059 intron intron Gm15051 Gm15050 52156902 52172881 + + 0 52158684 52173634 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 20 1.02189639023832 1.23578168121843 1.23578168121843 1.02189639023832 2 - - 0.746044848941682 +8762 GGGGGCGGGGCTGAGCGCGGCCGCAGCCATTTTGGTGGAAGAGAAACAATAGGACGGAAGCGTCGCGGGACTGGGCTGTGGCCGCAG|AGTGTCCTGGCCTGCACAAACCGAGGAGCTGAGATCAAACAGGTGGCTGTAAGGACAGACAGTGAACGGAGGGCAAGCCGGCCTCTGGACCCCGCTGCCTCCCCTTTCTCCCTGCTGCTCGTGTCCAGAGGATGAGCCCAGCCTTCAGGACCATGGACGTGGAGCCCCGCACCAAGGGCATCC 28 0.498252441818171 0.890426951441059 0.30351938174741 Y Y 3.55041790174621 3.58424216889964 3.55041790174621 0 0.977096322687365 0.0380455528693218 Y 0.977096322687365 N N 0 714 ENSMUSG00000088626 ENSMUSG00000032599 - - 9 9 108795388 108806333 downstream utr5p SNORA28 Ip6k2 108795243 108795994 - + 0 108783870 108796064 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 8 0.705029292490001 1.27539006843697 1.27539006843697 0.705029292490001 4 - - 0.953988239876003 +13800 GATTGTTAGGGATGGGTCCTTGGTCAGCTGTCCAGAATGCTAGAGCTTCGTCCTCCTGGGAGATGGTTTCAGTCCTTACCCCCAGGACTCTGATGAGATCCTGAGGGAATCCACCCTCTGCTGATGGCCCAGTGAAAGCC|AGGCTCCGTGGCTTTACAGTGGATGGGATCTGGGAGAGTAGAGGGAAGGTTCTAGAACCCTGAAACCAGACCACTCTATTAGCGAACTCACAGCTGCCTTGTGGTGTAGAC 3 0.978409273957491 0.969552944230377 0.898574822628826 N Y 3.329991337985 3.51679652308403 3.329991337985 0 0 0 Y 0 N N 575 443 ENSMUSG00000053799 ENSMUSG00000024987 + - 19 19 37683245 37701528 intron upstream Exoc6 Cyp26a1 37550418 37697808 + + 0 37678397 37696729 + - N - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 12 0.839697809033035 0.855541163920451 0.855541163920451 0.839697809033035 1 - - 0.80116231284735 +5983 CCATGTCAAACCACCATCCACGGCTGTCGTCCTCGAAGTGCAGGTAGTCCATCCTGTGAGCGCCGTGCGGGAACTGCCTCCGTGTCACTGGGGCGGCGCGCCTGTGGAAT|CCTCCGAAGAGATGGAATCCTTTCCTGCAGCTCGGCAAGGGCCACTTCGCAGAGCTGGATTTCTGAAAGCTTTGCTTGATTTTCAAATATTCTTTAGTAAAGAATGTCTTTGTGGCATTGTTC 1 1 1 0.945022483912395 Y Y 3.49015970162987 3.55334927718654 3.49015970162987 0 0.990947940947941 0.00427350427350426 Y 0.990947940947941 N N 0 60 ENSMUSG00000021718 ENSMUSG00000021720 + + 13 13 105121782 105271039 downstream intron 4933425L06Rik Rnf180 105082122 105149352 + - 0 105131562 105167527 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 5 0.78424606692708 0.974366325576069 0.974366325576069 0.78424606692708 1 - - 0.813708699275449 +12092 TGCCCAATGAGCTGCTGGCACTCAGCACAGGTGTTGGCGAAGGTGTTGTCATAGCACGGAACGCAGTAGGGGCCACTGTCTGTCTGGATGTATTTGCGGCCGTACAAGGACTCGTTGCATTTTGCACAGTCAAATGCCTCGCTCATGGTGGCGGTGCCCAGTGAGC|CCTCAAACTCAAGAAGCCCCATCTCAGTCGGTCTTCTTACTTTGCAAGAGTTTTCAAAGGACTGCGCTGGGTCTCTTCTGCCAAGCGGCTCATGGCTTCCTCTGGGTGCTGAATCATTCTCTGGTGCCTGCAACCACAAGACCTCCTTCC 2 1 0.16313785091306 0.240986946418107 Y Y 3.43784458057793 3.48805370929674 3.43784458057793 0 0.99335436382755 0.00315457413249232 Y 0 N N 2072 0 ENSMUSG00000032643 ENSMUSG00000088067 - - 4 4 124708611 124697607 utr5p downstream Fhl3 U6 124700701 124697505 + - 0 124705614 124696071 - + N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 9 1.21993832633101 1.18825161655618 1.21993832633101 1.18825161655618 1 - - 0.736022708835765 +11605 CTCCACACCACAGCGTCACCGATGTCCAGCGCTTTCGAGCCGGCACACAGCTGTTGCAGAAGACTACAGGGGTGGGATCCGAGCTGGGAATGTAGAAGGAGGAACAGCGGCCAAAACAGAGATGATTAAGGACCCGGGCACTTGTGCAACCAGGCCTGGAGATCACCTGC|GCCGTACGGACTAAAGCAGCGCGGCGCTCCTCCGCTCCCCGGCCGGAGGCCCCCGGTGTTTCCGCCGCGCAGGCAGCGCCGTAGCCAGCCCCGCTGCCGCGAGGACCCACAGCCAAG 11 0.982340211125991 0.568397069242315 0.586452494205497 Y Y 3.57167229721571 3.28292550416853 3.28292550416853 0 0 0 Y 0.770032051282051 N N 167 2428 ENSMUSG00000053226 ENSMUSG00000033751 - - 8 8 84822823 84835482 intron utr5p Dand5 Gadd45gip1 84815405 84831522 - + 0 84816537 84832174 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 12 1.26746839099326 0.990209680463485 1.26746839099326 0.990209680463485 2 - - 0.758045795414889 +7456 AAATTCTGGAGTCAGTTCTGGAGACATAGATAGCTCCCAAATTATAACCAACCCTCTTCCTCCCGTGGCCTCCCCTCCTCCTGCATCTAAAGCAAAGGAAGTTTCCGATGGGGAAAATCTCGAGCAAGATCTATGTACGTTCTTGATATCAAGAGCCTGTAAGAACTCAACACTGGCTAATTATTTATACTGG|TCACTTTTACTCTGCTATTCCCCTATCTATGCTCCGAAAATCATTCTGACTCCTGCAATGACCCCTATTACCCAGATTTTGAAGATACTCCCTGCTATCTACTCAGCTCTGAAGACTTG 9 1 0.674019218743678 0.827309111200933 N N 3.64400585547602 3.3012485241977 3.3012485241977 0 0 0 N 0 N N 1791 0 ENSMUSG00000033628 ENSMUSG00000056822 + - 18 16 30348126 19487830 coding downstream Pik3c3 Olfr166 30272747 19486793 + + 0 30311220 19493776 + - Y - - N dataset_6344_files 1 1 1 1 0 1 N N 0 1 10 1.37837187520517 0.942679615801238 1.37837187520517 0.942679615801238 1 - - 0.745013975183866 +2385 CGGGACGGGGCGGGGCCGGCGAACTTCTGTGCCTCACTGTCCCCGGACACTGAGGGACACCGGGCAGGCAGCTGGCACCATGAAGATCTGGACTTCGGAGCACGTCTTTGAC|TACATCCGGTTTTCCCAGATCTGTGCAAAAGCAGTGAGGGATGCCCTGAAGACCGAGTTCAAAGCGAACGCTGAGAAGACTTCGGGCAGCAGCATAAAAATTGTGAAAGTCTCGAAGAAGGAGTAGCTGAATCTGAAGCCTGAAGTGCTGAGTCTTGAA 7 1 0.168340481582645 0.162692865850249 Y Y 3.69784855983781 3.60831833727406 3.60831833727406 0 0 0 Y 0 N Y 3085 3617 ENSMUSG00000016257 ENSMUSG00000016252 + - 2 2 174473081 174464105 coding coding Slmo2 Atp5e 174465067 174461072 - - 0 174472832 174462629 - + N - - N dataset_6344_files 2 0.190909090909091 1.2 1 1 1 Y Y 0.190909090909091 0 5 0.871384518807867 1.29915510076809 1.29915510076809 0.871384518807867 4 - - 0.7622459420951 +13923 AGACTGTTGAGAAGGATTCAACTGCCGAATTCAGAACTCATCAGCTGGGGAACGACGGTGATAAAGGTTCCCGTAAAGCAGACTGTTGAGAAGGATTCAACTGCCGAATTCAGAACTCATCAG|CCAGAGTCGGCGCTCTCCGGCGAGCTATCCCCTTCTCACCACACTCTGAGAACGGAGCTTGGTGCCGGCTCGGCCGCCTCCGCCAATTCCGGGTCCCTCTTCA 4 0.0503617839630236 0.900541946891923 0.647988756014038 N Y 3.6988816296077 3.47080361183081 3.47080361183081 0 0.980668063812497 0 N 0.980668063812497 N N 590 367 ENSMUSG00000046079 ENSMUSG00000051671 - - 5 8 105832436 126425434 intron upstream Lrrc8d 1810063B05Rik 105699969 126422501 + + 0 105728723 126422269 - - Y - - N dataset_6344_files 10 0.968503937007874 8 4 7 1 N N 0.968503937007874 0 7 0.974366325576069 0.617890840609215 0.974366325576069 0.617890840609215 4 - - 0.812231785796686 +11177 ATCTGACAGAACTTGCCACTGTGCCTGCAACCTTGTCTGAGAGGAA|CCCTTCTCTGAGGATGGACACTTCTCACACTACAAAGTCCTGTTTGCTGATTCTTCTTGTGGCCCTACTGTGTGCAGAAAGAGCTCAGGGACTGGAGTGTTACCAGTGCTATGGAGTCCCATTTGAGACTTCTTGCCCATCAATTACCTGCCCCTACCCTGATGGAGTCTGTGTTACTCAGGAGGCAGCAGTTATTGTGGATTCTCAAACAAGGA 434 0.873359569934578 0.121885088610831 0.552181129473186 Y Y 3.55081912933034 3.43784458057793 3.43784458057793 0 0 0.00381679389312994 Y 0.956687686691006 N Y 1416 5049 ENSMUSG00000079018 ENSMUSG00000075602 + - 15 15 75048837 74997634 utr5p utr5p Ly6c1 Ly6a 75045017 74994878 - - 0 75048442 74996568 - + N - - N dataset_6344_files 2 0 1.65079365079365 1 41 2 Y Y 0 0 63 0.665420905271462 1.36252852031776 1.36252852031776 0.665420905271462 4 - - 0.566839236883877 +3839 GTCACAGCCACCAATGTGTCAGCCCATGGAAGCCAAGCTAACTCGCCCTCTACTCCCAACTCAGCGGGTGGATACCCTTCGCCATGTTATCAGCCAGACAGGAGGATACAGTGACGGACTCGCAGCCAGTCAGATGTACAGTCCGCAGGGCATCAGT|GATATGTCAAGAACCTACTGATCCTCACAAGAACCTACTGTCTCTCTTCTCTTGACTGAAAACAAAAGTCTTCTTCTACCTGACCCGTGGCCTGACTTCTGGAAGAATGCATATGGACTTTCGAAGAAGTCAGAGGATATCTGCTGGCCACTTGTATCAGACAAAACAAGGCTGGTGAGCA 13 1 0.804385830976137 0.0961220231684031 Y Y 3.42779550918039 3.49697642496855 3.42779550918039 0 0 0 Y 0.993649362117881 N N 3895 0 ENSMUSG00000052534 ENSMUSG00000093538 + + 1 1 168432270 168122465 coding intron Pbx1 Gm20711 168153527 168115244 - + 0 168183530 168121480 - + N - - N dataset_6344_files 1 0.745098039215686 1 1 0 1 N Y 0 0.745098039215686 16 1.29915510076809 1.21201664888731 1.29915510076809 1.21201664888731 4 - - 0.97679698937665 +66 CTGAGAACGAGGAGCAGGAAGAACACACCAGCATGGGCGCGTTCAACGATCCGTTCCTGGCTCAGCCCCCCGATGAAGATTCACATTCCAGTTTTCCTGATGGTGAACAAATAGACCCTGAAAATCTCCACTTCAACCCTGATGAAGGAGGTGGA|AGACTGCTTGTTCTTGGAACCCAGCAGCCATACTGTGAGGAAGTCCAAACCAGCCAACCTGGAGAGACGGCATGCACAGGGTCCCACGGATA 13 0.719595391393931 0.399734668051015 0.808813953308524 Y Y 3.39487633072086 3.67849247003875 3.39487633072086 0 0 0 Y 0.989174263674614 N N 1514 627 ENSMUSG00000018548 ENSMUSG00000046442 + + 11 11 87220683 87359023 coding intron Trim37 Ppm1e 87127077 87226906 + - 0 87218328 87250589 + - N - - N dataset_6344_files 1 1 1 1 0 1 N Y 0 1 11 0.863462841364159 0.689185937602586 0.863462841364159 0.689185937602586 1 - - 0.615916282680748 +11601 GGCACTGGGAACCCGAGCGCAGCTTGGACAAGTGGACTGCACGCAGCGCCTGCCTGGTCTTTCACACTTCTCTGGGGACCTCAGGAGAGGAAGGAAGACTTTTGGATATGG|GGTGCTTCGAGTGCTGCATTAAATGCCTGGGAGGTATTCCCTATGCTTCTCTGATTGCAACCATCCTGCTGTATGCAGGCGTTGCCCTGTTCTGTGGCTGTGGCCATGAAGCCCTTTCTGGAACAGTCAACATTCTGCAGACCTACTTTGAGTTGGCAAGGACTGCTG 45 1 0.903984782676282 0.890308208832683 Y Y 3.25993583872649 3.64553343878587 3.25993583872649 0 0 0 Y 0.991023166023166 N N 195 2438 ENSMUSG00000039375 ENSMUSG00000031517 - - 8 8 54887184 55060877 intron coding Wdr17 Gpm6a 54629055 54954728 - + 0 54779650 55037328 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 33 0.887227873695282 1.25162503610584 1.25162503610584 0.887227873695282 4 - - 0.961094875436964 +11202 CAAACACGTCCTGGAGCAAGGCCTCCACAGCCACGGGGTCCCCATCCAGGGTCCTGAGTGGTTTTGAGTCAGCCATCAGGATGAAATCTTCCAGCTCAGGAGGGCTCAGAGAAGGATCAGACGAGCTGGTGACAAGGTACACAATGGACTTTTCAGAATAGGG|CTGGACTGCTTATTCCTTTAGAAGCAAGGTCTCTTCCCCAACCTGGGGCTCTAATTTTCTCAGTTAAGGTGGAATCCAGCAAGTCCAGGGGATTCTCCAGTCTCCACCTTTGTCAGAGCAGGCTTGTTAACC 21 1 0.362314658679907 0.288186846563607 Y Y 3.57167229721571 3.6035982586987 3.57167229721571 0 0 0 Y 0.984899672399672 N N 1306 3566 ENSMUSG00000023044 ENSMUSG00000046897 - - 15 15 102189043 102215606 utr5p upstream Csad Zfp740 102176998 102203648 - + 0 102188962 102202444 + - N - - N dataset_6344_files 1 0.961240310077519 1 1 0 3 N Y 0 0.961240310077519 10 1.21993832633101 1.02189639023832 1.21993832633101 1.02189639023832 4 - - 0.973142113883806 +9292 GGAGGTATCAAAGGACTTTTCAAAGGCGGTGATATGTCTAAGAATGTGAGTCAGTCACAGATGGCAAAATTAAACCAACAAATGGCCAAAATGATGGACCCACGAGTTCTTCATCACATG|GGAGGAGGAGGAGGAAGAAGAAAATAGGATGTCAGAAGAAGCAGAAAGACAATACCAACAAAACAAGCTGCAGGCCGATTCCATTGTACAGACAGACCAACCAGAAACAGTGTCGTCCAGCTTTGTAAATATTAATTTTGAAATGGAGGAAGACTGTGAAGCAATTAAG 13 0.627587070077396 0.568913321228002 0.464690965079815 N Y 3.57233464462502 2.71992636785275 2.71992636785275 0 0 0 Y 0.368850574712644 N Y 1463 391 ENSMUSG00000073079 ENSMUSG00000094103 + - 12 12 55112891 55214076 coding intron Srp54c 1700047I17Rik2 55089202 55199533 + + 0 55111181 55213840 + - N - - N dataset_6344_files 9 0 9 9 9 1 N N 0 0 9 0.93475793835753 1.17240826166877 1.17240826166877 0.93475793835753 4 - - 0.529926984288339 +8690 CTGGAGGAAAGCACCGCAGGTCTGAGCAGCCCTGAGCCGGGCAGGGTGGGGGCAGTGGCTAAGGCCTAGCTGGGGACGATTTAAAGGTATCGCGCCACCCAGCCACACCCCACAGGCCAGGCGAGGGTGCCACCCCCGGAGATCAGAGGTCATTGCTGGCGTTCAGA|GCCTAGGAAGTGGGCTGCGTTTCAGGGGGAAGTCCATGATCACCACGTGGCAACATGCAAGCGGGTGCTG 4 0.685704457350918 0.684620189710687 0.966832507460471 N Y 3.58424216889964 3.53768019619294 3.53768019619294 39 0.98577430972389 0.260264105642257 N 0.260264105642257 N N 4781 556 ENSMUSG00000032000 ENSMUSG00000016409 + - 9 X 7873186 37150746 utr5p intron Birc3 Nkap 7848699 37126795 - + 0 7872842 37143984 - - Y - - N dataset_6344_files 1 0 1 1 0 1 N N 0 0 5 0.831776131589327 0.712950969933709 0.831776131589327 0.712950969933709 1 - - 0.520824161374586 +11497 CCTTTCCAGCGAGGTTCCAAGTTCTTAGTCTGGTGCCGGCGTACCCACACGGCGTCACCGACACGGAAGGGGTGTGGTATCACTGGCTGATCTAGCTGGTCCTGATAAGCAGCGGCCAGCGGCTTCCAGACCTCTCGTTGTACTGCTTGGAGGGCCTGTAAGTGAG|CTTCTCTTCTGGAAGTCGGACCAATTCACCTCAAGCACCAGAGCTTGAATTCATGATCATCCTGGACACAGCACTCATCAGGACCGCGCCGCGGCTGA 67 0.883499713768307 0.731092666993054 0.679758623906517 N Y 3.66336194527509 3.6508225788147 3.6508225788147 0 0.0444358875625721 0 N 0.94917212167886 N N 0 1604 ENSMUSG00000096832 ENSMUSG00000039007 - + 5 15 23711061 33594552 downstream intron SNORD93 Pgcp 23710991 33083129 - + 0 23703360 33221484 + + Y - - N dataset_6344_files 76 1 38.7730061349693 5 163 1 N N 1 0 163 1.34668516543034 0.760481034595956 1.34668516543034 0.760481034595956 4 - - 0.903135577741999 +8726 GAGCAAGCTGACAGCTGAGCAGAAGCTGAGCATGGACACCTTCAGATCCAACTCAGCGGACATCATTCTTTCTGCAGGGCGGCAAGAGCTCAAGAGCAAGCCAAGGCTGATAAGCATGAAGAGGATGGAGATGAGGAACAAGGCTCTGGAGTACTGGAGATAAT|CATGTCAGTTCGGCCGCGCAGGCGGGCTGCGTCGTCCTCGACGAGGCCTTTCGACGCTACCGTAACCTGCTCTTCGGTTCCGGCTCTTGGCCCCGACCCAGCTTCTCAAGTGAGTCACTGCCC 3 0.70117526311829 0.321035960603558 0.664396670389762 Y Y 3.42949092635959 3.29652844562235 3.29652844562235 0 0 0 Y 0 N N 44 8769 ENSMUSG00000049526 ENSMUSG00000025232 - - 9 9 59525501 59565105 coding coding Tmem202 Hexa 59518686 59539667 - + 0 59520238 59539868 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 5 1.02981806768202 0.744637679708541 1.02981806768202 0.744637679708541 4 - - 0.931447658206354 +5020 CAAACCATCATTTGTATTTTTCAAACTGTCTATCGAGCCTCAAACTCC|ACCATTTCATTTTCTGGTCTTCTATTCTAGCACCTACAACAAACAAATAGTTAAAGCTGCCCGGTTAGGCCCCTGACTCAGCACTTAAGATGGGGGAACCAGAGTCCACCAAATAGGGGGTAAATGCGTTAACGACCACTAGCTCCCACATATCAAAAACCAGCCTCCTCAGATACGCAGAAACAATGCCCTGACTTCAGAC 27 0.772639018637575 0.542772921402035 0.769655490407199 N Y 3.43152237077921 3.41592783769912 3.41592783769912 0 0.89624833997344 0 N 0.584993359893758 N N 0 0 ENSMUSG00000088422 ENSMUSG00000053332 - + 5 1 79639109 161038539 downstream intron 7SK Gas5 79638815 161034422 - + 0.89624833997344 79638362 161036581 + + Y - - N dataset_6344_files 1 0 1 1 0 3 N N 0 0 9 0.52283071128472 1.06942645490056 1.06942645490056 0.52283071128472 1 - - 0.818668251983994 +3184 CATTGGAGCTGTGGTGGCTTTTGTGATGAAGAGAAGGAGAAACACAGGTGGAAAAGGAGGGGACTATGCTCTGGCTCCAG|AGAACAGCGCCTGATGTTCCCTGTGAGCCTATGGGCTCAATGTGAAGAATTGTGGAGCCCAGCCTTCGCCTACACACCAGGACCCTGTCTCTGCATTGCCCTGTGTTCCCTTCCACCGCCAACCTTCCGGGTCTGCAGTGGAAACTAAGGGTTCTTTGGAAAGTCGG 40 1 0.843266171585359 0.786568876550746 N Y 3.49424793801946 3.684245618384 3.49424793801946 0 0 0.551814516129032 Y 0.950201612903226 N Y 23545 2145 ENSMUSG00000073411 ENSMUSG00000035929 + - 17 17 35267499 35385290 coding utr3p H2-D1 H2-Q4 35262730 35379617 + + 0 35266514 35383995 + - N - - N dataset_6344_files 4 0.22972972972973 2.57142857142857 2 7 1 N N 0 0.22972972972973 7 0.562439098503259 1.03773974512573 1.03773974512573 0.562439098503259 4 - - 0.85501289117385 +9158 CCGAATTTCAACCTCCTTATCAACAGTGGGATCTTCAAAGAGTTGTACCCTGAAGTTGCTCTTTCTCAGTGGAGTCCCACACTCAGGACAGTTTCCAGCTCCTCTTACAAACAGTAAGTCCACACAACTCTCACAC|CTTCCCCACCAGCCTGGTCCGGCTGCCCACCTCTCCCCGCCCCCCACCTCGCTTCCCTACCGGGGTGGTAGGGGGGACGACGGTGGCAACGAGCGGGCGGGGGATCCTCCC 5 1 0.770112168741176 0.964592797110167 Y Y 3.1490579347374 3.11232376639641 3.11232376639641 0 0 0 Y 0 N N 1882 677 ENSMUSG00000021103 ENSMUSG00000034460 - - 12 12 73273988 73114037 coding intron Mnat1 Six4 73123717 73099609 + - 0 73168000 73112254 - + N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 5 1.0060530353509 0.78424606692708 1.0060530353509 0.78424606692708 4 - - 0.934413835802699 +1851 GCTGCAGGTGGGCTTATTCTACCATTGCTACTGTTCTGTCTTTGAAGATGTTCTTTATAGCATACTGAACACATGCCATTTGTACGAGGGTTTCCATAAAATCCGCAGCCAGTGGAGCAAAGCATAGGTGCTTGGCTGTGATTAGTTTCTTGAGCCATGTTCTTCTGTTGCA|CACCTGGCGGCGGCCCTCTCCGCCGGACGCGCTGCCGCCGCCGCCTCTCGCCGCCGCTGAGAGTGAGGACAGGTGAGGCCGCCAAACCCCCACTCGCTCCCGGCCCGCCGCCGCCGGCCCTCCGTCCGC 21 0.826443478467522 0.547250352165575 0.543160934834641 Y Y 3.38908332958226 2.95732075160099 2.95732075160099 0 0 0 Y 0.992273730684327 N N 1944 1 ENSMUSG00000030629 ENSMUSG00000085236 - - 7 7 84679361 84776549 utr5p intron Zfand6 2610206C17Rik 84615054 84689640 - + 0 84634406 84689743 + - N - - N dataset_6344_files 2 0.159420289855072 1.04878048780488 1 2 1 N Y 0 0.159420289855072 41 1.14864322933764 1.09319148723169 1.14864322933764 1.09319148723169 1 - - 0.715005473321084 +9962 CTGCGGCCCGCCGGGTCCCGGAGCCCACTGCCCCAGCACCCCGCGCTCGGCGCCCGCAGACGGCGCGGACCTCAGCGCGCACTTATGGGCTCGTTACCAGGACATGCGGAGGCTGGTGCACG|ACCTTCTGCCCCCTGAGGTCTGCAGCCTCCTAAACCCAGCAGCTATTTATGCCAACAATGAGATCAGCCTGAGTGACGTCGAAGTCTATGGCTTTGACTACGACTACACGCTGGCCCAGTATGCGGATGCACTGCACCCTGAGATCTTCAATGCTGCCCGGGACATCTTGATAGAGC 93 0.0971360740885943 0.213119196903586 0.467950871740144 Y Y 3.71720464963688 3.27448372166755 3.27448372166755 0 0.983661202185792 0 Y 0.991830601092896 N N 211 4122 ENSMUSG00000058351 ENSMUSG00000071547 - - 14 14 31128930 31139121 upstream upstream 2010107H07Rik Nt5dc2 31088869 31134853 - + 0 31131376 31134739 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 33 1.1011131646754 0.808011099258204 1.1011131646754 0.808011099258204 4 - - 0.743419539119423 +12549 TTGGAGATGCCAGTACCATGAGATGACCACCAAGAGCAGCAGCAGCAGTGGAGTACAGGCATCTGGAGCCTAGAGGATGACACATGTGCTACAAAGGGCCTGGCTGGAGAAGTGACCCAAGCCCTTGGAGGAGCCCAGAAGATC|GTCCATCCTGATAAAAATCACCATCCCCGGGCTGAGGAGGCCTTCAAAATTTTGCGGGCAGCTTGGGACATTGTCAGCAACCCAGAGAGGCGGAAGGAATATGAGATGAAACGGAT 18 1 0.96278392593137 0.394484707065412 N Y 3.2808195118354 3.46448140203209 3.2808195118354 1 0.965649359228432 0 N 0.974237019421324 N Y 1732 2437 ENSMUSG00000039307 ENSMUSG00000025354 + - 11 10 121222655 128819446 utr5p coding Hexdc Dnajc14 121204433 128805676 + + 0 121206748 128814038 + - Y - - N dataset_6344_files 3 0.993103448275862 1.72222222222222 1 11 1 N N 0.993103448275862 0 18 1.14864322933764 0.879306196251575 1.14864322933764 0.879306196251575 4 - - 0.927135541379163 +3179 CTGGAGCAGTCCCCGTGACGCCGGGTGGCGACTGGCTCCCGGGTCTGAGGGGCTTCTGCTTGTCAGGTTCT|AGATATGTGCTGACTAGCAGGCTCACGTGCACAGTGTGGAGGATAAGCTATATCTTACAAAATGGGATTTGGGAGTGACCTGAAGAACTCACAGGAAGCTGTGTTAAAGTTGCAAGACTGGGAACTACGGTTGCTGGAGACAGTGAAGAAATTTATGGCTCTGAG 10 1 0.557528435226891 0.364482194613623 Y Y 3.15741158895574 3.54759612884129 3.15741158895574 0 0.985974921257503 0 Y 0.873774291317525 N N 10 1325 ENSMUSG00000045506 ENSMUSG00000000127 - - 17 17 63863791 64139494 upstream upstream A930002H24Rik Fer 63863300 63896018 - + 0 63864053 63896016 + - N - - N dataset_6344_files 1 0 1 1 0 1 N Y 0 0 7 0.499065678953596 1.12487819700652 1.12487819700652 0.499065678953596 3 - - 0.767822892174251
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/mm10_results.filtered.vcf Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,115 @@ +##fileformat=VCFv4.1 +##source=defuse +##reference=mm10 +##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> +##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> +##INFO=<ID=MATEID,Number=1,Type=String,Description="ID of the BND mate"> +##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend"> +##INFO=<ID=SPLITCNT,Number=1,Type=Integer,Description="number of split reads supporting the prediction"> +##INFO=<ID=SPANCNT,Number=1,Type=Integer,Description="number of spanning reads supporting the fusion"> +##INFO=<ID=HOMLEN,Number=1,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints"> +##INFO=<ID=SPLICESCORE,Number=1,Type=Integer,Description="number of nucleotides similar to GTAG at fusion splice"> +##INFO=<ID=GENE,Number=2,Type=String,Description="Gene Names at each breakend"> +##INFO=<ID=GENEID,Number=2,Type=String,Description="Gene IDs at each breakend"> +##INFO=<ID=GENELOC,Number=2,Type=String,Description="location of breakpoint releative to genes"> +##INFO=<ID=EXPR,Number=2,Type=Integer,Description="expression of genes as number of concordant pairs aligned to exons"> +##INFO=<ID=ORF,Number=0,Type=Flag,Description="fusion combines genes in a way that preserves a reading frame"> +##INFO=<ID=EXONBND,Number=0,Type=Flag,Description="fusion splice at exon boundaries"> +##INFO=<ID=INTERCHROM,Number=0,Type=Flag,Description="fusion produced by an interchromosomal translocation"> +##INFO=<ID=READTHROUGH,Number=0,Type=Flag,Description="fusion involving adjacent potentially resulting from co-transcription rather than genome rearrangement"> +##INFO=<ID=ADJACENT,Number=0,Type=Flag,Description="fusion between adjacent genes"> +##INFO=<ID=ALTSPLICE,Number=0,Type=Flag,Description="fusion likely the product of alternative splicing between adjacent genes"> +##INFO=<ID=DELETION,Number=0,Type=Flag,Description="fusion produced by a genomic deletion"> +##INFO=<ID=EVERSION,Number=0,Type=Flag,Description="fusion produced by a genomic eversion"> +##INFO=<ID=INVERSION,Number=0,Type=Flag,Description="fusion produced by a genomic inversion"> +#CHROM POS ID REF ALT QUAL FILTER INFO +1 106734547 bnd_4068_1 A A]16:37799068] 233 PASS SVTYPE=BND;MATEID=bnd_4068_2;DP=19;SPLITCNT=8;SPANCNT=11;GENE=Kdsr,Fstl1;GENEID=ENSMUSG00000009905,ENSMUSG00000022816;GENELOC=coding,intron;EXPR=775,35409;HOMLEN=2;SPLICESCORE=4;INTERCHROM;ALTSPLICE +1 127753955 bnd_3783_1 T T]1:127773930] 181 PASS SVTYPE=BND;MATEID=bnd_3783_2;DP=8;SPLITCNT=2;SPANCNT=6;GENE=Acmsd,Ccnt2;GENEID=ENSMUSG00000026348,ENSMUSG00000026349;GENELOC=intron,upstream;EXPR=0,752;HOMLEN=0;SPLICESCORE=3;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +1 127773930 bnd_3783_2 T [1:127753955[T 181 PASS SVTYPE=BND;MATEID=bnd_3783_1;DP=8;SPLITCNT=2;SPANCNT=6;GENE=Acmsd,Ccnt2;GENEID=ENSMUSG00000026348,ENSMUSG00000026349;GENELOC=intron,upstream;EXPR=0,752;HOMLEN=0;SPLICESCORE=3;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +1 161036581 bnd_5020_2 A [5:79638362[A 208 PASS SVTYPE=BND;MATEID=bnd_5020_1;DP=36;SPLITCNT=27;SPANCNT=9;GENE=7SK,Gas5;GENEID=ENSMUSG00000088422,ENSMUSG00000053332;GENELOC=downstream,intron;EXPR=0,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM;ALTSPLICE +1 161036831 bnd_4912_2 G ]5:79638354]G 151 PASS SVTYPE=BND;MATEID=bnd_4912_1;DP=20;SPLITCNT=15;SPANCNT=5;GENE=7SK,Gas5;GENEID=ENSMUSG00000088422,ENSMUSG00000053332;GENELOC=downstream,intron;EXPR=0,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM;ALTSPLICE +1 168121480 bnd_3839_2 G ]1:168183530]G 249 PASS SVTYPE=BND;MATEID=bnd_3839_1;DP=29;SPLITCNT=13;SPANCNT=16;GENE=Pbx1,Gm20711;GENEID=ENSMUSG00000052534,ENSMUSG00000093538;GENELOC=coding,intron;EXPR=3895,0;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +1 168183530 bnd_3839_1 T T[1:168121480[ 249 PASS SVTYPE=BND;MATEID=bnd_3839_2;DP=29;SPLITCNT=13;SPANCNT=16;GENE=Pbx1,Gm20711;GENEID=ENSMUSG00000052534,ENSMUSG00000093538;GENELOC=coding,intron;EXPR=3895,0;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +2 165827729 bnd_5326_1 C C]18:4198107] 136 PASS SVTYPE=BND;MATEID=bnd_5326_2;DP=20;SPLITCNT=7;SPANCNT=13;GENE=Zmynd8,Gm10557;GENEID=ENSMUSG00000039671,ENSMUSG00000073647;GENELOC=utr3p,downstream;EXPR=2963,0;HOMLEN=9;SPLICESCORE=1;INTERCHROM;ALTSPLICE +2 174462629 bnd_2385_2 T ]2:174472832]T 194 PASS SVTYPE=BND;MATEID=bnd_2385_1;DP=12;SPLITCNT=7;SPANCNT=5;GENE=Slmo2,Atp5e;GENEID=ENSMUSG00000016257,ENSMUSG00000016252;GENELOC=coding,coding;EXPR=3085,3617;HOMLEN=0;SPLICESCORE=4;ORF;EXONBND;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +2 174472832 bnd_2385_1 C C[2:174462629[ 194 PASS SVTYPE=BND;MATEID=bnd_2385_2;DP=12;SPLITCNT=7;SPANCNT=5;GENE=Slmo2,Atp5e;GENEID=ENSMUSG00000016257,ENSMUSG00000016252;GENELOC=coding,coding;EXPR=3085,3617;HOMLEN=0;SPLICESCORE=4;ORF;EXONBND;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +3 103040040 bnd_5160_1 C C]18:28188917] 213 PASS SVTYPE=BND;MATEID=bnd_5160_2;DP=63;SPLITCNT=33;SPANCNT=30;GENE=Csde1,SNORA17;GENEID=ENSMUSG00000068823,ENSMUSG00000087940;GENELOC=coding,upstream;EXPR=10681,0;HOMLEN=91;SPLICESCORE=3;INTERCHROM;ALTSPLICE +3 144201813 bnd_8647_1 T T]5:149198645] 242 PASS SVTYPE=BND;MATEID=bnd_8647_2;DP=9;SPLITCNT=4;SPANCNT=5;GENE=Lmo4,Uspl1;GENEID=ENSMUSG00000028266,ENSMUSG00000041264;GENELOC=coding,coding;EXPR=2482,2085;HOMLEN=1;SPLICESCORE=4;EXONBND;INTERCHROM +4 124696071 bnd_12092_2 C ]4:124705614]C 187 PASS SVTYPE=BND;MATEID=bnd_12092_1;DP=11;SPLITCNT=2;SPANCNT=9;GENE=Fhl3,U6;GENEID=ENSMUSG00000032643,ENSMUSG00000088067;GENELOC=utr5p,downstream;EXPR=2072,0;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +4 124705614 bnd_12092_1 C C[4:124696071[ 187 PASS SVTYPE=BND;MATEID=bnd_12092_2;DP=11;SPLITCNT=2;SPANCNT=9;GENE=Fhl3,U6;GENEID=ENSMUSG00000032643,ENSMUSG00000088067;GENELOC=utr5p,downstream;EXPR=2072,0;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +4 148948907 bnd_12095_1 C C]4:148961548] 249 PASS SVTYPE=BND;MATEID=bnd_12095_2;DP=11;SPLITCNT=6;SPANCNT=5;GENE=Gm13205,Pex14;GENEID=ENSMUSG00000086606,ENSMUSG00000028975;GENELOC=intron,coding;EXPR=0,1453;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +4 148961548 bnd_12095_2 C [4:148948907[C 249 PASS SVTYPE=BND;MATEID=bnd_12095_1;DP=11;SPLITCNT=6;SPANCNT=5;GENE=Gm13205,Pex14;GENEID=ENSMUSG00000086606,ENSMUSG00000028975;GENELOC=intron,coding;EXPR=0,1453;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +5 23703360 bnd_11497_1 G G[15:33221484[ 230 PASS SVTYPE=BND;MATEID=bnd_11497_2;DP=230;SPLITCNT=67;SPANCNT=163;GENE=SNORD93,Pgcp;GENEID=ENSMUSG00000096832,ENSMUSG00000039007;GENELOC=downstream,intron;EXPR=0,1604;HOMLEN=0;SPLICESCORE=4;INTERCHROM;ALTSPLICE +5 60176541 bnd_1721_2 G ]11:106187103]G 188 PASS SVTYPE=BND;MATEID=bnd_1721_1;DP=13;SPLITCNT=8;SPANCNT=5;GENE=Strada,Cbfa2t2-ps1;GENEID=ENSMUSG00000069631,ENSMUSG00000087034;GENELOC=coding,upstream;EXPR=800,0;HOMLEN=142;SPLICESCORE=1;INTERCHROM;ALTSPLICE +5 79638354 bnd_4912_1 A A]1:161036831] 151 PASS SVTYPE=BND;MATEID=bnd_4912_2;DP=20;SPLITCNT=15;SPANCNT=5;GENE=7SK,Gas5;GENEID=ENSMUSG00000088422,ENSMUSG00000053332;GENELOC=downstream,intron;EXPR=0,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM;ALTSPLICE +5 79638362 bnd_5020_1 C C[1:161036581[ 208 PASS SVTYPE=BND;MATEID=bnd_5020_2;DP=36;SPLITCNT=27;SPANCNT=9;GENE=7SK,Gas5;GENEID=ENSMUSG00000088422,ENSMUSG00000053332;GENELOC=downstream,intron;EXPR=0,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM;ALTSPLICE +5 105728723 bnd_13923_1 G G]8:126422269] 207 PASS SVTYPE=BND;MATEID=bnd_13923_2;DP=11;SPLITCNT=4;SPANCNT=7;GENE=Lrrc8d,1810063B05Rik;GENEID=ENSMUSG00000046079,ENSMUSG00000051671;GENELOC=intron,upstream;EXPR=590,367;HOMLEN=0;SPLICESCORE=4;INTERCHROM;ALTSPLICE +5 149198645 bnd_8647_2 G ]3:144201813]G 242 PASS SVTYPE=BND;MATEID=bnd_8647_1;DP=9;SPLITCNT=4;SPANCNT=5;GENE=Lmo4,Uspl1;GENEID=ENSMUSG00000028266,ENSMUSG00000041264;GENELOC=coding,coding;EXPR=2482,2085;HOMLEN=1;SPLICESCORE=4;EXONBND;INTERCHROM +6 51467295 bnd_12868_1 A A]10:73201702] 132 PASS SVTYPE=BND;MATEID=bnd_12868_2;DP=110;SPLITCNT=69;SPANCNT=41;GENE=Hnrnpa2b1,Gm15398;GENEID=ENSMUSG00000004980,ENSMUSG00000085456;GENELOC=coding,intron;EXPR=41631,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM;ALTSPLICE +6 52158684 bnd_7746_1 G G]6:52173634] 190 PASS SVTYPE=BND;MATEID=bnd_7746_2;DP=46;SPLITCNT=26;SPANCNT=20;GENE=Gm15051,Gm15050;GENEID=ENSMUSG00000087658,ENSMUSG00000087626;GENELOC=intron,intron;EXPR=0,0;HOMLEN=0;SPLICESCORE=2;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +6 52173634 bnd_7746_2 A [6:52158684[A 190 PASS SVTYPE=BND;MATEID=bnd_7746_1;DP=46;SPLITCNT=26;SPANCNT=20;GENE=Gm15051,Gm15050;GENEID=ENSMUSG00000087658,ENSMUSG00000087626;GENELOC=intron,intron;EXPR=0,0;HOMLEN=0;SPLICESCORE=2;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 28376683 bnd_1870_1 C C]7:28392166] 214 PASS SVTYPE=BND;MATEID=bnd_1870_2;DP=10;SPLITCNT=2;SPANCNT=8;GENE=Zfp36,Med29;GENEID=ENSMUSG00000044786,ENSMUSG00000003444;GENELOC=downstream,intron;EXPR=1543,697;HOMLEN=4;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 28392166 bnd_1870_2 C [7:28376683[C 214 PASS SVTYPE=BND;MATEID=bnd_1870_1;DP=10;SPLITCNT=2;SPANCNT=8;GENE=Zfp36,Med29;GENEID=ENSMUSG00000044786,ENSMUSG00000003444;GENELOC=downstream,intron;EXPR=1543,697;HOMLEN=4;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 66355922 bnd_12600_2 C [10:24597524[C 148 PASS SVTYPE=BND;MATEID=bnd_12600_1;DP=6;SPLITCNT=1;SPANCNT=5;GENE=Ctgf,Lrrk1;GENEID=ENSMUSG00000019997,ENSMUSG00000015133;GENELOC=coding,intron;EXPR=7298,2735;HOMLEN=0;SPLICESCORE=3;INTERCHROM +7 84634406 bnd_1851_1 A A]7:84689743] 182 PASS SVTYPE=BND;MATEID=bnd_1851_2;DP=62;SPLITCNT=21;SPANCNT=41;GENE=Zfand6,2610206C17Rik;GENEID=ENSMUSG00000030629,ENSMUSG00000085236;GENELOC=utr5p,intron;EXPR=1944,1;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 84689743 bnd_1851_2 C [7:84634406[C 182 PASS SVTYPE=BND;MATEID=bnd_1851_1;DP=62;SPLITCNT=21;SPANCNT=41;GENE=Zfand6,2610206C17Rik;GENEID=ENSMUSG00000030629,ENSMUSG00000085236;GENELOC=utr5p,intron;EXPR=1944,1;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 90125032 bnd_1855_1 T T]7:90129872] 242 PASS SVTYPE=BND;MATEID=bnd_1855_2;DP=13;SPLITCNT=4;SPANCNT=9;GENE=AC130210.1,Picalm;GENEID=ENSMUSG00000097162,ENSMUSG00000039361;GENELOC=intron,upstream;EXPR=0,8476;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 90129872 bnd_1855_2 C [7:90125032[C 242 PASS SVTYPE=BND;MATEID=bnd_1855_1;DP=13;SPLITCNT=4;SPANCNT=9;GENE=AC130210.1,Picalm;GENEID=ENSMUSG00000097162,ENSMUSG00000039361;GENELOC=intron,upstream;EXPR=0,8476;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 97788974 bnd_1886_1 G G]7:97854436] 247 PASS SVTYPE=BND;MATEID=bnd_1886_2;DP=36;SPLITCNT=18;SPANCNT=18;GENE=Aqp11,Pak1;GENEID=ENSMUSG00000042797,ENSMUSG00000030774;GENELOC=upstream,utr5p;EXPR=16,2476;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +7 97854436 bnd_1886_2 T [7:97788974[T 247 PASS SVTYPE=BND;MATEID=bnd_1886_1;DP=36;SPLITCNT=18;SPANCNT=18;GENE=Aqp11,Pak1;GENEID=ENSMUSG00000042797,ENSMUSG00000030774;GENELOC=upstream,utr5p;EXPR=16,2476;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +8 54779650 bnd_11601_1 G G]8:55037328] 245 PASS SVTYPE=BND;MATEID=bnd_11601_2;DP=78;SPLITCNT=45;SPANCNT=33;GENE=Wdr17,Gpm6a;GENEID=ENSMUSG00000039375,ENSMUSG00000031517;GENELOC=intron,coding;EXPR=195,2438;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +8 55037328 bnd_11601_2 G [8:54779650[G 245 PASS SVTYPE=BND;MATEID=bnd_11601_1;DP=78;SPLITCNT=45;SPANCNT=33;GENE=Wdr17,Gpm6a;GENEID=ENSMUSG00000039375,ENSMUSG00000031517;GENELOC=intron,coding;EXPR=195,2438;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +8 84816537 bnd_11605_1 C C]8:84832174] 193 PASS SVTYPE=BND;MATEID=bnd_11605_2;DP=23;SPLITCNT=11;SPANCNT=12;GENE=Dand5,Gadd45gip1;GENEID=ENSMUSG00000053226,ENSMUSG00000033751;GENELOC=intron,utr5p;EXPR=167,2428;HOMLEN=0;SPLICESCORE=2;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +8 84832174 bnd_11605_2 G [8:84816537[G 193 PASS SVTYPE=BND;MATEID=bnd_11605_1;DP=23;SPLITCNT=11;SPANCNT=12;GENE=Dand5,Gadd45gip1;GENEID=ENSMUSG00000053226,ENSMUSG00000033751;GENELOC=intron,utr5p;EXPR=167,2428;HOMLEN=0;SPLICESCORE=2;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +8 95682960 bnd_11596_2 G ]8:95739809]G 222 PASS SVTYPE=BND;MATEID=bnd_11596_1;DP=24;SPLITCNT=8;SPANCNT=16;GENE=Cnot1,Ndrg4;GENEID=ENSMUSG00000036550,ENSMUSG00000036564;GENELOC=coding,intron;EXPR=6004,544;HOMLEN=0;SPLICESCORE=1;ALTSPLICE;DELETION +8 95739809 bnd_11596_1 A A[8:95682960[ 222 PASS SVTYPE=BND;MATEID=bnd_11596_2;DP=24;SPLITCNT=8;SPANCNT=16;GENE=Cnot1,Ndrg4;GENEID=ENSMUSG00000036550,ENSMUSG00000036564;GENELOC=coding,intron;EXPR=6004,544;HOMLEN=0;SPLICESCORE=1;ALTSPLICE;DELETION +8 126422269 bnd_13923_2 C ]5:105728723]C 207 PASS SVTYPE=BND;MATEID=bnd_13923_1;DP=11;SPLITCNT=4;SPANCNT=7;GENE=Lrrc8d,1810063B05Rik;GENEID=ENSMUSG00000046079,ENSMUSG00000051671;GENELOC=intron,upstream;EXPR=590,367;HOMLEN=0;SPLICESCORE=4;INTERCHROM;ALTSPLICE +9 7872842 bnd_8690_1 A A]X:37143984] 132 PASS SVTYPE=BND;MATEID=bnd_8690_2;DP=9;SPLITCNT=4;SPANCNT=5;GENE=Birc3,Nkap;GENEID=ENSMUSG00000032000,ENSMUSG00000016409;GENELOC=utr5p,intron;EXPR=4781,556;HOMLEN=39;SPLICESCORE=1;INTERCHROM;ALTSPLICE +9 21320850 bnd_8748_2 A ]9:21337962]A 189 PASS SVTYPE=BND;MATEID=bnd_8748_1;DP=103;SPLITCNT=51;SPANCNT=52;GENE=Slc44a2,Ap1m2;GENEID=ENSMUSG00000057193,ENSMUSG00000003309;GENELOC=coding,upstream;EXPR=6343,2;HOMLEN=2;SPLICESCORE=2;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +9 21337962 bnd_8748_1 C C[9:21320850[ 189 PASS SVTYPE=BND;MATEID=bnd_8748_2;DP=103;SPLITCNT=51;SPANCNT=52;GENE=Slc44a2,Ap1m2;GENEID=ENSMUSG00000057193,ENSMUSG00000003309;GENELOC=coding,upstream;EXPR=6343,2;HOMLEN=2;SPLICESCORE=2;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +9 59520238 bnd_8726_1 T T]9:59539868] 237 PASS SVTYPE=BND;MATEID=bnd_8726_2;DP=8;SPLITCNT=3;SPANCNT=5;GENE=Tmem202,Hexa;GENEID=ENSMUSG00000049526,ENSMUSG00000025232;GENELOC=coding,coding;EXPR=44,8769;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +9 59539868 bnd_8726_2 C [9:59520238[C 237 PASS SVTYPE=BND;MATEID=bnd_8726_1;DP=8;SPLITCNT=3;SPANCNT=5;GENE=Tmem202,Hexa;GENEID=ENSMUSG00000049526,ENSMUSG00000025232;GENELOC=coding,coding;EXPR=44,8769;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +9 108783870 bnd_8762_1 G G]9:108796064] 243 PASS SVTYPE=BND;MATEID=bnd_8762_2;DP=36;SPLITCNT=28;SPANCNT=8;GENE=SNORA28,Ip6k2;GENEID=ENSMUSG00000088626,ENSMUSG00000032599;GENELOC=downstream,utr5p;EXPR=0,714;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +9 108796064 bnd_8762_2 A [9:108783870[A 243 PASS SVTYPE=BND;MATEID=bnd_8762_1;DP=36;SPLITCNT=28;SPANCNT=8;GENE=SNORA28,Ip6k2;GENEID=ENSMUSG00000088626,ENSMUSG00000032599;GENELOC=downstream,utr5p;EXPR=0,714;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +10 24597524 bnd_12600_1 G G]7:66355922] 148 PASS SVTYPE=BND;MATEID=bnd_12600_2;DP=6;SPLITCNT=1;SPANCNT=5;GENE=Ctgf,Lrrk1;GENEID=ENSMUSG00000019997,ENSMUSG00000015133;GENELOC=coding,intron;EXPR=7298,2735;HOMLEN=0;SPLICESCORE=3;INTERCHROM +10 73201702 bnd_12868_2 A ]6:51467295]A 132 PASS SVTYPE=BND;MATEID=bnd_12868_1;DP=110;SPLITCNT=69;SPANCNT=41;GENE=Hnrnpa2b1,Gm15398;GENEID=ENSMUSG00000004980,ENSMUSG00000085456;GENELOC=coding,intron;EXPR=41631,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM;ALTSPLICE +10 128814038 bnd_12549_2 G [11:121206748[G 236 PASS SVTYPE=BND;MATEID=bnd_12549_1;DP=36;SPLITCNT=18;SPANCNT=18;GENE=Hexdc,Dnajc14;GENEID=ENSMUSG00000039307,ENSMUSG00000025354;GENELOC=utr5p,coding;EXPR=1732,2437;HOMLEN=1;SPLICESCORE=4;EXONBND;INTERCHROM;ALTSPLICE +11 52099150 bnd_1027_1 G G]12:54980379] 212 PASS SVTYPE=BND;MATEID=bnd_1027_2;DP=12;SPLITCNT=2;SPANCNT=10;GENE=Ppp2ca,Baz1a;GENEID=ENSMUSG00000020349,ENSMUSG00000035021;GENELOC=coding,intron;EXPR=10063,1785;HOMLEN=3;SPLICESCORE=4;INTERCHROM +11 87218328 bnd_66_1 A A]11:87250589] 157 PASS SVTYPE=BND;MATEID=bnd_66_2;DP=24;SPLITCNT=13;SPANCNT=11;GENE=Trim37,Ppm1e;GENEID=ENSMUSG00000018548,ENSMUSG00000046442;GENELOC=coding,intron;EXPR=1514,627;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +11 87250589 bnd_66_2 A [11:87218328[A 157 PASS SVTYPE=BND;MATEID=bnd_66_1;DP=24;SPLITCNT=13;SPANCNT=11;GENE=Trim37,Ppm1e;GENEID=ENSMUSG00000018548,ENSMUSG00000046442;GENELOC=coding,intron;EXPR=1514,627;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +11 106187103 bnd_1721_1 A A[5:60176541[ 188 PASS SVTYPE=BND;MATEID=bnd_1721_2;DP=13;SPLITCNT=8;SPANCNT=5;GENE=Strada,Cbfa2t2-ps1;GENEID=ENSMUSG00000069631,ENSMUSG00000087034;GENELOC=coding,upstream;EXPR=800,0;HOMLEN=142;SPLICESCORE=1;INTERCHROM;ALTSPLICE +11 121206748 bnd_12549_1 C C]10:128814038] 236 PASS SVTYPE=BND;MATEID=bnd_12549_2;DP=36;SPLITCNT=18;SPANCNT=18;GENE=Hexdc,Dnajc14;GENEID=ENSMUSG00000039307,ENSMUSG00000025354;GENELOC=utr5p,coding;EXPR=1732,2437;HOMLEN=1;SPLICESCORE=4;EXONBND;INTERCHROM;ALTSPLICE +12 54980379 bnd_1027_2 T [11:52099150[T 212 PASS SVTYPE=BND;MATEID=bnd_1027_1;DP=12;SPLITCNT=2;SPANCNT=10;GENE=Ppp2ca,Baz1a;GENEID=ENSMUSG00000020349,ENSMUSG00000035021;GENELOC=coding,intron;EXPR=10063,1785;HOMLEN=3;SPLICESCORE=4;INTERCHROM +12 55111181 bnd_9292_1 G G]12:55213840] 135 PASS SVTYPE=BND;MATEID=bnd_9292_2;DP=22;SPLITCNT=13;SPANCNT=9;GENE=Srp54c,1700047I17Rik2;GENEID=ENSMUSG00000073079,ENSMUSG00000094103;GENELOC=coding,intron;EXPR=1463,391;HOMLEN=0;SPLICESCORE=4;EXONBND;ALTSPLICE;DELETION +12 55213840 bnd_9292_2 G [12:55111181[G 135 PASS SVTYPE=BND;MATEID=bnd_9292_1;DP=22;SPLITCNT=13;SPANCNT=9;GENE=Srp54c,1700047I17Rik2;GENEID=ENSMUSG00000073079,ENSMUSG00000094103;GENELOC=coding,intron;EXPR=1463,391;HOMLEN=0;SPLICESCORE=4;EXONBND;ALTSPLICE;DELETION +12 73112254 bnd_9158_2 C ]12:73168000]C 238 PASS SVTYPE=BND;MATEID=bnd_9158_1;DP=10;SPLITCNT=5;SPANCNT=5;GENE=Mnat1,Six4;GENEID=ENSMUSG00000021103,ENSMUSG00000034460;GENELOC=coding,intron;EXPR=1882,677;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +12 73168000 bnd_9158_1 C C[12:73112254[ 238 PASS SVTYPE=BND;MATEID=bnd_9158_2;DP=10;SPLITCNT=5;SPANCNT=5;GENE=Mnat1,Six4;GENEID=ENSMUSG00000021103,ENSMUSG00000034460;GENELOC=coding,intron;EXPR=1882,677;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +12 113422730 bnd_9297_1 T T]12:113426655] 212 PASS SVTYPE=BND;MATEID=bnd_9297_2;DP=26;SPLITCNT=20;SPANCNT=6;GENE=Ighm,AC073553.3;GENEID=ENSMUSG00000076617,ENSMUSG00000092748;GENELOC=coding,upstream;EXPR=1488,0;HOMLEN=0;SPLICESCORE=4;ALTSPLICE;DELETION +12 113426655 bnd_9297_2 C [12:113422730[C 212 PASS SVTYPE=BND;MATEID=bnd_9297_1;DP=26;SPLITCNT=20;SPANCNT=6;GENE=Ighm,AC073553.3;GENEID=ENSMUSG00000076617,ENSMUSG00000092748;GENELOC=coding,upstream;EXPR=1488,0;HOMLEN=0;SPLICESCORE=4;ALTSPLICE;DELETION +13 105131562 bnd_5983_1 T T]13:105167527] 207 PASS SVTYPE=BND;MATEID=bnd_5983_2;DP=6;SPLITCNT=1;SPANCNT=5;GENE=4933425L06Rik,Rnf180;GENEID=ENSMUSG00000021718,ENSMUSG00000021720;GENELOC=downstream,intron;EXPR=0,60;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +13 105167527 bnd_5983_2 C [13:105131562[C 207 PASS SVTYPE=BND;MATEID=bnd_5983_1;DP=6;SPLITCNT=1;SPANCNT=5;GENE=4933425L06Rik,Rnf180;GENEID=ENSMUSG00000021718,ENSMUSG00000021720;GENELOC=downstream,intron;EXPR=0,60;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +14 31131376 bnd_9962_1 G G]14:31134739] 189 PASS SVTYPE=BND;MATEID=bnd_9962_2;DP=126;SPLITCNT=93;SPANCNT=33;GENE=2010107H07Rik,Nt5dc2;GENEID=ENSMUSG00000058351,ENSMUSG00000071547;GENELOC=upstream,upstream;EXPR=211,4122;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +14 31134739 bnd_9962_2 A [14:31131376[A 189 PASS SVTYPE=BND;MATEID=bnd_9962_1;DP=126;SPLITCNT=93;SPANCNT=33;GENE=2010107H07Rik,Nt5dc2;GENEID=ENSMUSG00000058351,ENSMUSG00000071547;GENELOC=upstream,upstream;EXPR=211,4122;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +15 33221484 bnd_11497_2 C [5:23703360[C 230 PASS SVTYPE=BND;MATEID=bnd_11497_1;DP=230;SPLITCNT=67;SPANCNT=163;GENE=SNORD93,Pgcp;GENEID=ENSMUSG00000096832,ENSMUSG00000039007;GENELOC=downstream,intron;EXPR=0,1604;HOMLEN=0;SPLICESCORE=4;INTERCHROM;ALTSPLICE +15 74996568 bnd_11177_2 C ]15:75048442]C 144 PASS SVTYPE=BND;MATEID=bnd_11177_1;DP=497;SPLITCNT=434;SPANCNT=63;GENE=Ly6c1,Ly6a;GENEID=ENSMUSG00000079018,ENSMUSG00000075602;GENELOC=utr5p,utr5p;EXPR=1416,5049;HOMLEN=0;SPLICESCORE=4;ORF;EXONBND;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +15 74996568 bnd_11169_2 A ]15:75048442]A 237 PASS SVTYPE=BND;MATEID=bnd_11169_1;DP=702;SPLITCNT=410;SPANCNT=292;GENE=Ly6c1,Ly6a;GENEID=ENSMUSG00000079018,ENSMUSG00000075602;GENELOC=utr5p,utr5p;EXPR=1416,5049;HOMLEN=0;SPLICESCORE=4;ORF;EXONBND;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +15 75048442 bnd_11177_1 A A[15:74996568[ 144 PASS SVTYPE=BND;MATEID=bnd_11177_2;DP=497;SPLITCNT=434;SPANCNT=63;GENE=Ly6c1,Ly6a;GENEID=ENSMUSG00000079018,ENSMUSG00000075602;GENELOC=utr5p,utr5p;EXPR=1416,5049;HOMLEN=0;SPLICESCORE=4;ORF;EXONBND;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +15 75048442 bnd_11169_1 G G[15:74996568[ 237 PASS SVTYPE=BND;MATEID=bnd_11169_2;DP=702;SPLITCNT=410;SPANCNT=292;GENE=Ly6c1,Ly6a;GENEID=ENSMUSG00000079018,ENSMUSG00000075602;GENELOC=utr5p,utr5p;EXPR=1416,5049;HOMLEN=0;SPLICESCORE=4;ORF;EXONBND;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +15 102188962 bnd_11202_1 G G]15:102202444] 248 PASS SVTYPE=BND;MATEID=bnd_11202_2;DP=31;SPLITCNT=21;SPANCNT=10;GENE=Csad,Zfp740;GENEID=ENSMUSG00000023044,ENSMUSG00000046897;GENELOC=utr5p,upstream;EXPR=1306,3566;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +15 102202444 bnd_11202_2 C [15:102188962[C 248 PASS SVTYPE=BND;MATEID=bnd_11202_1;DP=31;SPLITCNT=21;SPANCNT=10;GENE=Csad,Zfp740;GENEID=ENSMUSG00000023044,ENSMUSG00000046897;GENELOC=utr5p,upstream;EXPR=1306,3566;HOMLEN=0;SPLICESCORE=4;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +16 19493776 bnd_7456_2 T [18:30311220[T 189 PASS SVTYPE=BND;MATEID=bnd_7456_1;DP=19;SPLITCNT=9;SPANCNT=10;GENE=Pik3c3,Olfr166;GENEID=ENSMUSG00000033628,ENSMUSG00000056822;GENELOC=coding,downstream;EXPR=1791,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM +16 37799068 bnd_4068_2 T ]1:106734547]T 233 PASS SVTYPE=BND;MATEID=bnd_4068_1;DP=19;SPLITCNT=8;SPANCNT=11;GENE=Kdsr,Fstl1;GENEID=ENSMUSG00000009905,ENSMUSG00000022816;GENELOC=coding,intron;EXPR=775,35409;HOMLEN=2;SPLICESCORE=4;INTERCHROM;ALTSPLICE +17 17395298 bnd_3153_1 C C]17:17411818] 206 PASS SVTYPE=BND;MATEID=bnd_3153_2;DP=41;SPLITCNT=16;SPANCNT=25;GENE=AC154200.1,Lix1;GENEID=ENSMUSG00000097379,ENSMUSG00000047786;GENELOC=intron,intron;EXPR=0,30;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +17 17411818 bnd_3153_2 T [17:17395298[T 206 PASS SVTYPE=BND;MATEID=bnd_3153_1;DP=41;SPLITCNT=16;SPANCNT=25;GENE=AC154200.1,Lix1;GENEID=ENSMUSG00000097379,ENSMUSG00000047786;GENELOC=intron,intron;EXPR=0,30;HOMLEN=0;SPLICESCORE=1;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +17 35266514 bnd_3184_1 G G]17:35383995] 218 PASS SVTYPE=BND;MATEID=bnd_3184_2;DP=47;SPLITCNT=40;SPANCNT=7;GENE=H2-D1,H2-Q4;GENEID=ENSMUSG00000073411,ENSMUSG00000035929;GENELOC=coding,utr3p;EXPR=23545,2145;HOMLEN=0;SPLICESCORE=4;EXONBND;ALTSPLICE;DELETION +17 35383995 bnd_3184_2 A [17:35266514[A 218 PASS SVTYPE=BND;MATEID=bnd_3184_1;DP=47;SPLITCNT=40;SPANCNT=7;GENE=H2-D1,H2-Q4;GENEID=ENSMUSG00000073411,ENSMUSG00000035929;GENELOC=coding,utr3p;EXPR=23545,2145;HOMLEN=0;SPLICESCORE=4;EXONBND;ALTSPLICE;DELETION +17 63864053 bnd_3179_1 T T]17:63896016] 195 PASS SVTYPE=BND;MATEID=bnd_3179_2;DP=17;SPLITCNT=10;SPANCNT=7;GENE=A930002H24Rik,Fer;GENEID=ENSMUSG00000045506,ENSMUSG00000000127;GENELOC=upstream,upstream;EXPR=10,1325;HOMLEN=0;SPLICESCORE=3;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +17 63896016 bnd_3179_2 A [17:63864053[A 195 PASS SVTYPE=BND;MATEID=bnd_3179_1;DP=17;SPLITCNT=10;SPANCNT=7;GENE=A930002H24Rik,Fer;GENEID=ENSMUSG00000045506,ENSMUSG00000000127;GENELOC=upstream,upstream;EXPR=10,1325;HOMLEN=0;SPLICESCORE=3;READTHROUGH;ADJACENT;ALTSPLICE;DELETION +18 4198107 bnd_5326_2 G ]2:165827729]G 136 PASS SVTYPE=BND;MATEID=bnd_5326_1;DP=20;SPLITCNT=7;SPANCNT=13;GENE=Zmynd8,Gm10557;GENEID=ENSMUSG00000039671,ENSMUSG00000073647;GENELOC=utr3p,downstream;EXPR=2963,0;HOMLEN=9;SPLICESCORE=1;INTERCHROM;ALTSPLICE +18 28188917 bnd_5160_2 A [3:103040040[A 213 PASS SVTYPE=BND;MATEID=bnd_5160_1;DP=63;SPLITCNT=33;SPANCNT=30;GENE=Csde1,SNORA17;GENEID=ENSMUSG00000068823,ENSMUSG00000087940;GENELOC=coding,upstream;EXPR=10681,0;HOMLEN=91;SPLICESCORE=3;INTERCHROM;ALTSPLICE +18 30311220 bnd_7456_1 G G]16:19493776] 189 PASS SVTYPE=BND;MATEID=bnd_7456_2;DP=19;SPLITCNT=9;SPANCNT=10;GENE=Pik3c3,Olfr166;GENEID=ENSMUSG00000033628,ENSMUSG00000056822;GENELOC=coding,downstream;EXPR=1791,0;HOMLEN=0;SPLICESCORE=1;INTERCHROM +18 53932206 bnd_5141_1 T T[X:53418862[ 201 PASS SVTYPE=BND;MATEID=bnd_5141_2;DP=26;SPLITCNT=9;SPANCNT=17;GENE=Csnk1g3,Gm14584;GENEID=ENSMUSG00000073563,ENSMUSG00000083798;GENELOC=coding,intron;EXPR=2227,0;HOMLEN=266;SPLICESCORE=2;INTERCHROM;ALTSPLICE +19 37678397 bnd_13800_1 C C]19:37696729] 204 PASS SVTYPE=BND;MATEID=bnd_13800_2;DP=15;SPLITCNT=3;SPANCNT=12;GENE=Exoc6,Cyp26a1;GENEID=ENSMUSG00000053799,ENSMUSG00000024987;GENELOC=intron,upstream;EXPR=575,443;HOMLEN=0;SPLICESCORE=1;ALTSPLICE;DELETION +19 37696729 bnd_13800_2 A [19:37678397[A 204 PASS SVTYPE=BND;MATEID=bnd_13800_1;DP=15;SPLITCNT=3;SPANCNT=12;GENE=Exoc6,Cyp26a1;GENEID=ENSMUSG00000053799,ENSMUSG00000024987;GENELOC=intron,upstream;EXPR=575,443;HOMLEN=0;SPLICESCORE=1;ALTSPLICE;DELETION +X 37143984 bnd_8690_2 G ]9:7872842]G 132 PASS SVTYPE=BND;MATEID=bnd_8690_1;DP=9;SPLITCNT=4;SPANCNT=5;GENE=Birc3,Nkap;GENEID=ENSMUSG00000032000,ENSMUSG00000016409;GENELOC=utr5p,intron;EXPR=4781,556;HOMLEN=39;SPLICESCORE=1;INTERCHROM;ALTSPLICE +X 53418862 bnd_5141_2 A ]18:53932206]A 201 PASS SVTYPE=BND;MATEID=bnd_5141_1;DP=26;SPLITCNT=9;SPANCNT=17;GENE=Csnk1g3,Gm14584;GENEID=ENSMUSG00000073563,ENSMUSG00000083798;GENELOC=coding,intron;EXPR=2227,0;HOMLEN=266;SPLICESCORE=2;INTERCHROM;ALTSPLICE
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/defuse_reference.loc.sample Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,7 @@ +## Configurstion info for prepared data references for DeFuse +## http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.6.0 +## 4 columns separated by the TAB character +## The 4th column has the path to the defuse config.txt file, it needs to have the dataset_directory set the directory path where the defuse reference data resides. +## The defuse galaxy tool will substitute the directory path of config.txt if the dataset_directory property is not set '__DATASET_DIRECTORY__' +#<unique_build_id> <dbkey> <display_name> <file_base_path> +GRCh37 GRCh37 Human GRCh37 (hg19) /depot/GRCh37/defuse/GRCh37.config
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_data_table_conf.xml.sample Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,7 @@ +<tables> + <!-- Locations of all fasta files under genome directory --> + <table name="defuse_reference" comment_char="#"> + <columns>value, dbkey, name, path</columns> + <file path="tool-data/defuse_reference.loc" /> + </table> +</tables>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Mon May 20 15:25:03 2019 -0400 @@ -0,0 +1,27 @@ +<?xml version="1.0" ?> +<tool_dependency> + <package name="defuse" version="0.6.2"> + <repository changeset_revision="bf5636f162fe" name="package_defuse_0_6_2" owner="jjohnson" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="samtools" version="0.1.19"> + <repository changeset_revision="a0ab0fae27e5" name="package_samtools_0_1_19" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="bowtie" version="1.0.0"> + <repository changeset_revision="0931da782b5d" name="package_bowtie_1_0_0" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="gmap" version="2013-05-09"> + <repository changeset_revision="f59058e92ed7" name="package_gmap_2013_05_09" owner="jjohnson" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="blat" version="35x1"> + <repository changeset_revision="c2d5d5117413" name="package_blat_35x1" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="faToTwoBit" version="35x1"> + <repository changeset_revision="d2839a511c5a" name="package_fatotwobit_35x1" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="R" version="3.1.2"> + <repository changeset_revision="1ca39eb16186" name="package_r_3_1_2" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> + <package name="ada" version="2.0.3"> + <repository changeset_revision="50f0e0825d41" name="package_r_ada_2_0_3" owner="jjohnson" toolshed="https://testtoolshed.g2.bx.psu.edu"/> + </package> +</tool_dependency> \ No newline at end of file