changeset 1:48e4e22dab0e draft default tip

planemo upload
author mingchen0919
date Tue, 29 May 2018 11:37:10 -0400
parents 9b1ebab47c0d
children
files command-line-arguments.csv expose-outputs-to-galaxy-history.sh rmarkdown_report.Rmd rmarkdown_report.xml rmarkdown_report_render.R
diffstat 5 files changed, 76 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/command-line-arguments.csv	Tue May 29 10:38:12 2018 -0400
+++ b/command-line-arguments.csv	Tue May 29 11:37:10 2018 -0400
@@ -1,3 +1,5 @@
 short flag,argument mask,data type,variable name
 o,1,character,report
-d,1,character,report.files_path
\ No newline at end of file
+d,1,character,report.files_path
+v,1,character,vcf_input
+s,1,character,fSNPs_matrix
\ No newline at end of file
--- a/expose-outputs-to-galaxy-history.sh	Tue May 29 10:38:12 2018 -0400
+++ b/expose-outputs-to-galaxy-history.sh	Tue May 29 11:37:10 2018 -0400
@@ -9,3 +9,7 @@
 if [ -e "index.html" ]; then
   cp index.html ${REPORT}
 fi
+
+if [ -e "fSNPS.csv" ]; then
+  cp fSNPS.csv ${X_s}
+fi
\ No newline at end of file
--- a/rmarkdown_report.Rmd	Tue May 29 10:38:12 2018 -0400
+++ b/rmarkdown_report.Rmd	Tue May 29 11:37:10 2018 -0400
@@ -55,3 +55,65 @@
 
 
 
+## Load Data
+
+```{r}
+VCF <- read.vcfR(opt$X_v) # import the VCF in R
+```
+
+## Create a Genotype Matrix
+
+The VCF object contains several additional information concerning genetic variant quality that are not needed in this exercise. We want to obtain a Genotype Matrix out of the VCF object, we can do it as shown here below:  
+
+```{r}
+GT <- vcfR2loci(VCF)
+
+GT[1:10,1:10] # The GT matrix shows the genotype for each individual (row) and genetic variant (column)
+
+dim(GT) 
+```
+
+For our calculations, we prefer having the genotypes coded as 0,1,2 (homozygous 0/0, heterozygous, homozygous 1/1). The code here below performs this transformation creating the SNPs matrix. 
+
+```{r}
+GT <- as.matrix(GT)
+
+GT[GT=='0/0'] <- 0
+GT[GT=='0/1'] <- 1
+GT[GT=='1/0'] <- 1
+GT[GT=='1/1'] <- 2
+
+SNPS <- GT
+SNPS <- apply(SNPS,2,as.numeric)
+rownames(SNPS) <- rownames(GT)
+```
+
+
+The SNPS matrix has the genotype values coded as 0,1,2:
+
+```{r}
+SNPS[1:10,1:10]
+```
+
+
+## Check for genetic diversity metrics
+
+Before starting, we create a fSNPS matrix identical to the SNPS matrix. We will perform all the calculation on the fSNPS matrix keeping the SNPS matrix as backup.
+
+```{r}
+fSNPS <- SNPS
+```
+
+
+There are R packages that allow to compute this metrics directly with built-in function. In this exercise we use the hand-written formula to better show the meaning of each procedure.
+
+### Missingness by SNP
+
+For each SNP, it computes the % of missing observation across the individuals. 
+
+```{r}
+MNsnp <- apply(fSNPS, 2, function(x) { sum(is.na(x))/length(x) })
+hist(MNsnp, breaks=10) # We can observe that most of the SNPS have a Missingness below 0.01 (i.e. ~ 2 individual ot of 160).
+fSNPS <- fSNPS[,MNsnp<0.01] # We apply a 1% threshold
+write.csv(fSNPS, file = "fSNPS.csv", quote = FALSE, col.names = TRUE)
+```
--- a/rmarkdown_report.xml	Tue May 29 10:38:12 2018 -0400
+++ b/rmarkdown_report.xml	Tue May 29 11:37:10 2018 -0400
@@ -31,12 +31,17 @@
             -o $report
             -d $report.files_path
 
+            -v $vcf_input
+            -s $fSNPs_matrix
+
 
     ]]></command>
     <inputs>
+        <param type="data" name="vcf_input" format="vcf" label="VCF input"/>
     </inputs>
     <outputs>
         <data format="html" name="report" label="${tool.name} report on ${on_string}"/>
+        <data format="csv" name="fSNPs_matrix" label="fSNPs matrix on ${on_string}"/>
     </outputs>
     <citations>
         <citation type="bibtex"><![CDATA[
--- a/rmarkdown_report_render.R	Tue May 29 10:38:12 2018 -0400
+++ b/rmarkdown_report_render.R	Tue May 29 11:37:10 2018 -0400
@@ -13,6 +13,8 @@
 library('getopt')
 library('rmarkdown')
 library('htmltools')
+library('raster')
+library('vcfR')
 #------------------------------------------------------------------
 options(stringsAsFactors = FALSE)