Mercurial > repos > mingchen0919 > aurora_genetic_filtering
changeset 1:48e4e22dab0e draft default tip
planemo upload
author | mingchen0919 |
---|---|
date | Tue, 29 May 2018 11:37:10 -0400 |
parents | 9b1ebab47c0d |
children | |
files | command-line-arguments.csv expose-outputs-to-galaxy-history.sh rmarkdown_report.Rmd rmarkdown_report.xml rmarkdown_report_render.R |
diffstat | 5 files changed, 76 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/command-line-arguments.csv Tue May 29 10:38:12 2018 -0400 +++ b/command-line-arguments.csv Tue May 29 11:37:10 2018 -0400 @@ -1,3 +1,5 @@ short flag,argument mask,data type,variable name o,1,character,report -d,1,character,report.files_path \ No newline at end of file +d,1,character,report.files_path +v,1,character,vcf_input +s,1,character,fSNPs_matrix \ No newline at end of file
--- a/expose-outputs-to-galaxy-history.sh Tue May 29 10:38:12 2018 -0400 +++ b/expose-outputs-to-galaxy-history.sh Tue May 29 11:37:10 2018 -0400 @@ -9,3 +9,7 @@ if [ -e "index.html" ]; then cp index.html ${REPORT} fi + +if [ -e "fSNPS.csv" ]; then + cp fSNPS.csv ${X_s} +fi \ No newline at end of file
--- a/rmarkdown_report.Rmd Tue May 29 10:38:12 2018 -0400 +++ b/rmarkdown_report.Rmd Tue May 29 11:37:10 2018 -0400 @@ -55,3 +55,65 @@ +## Load Data + +```{r} +VCF <- read.vcfR(opt$X_v) # import the VCF in R +``` + +## Create a Genotype Matrix + +The VCF object contains several additional information concerning genetic variant quality that are not needed in this exercise. We want to obtain a Genotype Matrix out of the VCF object, we can do it as shown here below: + +```{r} +GT <- vcfR2loci(VCF) + +GT[1:10,1:10] # The GT matrix shows the genotype for each individual (row) and genetic variant (column) + +dim(GT) +``` + +For our calculations, we prefer having the genotypes coded as 0,1,2 (homozygous 0/0, heterozygous, homozygous 1/1). The code here below performs this transformation creating the SNPs matrix. + +```{r} +GT <- as.matrix(GT) + +GT[GT=='0/0'] <- 0 +GT[GT=='0/1'] <- 1 +GT[GT=='1/0'] <- 1 +GT[GT=='1/1'] <- 2 + +SNPS <- GT +SNPS <- apply(SNPS,2,as.numeric) +rownames(SNPS) <- rownames(GT) +``` + + +The SNPS matrix has the genotype values coded as 0,1,2: + +```{r} +SNPS[1:10,1:10] +``` + + +## Check for genetic diversity metrics + +Before starting, we create a fSNPS matrix identical to the SNPS matrix. We will perform all the calculation on the fSNPS matrix keeping the SNPS matrix as backup. + +```{r} +fSNPS <- SNPS +``` + + +There are R packages that allow to compute this metrics directly with built-in function. In this exercise we use the hand-written formula to better show the meaning of each procedure. + +### Missingness by SNP + +For each SNP, it computes the % of missing observation across the individuals. + +```{r} +MNsnp <- apply(fSNPS, 2, function(x) { sum(is.na(x))/length(x) }) +hist(MNsnp, breaks=10) # We can observe that most of the SNPS have a Missingness below 0.01 (i.e. ~ 2 individual ot of 160). +fSNPS <- fSNPS[,MNsnp<0.01] # We apply a 1% threshold +write.csv(fSNPS, file = "fSNPS.csv", quote = FALSE, col.names = TRUE) +```
--- a/rmarkdown_report.xml Tue May 29 10:38:12 2018 -0400 +++ b/rmarkdown_report.xml Tue May 29 11:37:10 2018 -0400 @@ -31,12 +31,17 @@ -o $report -d $report.files_path + -v $vcf_input + -s $fSNPs_matrix + ]]></command> <inputs> + <param type="data" name="vcf_input" format="vcf" label="VCF input"/> </inputs> <outputs> <data format="html" name="report" label="${tool.name} report on ${on_string}"/> + <data format="csv" name="fSNPs_matrix" label="fSNPs matrix on ${on_string}"/> </outputs> <citations> <citation type="bibtex"><![CDATA[
--- a/rmarkdown_report_render.R Tue May 29 10:38:12 2018 -0400 +++ b/rmarkdown_report_render.R Tue May 29 11:37:10 2018 -0400 @@ -13,6 +13,8 @@ library('getopt') library('rmarkdown') library('htmltools') +library('raster') +library('vcfR') #------------------------------------------------------------------ options(stringsAsFactors = FALSE)