Mercurial > repos > spficklin > aurora_wgcna

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/aurora_wgcna.Rmd	Tue Nov 19 23:20:48 2019 +0000
@@ -0,0 +1,246 @@
+---
+title: 'WGCNA: Gene Co-Expression Network Construction Report'
+output:
+    html_document:
+      number_sections: false
+      toc: true
+---
+
+```{r setup, include=FALSE, warning=FALSE, message=FALSE}
+knitr::opts_chunk$set(error = TRUE, echo = FALSE)
+```
+
+# Expression Data
+
+The table below shows the first 100 rows of the Gene Expression Matrix (GEM) file that you provided.
+
+```{r}
+gem = read.csv(opt$expression_data, header = TRUE, row.names = 1)
+table_data = head(gem, 100)
+datatable(table_data)
+```
+
+```{r}
+gemt = as.data.frame(t(gem))
+```
+
+The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality.  The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.
+
+
+```{r}
+gsg = goodSamplesGenes(gemt, verbose = 3)
+
+if (!gsg$allOK) {
+  if (sum(!gsg$goodGenes)>0) {
+    printFlush(paste("Removing genes:", paste(names(gemt)[!gsg$goodGenes], collapse = ", ")));
+  }
+  if (sum(!gsg$goodSamples)>0) {
+    printFlush(paste("Removing samples:", paste(rownames(gemt)[!gsg$goodSamples], collapse = ", ")));
+  # Remove the offending genes and samples from the data:
+  }
+  gemt = gemt[gsg$goodSamples, gsg$goodGenes]
+} else {
+  print('all genes are OK!')
+}
+```
+
+
+Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If you did not specify a height to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed.  If you did not specify a cut height and you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.
+
+```{r fig.align='center'}
+sampleTree = hclust(dist(gemt), method = "average");
+plot(sampleTree, main = "Sample clustering prior to outlier removal", sub="", xlab="",
+     cex.axis = 1, cex.main = 1, cex = 0.5)
+```
+
+```{r}
+if (is.null(opt$height_cut)) {
+  print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")
+  clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)
+  keepSamples = (clust!=0)
+  gemt = gemt[keepSamples, ]
+} else {
+  print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")
+  clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)
+  keepSamples = (clust==1)
+  gemt = gemt[keepSamples, ]
+}
+n_genes = ncol(gemt)
+n_samples = nrow(gemt)
+removed = length(which(keepSamples == FALSE))
+print(paste("A total of", removed, "sample(s) was/were removed"))
+
+```
+
+```{r fig.align='center'}
+sampleTree = hclust(dist(gemt), method = "average");
+plot(sampleTree, main = "Sample clustering after outlier removal", sub="", xlab="",
+     cex.axis = 1, cex.main = 1, cex = 0.5)
+```
+
+# Network Module Discovery
+
+The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown.  This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests.  The meaning of the table headers are:
+
+- Power: The power tested
+- SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.
+- slope: The slope of the regression line used to calculate SFT.R.sq
+- trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq
+- mean.k:  The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)
+- median.k: The median degree
+- max.k: The largest degree.
+
+```{r}
+powers = c(1:10, seq(12, 20, 2))
+sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)
+```
+
+The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.
+
+```{r fig.align='center'}
+par(mfrow=c(1,2))
+
+th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]
+
+# Scale-free topology fit index as a function of the soft-thresholding power.
+plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
+     xlab="Soft Threshold (power)",
+     ylab="Scale Free Topology Model Fit,signed R^2", type="n",
+     main = paste("Scale independence"), cex.lab = 0.5);
+text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
+     labels=powers,cex=0.5,col="red");
+abline(h=th, col="blue")
+
+# Mean connectivity as a function of the soft-thresholding power.
+plot(sft$fitIndices[,1], sft$fitIndices[,5],
+     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
+     main = paste("Mean connectivity"), cex.lab = 0.5)
+text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")
+abline(h=th, col="blue")
+par(mfrow=c(1,1))
+```
+Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.
+
+```{r}
+print("WGCNA predicted the following power:")
+print(sft$powerEstimate)
+power = sft$powerEstimate
+if (!is.null(opt$power)) {
+  print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)
+  print(opt$soft_threshold_power)
+  power = opt$power
+}
+```
+
+Now that a power has been identified, modules can be discovered.  Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.
+
+```{r}
+net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,
+                      TOMType = "unsigned", minModuleSize = opt$min_cluster_size,
+                      reassignThreshold = 0, mergeCutHeight = 0.25,
+                      numericLabels = TRUE, pamRespectsDendro = FALSE,
+                      verbose = 3, saveTOMs = TRUE,
+                      saveTOMFileBase = "TOM")
+blocks = sort(unique(net$blocks))
+```
+The following table shows the list of modules that were discovered and their size (i.e. number of genes).
+
+```{r}
+module_labels = labels2colors(net$colors)
+module_labels = paste("ME", module_labels, sep="")
+module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))
+rownames(module_labels2num) = paste0('ME', module_labels2num$num)
+modules = unique(as.data.frame(table(module_labels)))
+n_modules = length(modules) - 1
+module_size_upper = modules[2]
+module_size_lower = modules[length(modules)]
+colnames(modules) = c('Module', 'Module Size')
+datatable(modules)
+```
+
+Modules consist of a set of genes that have highly similar expression patterns.  Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.
+
+```{r fig.align='center'}
+MEs = net$MEs
+colnames(MEs) = module_labels2num[colnames(MEs),]$label
+plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
+```
+
+Alternatively, we can use a heatmap to explore similarity of each module.
+```{r fig.align='center'}
+plotEigengeneNetworks(MEs, "Eigengene adjacency heatmap",
+                      marHeatmap = c(2, 3, 2, 2),
+                      plotDendrograms = FALSE)
+```
+
+We can examine gene similarity within the context of our modules.  The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph.  When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.
+
+```{r}
+# Plot the dendrogram and the module colors underneath
+for (i in blocks) {
+  plotDendroAndColors(net$dendrograms[[i]], net$colors[net$blockGenes[[i]]],
+                      "Module colors", dendroLabels = FALSE, hang = 0.03,
+                      addGuide = TRUE, guideHang = 0.05, main=paste('Cluster Dendgrogram, Block', i))
+}
+```
+
+The network is housed in a *n* x *n* similarity matrix known as the the Topological Overlap Matrix (TOM), where *n* is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps organizes the genes using the dendrograms from above and the color intensity indicates genes that tend to be more similar.  Genes with similar expression and in modules will cluster along the diagonal.
+
+```{r}
+for (i in blocks) {
+  # Load the TOM from a file.
+  load(net$TOMFiles[i])
+  TOM_size = length(which(net$blocks == i))
+  TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
+
+  # Calculate the dissimilarity and raise it to apower to make it more visible on the plot.
+  dissTOM = 1-TOM
+  plotTOM = dissTOM^7
+  diag(plotTOM) = NA
+
+  # Now plot the heatmap
+  geneTree = net$dendrograms[[i]]
+  colors = sub('ME','', module_labels[which(net$blocks == i)])
+  TOMplot(plotTOM, geneTree, colors, main = paste("Network Heatmap, Block", i))
+}
+```
+
+```{r}
+output = cbind(colnames(gemt), module_labels)
+colnames(output) = c('Gene', 'Module')
+write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)
+```
+
+A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.
+
+The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0.  This is known as hard thresholding.  However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules.  The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.
+
+Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form.  Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.
+
+```{r}
+edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())
+for (i in blocks) {
+  # Load the TOM from a file.
+  load(net$TOMFiles[i])
+  TOM_size = length(which(net$blocks == i))
+  TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
+  colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]
+  row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]
+
+  cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)
+  edges = rbind(edges, cydata$edgeData)
+}
+
+edges$Interaction = 'co'
+output = edges[,c('fromNode','toNode','Interaction', 'weight')]
+colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')
+write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")
+```
+
+Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.
+
+```{r}
+# Save this image for the next step which is optional if theuser
+# provides a trait file.
+save.image(file=opt$r_data)
+```