comparison rgedgeRpaired_nocamera.xml @ 50:60aceade5350 draft

removed atlas - won't compile here at the Baker - thinks we have cpu scaling turned on. We don't afaik.
author fubar
date Mon, 23 Dec 2013 21:36:34 -0500
parents
children
comparison
equal deleted inserted replaced
49:89ed2d3c529f 50:60aceade5350
1 <tool id="rgDifferentialCount" name="Differential_Count" version="0.23">
2 <description>models using BioConductor packages</description>
3 <requirements>
4 <requirement type="package" version="3.11.11">atlas</requirement>
5 <requirement type="package" version="3.0.1">r3</requirement>
6 <requirement type="package" version="1.3.18">graphicsmagick</requirement>
7 <requirement type="package" version="9.07">ghostscript</requirement>
8 <requirement type="package" version="2.12">biocbasics</requirement>
9 </requirements>
10
11 <command interpreter="python">
12 rgToolFactory.py --script_path "$runme" --interpreter "Rscript" --tool_name "DifferentialCounts"
13 --output_dir "$html_file.files_path" --output_html "$html_file" --make_HTML "yes"
14 </command>
15 <inputs>
16 <param name="input1" type="data" format="tabular" label="Select an input matrix - rows are contigs, columns are counts for each sample"
17 help="Use the HTSeq based count matrix preparation tool to create these matrices from BAM/SAM files and a GTF file of genomic features"/>
18 <param name="title" type="text" value="Differential Counts" size="80" label="Title for job outputs"
19 help="Supply a meaningful name here to remind you what the outputs contain">
20 <sanitizer invalid_char="">
21 <valid initial="string.letters,string.digits"><add value="_" /> </valid>
22 </sanitizer>
23 </param>
24 <param name="treatment_name" type="text" value="Treatment" size="50" label="Treatment Name"/>
25 <param name="Treat_cols" label="Select columns containing treatment." type="data_column" data_ref="input1" numerical="True"
26 multiple="true" use_header_names="true" size="120" display="checkboxes">
27 <validator type="no_options" message="Please select at least one column."/>
28 <sanitizer invalid_char="">
29 <valid initial="string.letters,string.digits"><add value="_" /> </valid>
30 </sanitizer>
31 </param>
32 <param name="control_name" type="text" value="Control" size="50" label="Control Name"/>
33 <param name="Control_cols" label="Select columns containing control." type="data_column" data_ref="input1" numerical="True"
34 multiple="true" use_header_names="true" size="120" display="checkboxes" optional="true">
35 <validator type="no_options" message="Please select at least one column."/>
36 <sanitizer invalid_char="">
37 <valid initial="string.letters,string.digits"><add value="_" /> </valid>
38 </sanitizer> <validator type="no_options" message="Please select at least one column."/>
39 <sanitizer invalid_char="">
40 <valid initial="string.letters,string.digits"><add value="_" /> </valid>
41 </sanitizer>
42
43 </param>
44 <param name="subjectids" type="text" optional="true" size="120" value = ""
45 label="IF SUBJECTS NOT ALL INDEPENDENT! Enter comma separated strings to indicate sample labels for (eg) pairing - must be one for every column in input"
46 help="Leave blank if no pairing, but eg if data from sample id A99 is in columns 2,4 and id C21 is in 3,5 then enter 'A99,C21,A99,C21'">
47 <sanitizer>
48 <valid initial="string.letters,string.digits"><add value="," /> </valid>
49 </sanitizer>
50 </param>
51 <param name="fQ" type="float" value="0.3" size="5" label="Non-differential contig count quantile threshold - zero to analyze all non-zero read count contigs"
52 help="May be a good or a bad idea depending on the biology and the question. EG 0.3 = sparsest 30% of contigs with at least one read are removed before analysis"/>
53 <param name="useNDF" type="boolean" truevalue="T" falsevalue="F" checked="false" size="1"
54 label="Non differential filter - remove contigs below a threshold (1 per million) for half or more samples"
55 help="May be a good or a bad idea depending on the biology and the question. This was the old default. Quantile based is available as an alternative"/>
56
57 <conditional name="edgeR">
58 <param name="doedgeR" type="select"
59 label="Run this model using edgeR"
60 help="edgeR uses a negative binomial model and seems to be powerful, even with few replicates">
61 <option value="F">Do not run edgeR</option>
62 <option value="T" selected="true">Run edgeR</option>
63 </param>
64 <when value="T">
65 <param name="edgeR_priordf" type="integer" value="20" size="3"
66 label="prior.df for tagwise dispersion - lower value = more emphasis on each tag's variance. Replaces prior.n and prior.df = prior.n * residual.df"
67 help="0 = Use edgeR default. Use a small value to 'smooth' small samples. See edgeR docs and note below"/>
68 </when>
69 <when value="F"></when>
70 </conditional>
71 <conditional name="DESeq2">
72 <param name="doDESeq2" type="select"
73 label="Run the same model with DESeq2 and compare findings"
74 help="DESeq2 is an update to the DESeq package. It uses different assumptions and methods to edgeR">
75 <option value="F" selected="true">Do not run DESeq2</option>
76 <option value="T">Run DESeq2</option>
77 </param>
78 <when value="T">
79 <param name="DESeq_fitType" type="select">
80 <option value="parametric" selected="true">Parametric (default) fit for dispersions</option>
81 <option value="local">Local fit - this will automagically be used if parametric fit fails</option>
82 <option value="mean">Mean dispersion fit- use this if you really understand what you're doing - read the fine manual linked below in the documentation</option>
83 </param>
84 </when>
85 <when value="F"> </when>
86 </conditional>
87 <param name="doVoom" type="select"
88 label="Run the same model with Voom/limma and compare findings"
89 help="Voom uses counts per million and a precise transformation of variance so count data can be analysed using limma">
90 <option value="F" selected="true">Do not run VOOM</option>
91 <option value="T">Run VOOM</option>
92 </param>
93 <!--
94 <conditional name="camera">
95 <param name="doCamera" type="select" label="Run the edgeR implementation of Camera GSEA for up/down gene sets"
96 help="If yes, you can choose a set of genesets to test and/or supply a gmt format geneset collection from your history">
97 <option value="F" selected="true">Do not run GSEA tests with the Camera algorithm</option>
98 <option value="T">Run GSEA tests with the Camera algorithm</option>
99 </param>
100 <when value="T">
101 <conditional name="gmtSource">
102 <param name="refgmtSource" type="select"
103 label="Use a gene set (.gmt) from your history and/or use a built-in (MSigDB etc) gene set">
104 <option value="indexed" selected="true">Use a built-in gene set</option>
105 <option value="history">Use a gene set from my history</option>
106 <option value="both">Add a gene set from my history to a built in gene set</option>
107 </param>
108 <when value="indexed">
109 <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis">
110 <options from_data_table="gseaGMT_3.1">
111 <filter type="sort_by" column="2" />
112 <validator type="no_options" message="No GMT v3.1 files are available - please install them"/>
113 </options>
114 </param>
115 </when>
116 <when value="history">
117 <param name="ownGMT" type="data" format="gmt" label="Select a Gene Set from your history" />
118 </when>
119 <when value="both">
120 <param name="ownGMT" type="data" format="gseagmt" label="Select a Gene Set from your history" />
121 <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis">
122 <options from_data_table="gseaGMT_4">
123 <filter type="sort_by" column="2" />
124 <validator type="no_options" message="No GMT v4 files are available - please fix tool_data_table and loc files"/>
125 </options>
126 </param>
127 </when>
128 </conditional>
129 </when>
130 <when value="F">
131 </when>
132 </conditional>
133 -->
134 <param name="fdrthresh" type="float" value="0.05" size="5" label="P value threshold for FDR filtering for amily wise error rate control"
135 help="Conventional default value of 0.05 recommended"/>
136 <param name="fdrtype" type="select" label="FDR (Type II error) control method"
137 help="Use fdr or bh typically to control for the number of tests in a reliable way">
138 <option value="fdr" selected="true">fdr</option>
139 <option value="BH">Benjamini Hochberg</option>
140 <option value="BY">Benjamini Yukateli</option>
141 <option value="bonferroni">Bonferroni</option>
142 <option value="hochberg">Hochberg</option>
143 <option value="holm">Holm</option>
144 <option value="hommel">Hommel</option>
145 <option value="none">no control for multiple tests</option>
146 </param>
147 </inputs>
148 <outputs>
149 <data format="tabular" name="out_edgeR" label="${title}_topTable_edgeR.xls">
150 <filter>edgeR['doedgeR'] == "T"</filter>
151 </data>
152 <data format="tabular" name="out_DESeq2" label="${title}_topTable_DESeq2.xls">
153 <filter>DESeq2['doDESeq2'] == "T"</filter>
154 </data>
155 <data format="tabular" name="out_VOOM" label="${title}_topTable_VOOM.xls">
156 <filter>doVoom == "T"</filter>
157 </data>
158 <data format="html" name="html_file" label="${title}.html"/>
159 </outputs>
160 <stdio>
161 <exit_code range="4" level="fatal" description="Number of subject ids must match total number of samples in the input matrix" />
162 </stdio>
163 <tests>
164 <test>
165 <param name='input1' value='test_bams2mx.xls' ftype='tabular' />
166 <param name='treatment_name' value='liver' />
167 <param name='title' value='edgeRtest' />
168 <param name='useNDF' value='' />
169 <param name='doedgeR' value='T' />
170 <param name='doVoom' value='T' />
171 <param name='doDESeq2' value='T' />
172 <param name='fdrtype' value='fdr' />
173 <param name='edgeR_priordf' value="8" />
174 <param name='fdrthresh' value="0.05" />
175 <param name='control_name' value='heart' />
176 <param name='subjectids' value='' />
177 <param name='Control_cols' value='3,4,5,9' />
178 <param name='Treat_cols' value='2,6,7,8' />
179 <output name='out_edgeR' file='edgeRtest1out.xls' compare='diff' />
180 <output name='html_file' file='edgeRtest1out.html' compare='diff' lines_diff='20' />
181 </test>
182 </tests>
183
184 <configfiles>
185 <configfile name="runme">
186 <![CDATA[
187 #
188 # edgeR.Rscript
189 # updated npv 2011 for R 2.14.0 and edgeR 2.4.0 by ross
190 # Performs DGE on a count table containing n replicates of two conditions
191 #
192 # Parameters
193 #
194 # 1 - Output Dir
195
196 # Original edgeR code by: S.Lunke and A.Kaspi
197 reallybig = log10(.Machine\$double.xmax)
198 reallysmall = log10(.Machine\$double.xmin)
199 library('stringr')
200 library('gplots')
201 library('edgeR')
202 hmap2 = function(cmat,nsamp=100,outpdfname='heatmap2.pdf', TName='Treatment',group=NA,myTitle='title goes here')
203 {
204 # Perform clustering for significant pvalues after controlling FWER
205 samples = colnames(cmat)
206 gu = unique(group)
207 gn = rownames(cmat)
208 if (length(gu) == 2) {
209 col.map = function(g) {if (g==gu[1]) "#FF0000" else "#0000FF"}
210 pcols = unlist(lapply(group,col.map))
211 } else {
212 colours = rainbow(length(gu),start=0,end=4/6)
213 pcols = colours[match(group,gu)] }
214 dm = cmat[(! is.na(gn)),]
215 # remove unlabelled hm rows
216 nprobes = nrow(dm)
217 # sub = paste('Showing',nprobes,'contigs ranked for evidence of differential abundance')
218 if (nprobes > nsamp) {
219 dm =dm[1:nsamp,]
220 #sub = paste('Showing',nsamp,'contigs ranked for evidence for differential abundance out of',nprobes,'total')
221 }
222 newcolnames = substr(colnames(dm),1,20)
223 colnames(dm) = newcolnames
224 pdf(outpdfname)
225 heatmap.2(dm,main=myTitle,ColSideColors=pcols,col=topo.colors(100),dendrogram="col",key=T,density.info='none',
226 Rowv=F,scale='row',trace='none',margins=c(8,8),cexRow=0.4,cexCol=0.5)
227 dev.off()
228 }
229
230 hmap = function(cmat,nmeans=4,outpdfname="heatMap.pdf",nsamp=250,TName='Treatment',group=NA,myTitle="Title goes here")
231 {
232 # for 2 groups only was
233 #col.map = function(g) {if (g==TName) "#FF0000" else "#0000FF"}
234 #pcols = unlist(lapply(group,col.map))
235 gu = unique(group)
236 colours = rainbow(length(gu),start=0.3,end=0.6)
237 pcols = colours[match(group,gu)]
238 nrows = nrow(cmat)
239 mtitle = paste(myTitle,'Heatmap: n contigs =',nrows)
240 if (nrows > nsamp) {
241 cmat = cmat[c(1:nsamp),]
242 mtitle = paste('Heatmap: Top ',nsamp,' DE contigs (of ',nrows,')',sep='')
243 }
244 newcolnames = substr(colnames(cmat),1,20)
245 colnames(cmat) = newcolnames
246 pdf(outpdfname)
247 heatmap(cmat,scale='row',main=mtitle,cexRow=0.3,cexCol=0.4,Rowv=NA,ColSideColors=pcols)
248 dev.off()
249 }
250
251 qqPlot = function(descr='qqplot',pvector, outpdf='qqplot.pdf',...)
252 # stolen from https://gist.github.com/703512
253 {
254 o = -log10(sort(pvector,decreasing=F))
255 e = -log10( 1:length(o)/length(o) )
256 o[o==-Inf] = reallysmall
257 o[o==Inf] = reallybig
258 maint = descr
259 pdf(outpdf)
260 plot(e,o,pch=19,cex=1, main=maint, ...,
261 xlab=expression(Expected~~-log[10](italic(p))),
262 ylab=expression(Observed~~-log[10](italic(p))),
263 xlim=c(0,max(e)), ylim=c(0,max(o)))
264 lines(e,e,col="red")
265 grid(col = "lightgray", lty = "dotted")
266 dev.off()
267 }
268
269 smearPlot = function(DGEList,deTags, outSmear, outMain)
270 {
271 pdf(outSmear)
272 plotSmear(DGEList,de.tags=deTags,main=outMain)
273 grid(col="lightgray", lty="dotted")
274 dev.off()
275 }
276
277 boxPlot = function(rawrs,cleanrs,maint,myTitle,pdfname)
278 { #
279 nc = ncol(rawrs)
280 #### for (i in c(1:nc)) {rawrs[(rawrs[,i] < 0),i] = NA}
281 fullnames = colnames(rawrs)
282 newcolnames = substr(colnames(rawrs),1,20)
283 colnames(rawrs) = newcolnames
284 newcolnames = substr(colnames(cleanrs),1,20)
285 colnames(cleanrs) = newcolnames
286 defpar = par(no.readonly=T)
287 print.noquote('raw contig counts by sample:')
288 print.noquote(summary(rawrs))
289 print.noquote('normalised contig counts by sample:')
290 print.noquote(summary(cleanrs))
291 pdf(pdfname)
292 par(mfrow=c(1,2))
293 boxplot(rawrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('Raw:',maint))
294 grid(col="lightgray",lty="dotted")
295 boxplot(cleanrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('After ',maint))
296 grid(col="lightgray",lty="dotted")
297 dev.off()
298 pdfname = "sample_counts_histogram.pdf"
299 nc = ncol(rawrs)
300 print.noquote(paste('Using ncol rawrs=',nc))
301 ncroot = round(sqrt(nc))
302 if (ncroot*ncroot < nc) { ncroot = ncroot + 1 }
303 m = c()
304 for (i in c(1:nc)) {
305 rhist = hist(rawrs[,i],breaks=100,plot=F)
306 m = append(m,max(rhist\$counts))
307 }
308 ymax = max(m)
309 ncols = length(fullnames)
310 if (ncols > 20)
311 {
312 scale = 7*ncols/20
313 pdf(pdfname,width=scale,height=scale)
314 } else {
315 pdf(pdfname)
316 }
317 par(mfrow=c(ncroot,ncroot))
318 for (i in c(1:nc)) {
319 hist(rawrs[,i], main=paste("Contig logcount",i), xlab='log raw count', col="maroon",
320 breaks=100,sub=fullnames[i],cex=0.8,ylim=c(0,ymax))
321 }
322 dev.off()
323 par(defpar)
324
325 }
326
327 cumPlot = function(rawrs,cleanrs,maint,myTitle)
328 { # updated to use ecdf
329 pdfname = "Filtering_rowsum_bar_charts.pdf"
330 defpar = par(no.readonly=T)
331 lrs = log(rawrs,10)
332 lim = max(lrs)
333 pdf(pdfname)
334 par(mfrow=c(2,1))
335 hist(lrs,breaks=100,main=paste('Before:',maint),xlab="# Reads (log)",
336 ylab="Count",col="maroon",sub=myTitle, xlim=c(0,lim),las=1)
337 grid(col="lightgray", lty="dotted")
338 lrs = log(cleanrs,10)
339 hist(lrs,breaks=100,main=paste('After:',maint),xlab="# Reads (log)",
340 ylab="Count",col="maroon",sub=myTitle,xlim=c(0,lim),las=1)
341 grid(col="lightgray", lty="dotted")
342 dev.off()
343 par(defpar)
344 }
345
346 cumPlot1 = function(rawrs,cleanrs,maint,myTitle)
347 { # updated to use ecdf
348 pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_')
349 pdf(pdfname)
350 par(mfrow=c(2,1))
351 lastx = max(rawrs)
352 rawe = knots(ecdf(rawrs))
353 cleane = knots(ecdf(cleanrs))
354 cy = 1:length(cleane)/length(cleane)
355 ry = 1:length(rawe)/length(rawe)
356 plot(rawe,ry,type='l',main=paste('Before',maint),xlab="Log Contig Total Reads",
357 ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle)
358 grid(col="blue")
359 plot(cleane,cy,type='l',main=paste('After',maint),xlab="Log Contig Total Reads",
360 ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle)
361 grid(col="blue")
362 dev.off()
363 }
364
365
366
367 doGSEA = function(y=NULL,design=NULL,histgmt="",
368 bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
369 ntest=0, myTitle="myTitle", outfname="GSEA.xls", minnin=5, maxnin=2000,fdrthresh=0.05,fdrtype="BH")
370 {
371 sink('Camera.log')
372 genesets = c()
373 if (bigmt > "")
374 {
375 bigenesets = readLines(bigmt)
376 genesets = bigenesets
377 }
378 if (histgmt > "")
379 {
380 hgenesets = readLines(histgmt)
381 if (bigmt > "") {
382 genesets = rbind(genesets,hgenesets)
383 } else {
384 genesets = hgenesets
385 } # use only history if no bi
386 }
387 print.noquote(paste("@@@read",length(genesets), 'genesets from',histgmt,bigmt))
388 genesets = strsplit(genesets,'\t') # tabular. genesetid\tURLorwhatever\tgene_1\t..\tgene_n
389 outf = outfname
390 head=paste(myTitle,'edgeR GSEA')
391 write(head,file=outfname,append=F)
392 ntest=length(genesets)
393 urownames = toupper(rownames(y))
394 upcam = c()
395 downcam = c()
396 for (i in 1:ntest) {
397 gs = unlist(genesets[i])
398 g = gs[1] # geneset_id
399 u = gs[2]
400 if (u > "") { u = paste("<a href=\'",u,"\'>",u,"</a>",sep="") }
401 glist = gs[3:length(gs)] # member gene symbols
402 glist = toupper(glist)
403 inglist = urownames %in% glist
404 nin = sum(inglist)
405 if ((nin > minnin) && (nin < maxnin)) {
406 ### print(paste('@@found',sum(inglist),'genes in glist'))
407 camres = camera(y=y,index=inglist,design=design)
408 if (! is.null(camres)) {
409 rownames(camres) = g # gene set name
410 camres = cbind(GeneSet=g,URL=u,camres)
411 if (camres\$Direction == "Up")
412 {
413 upcam = rbind(upcam,camres) } else {
414 downcam = rbind(downcam,camres)
415 }
416 }
417 }
418 }
419 uscam = upcam[order(upcam\$PValue),]
420 unadjp = uscam\$PValue
421 uscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
422 nup = max(10,sum((uscam\$adjPValue < fdrthresh)))
423 dscam = downcam[order(downcam\$PValue),]
424 unadjp = dscam\$PValue
425 dscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
426 ndown = max(10,sum((dscam\$adjPValue < fdrthresh)))
427 write.table(uscam,file=paste('camera_up',outfname,sep='_'),quote=F,sep='\t',row.names=F)
428 write.table(dscam,file=paste('camera_down',outfname,sep='_'),quote=F,sep='\t',row.names=F)
429 print.noquote(paste('@@@@@ Camera up top',nup,'gene sets:'))
430 write.table(head(uscam,nup),file="",quote=F,sep='\t',row.names=F)
431 print.noquote(paste('@@@@@ Camera down top',ndown,'gene sets:'))
432 write.table(head(dscam,ndown),file="",quote=F,sep='\t',row.names=F)
433 sink()
434 }
435
436
437
438
439 doGSEAatonce = function(y=NULL,design=NULL,histgmt="",
440 bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
441 ntest=0, myTitle="myTitle", outfname="GSEA.xls", minnin=5, maxnin=2000,fdrthresh=0.05,fdrtype="BH")
442 {
443 sink('Camera.log')
444 genesets = c()
445 if (bigmt > "")
446 {
447 bigenesets = readLines(bigmt)
448 genesets = bigenesets
449 }
450 if (histgmt > "")
451 {
452 hgenesets = readLines(histgmt)
453 if (bigmt > "") {
454 genesets = rbind(genesets,hgenesets)
455 } else {
456 genesets = hgenesets
457 } # use only history if no bi
458 }
459 print.noquote(paste("@@@read",length(genesets), 'genesets from',histgmt,bigmt))
460 genesets = strsplit(genesets,'\t') # tabular. genesetid\tURLorwhatever\tgene_1\t..\tgene_n
461 outf = outfname
462 head=paste(myTitle,'edgeR GSEA')
463 write(head,file=outfname,append=F)
464 ntest=length(genesets)
465 urownames = toupper(rownames(y))
466 upcam = c()
467 downcam = c()
468 incam = c()
469 urls = c()
470 gsids = c()
471 for (i in 1:ntest) {
472 gs = unlist(genesets[i])
473 gsid = gs[1] # geneset_id
474 url = gs[2]
475 if (url > "") { url = paste("<a href=\'",url,"\'>",url,"</a>",sep="") }
476 glist = gs[3:length(gs)] # member gene symbols
477 glist = toupper(glist)
478 inglist = urownames %in% glist
479 nin = sum(inglist)
480 if ((nin > minnin) && (nin < maxnin)) {
481 incam = c(incam,inglist)
482 gsids = c(gsids,gsid)
483 urls = c(urls,url)
484 }
485 }
486 incam = as.list(incam)
487 names(incam) = gsids
488 allcam = camera(y=y,index=incam,design=design)
489 allcamres = cbind(geneset=gsids,allcam,URL=urls)
490 for (i in 1:ntest) {
491 camres = allcamres[i]
492 res = try(test = (camres\$Direction == "Up"))
493 if ("try-error" %in% class(res)) {
494 cat("test failed, camres = :")
495 print.noquote(camres)
496 } else { if (camres\$Direction == "Up")
497 { upcam = rbind(upcam,camres)
498 } else { downcam = rbind(downcam,camres)
499 }
500
501 }
502 }
503 uscam = upcam[order(upcam\$PValue),]
504 unadjp = uscam\$PValue
505 uscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
506 nup = max(10,sum((uscam\$adjPValue < fdrthresh)))
507 dscam = downcam[order(downcam\$PValue),]
508 unadjp = dscam\$PValue
509 dscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
510 ndown = max(10,sum((dscam\$adjPValue < fdrthresh)))
511 write.table(uscam,file=paste('camera_up',outfname,sep='_'),quote=F,sep='\t',row.names=F)
512 write.table(dscam,file=paste('camera_down',outfname,sep='_'),quote=F,sep='\t',row.names=F)
513 print.noquote(paste('@@@@@ Camera up top',nup,'gene sets:'))
514 write.table(head(uscam,nup),file="",quote=F,sep='\t',row.names=F)
515 print.noquote(paste('@@@@@ Camera down top',ndown,'gene sets:'))
516 write.table(head(dscam,ndown),file="",quote=F,sep='\t',row.names=F)
517 sink()
518 }
519
520
521 edgeIt = function (Count_Matrix=c(),group=c(),out_edgeR=F,out_VOOM=F,out_DESeq2=F,fdrtype='fdr',priordf=5,
522 fdrthresh=0.05,outputdir='.', myTitle='Differential Counts',libSize=c(),useNDF=F,
523 filterquantile=0.2, subjects=c(),mydesign=NULL,
524 doDESeq2=T,doVoom=T,doCamera=T,doedgeR=T,org='hg19',
525 histgmt="", bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
526 doCook=F,DESeq_fitType="parameteric")
527 {
528 # Error handling
529 if (length(unique(group))!=2){
530 print("Number of conditions identified in experiment does not equal 2")
531 q()
532 }
533 require(edgeR)
534 options(width = 512)
535 mt = paste(unlist(strsplit(myTitle,'_')),collapse=" ")
536 allN = nrow(Count_Matrix)
537 nscut = round(ncol(Count_Matrix)/2)
538 colTotmillionreads = colSums(Count_Matrix)/1e6
539 counts.dataframe = as.data.frame(c())
540 rawrs = rowSums(Count_Matrix)
541 nonzerod = Count_Matrix[(rawrs > 0),] # remove all zero count genes
542 nzN = nrow(nonzerod)
543 nzrs = rowSums(nonzerod)
544 zN = allN - nzN
545 print('# Quantiles for non-zero row counts:',quote=F)
546 print(quantile(nzrs,probs=seq(0,1,0.1)),quote=F)
547 if (useNDF == T)
548 {
549 gt1rpin3 = rowSums(Count_Matrix/expandAsMatrix(colTotmillionreads,dim(Count_Matrix)) >= 1) >= nscut
550 lo = colSums(Count_Matrix[!gt1rpin3,])
551 workCM = Count_Matrix[gt1rpin3,]
552 cleanrs = rowSums(workCM)
553 cleanN = length(cleanrs)
554 meth = paste( "After removing",length(lo),"contigs with fewer than ",nscut," sample read counts >= 1 per million, there are",sep="")
555 print(paste("Read",allN,"contigs. Removed",zN,"contigs with no reads.",meth,cleanN,"contigs"),quote=F)
556 maint = paste('Filter >=1/million reads in >=',nscut,'samples')
557 } else {
558 useme = (nzrs > quantile(nzrs,filterquantile))
559 workCM = nonzerod[useme,]
560 lo = colSums(nonzerod[!useme,])
561 cleanrs = rowSums(workCM)
562 cleanN = length(cleanrs)
563 meth = paste("After filtering at count quantile =",filterquantile,", there are",sep="")
564 print(paste('Read',allN,"contigs. Removed",zN,"with no reads.",meth,cleanN,"contigs"),quote=F)
565 maint = paste('Filter below',filterquantile,'quantile')
566 }
567 cumPlot(rawrs=rawrs,cleanrs=cleanrs,maint=maint,myTitle=myTitle)
568 allgenes = rownames(workCM)
569 reg = "^chr([0-9]+):([0-9]+)-([0-9]+)"
570 genecards="<a href=\'http://www.genecards.org/index.php?path=/Search/keyword/"
571 ucsc = paste("<a href=\'http://genome.ucsc.edu/cgi-bin/hgTracks?db=",org,sep='')
572 testreg = str_match(allgenes,reg)
573 if (sum(!is.na(testreg[,1]))/length(testreg[,1]) > 0.8) # is ucsc style string
574 {
575 print("@@ using ucsc substitution for urls")
576 contigurls = paste0(ucsc,"&amp;position=chr",testreg[,2],":",testreg[,3],"-",testreg[,4],"\'>",allgenes,"</a>")
577 } else {
578 print.noquote("@@ using genecards substitution for urls")
579 contigurls = paste0(genecards,allgenes,"\'>",allgenes,"</a>")
580 }
581 print(paste("# Total low count contigs per sample = ",paste(lo,collapse=',')),quote=F)
582 cmrowsums = rowSums(workCM)
583 TName=unique(group)[1]
584 CName=unique(group)[2]
585 if (is.null(mydesign)) {
586 if (length(subjects) == 0)
587 {
588 mydesign = model.matrix(~group)
589 }
590 else {
591 subjf = factor(subjects)
592 mydesign = model.matrix(~subjf+group) # we block on subject so make group last to simplify finding it
593 }
594 }
595 print.noquote(paste('Using samples:',paste(colnames(workCM),collapse=',')))
596 print.noquote('Using design matrix:')
597 print.noquote(mydesign)
598 if (doedgeR) {
599 sink('edgeR.log')
600 #### Setup DGEList object
601 DGEList = DGEList(counts=workCM, group = group)
602 DGEList = calcNormFactors(DGEList)
603
604 DGEList = estimateGLMCommonDisp(DGEList,mydesign)
605 comdisp = DGEList\$common.dispersion
606 DGEList = estimateGLMTrendedDisp(DGEList,mydesign)
607 if (edgeR_priordf > 0) {
608 print.noquote(paste("prior.df =",edgeR_priordf))
609 DGEList = estimateGLMTagwiseDisp(DGEList,mydesign,prior.df = edgeR_priordf)
610 } else {
611 DGEList = estimateGLMTagwiseDisp(DGEList,mydesign)
612 }
613 DGLM = glmFit(DGEList,design=mydesign)
614 DE = glmLRT(DGLM,coef=ncol(DGLM\$design)) # always last one - subject is first if needed
615 efflib = DGEList\$samples\$lib.size*DGEList\$samples\$norm.factors
616 normData = (1e+06*DGEList\$counts/efflib)
617 uoutput = cbind(
618 Name=as.character(rownames(DGEList\$counts)),
619 DE\$table,
620 adj.p.value=p.adjust(DE\$table\$PValue, method=fdrtype),
621 Dispersion=DGEList\$tagwise.dispersion,totreads=cmrowsums,normData,
622 DGEList\$counts
623 )
624 soutput = uoutput[order(DE\$table\$PValue),] # sorted into p value order - for quick toptable
625 goodness = gof(DGLM, pcutoff=fdrthresh)
626 if (sum(goodness\$outlier) > 0) {
627 print.noquote('GLM outliers:')
628 print(paste(rownames(DGLM)[(goodness\$outlier)],collapse=','),quote=F)
629 } else {
630 print('No GLM fit outlier genes found\n')
631 }
632 z = limma::zscoreGamma(goodness\$gof.statistic, shape=goodness\$df/2, scale=2)
633 pdf("edgeR_GoodnessofFit.pdf")
634 qq = qqnorm(z, panel.first=grid(), main="tagwise dispersion")
635 abline(0,1,lwd=3)
636 points(qq\$x[goodness\$outlier],qq\$y[goodness\$outlier], pch=16, col="maroon")
637 dev.off()
638 estpriorn = getPriorN(DGEList)
639 print(paste("Common Dispersion =",comdisp,"CV = ",sqrt(comdisp),"getPriorN = ",estpriorn),quote=F)
640 efflib = DGEList\$samples\$lib.size*DGEList\$samples\$norm.factors
641 normData = (1e+06*DGEList\$counts)/efflib
642 lnormData = log(normData + 1e-6,10)
643 uniqueg = unique(group)
644 #### Plot MDS
645 sample_colors = match(group,levels(group))
646 sampleTypes = levels(factor(group))
647 print.noquote(sampleTypes)
648 pdf("edgeR_MDSplot.pdf")
649 plotMDS.DGEList(DGEList,main=paste("edgeR MDS for",myTitle),cex=0.5,col=sample_colors,pch=sample_colors)
650 legend(x="topleft", legend = sampleTypes,col=c(1:length(sampleTypes)), pch=19)
651 grid(col="blue")
652 dev.off()
653 colnames(normData) = paste( colnames(normData),'N',sep="_")
654 print(paste('Raw sample read totals',paste(colSums(nonzerod,na.rm=T),collapse=',')))
655 nzd = data.frame(log(nonzerod + 1e-2,10))
656 try( boxPlot(rawrs=nzd,cleanrs=lnormData,maint='TMM Normalisation',myTitle=myTitle,pdfname="edgeR_raw_norm_counts_box.pdf") )
657 write.table(soutput,file=out_edgeR, quote=FALSE, sep="\t",row.names=F)
658 tt = cbind(
659 Name=as.character(rownames(DGEList\$counts)),
660 DE\$table,
661 adj.p.value=p.adjust(DE\$table\$PValue, method=fdrtype),
662 Dispersion=DGEList\$tagwise.dispersion,totreads=cmrowsums
663 )
664 print.noquote("# edgeR Top tags\n")
665 tt = cbind(tt,URL=contigurls) # add to end so table isn't laid out strangely
666 tt = tt[order(DE\$table\$PValue),]
667 print.noquote(tt[1:50,])
668 deTags = rownames(uoutput[uoutput\$adj.p.value < fdrthresh,])
669 nsig = length(deTags)
670 print(paste('#',nsig,'tags significant at adj p=',fdrthresh),quote=F)
671 deColours = ifelse(deTags,'red','black')
672 pdf("edgeR_BCV_vs_abundance.pdf")
673 plotBCV(DGEList, cex=0.3, main="Biological CV vs abundance",col.tagwise=deColours)
674 dev.off()
675 dg = DGEList[order(DE\$table\$PValue),]
676 #normData = (1e+06 * dg\$counts/expandAsMatrix(dg\$samples\$lib.size, dim(dg)))
677 efflib = dg\$samples\$lib.size*dg\$samples\$norm.factors
678 normData = (1e+06*dg\$counts/efflib)
679 outpdfname="edgeR_top_100_heatmap.pdf"
680 hmap2(normData,nsamp=100,TName=TName,group=group,outpdfname=outpdfname,myTitle=paste('edgeR Heatmap',myTitle))
681 outSmear = "edgeR_smearplot.pdf"
682 outMain = paste("Smear Plot for ",TName,' Vs ',CName,' (FDR@',fdrthresh,' N = ',nsig,')',sep='')
683 smearPlot(DGEList=DGEList,deTags=deTags, outSmear=outSmear, outMain = outMain)
684 qqPlot(descr=paste(myTitle,'edgeR adj p QQ plot'),pvector=tt\$adj.p.value,outpdf='edgeR_qqplot.pdf')
685 norm.factor = DGEList\$samples\$norm.factors
686 topresults.edgeR = soutput[which(soutput\$adj.p.value < fdrthresh), ]
687 edgeRcountsindex = which(allgenes %in% rownames(topresults.edgeR))
688 edgeRcounts = rep(0, length(allgenes))
689 edgeRcounts[edgeRcountsindex] = 1 # Create venn diagram of hits
690 sink()
691 } ### doedgeR
692 if (doDESeq2 == T)
693 {
694 sink("DESeq2.log")
695 # DESeq2
696 require('DESeq2')
697 library('RColorBrewer')
698 if (length(subjects) == 0)
699 {
700 pdata = data.frame(Name=colnames(workCM),Rx=group,row.names=colnames(workCM))
701 deSEQds = DESeqDataSetFromMatrix(countData = workCM, colData = pdata, design = formula(~ Rx))
702 } else {
703 pdata = data.frame(Name=colnames(workCM),Rx=group,subjects=subjects,row.names=colnames(workCM))
704 deSEQds = DESeqDataSetFromMatrix(countData = workCM, colData = pdata, design = formula(~ subjects + Rx))
705 }
706 #DESeq2 = DESeq(deSEQds,fitType='local',pAdjustMethod=fdrtype)
707 #rDESeq = results(DESeq2)
708 #newCountDataSet(workCM, group)
709 deSeqDatsizefac = estimateSizeFactors(deSEQds)
710 deSeqDatdisp = estimateDispersions(deSeqDatsizefac,fitType=DESeq_fitType)
711 resDESeq = nbinomWaldTest(deSeqDatdisp, pAdjustMethod=fdrtype)
712 rDESeq = as.data.frame(results(resDESeq))
713 rDESeq = cbind(Contig=rownames(workCM),rDESeq,NReads=cmrowsums)
714 srDESeq = rDESeq[order(rDESeq\$pvalue),]
715 write.table(srDESeq,file=out_DESeq2, quote=FALSE, sep="\t",row.names=F)
716 qqPlot(descr=paste(myTitle,'DESeq2 adj p qq plot'),pvector=rDESeq\$padj,outpdf='DESeq2_qqplot.pdf')
717 cat("# DESeq top 50\n")
718 rDESeq = cbind(Contig=rownames(workCM),rDESeq,NReads=cmrowsums,URL=contigurls)
719 srDESeq = rDESeq[order(rDESeq\$pvalue),]
720 print.noquote(srDESeq[1:50,])
721 topresults.DESeq = rDESeq[which(rDESeq\$padj < fdrthresh), ]
722 DESeqcountsindex = which(allgenes %in% rownames(topresults.DESeq))
723 DESeqcounts = rep(0, length(allgenes))
724 DESeqcounts[DESeqcountsindex] = 1
725 pdf("DESeq2_dispersion_estimates.pdf")
726 plotDispEsts(resDESeq)
727 dev.off()
728 ysmall = abs(min(rDESeq\$log2FoldChange))
729 ybig = abs(max(rDESeq\$log2FoldChange))
730 ylimit = min(4,ysmall,ybig)
731 pdf("DESeq2_MA_plot.pdf")
732 plotMA(resDESeq,main=paste(myTitle,"DESeq2 MA plot"),ylim=c(-ylimit,ylimit))
733 dev.off()
734 rlogres = rlogTransformation(resDESeq)
735 sampledists = dist( t( assay(rlogres) ) )
736 sdmat = as.matrix(sampledists)
737 pdf("DESeq2_sample_distance_plot.pdf")
738 heatmap.2(sdmat,trace="none",main=paste(myTitle,"DESeq2 sample distances"),
739 col = colorRampPalette( rev(brewer.pal(9, "RdBu")) )(255))
740 dev.off()
741 ###outpdfname="DESeq2_top50_heatmap.pdf"
742 ###hmap2(sresDESeq,nsamp=50,TName=TName,group=group,outpdfname=outpdfname,myTitle=paste('DESeq2 vst rlog Heatmap',myTitle))
743 sink()
744 result = try( (ppca = plotPCA( varianceStabilizingTransformation(deSeqDatdisp,blind=T), intgroup=c("Rx","Name")) ) )
745 if ("try-error" %in% class(result)) {
746 print.noquote('DESeq2 plotPCA failed.')
747 } else {
748 pdf("DESeq2_PCA_plot.pdf")
749 #### wtf - print? Seems needed to get this to work
750 print(ppca)
751 dev.off()
752 }
753 }
754
755 if (doVoom == T) {
756 sink('Voom.log')
757 if (doedgeR == F) {
758 #### Setup DGEList object
759 DGEList = DGEList(counts=workCM, group = group)
760 DGEList = calcNormFactors(DGEList)
761 DGEList = estimateGLMCommonDisp(DGEList,mydesign)
762 DGEList = estimateGLMTrendedDisp(DGEList,mydesign)
763 DGEList = estimateGLMTagwiseDisp(DGEList,mydesign)
764 DGEList = estimateGLMTagwiseDisp(DGEList,mydesign)
765 norm.factor = DGEList\$samples\$norm.factors
766 }
767 pdf("Voom_mean_variance_plot.pdf")
768 dat.voomed = voom(DGEList, mydesign, plot = TRUE, lib.size = colSums(workCM) * norm.factor)
769 dev.off()
770 # Use limma to fit data
771 fit = lmFit(dat.voomed, mydesign)
772 fit = eBayes(fit)
773 rvoom = topTable(fit, coef = length(colnames(mydesign)), adj = fdrtype, n = Inf, sort="none")
774 qqPlot(descr=paste(myTitle,'Voom-limma adj p QQ plot'),pvector=rvoom\$adj.P.Val,outpdf='Voom_qqplot.pdf')
775 rownames(rvoom) = rownames(workCM)
776 rvoom = cbind(rvoom,NReads=cmrowsums)
777 srvoom = rvoom[order(rvoom\$P.Value),]
778 write.table(srvoom,file=out_VOOM, quote=FALSE, sep="\t",row.names=F)
779 rvoom = cbind(rvoom,URL=contigurls)
780 deTags = rownames(rvoom[rvoom\$adj.p.value < fdrthresh,])
781 nsig = length(deTags)
782 cat("# Voom top 50\n")
783 print(srvoom[1:50,])
784 normData = srvoom\$E
785 outpdfname="Voom_top_100_heatmap.pdf"
786 hmap2(normData,nsamp=100,TName=TName,group=group,outpdfname=outpdfname,myTitle=paste('VOOM Heatmap',myTitle))
787 outSmear = "Voom_smearplot.pdf"
788 outMain = paste("Smear Plot for ",TName,' Vs ',CName,' (FDR@',fdrthresh,' N = ',nsig,')',sep='')
789 smearPlot(DGEList=rvoom,deTags=deTags, outSmear=outSmear, outMain = outMain)
790 qqPlot(descr=paste(myTitle,'VOOM adj p QQ plot'),pvector=srvoom\$adj.P.Val,outpdf='Voom_qqplot.pdf')
791 # Use an FDR cutoff to find interesting samples for edgeR, DESeq and voom/limma
792 topresults.voom = rvoom[which(rvoom\$adj.P.Val < fdrthresh), ]
793 voomcountsindex = which(allgenes %in% topresults.voom\$ID)
794 voomcounts = rep(0, length(allgenes))
795 voomcounts[voomcountsindex] = 1
796 sink()
797 }
798
799 if (doCamera) {
800 doGSEA(y=DGEList,design=mydesign,histgmt=histgmt,bigmt=bigmt,ntest=20,myTitle=myTitle,
801 outfname=paste(mt,"GSEA.xls",sep="_"),fdrthresh=fdrthresh,fdrtype=fdrtype)
802 }
803
804 if ((doDESeq2==T) || (doVoom==T) || (doedgeR==T)) {
805 if ((doVoom==T) && (doDESeq2==T) && (doedgeR==T)) {
806 vennmain = paste(mt,'Voom,edgeR and DESeq2 overlap at FDR=',fdrthresh)
807 counts.dataframe = data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts,
808 VOOM_limma = voomcounts, row.names = allgenes)
809 } else if ((doDESeq2==T) && (doedgeR==T)) {
810 vennmain = paste(mt,'DESeq2 and edgeR overlap at FDR=',fdrthresh)
811 counts.dataframe = data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts, row.names = allgenes)
812 } else if ((doVoom==T) && (doedgeR==T)) {
813 vennmain = paste(mt,'Voom and edgeR overlap at FDR=',fdrthresh)
814 counts.dataframe = data.frame(edgeR = edgeRcounts, VOOM_limma = voomcounts, row.names = allgenes)
815 }
816
817 if (nrow(counts.dataframe > 1)) {
818 counts.venn = vennCounts(counts.dataframe)
819 vennf = "Venn_significant_genes_overlap.pdf"
820 pdf(vennf)
821 vennDiagram(counts.venn,main=vennmain,col="maroon")
822 dev.off()
823 }
824 } #### doDESeq2 or doVoom
825
826 }
827 #### Done
828
829 ###sink(stdout(),append=T,type="message")
830 builtin_gmt = ""
831 history_gmt = ""
832 history_gmt_name = ""
833 out_edgeR = F
834 out_DESeq2 = F
835 out_VOOM = "$out_VOOM"
836 doDESeq2 = $DESeq2.doDESeq2 # make these T or F
837 doVoom = $doVoom
838 doCamera = F
839 doedgeR = $edgeR.doedgeR
840 edgeR_priordf = 0
841
842
843 #if $doVoom == "T":
844 out_VOOM = "$out_VOOM"
845 #end if
846
847 #if $DESeq2.doDESeq2 == "T":
848 out_DESeq2 = "$out_DESeq2"
849 DESeq_fitType = "$DESeq2.DESeq_fitType"
850 #end if
851
852 #if $edgeR.doedgeR == "T":
853 out_edgeR = "$out_edgeR"
854 edgeR_priordf = $edgeR.edgeR_priordf
855 #end if
856
857
858 if (sum(c(doedgeR,doVoom,doDESeq2)) == 0)
859 {
860 write("No methods chosen - nothing to do! Please try again after choosing one or more methods", stderr())
861 quit(save="no",status=2)
862 }
863
864 Out_Dir = "$html_file.files_path"
865 Input = "$input1"
866 TreatmentName = "$treatment_name"
867 TreatmentCols = "$Treat_cols"
868 ControlName = "$control_name"
869 ControlCols= "$Control_cols"
870 org = "$input1.dbkey"
871 if (org == "") { org = "hg19"}
872 fdrtype = "$fdrtype"
873 fdrthresh = $fdrthresh
874 useNDF = $useNDF
875 fQ = $fQ # non-differential centile cutoff
876 myTitle = "$title"
877 sids = strsplit("$subjectids",',')
878 subjects = unlist(sids)
879 nsubj = length(subjects)
880 TCols = as.numeric(strsplit(TreatmentCols,",")[[1]])-1
881 CCols = as.numeric(strsplit(ControlCols,",")[[1]])-1
882 cat('Got TCols=')
883 cat(TCols)
884 cat('; CCols=')
885 cat(CCols)
886 cat('\n')
887 useCols = c(TCols,CCols)
888 if (file.exists(Out_Dir) == F) dir.create(Out_Dir)
889 Count_Matrix = read.table(Input,header=T,row.names=1,sep='\t') #Load tab file assume header
890 snames = colnames(Count_Matrix)
891 nsamples = length(snames)
892 if (nsubj > 0 & nsubj != nsamples) {
893 options("show.error.messages"=T)
894 mess = paste('Fatal error: Supplied subject id list',paste(subjects,collapse=','),
895 'has length',nsubj,'but there are',nsamples,'samples',paste(snames,collapse=','))
896 write(mess, stderr())
897 quit(save="no",status=4)
898 }
899 if (length(subjects) != 0) {subjects = subjects[useCols]}
900 Count_Matrix = Count_Matrix[,useCols] ### reorder columns
901 rn = rownames(Count_Matrix)
902 islib = rn %in% c('librarySize','NotInBedRegions')
903 LibSizes = Count_Matrix[subset(rn,islib),][1] # take first
904 Count_Matrix = Count_Matrix[subset(rn,! islib),]
905 group = c(rep(TreatmentName,length(TCols)), rep(ControlName,length(CCols)) ) #Build a group descriptor
906 group = factor(group, levels=c(ControlName,TreatmentName))
907 colnames(Count_Matrix) = paste(group,colnames(Count_Matrix),sep="_") #Relable columns
908 results = edgeIt(Count_Matrix=Count_Matrix,group=group, out_edgeR=out_edgeR, out_VOOM=out_VOOM, out_DESeq2=out_DESeq2,
909 fdrtype='BH',mydesign=NULL,priordf=edgeR_priordf,fdrthresh=fdrthresh,outputdir='.',
910 myTitle=myTitle,useNDF=F,libSize=c(),filterquantile=fQ,subjects=subjects,
911 doDESeq2=doDESeq2,doVoom=doVoom,doCamera=doCamera,doedgeR=doedgeR,org=org,
912 histgmt=history_gmt,bigmt=builtin_gmt,DESeq_fitType=DESeq_fitType)
913 sessionInfo()
914 ]]>
915 </configfile>
916 </configfiles>
917 <help>
918
919 **What it does**
920
921 Allows short read sequence counts from controlled experiments to be analysed for differentially expressed genes.
922 Optionally adds a term for subject if not all samples are independent or if some other factor needs to be blocked in the design.
923
924 **Input**
925
926 Requires a count matrix as a tabular file. These are best made using the companion HTSeq_ based counter Galaxy wrapper
927 and your fave gene model to generate inputs. Each row is a genomic feature (gene or exon eg) and each column the
928 non-negative integer count of reads from one sample overlapping the feature.
929 The matrix must have a header row uniquely identifying the source samples, and unique row names in
930 the first column. Typically the row names are gene symbols or probe ids for downstream use in GSEA and other methods.
931
932 **Specifying comparisons**
933
934 This is basically dumbed down for two factors - case vs control.
935
936 More complex interfaces are possible but painful at present.
937 Probably need to specify a phenotype file to do this better.
938 Work in progress. Send code.
939
940 If you have (eg) paired samples and wish to include a term in the GLM to account for some other factor (subject in the case of paired samples),
941 put a comma separated list of indicators for every sample (whether modelled or not!) indicating (eg) the subject number or
942 A list of integers, one for each subject or an empty string if samples are all independent.
943 If not empty, there must be exactly as many integers in the supplied integer list as there are columns (samples) in the count matrix.
944 Integers for samples that are not in the analysis *must* be present in the string as filler even if not used.
945
946 So if you have 2 pairs out of 6 samples, you need to put in unique integers for the unpaired ones
947 eg if you had 6 samples with the first two independent but the second and third pairs each being from independent subjects. you might use
948 8,9,1,1,2,2
949 as subject IDs to indicate two paired samples from the same subject in columns 3/4 and 5/6
950
951 **Methods available**
952
953 You can run 3 popular Bioconductor packages available for count data.
954
955 edgeR - see edgeR_ for details
956
957 VOOM/limma - see limma_VOOM_ for details
958
959 DESeq2 - see DESeq2_ for details
960
961 and optionally camera in edgeR which works better if MSigDB is installed.
962
963 **Outputs**
964
965 Some helpful plots and analysis results. Note that most of these are produced using R code
966 suggested by the excellent documentation and vignettes for the Bioconductor
967 packages invoked. The Tool Factory is used to automatically lay these out for you to enjoy.
968
969 **Note on Voom**
970
971 The voom from limma version 3.16.6 help in R includes this from the authors - but you should read the paper to interpret this method.
972
973 This function is intended to process RNA-Seq or ChIP-Seq data prior to linear modelling in limma.
974
975 voom is an acronym for mean-variance modelling at the observational level.
976 The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation.
977 Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend.
978 This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance.
979 The weights are then used in the linear modelling process to adjust for heteroscedasticity.
980
981 In an experiment, a count value is observed for each tag in each sample. A tag-wise mean-variance trend is computed using lowess.
982 The tag-wise mean is the mean log2 count with an offset of 0.5, across samples for a given tag.
983 The tag-wise variance is the quarter-root-variance of normalized log2 counts per million values with an offset of 0.5, across samples for a given tag.
984 Tags with zero counts across all samples are not included in the lowess fit. Optional normalization is performed using normalizeBetweenArrays.
985 Using fitted values of log2 counts from a linear model fit by lmFit, variances from the mean-variance trend were interpolated for each observation.
986 This was carried out by approxfun. Inverse variance weights can be used to correct for mean-variance trend in the count data.
987
988
989 Author(s)
990
991 Charity Law and Gordon Smyth
992
993 References
994
995 Law, CW (2013). Precision weights for gene expression analysis. PhD Thesis. University of Melbourne, Australia.
996
997 Law, CW, Chen, Y, Shi, W, Smyth, GK (2013). Voom! Precision weights unlock linear model analysis tools for RNA-seq read counts.
998 Technical Report 1 May 2013, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Reseach, Melbourne, Australia.
999 http://www.statsci.org/smyth/pubs/VoomPreprint.pdf
1000
1001 See Also
1002
1003 A voom case study is given in the edgeR User's Guide.
1004
1005 vooma is a similar function but for microarrays instead of RNA-seq.
1006
1007
1008 ***old rant on changes to Bioconductor package variable names between versions***
1009
1010 The edgeR authors made a small cosmetic change in the name of one important variable (from p.value to PValue)
1011 breaking this and all other code that assumed the old name for this variable,
1012 between edgeR2.4.4 and 2.4.6 (the version for R 2.14 as at the time of writing).
1013 This means that all code using edgeR is sensitive to the version. I think this was a very unwise thing
1014 to do because it wasted hours of my time to track down and will similarly cost other edgeR users dearly
1015 when their old scripts break. This tool currently now works with 2.4.6.
1016
1017 **Note on prior.N**
1018
1019 http://seqanswers.com/forums/showthread.php?t=5591 says:
1020
1021 *prior.n*
1022
1023 The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion.
1024 You can think of it as like a "weight" for the common value. (It is actually the weight for the common likelihood
1025 in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your
1026 tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the
1027 common likelihood the weight of one observation.
1028
1029 In answer to your question, it is a good thing to squeeze the tagwise dispersions towards a common value,
1030 or else you will be using very unreliable estimates of the dispersion. I would not recommend using the value that
1031 you obtained from estimateSmoothing()---this is far too small and would result in virtually no moderation
1032 (squeezing) of the tagwise dispersions. How many samples do you have in your experiment?
1033 What is the experimental design? If you have few samples (less than 6) then I would suggest a prior.n of at least 10.
1034 If you have more samples, then the tagwise dispersion estimates will be more reliable,
1035 so you could consider using a smaller prior.n, although I would hesitate to use a prior.n less than 5.
1036
1037
1038 From Bioconductor Digest, Vol 118, Issue 5, Gordon writes:
1039
1040 Dear Dorota,
1041
1042 The important settings are prior.df and trend.
1043
1044 prior.n and prior.df are related through prior.df = prior.n * residual.df,
1045 and your experiment has residual.df = 36 - 12 = 24. So the old setting of
1046 prior.n=10 is equivalent for your data to prior.df = 240, a very large
1047 value. Going the other way, the new setting of prior.df=10 is equivalent
1048 to prior.n=10/24.
1049
1050 To recover old results with the current software you would use
1051
1052 estimateTagwiseDisp(object, prior.df=240, trend="none")
1053
1054 To get the new default from old software you would use
1055
1056 estimateTagwiseDisp(object, prior.n=10/24, trend=TRUE)
1057
1058 Actually the old trend method is equivalent to trend="loess" in the new
1059 software. You should use plotBCV(object) to see whether a trend is
1060 required.
1061
1062 Note you could also use
1063
1064 prior.n = getPriorN(object, prior.df=10)
1065
1066 to map between prior.df and prior.n.
1067
1068 ----
1069
1070 **Attributions**
1071
1072 edgeR - edgeR_
1073
1074 VOOM/limma - limma_VOOM_
1075
1076 DESeq2 - DESeq2_ for details
1077
1078 See above for Bioconductor package documentation for packages exposed in Galaxy by this tool and app store package.
1079
1080 Galaxy_ (that's what you are using right now!) for gluing everything together
1081
1082 Otherwise, all code and documentation comprising this tool was written by Ross Lazarus and is
1083 licensed to you under the LGPL_ like other rgenetics artefacts
1084
1085 .. _LGPL: http://www.gnu.org/copyleft/lesser.html
1086 .. _HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/index.html
1087 .. _edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
1088 .. _DESeq2: http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html
1089 .. _limma_VOOM: http://www.bioconductor.org/packages/release/bioc/html/limma.html
1090 .. _Galaxy: http://getgalaxy.org
1091 </help>
1092
1093 </tool>
1094
1095