Mercurial > repos > melissacline > ucsc_cancer_browser_stats
comparison ttest/stats.xml @ 12:fd8529cd1564 default tip
better t-test
author | jingchunzhu |
---|---|
date | Mon, 28 Sep 2015 12:36:12 -0700 |
parents | cd4c13ae11ce |
children |
comparison
equal
deleted
inserted
replaced
11:cd4c13ae11ce | 12:fd8529cd1564 |
---|---|
1 <tool id="ucscCancerBrowserStats" description="t-tests of Difference in genomic data" name="Difference between categories (t-test)" version="0.0.1"> | 1 <tool id="ucscCancerBrowserStats" description="t-tests of difference in genomic data" name="Difference between categories (t-test)" version="0.0.1"> |
2 <command interpreter="python"> | 2 <command interpreter="python"> |
3 stats.py $genomicMatrix $clinicalFeatures $outFile -a="${category1}" -b="${category2}" | 3 stats.py $genomicMatrix $clinicalFeatures $outFile -a="${category1}" -b="${category2}" |
4 </command> | 4 </command> |
5 <inputs> | 5 <inputs> |
6 <param format="tabular" name="genomicMatrix" type="data" label="Genomic Matrix"/> | 6 <param format="tabular" name="genomicMatrix" type="data" label="Genomic Matrix"/> |
21 <param name="category2" value="B"/> | 21 <param name="category2" value="B"/> |
22 <output name="outFile" value="sample.stats.output.txt"/> | 22 <output name="outFile" value="sample.stats.output.txt"/> |
23 </tests> | 23 </tests> |
24 <help> | 24 <help> |
25 | 25 |
26 This tool performs statistical tests found in the UCSC Cancer Genomics | 26 This tool performs t-test on genomic data between two groups of samples, which can be used to identify for example, differentially expressed genes or probes. The genomic data is in the format of UCSC Xena genomic matrix (a tab-deliminated matrix) with rows representing genes or probes and columns representing samples. The phenotype matrix assigns samples into groups. The tool compares two groups of samples, and computes the t-statistics, p value, and delta of medians for each probe/gene between the two groups. The result can be downloaded to programs such as EXCEL for sorting based on the t-statistics. |
27 Browser. The input data is a genomic matrix (containing genomic data, | |
28 with rows representing genes or probes and columns representing | |
29 samples or patients), a clinical matrix of two (or more) columns | |
30 assigning categorical values to the samples, and two categorical | |
31 values of interest. The tool identifies the samples corresponding to | |
32 each categorical value, then identifies the columns in the genomic | |
33 matrix corresponding to those sets of samples, which identifies two | |
34 groups of columns. For each row in the genomic matrix, it extracts | |
35 the value for those two sets of columns, performs a t-test on the two | |
36 sets of values, and returns the result for the row. Any values for | |
37 any columns NOT pertaining to one of the categorical values of | |
38 interest are ignored. | |
39 | 27 |
40 The user runs this tool with th following steps: | 28 The user runs this tool with the following steps: |
29 | |
30 1. Specify a genomic matrix. The expected format is with rows representing genes and columns representing samples, and the first line contains sample names. Matrix can be obtained from UCSC Xena bulk download. See below for an example. | |
41 | 31 |
42 | 32 |
43 1. Specify a genomic matrix. The expected format is with rows representing | 33 2. Specify a phenotype matrix. Here, rows indicate samples, columns indicate phenotypes or annotations. Matrix can be obtained from UCSC Xena heatmap download. See below for an example. |
44 genes and columns representing samples, and the first line contains sample | |
45 names. | |
46 | |
47 2. Specify a clinical matrix. Here, rows indicate samples, columns | |
48 indicate clinical features, and the header row contains feature names. | |
49 The first column MUST indicate the sample names, and MUST correspond | |
50 to the column names of the genomic matrix. The clinical feature of | |
51 interest MUST be in the second column. Any other columns will be | |
52 ignored. | |
53 | 34 |
54 | 35 |
55 3. Indicate two clinical values that you want to use for defining the | 36 3. Specify the two categorical values that you want to use for defining the two groups. For example, the two groups could be A and B, 0 and 1, etc. |
56 two groups. For example, the two groups could be "Red group" and | |
57 "Green group", 0 and 1, or whatever. | |
58 | 37 |
59 The output indicates, for each row, the t-statistic reporting on the | |
60 difference between the two groups of columns (as specified by the two | |
61 clinical values), the p-value corresponding to that t-statistic, the | |
62 median value for each group, and the difference between the medians. If it | |
63 cannot calculate these values, it returns a vector of NAs. | |
64 | 38 |
65 For example, given the following genomic matrix for (1):: | 39 4. The output is, for each probe/gene (in each row), the t-statistics, the p-value, the median value for each group, and the difference between the medians. If it cannot calculate these values, it returns a vector of NAs. |
66 | 40 |
67 Gene 1 2 3 4 5 6 7 8 9 10 | 41 |
42 **Input genomic matrix**:: | |
43 | |
44 Gene s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 | |
68 G1 2.0 2.2 3.2 1.1 5.1 8.1 3.2 1.1 8.1 0.2 | 45 G1 2.0 2.2 3.2 1.1 5.1 8.1 3.2 1.1 8.1 0.2 |
69 G2 0.1 8.2 9.1 4.2 6.1 4.9 3.9 2.3 1.1 0.2 | 46 G2 0.1 8.2 9.1 4.2 6.1 4.9 3.9 2.3 1.1 0.2 |
70 | 47 |
71 and given the following clinical matrix for (2):: | 48 **Input phenotyp matrix**:: |
72 | 49 |
73 sample_id Value | 50 sample_id Value |
74 1 A | 51 s1 A |
75 2 A | 52 s2 A |
76 3 B | 53 s3 B |
77 4 C | 54 s4 C |
78 5 B | 55 s5 B |
79 6 B | 56 s6 B |
80 7 A | 57 s7 A |
81 8 A | 58 s8 A |
82 9 B | 59 s9 B |
83 10 A | 60 s10 A |
84 | 61 |
85 and given A for Category 1 and B for Category 2 | 62 **Category 1 : A** |
86 | 63 |
87 the tool will assemble the following two groups of values:: | 64 **Category 2 : B** |
88 | 65 |
89 G1 A:(2.0, 2.2, 3.2, 1.1, 0.2) B:(3.2, 5.1, 8.1, 8.1) | 66 **Output**:: |
90 G2 A:(0.1, 8.2, 3.9, 2.3, 0.2) B:(9.1, 6.1, 4.9, 1.1) | |
91 | |
92 Note that the values for sample_id 4 do not appear, because it has a Value | |
93 of C in the second column, which is neither A nor B. | |
94 | |
95 And it will return the output:: | |
96 | 67 |
97 Gene Statistic pValue Median1 Median2 Delta | 68 Gene Statistic pValue Median1 Median2 Delta |
98 G1 -4.168999 0.004194 2.000000 6.600000 -4.600000 | 69 G1 -4.168999 0.004194 2.000000 6.600000 -4.600000 |
99 G2 -1.198486 0.269724 2.300000 5.500000 -3.200000 | 70 G2 -1.198486 0.269724 2.300000 5.500000 -3.200000 |
100 | 71 |