annotate cor.xml @ 93:7ae2e4b1ff1f draft

Uploaded
author bernhardlutz
date Sun, 26 Jan 2014 08:35:17 -0500
parents b061185bcb83
children 6ef11b60940a
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
90
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
1 <tool id="cor2" name="Correlation" version="1.1.0">
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
2 <description>for numeric columns</description>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
3 <expand macro="requirements" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
4 <macros>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
5 <import>statistic_tools_macros.xml</import>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
6 </macros>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
7 <command interpreter="python">cor.py $input1 $out_file1 $numeric_columns $method</command>
93
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
8 <inputs>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
9 <param format="tabular" name="input1" type="data" label="Dataset" help="Dataset missing? See TIP below"/>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
10 <param name="numeric_columns" label="Numerical columns" type="data_column" numerical="True" multiple="True" data_ref="input1" help="Multi-select list - hold the appropriate key while clicking to select multiple columns" />
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
11 <param name="method" type="select" label="Method">
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
12 <option value="pearson">Pearson</option>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
13 <option value="kendall">Kendall rank</option>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
14 <option value="spearman">Spearman rank</option>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
15 </param>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
16 </inputs>
7ae2e4b1ff1f Uploaded
bernhardlutz
parents: 90
diff changeset
17 <expand macro="environment" />
90
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
18 <outputs>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
19 <data format="txt" name="out_file1" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
20 </outputs>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
21 <tests>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
22 <!--
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
23 Test a tabular input with the first line being a comment without a # character to start
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
24 -->
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
25 <test>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
26 <param name="input1" value="cor.tabular" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
27 <param name="numeric_columns" value="2,3" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
28 <param name="method" value="pearson" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
29 <output name="out_file1" file="cor_out.txt" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
30 </test>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
31 </tests>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
32 <help>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
33
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
34 .. class:: infomark
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
35
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
36 **TIP:** If your data is not TAB delimited, use *Text Manipulation-&gt;Convert*
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
37
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
38 .. class:: warningmark
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
39
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
40 Missing data ("nan") removed from each pairwise comparison
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
41
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
42 -----
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
43
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
44 **Syntax**
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
45
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
46 This tool computes the matrix of correlation coefficients between numeric columns.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
47
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
48 - All invalid, blank and comment lines are skipped when performing computations. The number of skipped lines is displayed in the resulting history item.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
49
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
50 - **Pearson's Correlation** reflects the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect positive linear relationship between variables. The formula for Pearson's correlation is:
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
51
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
52 .. image:: $PATH_TO_IMAGES/pearson.png
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
53
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
54 where n is the number of items
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
55
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
56 - **Kendall's rank correlation** is used to measure the degree of correspondence between two rankings and assessing the significance of this correspondence. The formula for Kendall's rank correlation is:
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
57
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
58 .. image:: $PATH_TO_IMAGES/kendall.png
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
59
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
60 where n is the number of items, and P is the sum.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
61
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
62 - **Spearman's rank correlation** assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. The formula for Spearman's rank correlation is
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
63
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
64 .. image:: $PATH_TO_IMAGES/spearman.png
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
65
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
66 where D is the difference between the ranks of corresponding values of X and Y, and N is the number of pairs of values.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
67
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
68 -----
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
69
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
70 **Example**
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
71
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
72 - Input file::
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
73
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
74 #Person Height Self Esteem
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
75 1 68 4.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
76 2 71 4.6
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
77 3 62 3.8
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
78 4 75 4.4
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
79 5 58 3.2
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
80 6 60 3.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
81 7 67 3.8
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
82 8 68 4.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
83 9 71 4.3
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
84 10 69 3.7
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
85 11 68 3.5
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
86 12 67 3.2
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
87 13 63 3.7
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
88 14 62 3.3
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
89 15 60 3.4
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
90 16 63 4.0
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
91 17 65 4.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
92 18 67 3.8
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
93 19 63 3.4
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
94 20 61 3.6
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
95
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
96 - Computing the correlation coefficients between columns 2 and 3 of the above file (using Pearson's Correlation), the output is::
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
97
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
98 1.0 0.730635686279
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
99 0.730635686279 1.0
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
100
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
101 So the correlation for our twenty cases is .73, which is a fairly strong positive relationship.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
102 </help>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
103 </tool>