annotate cor.xml @ 90:b061185bcb83 draft

Uploaded
author bernhardlutz
date Thu, 23 Jan 2014 14:53:46 -0500
parents
children 7ae2e4b1ff1f
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
90
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
1 <tool id="cor2" name="Correlation" version="1.1.0">
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
2 <description>for numeric columns</description>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
3 <expand macro="requirements" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
4 <macros>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
5 <import>statistic_tools_macros.xml</import>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
6 </macros>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
7 <command interpreter="python">cor.py $input1 $out_file1 $numeric_columns $method</command>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
8 <inputs>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
9 <param format="tabular" name="input1" type="data" label="Dataset" help="Dataset missing? See TIP below"/>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
10 <param name="numeric_columns" label="Numerical columns" type="data_column" numerical="True" multiple="True" data_ref="input1" help="Multi-select list - hold the appropriate key while clicking to select multiple columns" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
11 <param name="method" type="select" label="Method">
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
12 <option value="pearson">Pearson</option>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
13 <option value="kendall">Kendall rank</option>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
14 <option value="spearman">Spearman rank</option>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
15 </param>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
16 </inputs>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
17 <outputs>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
18 <data format="txt" name="out_file1" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
19 </outputs>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
20 <tests>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
21 <!--
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
22 Test a tabular input with the first line being a comment without a # character to start
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
23 -->
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
24 <test>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
25 <param name="input1" value="cor.tabular" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
26 <param name="numeric_columns" value="2,3" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
27 <param name="method" value="pearson" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
28 <output name="out_file1" file="cor_out.txt" />
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
29 </test>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
30 </tests>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
31 <help>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
32
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
33 .. class:: infomark
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
34
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
35 **TIP:** If your data is not TAB delimited, use *Text Manipulation-&gt;Convert*
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
36
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
37 .. class:: warningmark
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
38
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
39 Missing data ("nan") removed from each pairwise comparison
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
40
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
41 -----
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
42
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
43 **Syntax**
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
44
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
45 This tool computes the matrix of correlation coefficients between numeric columns.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
46
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
47 - All invalid, blank and comment lines are skipped when performing computations. The number of skipped lines is displayed in the resulting history item.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
48
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
49 - **Pearson's Correlation** reflects the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect positive linear relationship between variables. The formula for Pearson's correlation is:
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
50
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
51 .. image:: $PATH_TO_IMAGES/pearson.png
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
52
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
53 where n is the number of items
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
54
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
55 - **Kendall's rank correlation** is used to measure the degree of correspondence between two rankings and assessing the significance of this correspondence. The formula for Kendall's rank correlation is:
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
56
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
57 .. image:: $PATH_TO_IMAGES/kendall.png
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
58
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
59 where n is the number of items, and P is the sum.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
60
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
61 - **Spearman's rank correlation** assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. The formula for Spearman's rank correlation is
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
62
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
63 .. image:: $PATH_TO_IMAGES/spearman.png
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
64
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
65 where D is the difference between the ranks of corresponding values of X and Y, and N is the number of pairs of values.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
66
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
67 -----
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
68
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
69 **Example**
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
70
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
71 - Input file::
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
72
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
73 #Person Height Self Esteem
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
74 1 68 4.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
75 2 71 4.6
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
76 3 62 3.8
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
77 4 75 4.4
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
78 5 58 3.2
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
79 6 60 3.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
80 7 67 3.8
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
81 8 68 4.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
82 9 71 4.3
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
83 10 69 3.7
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
84 11 68 3.5
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
85 12 67 3.2
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
86 13 63 3.7
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
87 14 62 3.3
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
88 15 60 3.4
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
89 16 63 4.0
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
90 17 65 4.1
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
91 18 67 3.8
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
92 19 63 3.4
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
93 20 61 3.6
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
94
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
95 - Computing the correlation coefficients between columns 2 and 3 of the above file (using Pearson's Correlation), the output is::
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
96
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
97 1.0 0.730635686279
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
98 0.730635686279 1.0
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
99
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
100 So the correlation for our twenty cases is .73, which is a fairly strong positive relationship.
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
101 </help>
b061185bcb83 Uploaded
bernhardlutz
parents:
diff changeset
102 </tool>