annotate logistic_regression_vif.xml @ 96:36b4b11ec126 draft

Uploaded
author bernhardlutz
date Thu, 06 Feb 2014 13:22:48 -0500
parents c4a3a8999945
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
80
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
1 <tool id="LogisticRegression" name="Perform Logistic Regression with vif" version="1.1.0">
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
2 <description> </description>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
3 <expand macro="requirements" />
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
4 <macros>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
5 <import>statistic_tools_macros.xml</import>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
6 </macros>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
7 <command interpreter="python">
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
8 logistic_regression_vif.py
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
9 $input1
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
10 $response_col
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
11 $predictor_cols
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
12 $out_file1
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
13 1>/dev/null
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
14 </command>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
15 <inputs>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
16 <param format="tabular" name="input1" type="data" label="Select data" help="Dataset missing? See TIP below."/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
17 <param name="response_col" label="Response column (Y)" type="data_column" data_ref="input1" numerical="True"/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
18 <param name="predictor_cols" label="Predictor columns (X)" type="data_column" data_ref="input1" numerical="True" multiple="true" >
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
19 <validator type="no_options" message="Please select at least one column."/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
20 </param>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
21 </inputs>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
22 <outputs>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
23 <data format="input" name="out_file1" metadata_source="input1" />
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
24
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
25 </outputs>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
26 <tests>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
27 <test>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
28 <param name="input1" value="logreg_inp.tabular"/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
29 <param name="response_col" value="4"/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
30 <param name="predictor_cols" value="1,2,3"/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
31 <output name="out_file1" file="logreg_out2.tabular"/>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
32
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
33 </test>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
34 </tests>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
35 <help>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
36
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
37
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
38 .. class:: infomark
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
39
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
40 **TIP:** If your data is not TAB delimited, use *Edit Datasets-&gt;Convert characters*
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
41
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
42 -----
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
43
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
44 .. class:: infomark
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
45
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
46 **What it does**
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
47
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
48 This tool uses the **'glm'** function from R statistical package to perform logistic regression on the input data. It outputs one file containing the summary statistics of the performed regression. Also, it calculates VIF(Variance Inflation Factor) with **'vif'** function from library (car) in R.
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
49
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
50
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
51 *R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.*
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
52
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
53 -----
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
54
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
55 .. class:: warningmark
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
56
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
57 **Note**
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
58
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
59 - This tool currently treats all predictor variables as continuous numeric variables and response variable as categorical variable. Currently, the response variable can have only two classes, namely 0 and 1. The program will take 0 as base class.
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
60
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
61 - Rows containing non-numeric (or missing) data in any of the chosen columns will be skipped from the analysis.
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
62
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
63 - The summary statistics in the output are described below:
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
64
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
65 - Pseudo R-squared: the proportion of model improvement from null model
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
66 - p-value: p-value for the z-test of the null hypothesis that the corresponding slope is equal to zero against the two-sided alternative.
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
67 - Coefficient indicates log ratio of (probability to be class 1 / probability to be class 0)
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
68
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
69 - This tool also provides **Variance Inflation Factor or VIF** which quantifies the level of multicollinearity. The tool will automatic generate VIF if the model has more than one predictor. The higher the VIF, the higher is the multicollinearity. Multicollinearity will inflate standard error and reduce level of significance of the predictor. In the worst case, it can reverse direction of slope for highly correlated predictors if one of them is significant. A general thumb-rule is to use those predictors having VIF lower than 10 or 5.
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
70 - **vif** is calculated by
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
71 - First, regressing each predictor over all other predictors, and recording R-squared for each regression.
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
72 - Second, computing vif as 1/(1- R_squared)
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
73
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
74 </help>
c4a3a8999945 Uploaded
bernhardlutz
parents:
diff changeset
75 </tool>