annotate README.rst @ 4:ef365d71514e draft

planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 64158f357e708f0b60d2669d92d614f7aee34c0e
author bgruening
date Wed, 06 Jun 2018 17:39:49 -0400
parents ca05b5889dfc
children 1530d05d19b4
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
1 ***************
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
2 Galaxy wrapper for scikit-learn library
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
3 ***************
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
4
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
5 Contents
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
6 ========
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
7 - `What is scikit-learn?`_
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
8 - `Scikit-learn main package groups`_
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
9 - `Tools offered by this wrapper`_
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
10
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
11 - `Machine learning workflows`_
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
12 - `Supervised learning workflows`_
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
13 - `Unsupervised learning workflows`_
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
14
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
15
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
16 ____________________________
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
17
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
18
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
19 .. _What is scikit-learn?
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
20
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
21 What is scikit-learn?
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
22 ===========================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
23
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
24 Scikit-learn is an open-source machine learning library for the Python programming language. It offers various algorithms for performing supervised and unsupervised learning as well as data preprocessing and transformation, model selection and evaluation, and dataset utilities. It is built upon SciPy (Scientific Python) library.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
25
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
26 Scikit-learn source code can be accessed at https://github.com/scikit-learn/scikit-learn.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
27 Detailed installation instructions can be found at http://scikit-learn.org/stable/install.html
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
28
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
29
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
30 .. _Scikit-learn main package groups:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
31
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
32 ======
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
33 Scikit-learn main package groups
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
34 ======
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
35
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
36 Scikit-learn provides the users with several main groups of related operations.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
37 These are:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
38
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
39 - Classification
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
40 - Identifying to which category an object belongs.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
41 - Regression
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
42 - Predicting a continuous-valued attribute associated with an object.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
43 - Clustering
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
44 - Automatic grouping of similar objects into sets.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
45 - Preprocessing
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
46 - Feature extraction and normalization.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
47 - Model selection and evaluation
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
48 - Comparing, validating and choosing parameters and models.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
49 - Dimensionality reduction
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
50 - Reducing the number of random variables to consider.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
51
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
52 Each group consists of a number of well-known algorithms from the category. For example, one can find hierarchical, spectral, kmeans, and other clustering methods in sklearn.cluster package.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
53
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
54
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
55 .. _Tools offered by this wrapper:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
56
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
57 ===================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
58 Available tools in the current wrapper
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
59 ===================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
60
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
61 The current release of the wrapper offers a subset of the packages from scikit-learn library. You can find:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
62
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
63 - A subset of classification metric functions
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
64 - Linear and quadratic discriminant classifiers
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
65 - Random forest and Ada boost classifiers and regressors
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
66 - All the clustering methods
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
67 - All support vector machine classifiers
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
68 - A subset of data preprocessing estimator classes
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
69 - Pairwise metric measurement functions
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
70
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
71 In addition, several tools for performing matrix operations, generating problem-specific datasets, and encoding text and extracting features have been prepared to help the user with more advanced operations.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
72
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
73 .. _Machine learning workflows:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
74
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
75 Machine learning workflows
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
76 ===============
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
77
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
78 Machine learning is about processes. No matter what machine learning algorithm we use, we can apply typical workflows and dataflows to produce more robust models and better predictions.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
79 Here we discuss supervised and unsupervised learning workflows.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
80
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
81 .. _Supervised learning workflows:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
82
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
83 ===================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
84 Supervised machine learning workflows
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
85 ===================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
86
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
87 **What is supervised learning?**
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
88
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
89 In this machine learning task, given sample data which are labeled, the aim is to build a model which can predict the labels for new observations.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
90 In practice, there are five steps which we can go through to start from raw input data and end up getting reasonable predictions for new samples:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
91
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
92 1. Preprocess the data::
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
93
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
94 * Change the collected data into the proper format and datatype.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
95 * Adjust the data quality by filling the missing values, performing
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
96 required scaling and normalizations, etc.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
97 * Extract features which are the most meaningfull for the learning task.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
98 * Split the ready dataset into training and test samples.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
99
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
100 2. Choose an algorithm::
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
101
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
102 * These factors help one to choose a learning algorithm:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
103 - Nature of the data (e.g. linear vs. nonlinear data)
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
104 - Structure of the predicted output (e.g. binary vs. multilabel classification)
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
105 - Memory and time usage of the training
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
106 - Predictive accuracy on new data
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
107 - Interpretability of the predictions
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
108
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
109 3. Choose a validation method
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
110
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
111 Every machine learning model should be evaluated before being put into practicical use.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
112 There are numerous performance metrics to evaluate machine learning models.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
113 For supervised learning, usually classification or regression metrics are used.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
114
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
115 A validation method helps to evaluate the performance metrics of a trained model in order
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
116 to optimize its performance or ultimately switch to a more efficient model.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
117 Cross-validation is a known validation method.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
118
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
119 4. Fit a model
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
120
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
121 Given the learning algorithm, validation method, and performance metric(s)
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
122 repeat the following steps::
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
123
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
124 * Train the model.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
125 * Evaluate based on metrics.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
126 * Optimize unitl satisfied.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
127
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
128 5. Use fitted model for prediction::
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
129
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
130 This is a final evaluation in which, the optimized model is used to make predictions
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
131 on unseen (here test) samples. After this, the model is put into production.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
132
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
133 .. _Unsupervised learning workflows:
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
134
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
135 =======================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
136 Unsupervised machine learning workflows
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
137 =======================
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
138
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
139 **What is unsupervised learning?**
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
140
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
141 Unlike supervised learning and more liklely in real life, here the initial data is not labeled.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
142 The task is to extract the structure from the data and group the samples based on their similarities.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
143 Clustering and dimensionality reduction are two famous examples of unsupervised learning tasks.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
144
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
145 In this case, the workflow is as follows::
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
146
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
147 * Preprocess the data (without splitting to train and test).
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
148 * Train a model.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
149 * Evaluate and tune parameters.
ca05b5889dfc planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/sklearn commit 2e1e78576b38110cf5b1f2ed83b08b9c3a6cbfee
bgruening
parents:
diff changeset
150 * Analyse the model and test on real data.