comparison notes.txt @ 0:56d5be6c03b9 draft

planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty
author matthias
date Fri, 08 Mar 2019 06:30:11 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:56d5be6c03b9
1 TODO
2 ====
3
4
5
6 If we make a monolithic tool:
7
8 * implement sanity checks between important compute intensive steps (user definable criteria, abort if violated)
9
10 If we keep separate tools:
11
12 - make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes)
13 * alternatively the data set types could be derived from tabular and the Rdata could be attached via
14 `.extra_files_path` this way the user could have some intermediate output that he could look at.
15
16
17 In both cases:
18
19 * allow input of single end data, single pair, single pair in separate data sets, ...
20 * add mergePairsByID functionality to mergePairs tool
21
22
23 Datatypes:
24 ==========
25
26 **derep-class**: list w 3 members
27 - uniques: Named integer vector. Named by the unique sequence, valued by abundance.
28 • quals: Numeric matrix of average quality scores by position for each unique. Uniques are
29 rows, positions are cols.
30 * map: Integer vector of length the number of reads, and value the index (in uniques) of the
31 unique to which that read was assigned.
32
33 **learnErrorsOutput**: A named list with three entries
34 - err_out: A numeric matrix with the learned error rates.
35 - err_in: The initialization error rates (unimportant).
36 - trans: A feature table of observed transitions for each type (eg. A->C) and quality score.
37
38 **dada-class**: A multi-item List with the following named values...
39 • denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences.
40 • clustering: An informative data.frame containing information on each cluster.
41 • sequence: A character vector of each denoised sequence. Identical to names(denoised).
42 • quality: The average quality scores for each cluster (row) by position (col).
43 • map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised).
44 • birth_subs: A data.frame containing the substitutions at the birth of each new cluster.
45 • trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col)
46 observed in the final output of the dada algorithm.
47 • err_in: The err matrix used for this invocation of dada.
48 • err_out: The err matrix estimated from the output of dada. NULL if err_function not provided.
49 • opts: A list of the dada_opts used for this invocation of dada.
50 • call: The function call used for this invocation of dada.
51
52 **uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns)
53
54 **mergepairs**:
55
56 data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:
57 • abundance: Number of reads corresponding to this forward/reverse combination.
58 • sequence: The merged sequence.
59 • forward: The index of the forward denoised sequence.
60 • reverse: The index of the reverse denoised sequence.
61 • nmatch: Number of matches nts in the overlap region.
62 • nmismatch: Number of mismatches in the overlap region.
63 • nindel: Number of indels in the overlap region.
64 • prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
65 • accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
66 • ...: Additional columns specified in propagateCol
67
68
69
70 Tools:
71 ======
72
73 • Quality filtering
74
75 filterAndTrim IO=(fastq -> fastq)
76
77 • Dereplication
78
79 derepFastq (fastq -> derep-class object)
80
81 • Learn error rates
82
83 learnErrors + plotErrors
84 - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data)
85 - out: named list w entries
86 - \$err\_out: A numeric matrix with the learned error rates.
87 - \$err\_in: The initialization error rates (unimportant).
88 - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score
89
90 • Sample Inference (dada)
91 in: (list of) derep-class object
92 out: (list of) dada-class object
93
94 • Chimera Removal
95
96 removeBimeraDenovo
97
98 in: A uniques-vector or any object that can be coerced into one with getUniques.
99 out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided
100
101 • Merging of Paired Reads
102
103 mergePairs
104 in: 2x dada-class object(s), 2x derep-class object(s)
105 out: A data.frame, or a list of data.frames.
106 - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences,
107 - cols
108 - \$abundance: Number of reads corresponding to this forward/reverse combination.
109 - \$sequence: The merged sequence.
110 - \$forward: The index of the forward denoised sequence.
111 - \$reverse: The index of the reverse denoised sequence.
112 - \$nmatch: Number of matches nts in the overlap region.
113 - \$nmismatch: Number of mismatches in the overlap region.
114 - \$nindel: Number of indels in the overlap region.
115 - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
116 - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
117 - \$...: Additional columns specified in propagateCol.
118
119 • Taxonomic Classification (assignTaxonomy, assignSpecies)
120
121 * Other
122
123 makeSequenceTable
124 in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques
125 out Named integer matrix (row for each sample, column for each unique sequence)
126
127 mergeSequenceTables
128
129 uniquesToFasta
130 in: A uniques-vector or any object that can be coerced into one with getUniques.
131
132 getSequences
133
134 extracts the sequences from several different data objects: including including dada-class
135 and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun-
136 dance columns.
137
138 getUniques
139
140 extracts the uniques-vector from several different data objects, including dada-class
141 and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance
142 columns
143
144 plotQualityProfile
145
146 seqComplexity
147
148 setDadaOpt(...)