Mercurial > repos > matthias > dada2_mergepairs
comparison notes.txt @ 0:44230c777694 draft
planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty
author | matthias |
---|---|
date | Fri, 08 Mar 2019 06:39:56 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:44230c777694 |
---|---|
1 TODO | |
2 ==== | |
3 | |
4 | |
5 | |
6 If we make a monolithic tool: | |
7 | |
8 * implement sanity checks between important compute intensive steps (user definable criteria, abort if violated) | |
9 | |
10 If we keep separate tools: | |
11 | |
12 - make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes) | |
13 * alternatively the data set types could be derived from tabular and the Rdata could be attached via | |
14 `.extra_files_path` this way the user could have some intermediate output that he could look at. | |
15 | |
16 | |
17 In both cases: | |
18 | |
19 * allow input of single end data, single pair, single pair in separate data sets, ... | |
20 * add mergePairsByID functionality to mergePairs tool | |
21 | |
22 | |
23 Datatypes: | |
24 ========== | |
25 | |
26 **derep-class**: list w 3 members | |
27 - uniques: Named integer vector. Named by the unique sequence, valued by abundance. | |
28 • quals: Numeric matrix of average quality scores by position for each unique. Uniques are | |
29 rows, positions are cols. | |
30 * map: Integer vector of length the number of reads, and value the index (in uniques) of the | |
31 unique to which that read was assigned. | |
32 | |
33 **learnErrorsOutput**: A named list with three entries | |
34 - err_out: A numeric matrix with the learned error rates. | |
35 - err_in: The initialization error rates (unimportant). | |
36 - trans: A feature table of observed transitions for each type (eg. A->C) and quality score. | |
37 | |
38 **dada-class**: A multi-item List with the following named values... | |
39 • denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences. | |
40 • clustering: An informative data.frame containing information on each cluster. | |
41 • sequence: A character vector of each denoised sequence. Identical to names(denoised). | |
42 • quality: The average quality scores for each cluster (row) by position (col). | |
43 • map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised). | |
44 • birth_subs: A data.frame containing the substitutions at the birth of each new cluster. | |
45 • trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col) | |
46 observed in the final output of the dada algorithm. | |
47 • err_in: The err matrix used for this invocation of dada. | |
48 • err_out: The err matrix estimated from the output of dada. NULL if err_function not provided. | |
49 • opts: A list of the dada_opts used for this invocation of dada. | |
50 • call: The function call used for this invocation of dada. | |
51 | |
52 **uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns) | |
53 | |
54 **mergepairs**: | |
55 | |
56 data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns: | |
57 • abundance: Number of reads corresponding to this forward/reverse combination. | |
58 • sequence: The merged sequence. | |
59 • forward: The index of the forward denoised sequence. | |
60 • reverse: The index of the reverse denoised sequence. | |
61 • nmatch: Number of matches nts in the overlap region. | |
62 • nmismatch: Number of mismatches in the overlap region. | |
63 • nindel: Number of indels in the overlap region. | |
64 • prefer: The sequence used for the overlap region. 1=forward; 2=reverse. | |
65 • accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. | |
66 • ...: Additional columns specified in propagateCol | |
67 | |
68 | |
69 | |
70 Tools: | |
71 ====== | |
72 | |
73 • Quality filtering | |
74 | |
75 filterAndTrim IO=(fastq -> fastq) | |
76 | |
77 • Dereplication | |
78 | |
79 derepFastq (fastq -> derep-class object) | |
80 | |
81 • Learn error rates | |
82 | |
83 learnErrors + plotErrors | |
84 - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data) | |
85 - out: named list w entries | |
86 - \$err\_out: A numeric matrix with the learned error rates. | |
87 - \$err\_in: The initialization error rates (unimportant). | |
88 - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score | |
89 | |
90 • Sample Inference (dada) | |
91 in: (list of) derep-class object | |
92 out: (list of) dada-class object | |
93 | |
94 • Chimera Removal | |
95 | |
96 removeBimeraDenovo | |
97 | |
98 in: A uniques-vector or any object that can be coerced into one with getUniques. | |
99 out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided | |
100 | |
101 • Merging of Paired Reads | |
102 | |
103 mergePairs | |
104 in: 2x dada-class object(s), 2x derep-class object(s) | |
105 out: A data.frame, or a list of data.frames. | |
106 - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, | |
107 - cols | |
108 - \$abundance: Number of reads corresponding to this forward/reverse combination. | |
109 - \$sequence: The merged sequence. | |
110 - \$forward: The index of the forward denoised sequence. | |
111 - \$reverse: The index of the reverse denoised sequence. | |
112 - \$nmatch: Number of matches nts in the overlap region. | |
113 - \$nmismatch: Number of mismatches in the overlap region. | |
114 - \$nindel: Number of indels in the overlap region. | |
115 - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse. | |
116 - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. | |
117 - \$...: Additional columns specified in propagateCol. | |
118 | |
119 • Taxonomic Classification (assignTaxonomy, assignSpecies) | |
120 | |
121 * Other | |
122 | |
123 makeSequenceTable | |
124 in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques | |
125 out Named integer matrix (row for each sample, column for each unique sequence) | |
126 | |
127 mergeSequenceTables | |
128 | |
129 uniquesToFasta | |
130 in: A uniques-vector or any object that can be coerced into one with getUniques. | |
131 | |
132 getSequences | |
133 | |
134 extracts the sequences from several different data objects: including including dada-class | |
135 and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun- | |
136 dance columns. | |
137 | |
138 getUniques | |
139 | |
140 extracts the uniques-vector from several different data objects, including dada-class | |
141 and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance | |
142 columns | |
143 | |
144 plotQualityProfile | |
145 | |
146 seqComplexity | |
147 | |
148 setDadaOpt(...) |