Mercurial > repos > matthias > dada2_mergepairs

diff notes.txt @ 0:44230c777694 draft
planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty
author: matthias
date: Fri, 08 Mar 2019 06:39:56 -0500
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notes.txt	Fri Mar 08 06:39:56 2019 -0500
@@ -0,0 +1,148 @@
+TODO
+====
+
+
+
+If we make a monolithic tool: 
+
+* implement sanity checks between important compute intensive steps (user definable criteria, abort if violated)
+
+If we keep separate tools: 
+
+- make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes)
+* alternatively the data set types could be derived from tabular and the Rdata could be attached via
+  `.extra_files_path` this way the user could have some intermediate output that he could look at.
+
+
+In both cases: 
+
+* allow input of single end data, single pair, single pair in separate data sets, ...
+* add mergePairsByID functionality to mergePairs tool
+
+
+Datatypes:
+==========
+
+**derep-class**: list w 3 members
+- uniques: Named integer vector. Named by the unique sequence, valued by abundance.
+• quals: Numeric matrix of average quality scores by position for each unique. Uniques are
+rows, positions are cols.
+* map: Integer vector of length the number of reads, and value the index (in uniques) of the
+unique to which that read was assigned.
+
+**learnErrorsOutput**: A named list with three entries
+- err_out: A numeric matrix with the learned error rates. 
+- err_in: The initialization error rates (unimportant). 
+- trans: A feature table of observed transitions for each type (eg. A->C) and quality score.
+
+**dada-class**: A multi-item List with the following named values...
+• denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences.
+• clustering: An informative data.frame containing information on each cluster.
+• sequence: A character vector of each denoised sequence. Identical to names(denoised).
+• quality: The average quality scores for each cluster (row) by position (col).
+• map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised).
+• birth_subs: A data.frame containing the substitutions at the birth of each new cluster.
+• trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col)
+observed in the final output of the dada algorithm.
+• err_in: The err matrix used for this invocation of dada.
+• err_out: The err matrix estimated from the output of dada. NULL if err_function not provided.
+• opts: A list of the dada_opts used for this invocation of dada.
+• call: The function call used for this invocation of dada.
+
+**uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns)
+
+**mergepairs**:
+
+data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:
+• abundance: Number of reads corresponding to this forward/reverse combination.
+• sequence: The merged sequence.
+• forward: The index of the forward denoised sequence.
+• reverse: The index of the reverse denoised sequence.
+• nmatch: Number of matches nts in the overlap region.
+• nmismatch: Number of mismatches in the overlap region.
+• nindel: Number of indels in the overlap region.
+• prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
+• accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
+• ...: Additional columns specified in propagateCol
+
+
+
+Tools: 
+======
+
+• Quality filtering 
+  
+  filterAndTrim IO=(fastq -> fastq) 
+
+• Dereplication 
+
+  derepFastq (fastq -> derep-class object)
+
+• Learn error rates 
+
+  learnErrors + plotErrors
+    - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data) 
+    - out: named list w entries 
+      - \$err\_out: A numeric matrix with the learned error rates. 
+      - \$err\_in: The initialization error rates (unimportant). 
+      - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score
+
+• Sample Inference (dada)
+   in: (list of) derep-class object
+   out: (list of) dada-class object 
+
+• Chimera Removal 
+
+  removeBimeraDenovo
+
+  in: A uniques-vector or any object that can be coerced into one with getUniques.
+  out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided
+
+• Merging of Paired Reads 
+
+  mergePairs
+   in: 2x dada-class object(s), 2x derep-class object(s)
+   out: A data.frame, or a list of data.frames.
+     - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences,
+     - cols
+       - \$abundance: Number of reads corresponding to this forward/reverse combination.
+       - \$sequence: The merged sequence.
+       - \$forward: The index of the forward denoised sequence.
+       - \$reverse: The index of the reverse denoised sequence.
+       - \$nmatch: Number of matches nts in the overlap region.
+       - \$nmismatch: Number of mismatches in the overlap region.
+       - \$nindel: Number of indels in the overlap region.
+       - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
+       - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
+       - \$...: Additional columns specified in propagateCol.
+
+• Taxonomic Classification (assignTaxonomy, assignSpecies)
+
+* Other 
+
+  makeSequenceTable
+   in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques
+   out Named integer matrix (row for each sample, column for each unique sequence)
+
+  mergeSequenceTables
+
+  uniquesToFasta
+  in: A uniques-vector or any object that can be coerced into one with getUniques.
+  
+  getSequences
+
+  extracts the sequences from several different data objects: including including dada-class
+  and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun-
+  dance columns.
+
+  getUniques
+
+  extracts the uniques-vector from several different data objects, including dada-class
+  and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance
+  columns
+
+  plotQualityProfile
+
+  seqComplexity
+
+  setDadaOpt(...)
author	matthias
date	Fri, 08 Mar 2019 06:39:56 -0500
parents
children