view notes.txt @ 0:de5c51e1c190 draft

planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty
author matthias
date Fri, 08 Mar 2019 06:35:24 -0500
parents
children
line wrap: on
line source

TODO
====



If we make a monolithic tool: 

* implement sanity checks between important compute intensive steps (user definable criteria, abort if violated)

If we keep separate tools: 

- make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes)
* alternatively the data set types could be derived from tabular and the Rdata could be attached via
  `.extra_files_path` this way the user could have some intermediate output that he could look at.


In both cases: 

* allow input of single end data, single pair, single pair in separate data sets, ...
* add mergePairsByID functionality to mergePairs tool


Datatypes:
==========

**derep-class**: list w 3 members
- uniques: Named integer vector. Named by the unique sequence, valued by abundance.
• quals: Numeric matrix of average quality scores by position for each unique. Uniques are
rows, positions are cols.
* map: Integer vector of length the number of reads, and value the index (in uniques) of the
unique to which that read was assigned.

**learnErrorsOutput**: A named list with three entries
- err_out: A numeric matrix with the learned error rates. 
- err_in: The initialization error rates (unimportant). 
- trans: A feature table of observed transitions for each type (eg. A->C) and quality score.

**dada-class**: A multi-item List with the following named values...
• denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences.
• clustering: An informative data.frame containing information on each cluster.
• sequence: A character vector of each denoised sequence. Identical to names(denoised).
• quality: The average quality scores for each cluster (row) by position (col).
• map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised).
• birth_subs: A data.frame containing the substitutions at the birth of each new cluster.
• trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col)
observed in the final output of the dada algorithm.
• err_in: The err matrix used for this invocation of dada.
• err_out: The err matrix estimated from the output of dada. NULL if err_function not provided.
• opts: A list of the dada_opts used for this invocation of dada.
• call: The function call used for this invocation of dada.

**uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns)

**mergepairs**:

data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:
• abundance: Number of reads corresponding to this forward/reverse combination.
• sequence: The merged sequence.
• forward: The index of the forward denoised sequence.
• reverse: The index of the reverse denoised sequence.
• nmatch: Number of matches nts in the overlap region.
• nmismatch: Number of mismatches in the overlap region.
• nindel: Number of indels in the overlap region.
• prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
• accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
• ...: Additional columns specified in propagateCol



Tools: 
======

• Quality filtering 
  
  filterAndTrim IO=(fastq -> fastq) 

• Dereplication 

  derepFastq (fastq -> derep-class object)

• Learn error rates 

  learnErrors + plotErrors
    - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data) 
    - out: named list w entries 
      - \$err\_out: A numeric matrix with the learned error rates. 
      - \$err\_in: The initialization error rates (unimportant). 
      - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score

• Sample Inference (dada)
   in: (list of) derep-class object
   out: (list of) dada-class object 

• Chimera Removal 

  removeBimeraDenovo

  in: A uniques-vector or any object that can be coerced into one with getUniques.
  out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided

• Merging of Paired Reads 

  mergePairs
   in: 2x dada-class object(s), 2x derep-class object(s)
   out: A data.frame, or a list of data.frames.
     - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences,
     - cols
       - \$abundance: Number of reads corresponding to this forward/reverse combination.
       - \$sequence: The merged sequence.
       - \$forward: The index of the forward denoised sequence.
       - \$reverse: The index of the reverse denoised sequence.
       - \$nmatch: Number of matches nts in the overlap region.
       - \$nmismatch: Number of mismatches in the overlap region.
       - \$nindel: Number of indels in the overlap region.
       - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
       - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
       - \$...: Additional columns specified in propagateCol.

• Taxonomic Classification (assignTaxonomy, assignSpecies)

* Other 

  makeSequenceTable
   in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques
   out Named integer matrix (row for each sample, column for each unique sequence)

  mergeSequenceTables

  uniquesToFasta
  in: A uniques-vector or any object that can be coerced into one with getUniques.
  
  getSequences

  extracts the sequences from several different data objects: including including dada-class
  and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun-
  dance columns.

  getUniques

  extracts the uniques-vector from several different data objects, including dada-class
  and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance
  columns

  plotQualityProfile

  seqComplexity

  setDadaOpt(...)