Mercurial > repos > matthias > dada2_seqcounts

TODO
====


If we make a monolithic tool:

* implement sanity checks between important compute intensive steps (user definable criteria, abort if violated)

If we keep separate tools:

- make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes)
* alternatively the data set types could be derived from tabular and the Rdata could be attached via
  `.extra_files_path` this way the user could have some intermediate output that he could look at.


In both cases:

* allow input of single end data, single pair, single pair in separate data sets, ...
* add mergePairsByID functionality to mergePairs tool


Datatypes:
==========

**derep-class**: list w 3 members
- uniques: Named integer vector. Named by the unique sequence, valued by abundance.
• quals: Numeric matrix of average quality scores by position for each unique. Uniques are
rows, positions are cols.
* map: Integer vector of length the number of reads, and value the index (in uniques) of the
unique to which that read was assigned.

**learnErrorsOutput**: A named list with three entries
- err_out: A numeric matrix with the learned error rates.
- err_in: The initialization error rates (unimportant).
- trans: A feature table of observed transitions for each type (eg. A->C) and quality score.

**dada-class**: A multi-item List with the following named values...
• denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences.
• clustering: An informative data.frame containing information on each cluster.
• sequence: A character vector of each denoised sequence. Identical to names(denoised).
• quality: The average quality scores for each cluster (row) by position (col).
• map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised).
• birth_subs: A data.frame containing the substitutions at the birth of each new cluster.
• trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col)
observed in the final output of the dada algorithm.
• err_in: The err matrix used for this invocation of dada.
• err_out: The err matrix estimated from the output of dada. NULL if err_function not provided.
• opts: A list of the dada_opts used for this invocation of dada.
• call: The function call used for this invocation of dada.

**uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns)

**mergepairs**:

data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:
• abundance: Number of reads corresponding to this forward/reverse combination.
• sequence: The merged sequence.
• forward: The index of the forward denoised sequence.
• reverse: The index of the reverse denoised sequence.
• nmatch: Number of matches nts in the overlap region.
• nmismatch: Number of mismatches in the overlap region.
• nindel: Number of indels in the overlap region.
• prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
• accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
• ...: Additional columns specified in propagateCol


Tools:
======

• Quality filtering

  filterAndTrim IO=(fastq -> fastq)

• Dereplication

  derepFastq (fastq -> derep-class object)

• Learn error rates

  learnErrors + plotErrors
    - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data)
    - out: named list w entries
      - \$err\_out: A numeric matrix with the learned error rates.
      - \$err\_in: The initialization error rates (unimportant).
      - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score

• Sample Inference (dada)
   in: (list of) derep-class object
   out: (list of) dada-class object

• Chimera Removal

  removeBimeraDenovo

  in: A uniques-vector or any object that can be coerced into one with getUniques.
  out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided

• Merging of Paired Reads

  mergePairs
   in: 2x dada-class object(s), 2x derep-class object(s)
   out: A data.frame, or a list of data.frames.
     - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences,
     - cols
       - \$abundance: Number of reads corresponding to this forward/reverse combination.
       - \$sequence: The merged sequence.
       - \$forward: The index of the forward denoised sequence.
       - \$reverse: The index of the reverse denoised sequence.
       - \$nmatch: Number of matches nts in the overlap region.
       - \$nmismatch: Number of mismatches in the overlap region.
       - \$nindel: Number of indels in the overlap region.
       - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
       - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
       - \$...: Additional columns specified in propagateCol.

• Taxonomic Classification (assignTaxonomy, assignSpecies)

* Other

  makeSequenceTable
   in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques
   out Named integer matrix (row for each sample, column for each unique sequence)

  mergeSequenceTables

  uniquesToFasta
  in: A uniques-vector or any object that can be coerced into one with getUniques.

  getSequences

  extracts the sequences from several different data objects: including including dada-class
  and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun-
  dance columns.

  getUniques

  extracts the uniques-vector from several different data objects, including dada-class
  and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance
  columns

  plotQualityProfile

  seqComplexity

  setDadaOpt(...)
author	matthias
date	Tue, 09 Apr 2019 07:09:26 -0400
parents	11993afc394e
children