dada2_learnerrors: notes.txt comparison

comparison notes.txt @ 0:56d5be6c03b9 draft

planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty

author	matthias
date	Fri, 08 Mar 2019 06:30:11 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:56d5be6c03b9
+TODO
+====
+If we make a monolithic tool:
+* implement sanity checks between important compute intensive steps (user definable criteria, abort if violated)
+If we keep separate tools:
+- make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes)
+* alternatively the data set types could be derived from tabular and the Rdata could be attached via
+`.extra_files_path` this way the user could have some intermediate output that he could look at.
+In both cases:
+* allow input of single end data, single pair, single pair in separate data sets, ...
+* add mergePairsByID functionality to mergePairs tool
+Datatypes:
+==========
+**derep-class**: list w 3 members
+- uniques: Named integer vector. Named by the unique sequence, valued by abundance.
+• quals: Numeric matrix of average quality scores by position for each unique. Uniques are
+rows, positions are cols.
+* map: Integer vector of length the number of reads, and value the index (in uniques) of the
+unique to which that read was assigned.
+**learnErrorsOutput**: A named list with three entries
+- err_out: A numeric matrix with the learned error rates.
+- err_in: The initialization error rates (unimportant).
+- trans: A feature table of observed transitions for each type (eg. A->C) and quality score.
+**dada-class**: A multi-item List with the following named values...
+• denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences.
+• clustering: An informative data.frame containing information on each cluster.
+• sequence: A character vector of each denoised sequence. Identical to names(denoised).
+• quality: The average quality scores for each cluster (row) by position (col).
+• map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised).
+• birth_subs: A data.frame containing the substitutions at the birth of each new cluster.
+• trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col)
+observed in the final output of the dada algorithm.
+• err_in: The err matrix used for this invocation of dada.
+• err_out: The err matrix estimated from the output of dada. NULL if err_function not provided.
+• opts: A list of the dada_opts used for this invocation of dada.
+• call: The function call used for this invocation of dada.
+**uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns)
+**mergepairs**:
+data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns:
+• abundance: Number of reads corresponding to this forward/reverse combination.
+• sequence: The merged sequence.
+• forward: The index of the forward denoised sequence.
+• reverse: The index of the reverse denoised sequence.
+• nmatch: Number of matches nts in the overlap region.
+• nmismatch: Number of mismatches in the overlap region.
+• nindel: Number of indels in the overlap region.
+• prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
+• accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
+• ...: Additional columns specified in propagateCol
+Tools:
+======
+• Quality filtering
+filterAndTrim IO=(fastq -> fastq)
+• Dereplication
+derepFastq (fastq -> derep-class object)
+• Learn error rates
+learnErrors + plotErrors
+- in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data)
+- out: named list w entries
+- \$err\_out: A numeric matrix with the learned error rates.
+- \$err\_in: The initialization error rates (unimportant).
+- \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score
+• Sample Inference (dada)
+in: (list of) derep-class object
+out: (list of) dada-class object
+• Chimera Removal
+removeBimeraDenovo
+in: A uniques-vector or any object that can be coerced into one with getUniques.
+out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided
+• Merging of Paired Reads
+mergePairs
+in: 2x dada-class object(s), 2x derep-class object(s)
+out: A data.frame, or a list of data.frames.
+- The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences,
+- cols
+- \$abundance: Number of reads corresponding to this forward/reverse combination.
+- \$sequence: The merged sequence.
+- \$forward: The index of the forward denoised sequence.
+- \$reverse: The index of the reverse denoised sequence.
+- \$nmatch: Number of matches nts in the overlap region.
+- \$nmismatch: Number of mismatches in the overlap region.
+- \$nindel: Number of indels in the overlap region.
+- \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
+- \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
+- \$...: Additional columns specified in propagateCol.
+• Taxonomic Classification (assignTaxonomy, assignSpecies)
+* Other
+makeSequenceTable
+in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques
+out Named integer matrix (row for each sample, column for each unique sequence)
+mergeSequenceTables
+uniquesToFasta
+in: A uniques-vector or any object that can be coerced into one with getUniques.
+getSequences
+extracts the sequences from several different data objects: including including dada-class
+and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun-
+dance columns.
+getUniques
+extracts the uniques-vector from several different data objects, including dada-class
+and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance
+columns
+plotQualityProfile
+seqComplexity
+setDadaOpt(...)

Mercurial > repos > matthias > dada2_learnerrors

comparison notes.txt @ 0:56d5be6c03b9 draft