Mercurial > repos > matthias > dada2_mergepairs
diff notes.txt @ 0:44230c777694 draft
planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty
author | matthias |
---|---|
date | Fri, 08 Mar 2019 06:39:56 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/notes.txt Fri Mar 08 06:39:56 2019 -0500 @@ -0,0 +1,148 @@ +TODO +==== + + + +If we make a monolithic tool: + +* implement sanity checks between important compute intensive steps (user definable criteria, abort if violated) + +If we keep separate tools: + +- make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes) +* alternatively the data set types could be derived from tabular and the Rdata could be attached via + `.extra_files_path` this way the user could have some intermediate output that he could look at. + + +In both cases: + +* allow input of single end data, single pair, single pair in separate data sets, ... +* add mergePairsByID functionality to mergePairs tool + + +Datatypes: +========== + +**derep-class**: list w 3 members +- uniques: Named integer vector. Named by the unique sequence, valued by abundance. +• quals: Numeric matrix of average quality scores by position for each unique. Uniques are +rows, positions are cols. +* map: Integer vector of length the number of reads, and value the index (in uniques) of the +unique to which that read was assigned. + +**learnErrorsOutput**: A named list with three entries +- err_out: A numeric matrix with the learned error rates. +- err_in: The initialization error rates (unimportant). +- trans: A feature table of observed transitions for each type (eg. A->C) and quality score. + +**dada-class**: A multi-item List with the following named values... +• denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences. +• clustering: An informative data.frame containing information on each cluster. +• sequence: A character vector of each denoised sequence. Identical to names(denoised). +• quality: The average quality scores for each cluster (row) by position (col). +• map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised). +• birth_subs: A data.frame containing the substitutions at the birth of each new cluster. +• trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col) +observed in the final output of the dada algorithm. +• err_in: The err matrix used for this invocation of dada. +• err_out: The err matrix estimated from the output of dada. NULL if err_function not provided. +• opts: A list of the dada_opts used for this invocation of dada. +• call: The function call used for this invocation of dada. + +**uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns) + +**mergepairs**: + +data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns: +• abundance: Number of reads corresponding to this forward/reverse combination. +• sequence: The merged sequence. +• forward: The index of the forward denoised sequence. +• reverse: The index of the reverse denoised sequence. +• nmatch: Number of matches nts in the overlap region. +• nmismatch: Number of mismatches in the overlap region. +• nindel: Number of indels in the overlap region. +• prefer: The sequence used for the overlap region. 1=forward; 2=reverse. +• accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. +• ...: Additional columns specified in propagateCol + + + +Tools: +====== + +• Quality filtering + + filterAndTrim IO=(fastq -> fastq) + +• Dereplication + + derepFastq (fastq -> derep-class object) + +• Learn error rates + + learnErrors + plotErrors + - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data) + - out: named list w entries + - \$err\_out: A numeric matrix with the learned error rates. + - \$err\_in: The initialization error rates (unimportant). + - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score + +• Sample Inference (dada) + in: (list of) derep-class object + out: (list of) dada-class object + +• Chimera Removal + + removeBimeraDenovo + + in: A uniques-vector or any object that can be coerced into one with getUniques. + out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided + +• Merging of Paired Reads + + mergePairs + in: 2x dada-class object(s), 2x derep-class object(s) + out: A data.frame, or a list of data.frames. + - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, + - cols + - \$abundance: Number of reads corresponding to this forward/reverse combination. + - \$sequence: The merged sequence. + - \$forward: The index of the forward denoised sequence. + - \$reverse: The index of the reverse denoised sequence. + - \$nmatch: Number of matches nts in the overlap region. + - \$nmismatch: Number of mismatches in the overlap region. + - \$nindel: Number of indels in the overlap region. + - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse. + - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. + - \$...: Additional columns specified in propagateCol. + +• Taxonomic Classification (assignTaxonomy, assignSpecies) + +* Other + + makeSequenceTable + in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques + out Named integer matrix (row for each sample, column for each unique sequence) + + mergeSequenceTables + + uniquesToFasta + in: A uniques-vector or any object that can be coerced into one with getUniques. + + getSequences + + extracts the sequences from several different data objects: including including dada-class + and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun- + dance columns. + + getUniques + + extracts the uniques-vector from several different data objects, including dada-class + and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance + columns + + plotQualityProfile + + seqComplexity + + setDadaOpt(...)