Mercurial > repos > matthias > dada2_seqcounts
view notes.txt @ 2:e089fb4ee28b draft
planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit 5b1603bbcd3f139cad5c876be83fcb39697b5613-dirty
author | matthias |
---|---|
date | Tue, 09 Apr 2019 07:09:26 -0400 |
parents | 11993afc394e |
children |
line wrap: on
line source
TODO ==== If we make a monolithic tool: * implement sanity checks between important compute intensive steps (user definable criteria, abort if violated) If we keep separate tools: - make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes) * alternatively the data set types could be derived from tabular and the Rdata could be attached via `.extra_files_path` this way the user could have some intermediate output that he could look at. In both cases: * allow input of single end data, single pair, single pair in separate data sets, ... * add mergePairsByID functionality to mergePairs tool Datatypes: ========== **derep-class**: list w 3 members - uniques: Named integer vector. Named by the unique sequence, valued by abundance. • quals: Numeric matrix of average quality scores by position for each unique. Uniques are rows, positions are cols. * map: Integer vector of length the number of reads, and value the index (in uniques) of the unique to which that read was assigned. **learnErrorsOutput**: A named list with three entries - err_out: A numeric matrix with the learned error rates. - err_in: The initialization error rates (unimportant). - trans: A feature table of observed transitions for each type (eg. A->C) and quality score. **dada-class**: A multi-item List with the following named values... • denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences. • clustering: An informative data.frame containing information on each cluster. • sequence: A character vector of each denoised sequence. Identical to names(denoised). • quality: The average quality scores for each cluster (row) by position (col). • map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised). • birth_subs: A data.frame containing the substitutions at the birth of each new cluster. • trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col) observed in the final output of the dada algorithm. • err_in: The err matrix used for this invocation of dada. • err_out: The err matrix estimated from the output of dada. NULL if err_function not provided. • opts: A list of the dada_opts used for this invocation of dada. • call: The function call used for this invocation of dada. **uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns) **mergepairs**: data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns: • abundance: Number of reads corresponding to this forward/reverse combination. • sequence: The merged sequence. • forward: The index of the forward denoised sequence. • reverse: The index of the reverse denoised sequence. • nmatch: Number of matches nts in the overlap region. • nmismatch: Number of mismatches in the overlap region. • nindel: Number of indels in the overlap region. • prefer: The sequence used for the overlap region. 1=forward; 2=reverse. • accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. • ...: Additional columns specified in propagateCol Tools: ====== • Quality filtering filterAndTrim IO=(fastq -> fastq) • Dereplication derepFastq (fastq -> derep-class object) • Learn error rates learnErrors + plotErrors - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data) - out: named list w entries - \$err\_out: A numeric matrix with the learned error rates. - \$err\_in: The initialization error rates (unimportant). - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score • Sample Inference (dada) in: (list of) derep-class object out: (list of) dada-class object • Chimera Removal removeBimeraDenovo in: A uniques-vector or any object that can be coerced into one with getUniques. out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided • Merging of Paired Reads mergePairs in: 2x dada-class object(s), 2x derep-class object(s) out: A data.frame, or a list of data.frames. - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, - cols - \$abundance: Number of reads corresponding to this forward/reverse combination. - \$sequence: The merged sequence. - \$forward: The index of the forward denoised sequence. - \$reverse: The index of the reverse denoised sequence. - \$nmatch: Number of matches nts in the overlap region. - \$nmismatch: Number of mismatches in the overlap region. - \$nindel: Number of indels in the overlap region. - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse. - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. - \$...: Additional columns specified in propagateCol. • Taxonomic Classification (assignTaxonomy, assignSpecies) * Other makeSequenceTable in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques out Named integer matrix (row for each sample, column for each unique sequence) mergeSequenceTables uniquesToFasta in: A uniques-vector or any object that can be coerced into one with getUniques. getSequences extracts the sequences from several different data objects: including including dada-class and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun- dance columns. getUniques extracts the uniques-vector from several different data objects, including dada-class and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance columns plotQualityProfile seqComplexity setDadaOpt(...)