# HG changeset patch # User devteam # Date 1390498302 18000 # Node ID 0017fa63af6c7e8fde9d7735c13eea04bc2da293 Imported from capsule None diff -r 000000000000 -r 0017fa63af6c fastq_filter.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fastq_filter.py Thu Jan 23 12:31:42 2014 -0500 @@ -0,0 +1,36 @@ +#Dan Blankenberg +import sys, os, shutil +from galaxy_utils.sequence.fastq import fastqReader, fastqWriter + +def main(): + #Read command line arguments + input_filename = sys.argv[1] + script_filename = sys.argv[2] + output_filename = sys.argv[3] + additional_files_path = sys.argv[4] + input_type = sys.argv[5] or 'sanger' + + #Save script file for debuging/verification info later + os.mkdir( additional_files_path ) + shutil.copy( script_filename, os.path.join( additional_files_path, 'debug.txt' ) ) + + ## Dan, Others: Can we simply drop the "format=input_type" here since it is specified in reader. + ## This optimization would cut runtime roughly in half (for my test case anyway). -John + out = fastqWriter( open( output_filename, 'wb' ), format = input_type ) + + i = None + reads_kept = 0 + execfile(script_filename, globals()) + for i, fastq_read in enumerate( fastqReader( open( input_filename ), format = input_type ) ): + ret_val = fastq_read_pass_filter( fastq_read ) ## fastq_read_pass_filter defined in script_filename + if ret_val: + out.write( fastq_read ) + reads_kept += 1 + out.close() + if i is None: + print "Your file contains no valid fastq reads." + else: + print 'Kept %s of %s reads (%.2f%%).' % ( reads_kept, i + 1, float( reads_kept ) / float( i + 1 ) * 100.0 ) + +if __name__ == "__main__": + main() diff -r 000000000000 -r 0017fa63af6c fastq_filter.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fastq_filter.xml Thu Jan 23 12:31:42 2014 -0500 @@ -0,0 +1,320 @@ + + reads by quality score and length + + galaxy_sequence_utils + + fastq_filter.py $input_file $fastq_filter_file $output_file $output_file.files_path '${input_file.extension[len( 'fastq' ):]}' + + + + + + + + + + + + + + + + + + + + + + + int( float( value ) ) == float( value ) + + + + int( float( value ) ) == float( value ) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +def fastq_read_pass_filter( fastq_read ): + def mean( score_list ): + return float( sum( score_list ) ) / float( len( score_list ) ) + if len( fastq_read ) < $min_size: + return False + if $max_size > 0 and len( fastq_read ) > $max_size: + return False + num_deviates = $max_num_deviants + qual_scores = fastq_read.get_decimal_quality_scores() + for qual_score in qual_scores: + if qual_score < $min_quality or ( $max_quality > 0 and qual_score > $max_quality ): + if num_deviates == 0: + return False + else: + num_deviates -= 1 +#if not $paired_end: + qual_scores_split = [ qual_scores ] +#else: + qual_scores_split = [ qual_scores[ 0:int( len( qual_scores ) / 2 ) ], qual_scores[ int( len( qual_scores ) / 2 ): ] ] +#end if +#for $fastq_filter in $fastq_filters: + for split_scores in qual_scores_split: + left_column_offset = $fastq_filter[ 'offset_type' ][ 'left_column_offset' ] + right_column_offset = $fastq_filter[ 'offset_type' ][ 'right_column_offset' ] +#if $fastq_filter[ 'offset_type' ]['base_offset_type'] == 'offsets_percent': + left_column_offset = int( round( float( left_column_offset ) / 100.0 * float( len( split_scores ) ) ) ) + right_column_offset = int( round( float( right_column_offset ) / 100.0 * float( len( split_scores ) ) ) ) +#end if + if right_column_offset > 0: + split_scores = split_scores[ left_column_offset:-right_column_offset] + else: + split_scores = split_scores[ left_column_offset:] + if split_scores: ##if a read doesn't have enough columns, it passes by default + if not ( ${fastq_filter[ 'score_operation' ]}( split_scores ) $fastq_filter[ 'score_comparison' ] $fastq_filter[ 'score' ] ): + return False +#end for + return True + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +This tool allows you to build complex filters to be applied to each read in a FASTQ file. + +**Basic Options:** + * You can specify a minimum and maximum read lengths. + * You can specify minimum and maximum per base quality scores, with optionally specifying the number of bases that are allowed to deviate from this range (default of 0 deviant bases). + * If your data is paired-end, select the proper checkbox; this will cause each read to be internally split down the middle and filters applied to each half using the offsets specified. + +**Advance Options:** + * You can specify any number of advanced filters. + * 5' and 3' offsets are defined, starting at zero, increasing from the respective end of the reads. For example, a quality string of "ABCDEFG", with 5' and 3' offsets of 1 and 1, respectively, specified will yield "BCDEF". + * You can specify either absolute offset values, or percentage offset values. *Absolute Values* based offsets are useful for fixed length reads (e.g. Illumina or SOLiD data). *Percentage of Read Length* based offsets are useful for variable length reads (e.g. 454 data). When using the percent-based method, offsets are rounded to the nearest integer. + * The user specifies the aggregating action (min, max, sum, mean) to perform on the quality score values found between the specified offsets to be used with the user defined comparison operation and comparison value. + * If a set of offsets is specified that causes the remaining quality score list to be of length zero, then the read will **pass** the quality filter unless the size range filter is used to remove these reads. + +----- + +.. class:: warningmark + +Adapter bases in color space reads are excluded from filtering. + +------ + +**Citation** + +If you use this tool, please cite `Blankenberg D, Gordon A, Von Kuster G, Coraor N, Taylor J, Nekrutenko A; Galaxy Team. Manipulation of FASTQ data with Galaxy. Bioinformatics. 2010 Jul 15;26(14):1783-5. <http://www.ncbi.nlm.nih.gov/pubmed/20562416>`_ + + + + diff -r 000000000000 -r 0017fa63af6c test-data/empty_file.dat diff -r 000000000000 -r 0017fa63af6c test-data/sanger_full_range_as_cssanger.fastqcssanger --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/sanger_full_range_as_cssanger.fastqcssanger Thu Jan 23 12:31:42 2014 -0500 @@ -0,0 +1,8 @@ +@FAKE0001 Original version has PHRED scores from 0 to 93 inclusive (in that order) +G2131313131313131313131313131313131313131313131313131313131313131313131313131313131313131313131 ++ +!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ +@FAKE0002 Original version has PHRED scores from 93 to 0 inclusive (in that order) +G3131313131313131313131313131313131313131313131313131313131313131313131313131313131313131313131 ++ +~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! diff -r 000000000000 -r 0017fa63af6c test-data/sanger_full_range_original_sanger.fastqsanger --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/sanger_full_range_original_sanger.fastqsanger Thu Jan 23 12:31:42 2014 -0500 @@ -0,0 +1,8 @@ +@FAKE0001 Original version has PHRED scores from 0 to 93 inclusive (in that order) +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC ++ +!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ +@FAKE0002 Original version has PHRED scores from 93 to 0 inclusive (in that order) +CATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA ++ +~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! diff -r 000000000000 -r 0017fa63af6c test-data/solexa_full_range_original_solexa.fastqsolexa --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/solexa_full_range_original_solexa.fastqsolexa Thu Jan 23 12:31:42 2014 -0500 @@ -0,0 +1,8 @@ +@FAKE0003 Original version has Solexa scores from -5 to 62 inclusive (in that order) +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ++ +;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ +@FAKE0004 Original version has Solexa scores from 62 to -5 inclusive (in that order) +TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA ++ +~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; diff -r 000000000000 -r 0017fa63af6c tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Thu Jan 23 12:31:42 2014 -0500 @@ -0,0 +1,6 @@ + + + + + +