Mercurial > repos > deepakjadmin > mayatool3_test2
diff docs/scripts/txt/ExtractFromTextFiles.txt @ 0:4816e4a8ae95 draft default tip
Uploaded
author | deepakjadmin |
---|---|
date | Wed, 20 Jan 2016 09:23:18 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/docs/scripts/txt/ExtractFromTextFiles.txt Wed Jan 20 09:23:18 2016 -0500 @@ -0,0 +1,208 @@ +NAME + ExtractFromTextFiles.pl - Extract specific data from TextFile(s) + +SYNOPSIS + ExtractFromTextFiles.pl TextFile(s)... + + ExtractFromTextFiles.pl [-c, --colmode colnum | collabel] [--categorycol + number | string] [--columns "colnum,[colnum]..." | + "collabel,[collabel]..."] [-h, --help] [--indelim *comma | semicolon*] + [-m, --mode *columns | rows | categories*] [-o, --overwrite] [--outdelim + *comma | tab | semicolon*] [-q, --quote *yes | no*] [--rows + "colid,value,criteria..." | "colid,value..." | + "colid,mincolvalue,maxcolvalue" | "rownum,rownum,..." | colid | + "minrownum,maxrownum"] [ --rowsmode rowsbycolvalue | rowsbycolvaluelist + | rowsbycolvaluerange | rowbymincolvalue | rowbymaxcolvalue | rownums | + rownumrange] [-r, --root *rootname*] [-w, --workingdir *dirname*] + TextFile(s)... + +DESCRIPTION + Extract column(s)/row(s) data from *TextFile(s)* identified by column + numbers or labels. Or categorize data using a specified column category. + During categorization, a summary text file is generated containing + category name and count; an additional text file, containing data for + for each category, is also generated. The file names are separated by + space. The valid file extensions are *.csv* and *.tsv* for + comma/semicolon and tab delimited text files respectively. All other + file names are ignored. All the text files in a current directory can be + specified by **.csv*, **.tsv*, or the current directory name. The + --indelim option determines the format of *TextFile(s)*. Any file which + doesn't correspond to the format indicated by --indelim option is + ignored. + +OPTIONS + -c, --colmode *colnum | collabel* + Specify how columns are identified in *TextFile(s)*: using column + number or column label. Possible values: *colnum or collabel*. + Default value: *colnum*. + + --categorycol *number | string* + Column used to categorize data. Default value: First column. + + For *colnum* value of -c, --colmode option, input value is a column + number. Example: *1*. + + For *collabel* value of -c, --colmode option, input value is a + column label. Example: *Mol_ID*. + + --columns *"colnum,[colnum]..." | "collabel,[collabel]..."* + List of comma delimited columns to extract. Default value: First + column. + + For *colnum* value of -c, --colmode option, input values format is: + *colnum,colnum,...*. Example: *1,3,5* + + For *collabel* value of -c, --colmode option, input values format + is: *collabel,collabel,..*. Example: *Mol_ID,MolWeight* + + -h, --help + Print this help message. + + --indelim *comma | semicolon* + Input delimiter for CSV *TextFile(s)*. Possible values: *comma or + semicolon*. Default value: *comma*. For TSV files, this option is + ignored and *tab* is used as a delimiter. + + -m, --mode *columns | rows | categories* + Specify what to extract from *TextFile(s)*. Possible values: + *columns, rows, or categories*. Default value: *columns*. + + For *columns* mode, data for appropriate columns specified by + --columns option is extracted from *TextFile(s)* and placed into new + text files. + + For *rows* mode, appropriate rows specified in conjuction with + --rowsmode and rows options are extracted from *TextFile(s)* and + placed into new text files. + + For *categories* mode, coulmn specified by --categorycol is used to + categorize data, and a summary text file is generated containing + category name and count; an additional text file, containing data + for for each category, is also generated. + + -o, --overwrite + Overwrite existing files. + + --outdelim *comma | tab | semicolon*. + Output text file delimiter. Possible values: *comma, tab, or + semicolon*. Default value: *comma* + + -q, --quote *yes | no* + Put quotes around column values in output text file. Possible + values: *yes or no*. Default value: *yes*. + + -r, --root *rootname* + New file name is generated using the root: <Root>.<Ext>. Default for + new file names: <TextFile>CategoriesSummary.<Ext>, + <TextFile>ExtractedColumns.<Ext>, and <TextFile>ExtractedRows.<Ext> + for *categories*, *columns*, and *rows* mode respectively. And + <TextFile>Category<CategoryName>.<Ext> for each category retrieved + from each text file. The output file type determines <Ext> value: + csv and tsv for CSV, and TSV files respectively. + + This option is ignored for multiple input files. + + --rows *"colid,value,criteria..." | "colid,value..." | + "colid,mincolvalue,maxcolvalue" | "rownum,rownum,..." | colid | + "minrownum,maxrownum"* + This value is --rowsmode specific. In general, it's a list of comma + separated column ids and associated mode specific value. Based on + Column ids specification, column label or number, is controlled by + -c, --colmode option. + + First line containing column labels is always written out. And value + comparisons assume numerical column data. + + For *rowsbycolvalue* mode, input value format contains these + triplets: *colid,value, criteria...*. Possible values for criteria: + *le, ge or eq*. Examples: + + MolWt,450,le + MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le + + For *rowsbycolvaluelist* mode, input value format is: + *colid,value...*. Examples: + + Mol_ID,20 + Mol_ID,20,1002,1115 + + For *rowsbycolvaluerange* mode, input value format is: + *colid,mincolvalue,maxcolvalue*. Examples: + + MolWt,100,450 + + For *rowbymincolvalue, rowbymaxcolvalue* modes, input value format + is: *colid*. + + For *rownum* mode, input value format is: *rownum*. Default value: + *2*. + + For *rownumrange* mode, input value format is: *minrownum, + maxrownum*. Examples: + + 10,40 + + --rowsmode *rowsbycolvalue | rowsbycolvaluelist | rowsbycolvaluerange | + rowbymincolvalue | rowbymaxcolvalue | rownums | rownumrange* + Specify how to extract rows from *TextFile(s)*. Possible values: + *rowsbycolvalue, rowsbycolvaluelist, rowsbycolvaluerange, + rowbymincolvalue, rowbymaxcolvalue, rownum, rownumrange*. Default + value: *rownum*. + + Use --rows option to list rows criterion used for extraction of rows + from *TextFile(s)*. + + -w, --workingdir *dirname* + Location of working directory. Default: current directory. + +EXAMPLES + To extract first column from a text file and generate a new CSV text + file NewSample1.csv, type: + + % ExtractFromTextFiles.pl -r NewSample1 -o Sample1.csv + + To extract columns Mol_ID, MolWeight, and NAME from Sample1.csv and + generate a new textfile NewSample1.tsv with no quotes, type: + + % ExtractFromTextFiles.pl -m columns -c collabel --columns "Mol_ID, + MolWeight,NAME" --outdelim tab --quote no -r NewSample1 + -o Sample1.csv + + To extract rows containing values for MolWeight column of less than 450 + from Sample1.csv and generate a new textfile NewSample1.csv, type: + + % ExtractFromTextFiles.pl -m rows --rowsmode rowsbycolvalue + -c collabel --rows MolWeight,450,le -r NewSample1 + -o Sample1.csv + + To extract rows containing values for MolWeight column between 400 and + 500 from Sample1.csv and generate a new textfile NewSample1.csv, type: + + % ExtractFromTextFiles.pl -m rows --rowsmode rowsbycolvaluerange + -c collabel --rows MolWeight,450,500 -r NewSample1 + -o Sample1.csv + + To extract a row containing minimum value for column MolWeight from + Sample1.csv and generate a new textfile NewSample1.csv, type: + + % ExtractFromTextFiles.pl -m rows --rowsmode rowbymincolvalue + -c collabel --rows MolWeight -r NewSample1 + -o Sample1.csv + +AUTHOR + Manish Sud <msud@san.rr.com> + +SEE ALSO + JoinTextFiles.pl, MergeTextFilesWithSD.pl, ModifyTextFilesFormat.pl, + SplitTextFiles.pl + +COPYRIGHT + Copyright (C) 2015 Manish Sud. All rights reserved. + + This file is part of MayaChemTools. + + MayaChemTools is free software; you can redistribute it and/or modify it + under the terms of the GNU Lesser General Public License as published by + the Free Software Foundation; either version 3 of the License, or (at + your option) any later version. +