Mercurial > repos > bgruening > text_processing
comparison readme.rst @ 0:ec66f9d90ef0 draft
initial uploaded
author | bgruening |
---|---|
date | Thu, 05 Sep 2013 04:58:21 -0400 |
parents | |
children | a4ad586d1403 |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:ec66f9d90ef0 |
---|---|
1 These are Galaxy wrappers for common unix text-processing tools | |
2 =============================================================== | |
3 | |
4 The initial work was done by Assaf Gordon and Greg Hannon's lab ( http://hannonlab.cshl.edu ) | |
5 in Cold Spring Harbor Laboratory ( http://www.cshl.edu ). | |
6 | |
7 | |
8 The tools are: | |
9 | |
10 * awk - The AWK programmning language ( http://www.gnu.org/software/gawk/ ) | |
11 * sed - Stream Editor ( http://sed.sf.net ) | |
12 * grep - Search files ( http://www.gnu.org/software/grep/ ) | |
13 * sort_columns - Sorting every line according to there columns | |
14 * GNU Coreutils programs ( http://www.gnu.org/software/coreutils/ ): | |
15 * sort - sort files | |
16 * join - join two files, based on common key field. | |
17 * cut - keep/discard fields from a file | |
18 * unsorted_uniq - keep unique/duplicated lines in a file | |
19 * sorted_uniq - keep unique/duplicated lines in a file | |
20 * head - keep the first X lines in a file. | |
21 * tail - keep the last X lines in a file. | |
22 | |
23 Few improvements over the standard tools: | |
24 | |
25 * EasyJoin - A Join tool that does not require pre-sorted the files ( https://github.com/agordon/filo/blob/scripts/src/scripts/easyjoin ) | |
26 * Multi-Join - Join multiple (>2) files ( https://github.com/agordon/filo/blob/scripts/src/scripts/multijoin ) | |
27 * Find_and_Replace - Find/Replace text in a line or specific column. | |
28 * Grep with Perl syntax - uses grep with Perl-Compatible regular expressions. | |
29 * HTML'd Grep - grep text in a file, and produced high-lighted HTML output, for easier viewing ( uses https://github.com/agordon/filo/blob/scripts/src/scripts/sort-header ) | |
30 | |
31 | |
32 Requirements | |
33 ------------ | |
34 | |
35 1. Coreutils vesion 8.19 or later. | |
36 2. AWK version 4.0.1 or later. | |
37 3. SED version 4.2 *with* a special patch | |
38 4. Grep with PCRE support | |
39 | |
40 These will be installed automatically with the Galaxy Tool Shed. | |
41 | |
42 | |
43 ------------------- | |
44 NOTE About Security | |
45 ------------------- | |
46 | |
47 The included tools are secure (barring unintentional bugs): | |
48 The main concern might be executing system commands with awk's "system" and sed's "e" commands, | |
49 or reading/writing arbitrary files with awk's redirection and sed's "r/w" commands. | |
50 These commands are DISABLED using the "--sandbox" parameter to awk and sed. | |
51 | |
52 User trying to run an awk program similar to: | |
53 BEGIN { system("ls") } | |
54 Will get an error (in Galaxy) saying: | |
55 fatal: 'system' function not allowed in sandbox mode. | |
56 | |
57 User trying to run a SED program similar to: | |
58 1els | |
59 will get an error (in Galaxy) saying: | |
60 sed: -e expression #1, char 2: e/r/w commands disabled in sandbox mode | |
61 | |
62 That being said, if you do find some vulnerability in these tools, please let me know and I'll try fix them. | |
63 | |
64 ------------ | |
65 Installation | |
66 ------------ | |
67 | |
68 Should be done with the Galaxy `Tool Shed`_. | |
69 | |
70 .. _`Tool Shed`: http://wiki.galaxyproject.org/Tool%20Shed | |
71 | |
72 | |
73 ---- | |
74 TODO | |
75 ---- | |
76 | |
77 - unit-tests | |
78 - uniqu will get a new --group funciton with the 8.22 release, its currently commended out | |
79 - also shuf will get a major improved performance with large files http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff | |
80 we can remove the random feature from sort and use shuf instead | |
81 - move some advanced settings under a conditional, for example the cut tools offers to cut bytes | |
82 | |
83 | |
84 | |
85 | |
86 |