Mercurial > repos > bgruening > text_processing

<tool id="tp_find_and_replace" name="Replace" version="@BASE_VERSION@.3">
    <description>parts of text</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <requirements>
        <requirement type="package" version="5.22.0.1">perl</requirement>
    </requirements>
    <command>
<![CDATA[
        perl '$__tool_directory__/find_and_replace'
            #if $searchwhere.searchwhere_select == "column":
                -c $searchwhere.column
            #end if
            -o $outfile
            $global
            $caseinsensitive
            $wholewords
            $skip_first_line
            $is_regex
            '$find_pattern'
            '$replace_pattern'
            '$infile'
]]>
    </command>
    <inputs>
        <param name="infile" format="txt" type="data" label="File to process" />
        <param name="find_pattern" type="text" label="Find pattern" help="Use simple text, or a valid regular expression (without backslashes // ) " >
            <sanitizer>
                <valid initial="string.printable">
                    <remove value="&apos;"/>
                </valid>
            </sanitizer>
        </param>
        <param name="replace_pattern" type="text" label="Replace with"
            help="Use simple text, or $&amp; (dollar-ampersand) and $1 $2 $3 to refer to matched text. See examples below." >
            <sanitizer>
                <valid initial="string.printable">
                    <remove value="&apos;"/>
                </valid>
            </sanitizer>
        </param>
        <param name="is_regex" type="boolean" checked="false" truevalue="-r" falsevalue=""
            label="Find-Pattern is a regular expression" help="see help section for details." />

        <param name="global" type="boolean" checked="false" truevalue="-g" falsevalue=""
            label="Replace all occurences of the pattern" />

        <param name="caseinsensitive" type="boolean" checked="false" truevalue="-i" falsevalue=""
            label="Case-Insensitive search" help="" />

        <param name="wholewords" type="boolean" checked="false" truevalue="-w" falsevalue=""
            label="Find whole-words" help="ignore partial matches (e.g. 'apple' will not match 'snapple')" />

        <param name="skip_first_line" type="boolean" checked="false" truevalue="-s" falsevalue=""
            label="Ignore first line" help="Select this option if the first line contains column headers. Text in the line will not be replaced. " />

        <conditional name="searchwhere">
            <param name="searchwhere_select" type="select" label="Find and Replace text in">
                <option value="line" selected="true">entire line</option>
                <option value="column">specific column</option>
            </param>
            <when value="line" />
            <when value="column">
                <param name="column" label="in column" type="data_column" data_ref="infile" accept_default="true" />
            </when>
        </conditional>
    </inputs>
    <outputs>
        <data format_source="infile" name="outfile" metadata_source="infile" />
    </outputs>
    <tests>
        <test>
            <param name="infile" value="find_and_replace1.txt" />
            <param name="find_pattern" value="day" />
            <param name="replace_pattern" value="great day" />
            <param name="is_regex" value="False" />
            <param name="global" value="true" />
            <param name="caseinsensitive" value="False" />
            <param name="wholewords" value="True" />
            <output name="outfile" file="find_and_replace_results1.txt" />
        </test>
	<!-- test that columns are split by tab not space. input has no tab loads of spaces
	     .. therefore the ftype needs to be set.
             result should be the same as in test 1 which works on whole line -->
        <test>
            <param name="infile" value="find_and_replace1.txt" ftype="tabular" />
            <param name="find_pattern" value="day" />
            <param name="replace_pattern" value="great day" />
            <param name="is_regex" value="False" />
            <param name="global" value="true" />
            <param name="caseinsensitive" value="False" />
            <param name="wholewords" value="True" />
            <conditional name="searchwhere">
                <param name="searchwhere_select" value="column"/>
                <param name="column" value="1" />
            </conditional>
            <output name="outfile" file="find_and_replace_results1.txt" />
        </test>
        <test>
            <param name="infile" value="find_and_replace2.txt" />
            <param name="find_pattern" value="^chr" />
            <param name="replace_pattern" value="" />
            <param name="is_regex" value="True" />
            <param name="caseinsensitive" value="False" />
            <param name="wholewords" value="False" />
            <param name="searchwhere_select" value="column" />
            <param name="column" value="3" />
            <output name="outfile" file="find_and_replace_results2.txt" />
        </test>
    </tests>
    <help>
<![CDATA[
**What it does**

This tool finds $ replaces text in an input dataset.

.. class:: infomark

The **pattern to find** can be a simple text string, or a perl **regular expression** string (depending on *pattern is a regex* check-box).

.. class:: infomark

When using regular expressions, the **replace pattern** can contain back-references ( e.g. \\1 )

.. class:: infomark

This tool uses Perl regular expression syntax.

-----

**Examples of *regular-expression* Find Patterns**

- **HELLO**     The word 'HELLO' (case sensitive).
- **AG.T**      The letters A,G followed by any single character, followed by the letter T.
- **A{4,}**     Four or more consecutive A's.
- **chr2[012]\\t**       The words 'chr20' or 'chr21' or 'chr22' followed by a tab character.
- **hsa-mir-([^ ]+)**        The text 'hsa-mir-' followed by one-or-more non-space characters. When using parenthesis, the matched content of the parenthesis can be accessed with **\1** in the **replace** pattern.


**Examples of Replace Patterns**

- **WORLD**  The word 'WORLD' will be placed whereever the find pattern was found.
- **FOO-$&-BAR**  Each time the find pattern is found, it will be surrounded with 'FOO-' at the begining and '-BAR' at the end. **$&** (dollar-ampersand) represents the matched find pattern.
- **$1**   The text which matched the first parenthesis in the Find Pattern.


-----

**Example 1**

**Find Pattern:** HELLO
**Replace Pattern:** WORLD
**Regular Expression:** no
**Replace what:** entire line

Every time the word HELLO is found, it will be replaced with the word WORLD.

-----

**Example 2**

**Find Pattern:** ^chr
**Replace Pattern:** (empty)
**Regular Expression:** yes
**Replace what:** column 11

If column 11 (of every line) begins with ther letters 'chr', they will be removed. Effectively, it'll turn "chr4" into "4" and "chrXHet" into "XHet"


-----

**Perl's Regular Expression Syntax**

The Find & Replace tool searches the data for lines containing or not containing a match to the given pattern. A Regular Expression is a pattern descibing a certain amount of text.

- **( ) { } [ ] . * ? + \\ ^ $** are all special characters. **\\** can be used to "escape" a special character, allowing that special character to be searched for.
- **^** matches the beginning of a string(but not an internal line).
- **(** .. **)** groups a particular pattern.
- **{** n or n, or n,m **}** specifies an expected number of repetitions of the preceding pattern.

  - **{n}** The preceding item is matched exactly n times.
  - **{n,}** The preceding item ismatched n or more times.
  - **{n,m}** The preceding item is matched at least n times but not more than m times.

- **[** ... **]** creates a character class. Within the brackets, single characters can be placed. A dash (-) may be used to indicate a range such as **a-z**.
- **.** Matches any single character except a newline.
- ***** The preceding item will be matched zero or more times.
- **?** The preceding item is optional and matched at most once.
- **+** The preceding item will be matched one or more times.
- **^** has two meaning:
  - matches the beginning of a line or string.
  - indicates negation in a character class. For example, [^...] matches every character except the ones inside brackets.
- **$** matches the end of a line or string.
- **\\|** Separates alternate possibilities.
- **\\d** matches a single digit
- **\\w** matches a single letter or digit or an underscore.
- **\\s** matches a single white-space (space or tabs).

@REFERENCES@
]]>
    </help>
    <expand macro="citations" />
</tool>
author	bgruening
date	Sun, 15 Mar 2020 22:58:18 +0000
parents	1aa30b2c73c9
children	cd83b5644eab