annotate mayachemtools/docs/modules/txt/SequenceFileUtil.txt @ 9:ab29fa5c8c1f draft default tip

Uploaded
author deepakjadmin
date Thu, 15 Dec 2016 14:18:03 -0500
parents 73ae111cf86f
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
1 NAME
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
2 SequenceFileUtil
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
3
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
4 SYNOPSIS
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
5 use SequenceFileUtil ;
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
6
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
7 use SequenceFileUtil qw(:all);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
8
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
9 DESCRIPTION
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
10 SequenceFileUtil module provides the following functions:
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
11
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
12 AreSequenceLengthsIdentical, CalcuatePercentSequenceIdentity,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
13 CalculatePercentSequenceIdentityMatrix, GetLongestSequence,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
14 GetSequenceLength, GetShortestSequence, IsClustalWSequenceFile,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
15 IsGapResidue, IsMSFSequenceFile, IsPIRFastaSequenceFile,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
16 IsPearsonFastaSequenceFile, IsSupportedSequenceFile,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
17 ReadClustalWSequenceFile, ReadMSFSequenceFile, ReadPIRFastaSequenceFile,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
18 ReadPearsonFastaSequenceFile, ReadSequenceFile,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
19 RemoveSequenceAlignmentGapColumns, RemoveSequenceGaps,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
20 WritePearsonFastaSequenceFile SequenceFileUtil module provides various
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
21 methods to process sequence files and retreive appropriate information.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
22
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
23 FUNCTIONS
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
24 AreSequenceLengthsIdentical
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
25 $Status = AreSequenceLengthsIdentical($SequencesDataRef);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
26
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
27 Checks the lengths of all the sequences available in
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
28 *SequencesDataRef* and returns 1 or 0 based whether lengths of all
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
29 the sequence is same.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
30
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
31 CalcuatePercentSequenceIdentity
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
32 $PercentIdentity =
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
33 AreSequenceLengthsIdenticalAreSequenceLengthsIdentical(
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
34 $Sequence1, $Sequence2, [$IgnoreGaps, $Precision]);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
35
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
36 Returns percent identity between *Sequence1* and *Sequence2*.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
37 Optional arguments *IgnoreGaps* and *Precision* control handling of
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
38 gaps in sequences and precision of the returned value. By default,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
39 gaps are ignored and precision is set up to 1 decimal.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
40
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
41 CalculatePercentSequenceIdentityMatrix
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
42 $IdentityMatrixDataRef = CalculatePercentSequenceIdentityMatrix(
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
43 $SequencesDataRef, [$IgnoreGaps,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
44 $Precision]);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
45
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
46 Calculate pairwise percent identity between all the sequences
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
47 available in *SequencesDataRef* and returns a reference to identity
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
48 matrix hash. Optional arguments *IgnoreGaps* and *Precision* control
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
49 handling of gaps in sequences and precision of the returned value.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
50 By default, gaps are ignored and precision is set up to 1 decimal.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
51
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
52 GetSequenceLength
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
53 $SeqquenceLength = GetSequenceLength($Sequence, [$IgnoreGaps]);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
54
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
55 Returns length of the specified sequence. Optional argument
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
56 *IgnoreGaps* controls handling of gaps. By default, gaps are
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
57 ignored.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
58
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
59 GetShortestSequence
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
60 ($ID, $Sequence, $SeqLen, $Description) = GetShortestSequence(
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
61 $SequencesDataRef, [$IgnoreGaps]);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
62
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
63 Checks the lengths of all the sequences available in
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
64 $SequencesDataRef and returns $ID, $Sequence, $SeqLen, and
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
65 $Description values for the shortest sequence. Optional arguments
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
66 $IgnoreGaps controls handling of gaps in sequences. By default, gaps
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
67 are ignored.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
68
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
69 GetLongestSequence
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
70 ($ID, $Sequence, $SeqLen, $Description) = GetLongestSequence(
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
71 $SequencesDataRef, [$IgnoreGaps]);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
72
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
73 Checks the lengths of all the sequences available in
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
74 *SequencesDataRef* and returns ID, Sequence, SeqLen, and Description
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
75 values for the longest sequence. Optional argument $*IgnoreGaps*
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
76 controls handling of gaps in sequences. By default, gaps are
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
77 ignored.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
78
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
79 IsGapResidue
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
80 $Status = AreSequenceLengthsIdentical($Residue);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
81
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
82 Returns 1 or 0 based on whether *Residue* corresponds to a gap. Any
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
83 character other than A to Z is considered a gap residue.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
84
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
85 IsSupportedSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
86 $Status = IsSupportedSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
87
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
88 Returns 1 or 0 based on whether *SequenceFile* corresponds to a
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
89 supported sequence format.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
90
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
91 IsClustalWSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
92 $Status = IsClustalWSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
93
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
94 Returns 1 or 0 based on whether *SequenceFile* corresponds to
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
95 Clustal sequence alignment format.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
96
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
97 IsPearsonFastaSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
98 $Status = IsPearsonFastaSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
99
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
100 Returns 1 or 0 based on whether *SequenceFile* corresponds to
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
101 Pearson FASTA sequence format.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
102
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
103 IsPIRFastaSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
104 $Status = IsPIRFastaSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
105
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
106 Returns 1 or 0 based on whether *SequenceFile* corresponds to PIR
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
107 FASTA sequence format.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
108
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
109 IsMSFSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
110 $Status = IsClustalWSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
111
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
112 Returns 1 or 0 based on whether *SequenceFile* corresponds to MSF
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
113 sequence alignment format.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
114
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
115 ReadSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
116 $SequenceDataMapRef = ReadSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
117
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
118 Reads *SequenceFile* and returns reference to a hash containing
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
119 following key/value pairs:
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
120
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
121 $SequenceDataMapRef->{IDs} - Array of sequence IDs
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
122 $SequenceDataMapRef->{Count} - Number of sequences
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
123 $SequenceDataMapRef->{Description}{$ID} - Sequence description
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
124 $SequenceDataMapRef->{Sequence}{$ID} - Sequence for a specific ID
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
125 $SequenceDataMapRef->{Sequence}{InputFileType} - File format
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
126
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
127 ReadClustalWSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
128 $SequenceDataMapRef = ReadClustalWSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
129
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
130 Reads ClustalW *SequenceFile* and returns reference to a hash
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
131 containing following key/value pairs as describes in
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
132 ReadSequenceFile method.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
133
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
134 ReadMSFSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
135 $SequenceDataMapRef = ReadMSFSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
136
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
137 Reads MSF *SequenceFile* and returns reference to a hash containing
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
138 following key/value pairs as describes in ReadSequenceFile method.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
139
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
140 ReadPIRFastaSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
141 $SequenceDataMapRef = ReadPIRFastaSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
142
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
143 Reads PIR FASTA *SequenceFile* and returns reference to a hash
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
144 containing following key/value pairs as describes in
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
145 ReadSequenceFile method.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
146
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
147 ReadPearsonFastaSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
148 $SequenceDataMapRef = ReadPearsonFastaSequenceFile($SequenceFile);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
149
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
150 Reads Pearson FASTA *SequenceFile* and returns reference to a hash
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
151 containing following key/value pairs as describes in
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
152 ReadSequenceFile method.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
153
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
154 RemoveSequenceGaps
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
155 $SeqWithoutGaps = RemoveSequenceGaps($Sequence);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
156
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
157 Removes gaps from *Sequence* and return a sequence without any gaps.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
158
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
159 RemoveSequenceAlignmentGapColumns
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
160 $NewAlignmentDataMapRef = RemoveSequenceAlignmentGapColumns(
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
161 $AlignmentDataMapRef);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
162
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
163 Using input alignment data map ref containing following keys,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
164 generate a new hash with same set of keys after residue columns
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
165 containg only gaps have been removed:
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
166
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
167 {IDs} : Array of IDs in order as they appear in file
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
168 {Count}: ID count
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
169 {Description}{$ID} : Description data
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
170 {Sequence}{$ID} : Sequence data
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
171
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
172 WritePearsonFastaSequenceFile
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
173 WritePearsonFastaSequenceFile($SequenceFileName, $SequenceDataRef,
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
174 [$MaxLength]);
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
175
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
176 Using sequence data specified via *SequenceDataRef*, write out a
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
177 Pearson FASTA sequence file. Optional argument *MaxLength* controls
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
178 maximum length sequence in each line; default is 80.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
179
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
180 AUTHOR
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
181 Manish Sud <msud@san.rr.com>
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
182
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
183 SEE ALSO
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
184 PDBFileUtil.pm
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
185
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
186 COPYRIGHT
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
187 Copyright (C) 2015 Manish Sud. All rights reserved.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
188
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
189 This file is part of MayaChemTools.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
190
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
191 MayaChemTools is free software; you can redistribute it and/or modify it
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
192 under the terms of the GNU Lesser General Public License as published by
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
193 the Free Software Foundation; either version 3 of the License, or (at
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
194 your option) any later version.
73ae111cf86f Uploaded
deepakjadmin
parents:
diff changeset
195