Mercurial > repos > melissacline > ucsc_cancer_utilities
annotate mergeGenomicMatrixFiles.py @ 17:0b0a6f326dad
Cleaned up the output dataset names for Merge Genomic Datasets
author | melissacline |
---|---|
date | Fri, 20 Mar 2015 14:22:02 -0700 |
parents | 30aab34424a9 |
children | 1d83dbbee373 |
rev | line source |
---|---|
6 | 1 #!/usr/bin/env python |
2 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
3 import argparse |
6 | 4 import string,os,sys |
5 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
6 def header (samples, sourceFiles, infile, labelThisFile): |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
7 if labelThisFile == None: |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
8 labelToUse = infile |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
9 else: |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
10 labelToUse = labelThisFile |
8
5d4538cb38db
When opening files for reading, changed the open() mode from 'r' to 'U' to accommodate non-unix systems
melissacline
parents:
7
diff
changeset
|
11 fin= open(infile, 'U') |
6 | 12 #header, samples |
13 newSamples = string.split(string.strip(fin.readline()),'\t')[1:] | |
14 for sample in newSamples: | |
15 if sample not in samples: | |
16 samples[sample]= len(samples) | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
17 sourceFiles[sample] = labelToUse |
6 | 18 fin.close() |
19 return | |
20 | |
21 def process(genes, samples, dataMatrix, infile): | |
22 maxLength= len(samples) | |
23 | |
8
5d4538cb38db
When opening files for reading, changed the open() mode from 'r' to 'U' to accommodate non-unix systems
melissacline
parents:
7
diff
changeset
|
24 fin= open(infile,'U') |
6 | 25 #header |
26 newSamples = string.split(string.strip(fin.readline()),'\t') | |
27 | |
28 while 1: | |
29 line = fin.readline()[:-1] | |
30 if line =="": | |
31 break | |
32 data = string.split(line,"\t") | |
33 gene = data[0] | |
34 if gene not in genes: | |
35 genes[gene]= len(genes) | |
36 l=[] | |
37 for i in range (0, maxLength): | |
38 l.append("") | |
39 dataMatrix.append(l) | |
40 | |
41 x = genes[gene] | |
42 for i in range (1, len(data)): | |
43 sample = newSamples[i] | |
44 y = samples[sample] | |
45 dataMatrix[x][y]= data[i] | |
46 | |
47 fin.close() | |
48 return | |
49 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
50 |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
51 def outputSourceMatrix(sourceData, outputFileName): |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
52 fout = open(outputFileName, "w") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
53 fout.write("Sample\tSource\n") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
54 for thisSample in sourceData.keys(): |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
55 fout.write("%s\t%s\n" % (thisSample, sourceData[thisSample])) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
56 fout.close() |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
57 return |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
58 |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
59 |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
60 def outputMergedMatrix(dataMatrix, samples, genes, outfile): |
6 | 61 fout = open(outfile,"w") |
62 maxLength= len(samples) | |
63 sList=[] | |
64 for i in range (0, maxLength): | |
65 sList.append("") | |
66 for sample in samples: | |
67 pos =samples[sample] | |
68 sList[pos] = sample | |
69 | |
70 fout.write("sample") | |
71 for sample in sList: | |
72 fout.write("\t"+sample) | |
73 fout.write("\n") | |
74 | |
75 for gene in genes: | |
76 fout.write(gene) | |
77 for sample in sList: | |
78 value = dataMatrix[genes[gene]][samples[sample]] | |
79 fout.write("\t"+value) | |
80 fout.write("\n") | |
81 fout.close() | |
82 return | |
83 | |
84 if __name__ == '__main__' : | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
85 # |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
86 # The input files to this script are two or more matrices, in which |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
87 # columns represent samples and rows represent genes or measurements. |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
88 # There are two output files: outMergedData contains the input data merged |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
89 # into a single matrix, and outSourceMatrix is a two-column matrix |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
90 # indicating which file each sample (or column label) came from. This |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
91 # assumes that each sample came from at most one file. |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
92 # |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
93 parser = argparse.ArgumentParser() |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
94 parser.add_argument("inFileA", type=str, help="First input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
95 parser.add_argument("inFileB", type=str, help="Second input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
96 parser.add_argument("outMergedData", type=str, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
97 help="Filename for the merged dataset") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
98 parser.add_argument("outSourceMatrix", type=str, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
99 help="""Filename for a Nx2 matrix that indicates |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
100 the source file of each column""") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
101 parser.add_argument("--aLabel", type=str, default=None, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
102 help="User-friendly label for the first input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
103 parser.add_argument("--bLabel", type=str, default=None, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
104 help="User-friendly label for the second input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
105 args = parser.parse_args() |
6 | 106 |
107 genes={} | |
108 samples={} | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
109 sourceFiles = {} |
6 | 110 dataMatrix=[] |
111 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
112 header(samples, sourceFiles, args.inFileA, args.aLabel) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
113 header(samples, sourceFiles, args.inFileB, args.bLabel) |
6 | 114 |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
115 process(genes, samples, dataMatrix, args.inFileA) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
116 process(genes, samples, dataMatrix, args.inFileB) |
6 | 117 |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
118 outputSourceMatrix(sourceFiles, args.outSourceMatrix) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
119 outputMergedMatrix(dataMatrix, samples, genes, args.outMergedData) |