Mercurial > repos > melissacline > ucsc_cancer_utilities
annotate mergeGenomicMatrixFiles.py @ 54:59dbe857f5d4
introduce normal_CNV parameter
author | jingchunzhu |
---|---|
date | Thu, 17 Sep 2015 22:03:04 -0700 |
parents | eb5acf81e609 |
children |
rev | line source |
---|---|
4 | 1 #!/usr/bin/env python |
2 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
3 import argparse |
43 | 4 import string,os,sys,json |
3 | 5 |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
6 def header (samples, sourceFiles, infile, labelThisFile): |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
7 if labelThisFile == None: |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
8 labelToUse = infile |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
9 else: |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
10 labelToUse = labelThisFile |
8
5d4538cb38db
When opening files for reading, changed the open() mode from 'r' to 'U' to accommodate non-unix systems
melissacline
parents:
7
diff
changeset
|
11 fin= open(infile, 'U') |
3 | 12 #header, samples |
13 newSamples = string.split(string.strip(fin.readline()),'\t')[1:] | |
14 for sample in newSamples: | |
15 if sample not in samples: | |
16 samples[sample]= len(samples) | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
17 sourceFiles[sample] = labelToUse |
3 | 18 fin.close() |
19 return | |
20 | |
21 def process(genes, samples, dataMatrix, infile): | |
22 maxLength= len(samples) | |
23 | |
8
5d4538cb38db
When opening files for reading, changed the open() mode from 'r' to 'U' to accommodate non-unix systems
melissacline
parents:
7
diff
changeset
|
24 fin= open(infile,'U') |
3 | 25 #header |
26 newSamples = string.split(string.strip(fin.readline()),'\t') | |
27 | |
28 while 1: | |
29 line = fin.readline()[:-1] | |
30 if line =="": | |
31 break | |
32 data = string.split(line,"\t") | |
33 gene = data[0] | |
34 if gene not in genes: | |
35 genes[gene]= len(genes) | |
36 l=[] | |
37 for i in range (0, maxLength): | |
38 l.append("") | |
39 dataMatrix.append(l) | |
40 | |
41 x = genes[gene] | |
42 for i in range (1, len(data)): | |
43 sample = newSamples[i] | |
44 y = samples[sample] | |
45 dataMatrix[x][y]= data[i] | |
46 | |
47 fin.close() | |
48 return | |
49 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
50 |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
51 def outputSourceMatrix(sourceData, outputFileName): |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
52 fout = open(outputFileName, "w") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
53 fout.write("Sample\tSource\n") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
54 for thisSample in sourceData.keys(): |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
55 fout.write("%s\t%s\n" % (thisSample, sourceData[thisSample])) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
56 fout.close() |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
57 return |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
58 |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
59 |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
60 def outputMergedMatrix(dataMatrix, samples, genes, outfile): |
3 | 61 fout = open(outfile,"w") |
62 maxLength= len(samples) | |
63 sList=[] | |
64 for i in range (0, maxLength): | |
65 sList.append("") | |
66 for sample in samples: | |
67 pos =samples[sample] | |
68 sList[pos] = sample | |
69 | |
70 fout.write("sample") | |
71 for sample in sList: | |
72 fout.write("\t"+sample) | |
73 fout.write("\n") | |
74 | |
75 for gene in genes: | |
76 fout.write(gene) | |
77 for sample in sList: | |
78 value = dataMatrix[genes[gene]][samples[sample]] | |
79 fout.write("\t"+value) | |
80 fout.write("\n") | |
81 fout.close() | |
82 return | |
83 | |
43 | 84 def outputMergedMatrixJson(output): |
85 fout = open(output,'w') | |
86 j={} | |
87 j["type"]="genomicMatrix" | |
88 json.dump(j, fout) | |
89 fout.close() | |
90 | |
3 | 91 if __name__ == '__main__' : |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
92 # |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
93 # The input files to this script are two or more matrices, in which |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
94 # columns represent samples and rows represent genes or measurements. |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
95 # There are two output files: outMergedData contains the input data merged |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
96 # into a single matrix, and outSourceMatrix is a two-column matrix |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
97 # indicating which file each sample (or column label) came from. This |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
98 # assumes that each sample came from at most one file. |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
99 # |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
100 parser = argparse.ArgumentParser() |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
101 parser.add_argument("inFileA", type=str, help="First input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
102 parser.add_argument("inFileB", type=str, help="Second input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
103 parser.add_argument("outMergedData", type=str, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
104 help="Filename for the merged dataset") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
105 parser.add_argument("outSourceMatrix", type=str, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
106 help="""Filename for a Nx2 matrix that indicates |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
107 the source file of each column""") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
108 parser.add_argument("--aLabel", type=str, default=None, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
109 help="User-friendly label for the first input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
110 parser.add_argument("--bLabel", type=str, default=None, |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
111 help="User-friendly label for the second input file") |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
112 args = parser.parse_args() |
3 | 113 |
114 genes={} | |
115 samples={} | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
116 sourceFiles = {} |
3 | 117 dataMatrix=[] |
118 | |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
119 header(samples, sourceFiles, args.inFileA, args.aLabel) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
120 header(samples, sourceFiles, args.inFileB, args.bLabel) |
3 | 121 |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
122 process(genes, samples, dataMatrix, args.inFileA) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
123 process(genes, samples, dataMatrix, args.inFileB) |
3 | 124 |
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
125 outputSourceMatrix(sourceFiles, args.outSourceMatrix) |
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
126 outputMergedMatrix(dataMatrix, samples, genes, args.outMergedData) |
43 | 127 |