Mercurial > repos > melissacline > ucsc_cancer_utilities
annotate mergeGenomicMatrixFiles.py @ 56:2a240b005731
better instructions on browser
| author | jingchunzhu |
|---|---|
| date | Fri, 18 Sep 2015 11:03:59 -0700 |
| parents | eb5acf81e609 |
| children |
| rev | line source |
|---|---|
| 4 | 1 #!/usr/bin/env python |
| 2 | |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
3 import argparse |
| 43 | 4 import string,os,sys,json |
| 3 | 5 |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
6 def header (samples, sourceFiles, infile, labelThisFile): |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
7 if labelThisFile == None: |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
8 labelToUse = infile |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
9 else: |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
10 labelToUse = labelThisFile |
|
8
5d4538cb38db
When opening files for reading, changed the open() mode from 'r' to 'U' to accommodate non-unix systems
melissacline
parents:
7
diff
changeset
|
11 fin= open(infile, 'U') |
| 3 | 12 #header, samples |
| 13 newSamples = string.split(string.strip(fin.readline()),'\t')[1:] | |
| 14 for sample in newSamples: | |
| 15 if sample not in samples: | |
| 16 samples[sample]= len(samples) | |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
17 sourceFiles[sample] = labelToUse |
| 3 | 18 fin.close() |
| 19 return | |
| 20 | |
| 21 def process(genes, samples, dataMatrix, infile): | |
| 22 maxLength= len(samples) | |
| 23 | |
|
8
5d4538cb38db
When opening files for reading, changed the open() mode from 'r' to 'U' to accommodate non-unix systems
melissacline
parents:
7
diff
changeset
|
24 fin= open(infile,'U') |
| 3 | 25 #header |
| 26 newSamples = string.split(string.strip(fin.readline()),'\t') | |
| 27 | |
| 28 while 1: | |
| 29 line = fin.readline()[:-1] | |
| 30 if line =="": | |
| 31 break | |
| 32 data = string.split(line,"\t") | |
| 33 gene = data[0] | |
| 34 if gene not in genes: | |
| 35 genes[gene]= len(genes) | |
| 36 l=[] | |
| 37 for i in range (0, maxLength): | |
| 38 l.append("") | |
| 39 dataMatrix.append(l) | |
| 40 | |
| 41 x = genes[gene] | |
| 42 for i in range (1, len(data)): | |
| 43 sample = newSamples[i] | |
| 44 y = samples[sample] | |
| 45 dataMatrix[x][y]= data[i] | |
| 46 | |
| 47 fin.close() | |
| 48 return | |
| 49 | |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
50 |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
51 def outputSourceMatrix(sourceData, outputFileName): |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
52 fout = open(outputFileName, "w") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
53 fout.write("Sample\tSource\n") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
54 for thisSample in sourceData.keys(): |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
55 fout.write("%s\t%s\n" % (thisSample, sourceData[thisSample])) |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
56 fout.close() |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
57 return |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
58 |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
59 |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
60 def outputMergedMatrix(dataMatrix, samples, genes, outfile): |
| 3 | 61 fout = open(outfile,"w") |
| 62 maxLength= len(samples) | |
| 63 sList=[] | |
| 64 for i in range (0, maxLength): | |
| 65 sList.append("") | |
| 66 for sample in samples: | |
| 67 pos =samples[sample] | |
| 68 sList[pos] = sample | |
| 69 | |
| 70 fout.write("sample") | |
| 71 for sample in sList: | |
| 72 fout.write("\t"+sample) | |
| 73 fout.write("\n") | |
| 74 | |
| 75 for gene in genes: | |
| 76 fout.write(gene) | |
| 77 for sample in sList: | |
| 78 value = dataMatrix[genes[gene]][samples[sample]] | |
| 79 fout.write("\t"+value) | |
| 80 fout.write("\n") | |
| 81 fout.close() | |
| 82 return | |
| 83 | |
| 43 | 84 def outputMergedMatrixJson(output): |
| 85 fout = open(output,'w') | |
| 86 j={} | |
| 87 j["type"]="genomicMatrix" | |
| 88 json.dump(j, fout) | |
| 89 fout.close() | |
| 90 | |
| 3 | 91 if __name__ == '__main__' : |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
92 # |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
93 # The input files to this script are two or more matrices, in which |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
94 # columns represent samples and rows represent genes or measurements. |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
95 # There are two output files: outMergedData contains the input data merged |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
96 # into a single matrix, and outSourceMatrix is a two-column matrix |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
97 # indicating which file each sample (or column label) came from. This |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
98 # assumes that each sample came from at most one file. |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
99 # |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
100 parser = argparse.ArgumentParser() |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
101 parser.add_argument("inFileA", type=str, help="First input file") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
102 parser.add_argument("inFileB", type=str, help="Second input file") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
103 parser.add_argument("outMergedData", type=str, |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
104 help="Filename for the merged dataset") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
105 parser.add_argument("outSourceMatrix", type=str, |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
106 help="""Filename for a Nx2 matrix that indicates |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
107 the source file of each column""") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
108 parser.add_argument("--aLabel", type=str, default=None, |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
109 help="User-friendly label for the first input file") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
110 parser.add_argument("--bLabel", type=str, default=None, |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
111 help="User-friendly label for the second input file") |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
112 args = parser.parse_args() |
| 3 | 113 |
| 114 genes={} | |
| 115 samples={} | |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
116 sourceFiles = {} |
| 3 | 117 dataMatrix=[] |
| 118 | |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
119 header(samples, sourceFiles, args.inFileA, args.aLabel) |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
120 header(samples, sourceFiles, args.inFileB, args.bLabel) |
| 3 | 121 |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
122 process(genes, samples, dataMatrix, args.inFileA) |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
123 process(genes, samples, dataMatrix, args.inFileB) |
| 3 | 124 |
|
7
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
125 outputSourceMatrix(sourceFiles, args.outSourceMatrix) |
|
1d150e860c4d
Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
melissacline
parents:
6
diff
changeset
|
126 outputMergedMatrix(dataMatrix, samples, genes, args.outMergedData) |
| 43 | 127 |
