Genes by any other name

From genomewiki
Jump to navigationJump to search

Genes are known by many names from many naming schemes. The UCSC genes track has a number of additional database tables to enable finding alternative names for genes. The table browser can be used to select columns from any of the UCSC genes database tables while combining the alternative gene names. This is quite useful for obtaining UCSC gene data listed with other gene names.

Unix command line tools can be used as an alternative to using the table browser to facilitate better control of the exact resulting output. The example here uses the unix join command to combine two files that have a corresponding column that joins the two data sets. The critical factor in using the join command is to have the two files sorted on the corresponding column. See also: the usage message from the join command.

Example combining the UCSC gene data with the geneSymbol column from the kgXref table.

1. to see the potential column names in kgXref:
   $ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e "desc kgXref" hg19
2. select UCSC gene id and geneSymbol from kgXref table and sort on UCSC gene id:
   $ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e "select kgID,geneSymbol from kgXref" hg19 \
      | sort > hg19.kgXref.geneSymbol.tab
3. download all the UCSC gene data, sort on UCSC gene id:
   $ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e "select * from knownGene" hg19 \
      | sort > hg19.knownGene.tab
4. join the two files on their first column, the UCSC gene id,
    specify the output format with the -o option to place the geneSymbol in column 1 of the output:
   $ join -o 2.2,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12 hg19.knownGene.tab hg19.kgXref.geneSymbol.tab \
      > hg19.geneSymbol.knownGene.txt

The resulting file hg19.geneSymbol.knownGene.txt is now a copy of the UCSC gene data with the first column replaced with the geneSymbol from kgXref in place of the UCSC identifier. The obscure output format option -o can be used to rearrange any of the columns from the two inputs to any combination on output. Note the usage message from join for this format, a comma separated list of fileNumber.columnNumber

The example here is kept simple by selecting only two columns out of the kgXref table. You could use the entire kgXref table and instead specify in the output format of the join command which column of the kgXref table to use in the output.

The resulting file hg19.geneSymbol.knownGene.txt has its fields separated by a single space. To convert the spaces to tabs:

1. with sed: sed -e "s/ /\t/g" hg19.geneSymbol.knownGene.txt > hg19.geneSymbol.knownGene.tab
2. with tr: cat hg19.geneSymbol.knownGene.txt | tr '[ ]' '[\t]' > hg19.geneSymbol.knownGene.tab

You could use the -t argument to join to specify the tab character on input and output for field separation. However, on the command line this tab needs to be written with two keystrokes: Control-v i

 $ join -t 'Ctrl-v i' ...