Chromosome name conversion

From genomewiki
Jump to navigationJump to search

Hiram's wisdom on conversion of chromosomes names in data files

Max made a command to change names in our standard file types:

   chromToUcsc

Brian is in the process of fixing our system so that any name in a big* file that can be found in our alias system just works without any translation.

All assembly hubs have IGV compatible alias name files, for example:

   https://hgdownload.soe.ucsc.edu/hubs/GCF/000/001/405/GCF_000001405.39/GCF_000001405.39.chromAlias.txt

All of our database browsers have the same type of alias name file, for example:

   https://hgdownload.soe.ucsc.edu/goldenPath/xenTro9/bigZips/xenTro9.chromAlias.txt

as well as an equivalent database table:

  hgsql -e 'desc chromAlias;' xenTro9
  +--------+--------------+------+-----+---------+-------+
  | Field  | Type         | Null | Key | Default | Extra |
  +--------+--------------+------+-----+---------+-------+
  | alias  | varchar(255) | NO   | PRI | NULL    |       |
  | chrom  | varchar(255) | NO   | MUL | NULL    |       |
  | source | varchar(255) | NO   | MUL | NULL    |       |
  +--------+--------------+------+-----+---------+-------+

BED custom tracks submitted to the browser already understand how to take advantage of the alias system.


In this case, you may need to make your own translation set from the genome sequence itself. I probably don't have those contig names anywhere.

Get the 2bit file of the genome to translate somewhere to work on in a directory, then:

  doIdKeys.pl -buildDir=`pwd` -twoBit=`pwd`/yourGenome.2bit yourGenome

This will create the files:

  yourGenome.idKeys.txt
  yourGenome.keySignature.txt

That idKeys.txt file can be joined to existing database idKey files:

   /hive/data/genomes/<db>/bed/idKeys/<db>.idKeys.txt

To get the name to name correspondence.

I've probably already done this if I came anywhere near these genomes. It is practically the first step I have to do anytime I encounter any of that one-off work.

For example, for the cactus241 alignment, I checked all those genomes to find out which ones matched any of our database genomes and did the name translation thereby. hg38 to 'Homo_sapiens' happens to come out with the same names since Joel used our sequence, but it also proved the Homo_sapiens sequences was identical to hg38:

  join \
 /hive/data/genomes/hg38/bed/idKeys.p13/hg38.p13.idKeys.txt \
    /hive/data/genomes/hg38/bed/cactus242way/idKeys/Homo_sapiens/Homo_sapiens.idKeys.txt

The keySignature.txt file is a single MD5 sum representing the entire genome, in which case if you expect the entire genome to be completely identical, just match it to any of the existing keySignature files.

And of course, such idKey files also exist for all assembly hubs so you can find a matching genome there too if it isn't one of our database genomes. I have a specialized 'find' command that can rapidly locate all the assembly hub idKeys. For example, here are all the keySignatures for all the UCSC database genomes, refseq, genbank assembly hubs, and Ensembl genomes:

  /hive/data/inside/assemblyEquivalence/2021-05-11/ucsc/ucsc.keySignatures.txt
  /hive/data/inside/assemblyEquivalence/2021-05-11/refseq/refseq.keySignatures.txt
  /hive/data/inside/assemblyEquivalence/2021-05-11/genbank/genbank.keySignatures.txt
  /hive/data/inside/assemblyEquivalence/2021-05-11/ensembl/ensembl.keySignatures.txt

A simple join on any of those lists immediately identifies identical genomes. This business creates the hgFixed table asmEquivalent

  hgsql -e 'desc asmEquivalent;' hgFixed
  +----------------------+-------------------------------------------+------+-----+---------+
  | Field                | Type                                      | Null | Key | Default |
  +----------------------+-------------------------------------------+------+-----+---------+
  | source               | varchar(255)                              | NO   | MUL | NULL    |
  | destination          | varchar(255)                              | NO   | MUL | NULL    |
  | sourceAuthority      | enum('ensembl','ucsc','genbank','refseq') | NO   |     | NULL    |
  | destinationAuthority | enum('ensembl','ucsc','genbank','refseq') | NO   |     | NULL    |
  | matchCount           | bigint(20)                                | NO   |     | NULL    |
  | sourceCount          | bigint(20)                                | NO   |     | NULL    |
  | destinationCount     | bigint(20)                                | NO   |     | NULL    |
  +----------------------+-------------------------------------------+------+-----+---------+

The idKeys.txt files are used to find nearly identical genomes.

Use the chromToUcsc command to rename sequence names in BED or other format files.