KnownGene build: Difference between revisions

From Genecats
Jump to navigationJump to search
Line 85: Line 85:


   copyFilesToAssembly.sh ${GENCODE_VERSION}.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt
   copyFilesToAssembly.sh ${GENCODE_VERSION}.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt
  cat copyScript.txt | hgsql $db


== Edit trackDb to add new trackDb ==
== Edit trackDb to add new trackDb ==

Revision as of 22:24, 18 August 2023

Build UniProt and Protein databases

I haven't been doing this recently. We need to look into whether the work Max has done with uniprot should replace this.

Consider updating underlying databases

  • If it's been a while since we updated BioCyc versions for this species, consider doing so (we download the public BioCyc flat file .tar.gz distribution for mouse and human)

Initialize work directory

  • Set version variable
 export GENCODE_VERSION=V39
  • Start a screen.
 screen -S knownGene$GENCODE_VERSION
  • Create and cd into work directory of the form /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
   export db=hg38
   mkdir /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
   cd /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
  • Set PATH to include $HOME/kent/src/hg/utils/otto/knownGene
 PATH=$HOME/kent/src/hg/utils/otto/knownGene":$PATH"
  • Copy buildEnv.sh from previous build on this db
  olddir=`ls -trd /hive/data/genomes/$db/bed/gencodeVM*/build | tail -n 2 | head -1`
  cp $olddir/buildEnv.sh  buildEnv.sh
  edit buildEnv.sh to have correct values
  . buildEnv.sh
  • Find Table and File list from previous build


  cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.files.txt  .
  cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.tables.txt  .

  • Confirm existing assembly tables are in a knownGene* database (sort syntax is a bashism - if using tcsh, sort the tables before the diff)
  hgsql ${oldKnownDb} -Ne "show tables" > ${oldKnownDb}.tables.txt
  diff <(sort ${PREV_GENCODE_VERSION}.tables.txt) <(sort ${oldKnownDb}.tables.txt)

Setting environment variables

The environment variables used in the build are set in the script buildEnv.sh. All the other scripts assume that this script has been sourced in the current shell. You have to edit this by hand. Most of the variables don't change. The hairiest ones are the other assemblies for the blast tables.

Running the build

To run the build execute hg/utils/otto/knownGene/buildKnown.sh.

  buildKnown.sh &
  tail -f doKnown.log

It builds into the knownGene${GENCODE_VERSION} database. It does the following steps:

  • Extracting Gencode data
  • Building initial knownGene table
  • Adding primary reference tables
  • Building final knownGene core tables
  • Building bigGenePred
  • Building GTF file

Copying over tables

drop chromInfo and history from knownGene database

  hgsql knownGene${GENCODE_VERSION} -Ne "drop table if exists chromInfo, history"
  hgsql knownGene${GENCODE_VERSION} -Ne "show tables" | egrep "knownGene|kgXref" > ${GENCODE_VERSION}.tables.txt
  hgsql knownGene${GENCODE_VERSION} -Ne "show tables" | egrep -v  "knownGene|kgXref" >> ${GENCODE_VERSION}.tables.txt

look for unexpected differences between this release and the last one

  diff ${PREV_GENCODE_VERSION}.tables.txt ${GENCODE_VERSION}.tables.txt

drop old tables

  hgsql $db -Ne "drop table knownGene, kgXref;"
  grep -v "ToKg" ${PREV_GENCODE_VERSION}.tables.txt | egrep -vw "knownGene|kgXref"  | awk '{printf "drop table %s;\n", $1}' > toDrop.lst
  cat toDrop.lst | hgsql $db

check for orphans and drop them (or build them) if appropriate

 hgsql $db -Ne "show tables like 'known%'"  > orphan.lst


copy tables from knownGene database to assembly database

 copyFilesToAssembly.sh ${GENCODE_VERSION}.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt
 cat copyScript.txt | hgsql $db

Edit trackDb to add new trackDb

 cd $HOME/kent/src/hg/makeDb/trackDb/*/$db
 vi trackDb.ra
     include knownGene.ra beta,public
     include knownGene.alpha.ra alpha
 sed "s/$PREV_GENCODE_VERSION/$GENCODE_VERSION/g" knownGene.ra > knownGene.alpha.ra
 cp knownGene$PREV_GENCODE_VERSION.html knownGene$GENCODE_VERSION.html
 git add knownGene.alpha.ra knownGene$GENCODE_VERSION.html trackDb.ra
 git commit -m "$GENCODE_VERSION knownGene trackDb"
 git push
 cd ../..
 make DBS=$db alpha
 cd $dir

Adding IsPcr server

On hgwdev, drop old records in blatServers and targetDb.

 hgsql hgcentraltest -Ne "delete from blatServers where db like '${db}Kg%'"
 hgsql hgcentraltest -Ne "delete from targetDb where name like '${db}Kg%'"


Ask cluster-admin to start an untranslated, -stepSize=5 gfServer on /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit

  genIspcrMail.sh

send to cluster-admin

cluster-admin will say something like this:

 Starting untrans gfServer for mm39KgSeqV38 on host blat1b port 17921

where blat1b is the serverName and the port is 17921


Add this info to blatServers and targetDb tables in hgcentral.

addIspcrToCentral.sh serverName port

all.joiner changes

I haven't added anything to this recently.

The relevant id's are :

knownGeneId

  joinerCheck all.joiner -identifier=knownGeneId -keys  -database=${db}

Bundle up logs and check them in

Redmine ticket files and tables

Post release push "other species" blast tables

Load the other species blastTab tables.

   buildLoadOther.sh