KnownGene build: Difference between revisions

Revision as of 23:50, 1 February 2022

Build UniProt and Protein databases

I haven't been doing this recently. We need to look into whether the work Max has done with uniprot should replace this.

Initialize work directory

Set version variable

 export GENCODE_VERSION=V39

Start a screen.

 screen -S knownGene$GENCODE_VERSION

Create and cd into work directory of the form /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build

   export db=hg38
   mkdir /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
   cd /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build

Set PATH to include $HOME/kent/src/hg/utils/otto/knownGene

 PATH=$HOME/kent/src/hg/utils/otto/knownGene":$PATH"

Copy buildEnv.sh from previous build on this db

  cp /hive/data/genomes/$db/bed/gencodeVM27/build/buildEnv.sh  buildEnv.sh
  edit buildEnv.sh to have correct values
  . buildEnv.sh

Find Table and File list from previous build

  cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.files.txt  .
  cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.tables.txt  .

Confirm existing assembly tables are in a knownGene* database

  hgsql ${oldKnownDb} -Ne "show tables" > ${oldKnownDb}.tables.txt
  diff ${PREV_GENCODE_VERSION}.tables.txt ${oldKnownDb}.tables.txt

Setting environment variables

The environment variables used in the build are set in the script buildEnv.sh. All the other scripts assume that this script has been sourced in the current shell. You have to edit this by hand. Most of the variables don't change. The hairiest ones are the other assemblies for the blast tables.

Running the build

To run the build execute hg/utils/otto/knownGene/buildKnown.sh.

  buildKnown.sh &
  tail -f doKnown.log

It builds into the knownGene${GENCODE_VERSION} database. It does the following steps:

Extracting Gencode data
Building initial knownGene table
Adding primary reference tables
Building final knownGene core tables
Building bigGenePred
Building GTF file

Copying over tables

drop chromInfo and history from knownGene database

  hgsql knownGene${GENCODE_VERSION} -Ne "drop table chromInfo, history"
  hgsql knownGene${GENCODE_VERSION} -Ne "show tables" > ${GENCODE_VERSION}.tables.txt

look for unexpected differences between this release and the last one

  diff ${PREV_GENCODE_VERSION}.tables.txt ${GENCODE_VERSION}.tables.txt

drop old tables

  hgsql $db -Ne "drop table knownGene, kgXref;"
  grep -v "ToKg" ${PREV_GENCODE_VERSION}.tables.txt | egrep -vw "knownGene|kgXref"  | awk '{printf "drop table %s;\n", $1}' > toDrop.lst
  cat toDrop.lst | hgsql $db

check for orphans and drop them (or build them) if appropriate

 hgsql $db -Ne "show tables like 'known%'"  > orphan.lst

copy tables from knownGene database to assembly database

 copyFilesToAssembly.sh VM28.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt

Edit trackDb to add new trackDb

 include knownGene.ra beta,public
 include knownGene.alpha.ra alpha

Look for the previous trackDb.ra file, normally hg/makeDb/trackDb/<org>/<assembly>/knownGene.ra.

Adding IsPcr server

After building /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit, which happens in the buildCore.sh script run at the beginning of the process, ask cluster-admin to start an untranslated, -stepSize=5 gfServer on /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit

 to cluster-admin

 Hey my friends,
 
 Could you please start an untranslated -stepSize=5 production gfserver
 with this 2bit file?
 
 hgwdev:/gbdb/mm39/targetDb/mm39KgSeq13.2bit
 
 thanks!
 brian

On hgwdev, drop old records in blatServers and targetDb Identify the blatServer by the keyword "$db"Kg with the version number appended

 hgsql hgcentraltest -Ne "delete from blatServers where db like '${db}Kg%'"
 hgsql hgcentraltest -Ne "delete from targetDb where name like '${db}Kg%'"

On hgwdev, insert new records into blatServers and targetDb, using the host (field 2) and port (field 3) specified by cluster-admin. Identify the blatServer by the keyword "$db"Kg with the version number appended

cluster-admin will say something like this:

 Starting untrans gfServer for mm39KgSeqV38 on host blat1b port 17921

Add this info to blatServers and targetDb tables in hgcentral.

  hgsql hgcentraltest -e \
     "INSERT into blatServers values ('${db}KgSeq${GENCODE_VERSION}', 'blat1c', 17921, 0, 1,);"
  hgsql hgcentraltest -e \
           "INSERT into targetDb values('${db}KgSeq${GENCODE_VERSION}', 'GENCODE Genes', \
                    '$db', 'kgTargetAli', , , \
                             '/gbdb/${db}/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit', 1, now(), );"

all.joiner changes

I haven't added anything to this recently.

The relevant id's are :

knownGeneId

  joinerCheck all.joiner -identifier=knownGeneId -keys  -database=${db}

Bundle up logs and check them in

Redmine ticket files and tables

Post release push "other species" blast tables

Load the other species blastTab tables.

   buildLoadOther.sh

@@ Line 136: / Line 136: @@
 knownGeneId
-    joinerCheck all.joiner -identifier=knownGeneId -keys  -database=mm39
+    joinerCheck all.joiner -identifier=knownGeneId -keys  -database=${db}
 == Bundle up logs and check them in ==

KnownGene build: Difference between revisions

Revision as of 23:50, 1 February 2022

Contents

Build UniProt and Protein databases

Initialize work directory

Setting environment variables

Running the build

Copying over tables

Edit trackDb to add new trackDb

Adding IsPcr server

all.joiner changes

Bundle up logs and check them in

Redmine ticket files and tables

Post release push "other species" blast tables

Navigation menu

Page actions

Page actions

Personal tools

Genecats Wiki Navigation

Search

Media Wiki Navigation

Tools