KnownGene build: Difference between revisions
Line 136: | Line 136: | ||
knownGeneId | knownGeneId | ||
joinerCheck all.joiner -identifier=knownGeneId -keys -database= | joinerCheck all.joiner -identifier=knownGeneId -keys -database=${db} | ||
== Bundle up logs and check them in == | == Bundle up logs and check them in == |
Revision as of 23:50, 1 February 2022
Build UniProt and Protein databases
I haven't been doing this recently. We need to look into whether the work Max has done with uniprot should replace this.
Initialize work directory
- Set version variable
export GENCODE_VERSION=V39
- Start a screen.
screen -S knownGene$GENCODE_VERSION
- Create and cd into work directory of the form /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
export db=hg38 mkdir /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build cd /hive/data/genomes/$db/bed/gencode$GENCODE_VERSION/build
- Set PATH to include $HOME/kent/src/hg/utils/otto/knownGene
PATH=$HOME/kent/src/hg/utils/otto/knownGene":$PATH"
- Copy buildEnv.sh from previous build on this db
cp /hive/data/genomes/$db/bed/gencodeVM27/build/buildEnv.sh buildEnv.sh edit buildEnv.sh to have correct values . buildEnv.sh
- Find Table and File list from previous build
cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.files.txt . cp ${oldGeneDir}/${PREV_GENCODE_VERSION}.tables.txt .
- Confirm existing assembly tables are in a knownGene* database
hgsql ${oldKnownDb} -Ne "show tables" > ${oldKnownDb}.tables.txt diff ${PREV_GENCODE_VERSION}.tables.txt ${oldKnownDb}.tables.txt
Setting environment variables
The environment variables used in the build are set in the script buildEnv.sh. All the other scripts assume that this script has been sourced in the current shell. You have to edit this by hand. Most of the variables don't change. The hairiest ones are the other assemblies for the blast tables.
Running the build
To run the build execute hg/utils/otto/knownGene/buildKnown.sh.
buildKnown.sh & tail -f doKnown.log
It builds into the knownGene${GENCODE_VERSION} database. It does the following steps:
- Extracting Gencode data
- Building initial knownGene table
- Adding primary reference tables
- Building final knownGene core tables
- Building bigGenePred
- Building GTF file
Copying over tables
drop chromInfo and history from knownGene database
hgsql knownGene${GENCODE_VERSION} -Ne "drop table chromInfo, history" hgsql knownGene${GENCODE_VERSION} -Ne "show tables" > ${GENCODE_VERSION}.tables.txt
look for unexpected differences between this release and the last one
diff ${PREV_GENCODE_VERSION}.tables.txt ${GENCODE_VERSION}.tables.txt
drop old tables
hgsql $db -Ne "drop table knownGene, kgXref;" grep -v "ToKg" ${PREV_GENCODE_VERSION}.tables.txt | egrep -vw "knownGene|kgXref" | awk '{printf "drop table %s;\n", $1}' > toDrop.lst cat toDrop.lst | hgsql $db
check for orphans and drop them (or build them) if appropriate
hgsql $db -Ne "show tables like 'known%'" > orphan.lst
copy tables from knownGene database to assembly database
copyFilesToAssembly.sh VM28.tables.txt knownGene${GENCODE_VERSION} > copyScript.txt
Edit trackDb to add new trackDb
include knownGene.ra beta,public include knownGene.alpha.ra alpha
Look for the previous trackDb.ra file, normally hg/makeDb/trackDb/<org>/<assembly>/knownGene.ra.
Adding IsPcr server
After building /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit, which happens in the buildCore.sh script run at the beginning of the process, ask cluster-admin to start an untranslated, -stepSize=5 gfServer on /gbdb/$db/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit
to cluster-admin
Hey my friends, Could you please start an untranslated -stepSize=5 production gfserver with this 2bit file? hgwdev:/gbdb/mm39/targetDb/mm39KgSeq13.2bit thanks! brian
On hgwdev, drop old records in blatServers and targetDb Identify the
blatServer by the keyword "$db"Kg with the version number appended
hgsql hgcentraltest -Ne "delete from blatServers where db like '${db}Kg%'" hgsql hgcentraltest -Ne "delete from targetDb where name like '${db}Kg%'"
On hgwdev, insert new records into blatServers and targetDb, using the
host (field 2) and port (field 3) specified by cluster-admin. Identify the
blatServer by the keyword "$db"Kg with the version number appended
cluster-admin will say something like this:
Starting untrans gfServer for mm39KgSeqV38 on host blat1b port 17921
Add this info to blatServers and targetDb tables in hgcentral.
hgsql hgcentraltest -e \ "INSERT into blatServers values ('${db}KgSeq${GENCODE_VERSION}', 'blat1c', 17921, 0, 1,);" hgsql hgcentraltest -e \ "INSERT into targetDb values('${db}KgSeq${GENCODE_VERSION}', 'GENCODE Genes', \ '$db', 'kgTargetAli', , , \ '/gbdb/${db}/targetDb/${db}KgSeq${GENCODE_VERSION}.2bit', 1, now(), );"
all.joiner changes
I haven't added anything to this recently.
The relevant id's are :
knownGeneId
joinerCheck all.joiner -identifier=knownGeneId -keys -database=${db}
Bundle up logs and check them in
Redmine ticket files and tables
Post release push "other species" blast tables
Load the other species blastTab tables.
buildLoadOther.sh