Building GenArk genomes

From Genecats
Jump to navigationJump to search

GenArk

The Genome Archive (GenArk) are assembly hubs that come pre-loaded with several annotation tracks, gene models, and the ability to align genomic sequence to the reference assembly using the BLAT alignment tool. The genomes in the GenArk are sourced from NCBI RefSeq, the Vertebrate Genomes Project (VGP), and other projects.


Steps from Hiram's Talk (by Gerardo) 2/2/2022

See Recorded Demo here: https://genome-test.gi.ucsc.edu/~hiram/zoomRecordings/2022-02-02.GenArkBuild/video1137793487.mp4

Before starting

Check to see if you have the goto and gotos command. If not, add the following to your .bashrc file:


function goto() {
export asmId=$1

export gcX=${asmId:0:3}
export d0=${asmId:4:3}
export d1=${asmId:7:3}
export d2=${asmId:10:3}

export destDir=""

if [ "${gcX}" = "GCF" ]; then
   destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}"
else
   destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}"
fi

export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`

if [ -d "${cdDir}" ]; then
   printf "cd $cdDir\n" 1>&2
   cd "$cdDir"
else
   printf "# can not find ${destDir}/${asmId}*\n" 1>&2
fi
}


function gotos() {
export asmId=$1

export gcX=${asmId:0:3}
export d0=${asmId:4:3}
export d1=${asmId:7:3}
export d2=${asmId:10:3}

export destDir="/hive/data/outside/ncbi/genomes/${gcX}/${d0}/${d1}/${d2}"

export dirCount=`ls -d ${destDir}/${asmId}* 2> /dev/null | wc -l`
if [ "${dirCount}" -ne 1 ]; then
   printf "# can not find ${destDir}/${asmId}*\n" 1>&2
else
   export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null`
   if [ -d "${cdDir}" ]; then
     printf "cd $cdDir\n" 1>&2
     cd "$cdDir"
   else
     printf "# can not find ${destDir}/${asmId}*\n" 1>&2
   fi
fi
}


Directory that everything came from ncbi, accumulating via weekly rsync and this is the genome we want to work from:

gotos GCA_001708105.1
pwd
/hive/data/outside/ncbi/genomes/GCA/001/708/105/GCA_001708105.1_ASM170810v1

How big and long its going to take to run:

faSize GCA_001708105.1_ASM170810v1_rna_from_genomic.fna.gz


The command to run

The runBuild command template (scientific name should have underscores instead of blanks):

time (./runBuild assemblyName Scientific_Name clade)

Genark request

When a user sends in a request with an accession ID, e.g.: GCF_002776525.5 the assembly may already exist in some version. To check if something already exists, use just the number part of the ID: 002776525 and check the existing listings in the source tree:

grep 002776525 ~/kent/src/hg/makeDb/doc/*AsmHub/*.tsv

They may be asking for a newer version, or they may be asking for a GenBank version when a RefSeq version already exists. Can decide if what we have is better than what they ask for. True, sometimes they may want a specific older version, negotiate that with the user in email to see if they would accept a newer version.

When not there, check to see if has been recognized as something to build. Again, just the number, for RefSeq assemblies:

   grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*

for Genbank:

   grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.*

decide which one is best or most up to date, RefSeq is always first choice. Those are the pre-ready to go build commands to be run in the allBuild directory:

   cd /hive/data/genomes/asmHubs/allBuild
   time (./runBuild GCA_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles) >> GCA_002776525.5.log 2>&1

If it doesn't show up there, check the 'master' listings from NCBI:

  /hive/data/outside/ncbi/genomes/reports/assembly_summary*.txt
assembly_summary_genbank.txt             assembly_summary_refseq.txt
assembly_summary_genbank_historical.txt  assembly_summary_refseq_historical.txt

These asssembly_summary*.txt files can also be scanned for scientific names if that is all the user supplied. When you decide what to do, please notify the user that you have started the build for what they asked for, or if you are making a substitute or pointing to an existing browser, ask them if that satisfies their request. And if you want to copy the Gmail group, use the genark-request-group@ucsc.edu address in the Bcc line so the user doesn't see it. They don't need to know about our recording device.

Regarding "betterName:" from the request, this will come in when you add adding an entry to the [animal group].orderList.tsv file. See the Post runBuild steps below for more info. There is a script in makeDb/doc/asmHubs/commonNames.pl which will help you construct the common name for the orderList.tsv file. But, in some cases it might not help:

# place the name in a temporary file, in this case called '1'
echo GCA_900660155.1_ASM90066015v1 > 1
./commonNames.pl 1
GCA_900660155.1_ASM90066015v1   animals (2019)

That isn't a good name. If it really does have some kind of common name, use that in place of 'animals'. And leave the (2019) in the name, that is the assembly date. You will see the pattern in the orderList.tsv file. Keep that file in order by common name, case insensitive.

Finding the info for runBuild

History of every command used to build the browser:

head kent/src/hg/makeDb/doc/asmHubs/master.run.list

Full assembly name here:

less *report.txt

Example report.txt:


# Assembly name:  ASM170810v1
# Organism name:  Komagataella pastoris (budding yeasts)
# Infraspecific name:  strain=ATCC 28485

Going to need the scientific name, ex. Komagataella pastoris

Find what the clade. See what others are called:

cd ~kent/src/hg/makeDb/doc/
grep -i budding yeasts */*.tsv

fungiAsmHub/fungi.orderList.tsv:GCF_001661345.1_Ascru1	budding yeast A.rubescens DSM 1968
fungiAsmHub/fungi.orderList.tsv:GCF_011074885.1_ASM1107488v2	budding yeast B.bruxellensis UCD 2041
fungiAsmHub/fungi.orderList.tsv:GCF_001661335.1_Babin1	budding yeast B.inositovora NRRL Y-12698
fungiAsmHub/fungi.orderList.tsv:GCF_011074865.1_ASM1107486v2	budding yeast B.nanus
fungiAsmHub/fungi.orderList.tsv:GCF_000182965.3_ASM18296v3	budding yeast C.albicans SC5314
fungiAsmHub/fungi.orderList.tsv:GCF_002775015.1_Cand_auris_B11221_V1	budding yeast C.auris
fungiAsmHub/fungi.orderList.tsv:GCF_000026945.1_ASM2694v1	budding yeast C.dubliniensis CD36
...


Running the runBuild command

The command can be run anywhere but preferably accumulated into one directory:

cd /hive/data/genomes/asmHubs/allBuild

Saved the command in a file:

cd /hive/data/genomes/asmHubs/allBuild/history
echo './runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris' > GCA_001708105.1.list


Move to a screen:

screen -S screenName

You can exit the screen by doing Ctrl-a then Ctrl-d

You can reattach to the screen by the following command:

screen -r -d screenName

Run the command:

cd /hive/data/genomes/asmHubs/allBuild
time (./runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris) > GCA_001708105.1.log 2>&1 &

Even before building the genome, it allows to compare the genome with the other ones to find out the equivalence between all the different genomes. Everything has an idKeys directory where this work has started before


Checking progress

Look at progress:

goto GCA_001708105.1

Check what track is getting built (can get stuck here):

og trackData/

Can see each step happening in the build log:

tail -f build.log

You can confirm the run finished when it makes a trackDb:

GCA_001708105.1_ASM170810v1.trackDb.txt


Post runBuild steps

Add the command:

./runBuild GCA_001708105.1 GCA_001708105.1_ASM170810v1 fungi Komagataella_pastoris

In the master run list in the source tree, record that this has been done and sorted (ordered by GC identifier):

kent/src/hg/makeDb/doc/asmHubs/master.run.list

Commit changes, ex.

git commit -m 'adding fungi per user request, refs #28821'

Add an entry with a common name and strain in orderList.tsv file. In alphabetical order by the second column, and case insensitive. The 'better name' from the genArk request can be used here. You can also look at the *report.txt and use the nomenclature in the clade orderList.tsv file. Here is an example of an entry:

GCA_001708105.1_ASM170810v1     budding yeast K.pastoris ATCC 28485

to:

 makeDb/doc/fungiAsmHub/fungi.orderList.tsv

The second column would show up in the pull down menu.

Commit changes, ex.

git commit -m 'adding yeast K.pastoris per user request refs #28821'

Pushing to the RR

Run the following make in makeDb/doc/[animalGroup]AsmHub to get the symlinks constructed and make this assembly appear in the outside world:

time (make) > dbg 2>&1 &

Check for errors:

grep -i error dbg
tail dbg

This make sets up a number of file links to get pushed out.

Same directory, verify that it worked:

time (make verifyTestDownload) >> test.down.log 2>&1 &

Its going through each assembly in that assembly hub and making sure it has enough tracks to be a legitimate assembly hub. The test.down.log file should end in the line something like:

  tail test.down.log
  # checked 598 hubs, 598 success,   0 fail, total tracks: 10288, 2023-06-17 12:40:15

And if you accumulate that test.down.log file, you can see the changes over time via:

  grep check test.down.log

for each of those 'checked' lines.

You should be able to see for yourself the assembly in the hgDownload-test listing and file directory. Example below:

https://hgdownload-test.gi.ucsc.edu/hubs/GCA/011/100/615/GCA_011100615.1/
https://hgdownload-test.gi.ucsc.edu/hubs/primates/index.html


Verify you can login to hgdownload and dynablat. For the first time, you may need to update your keys. If it doesn't work, you may already have a key and you may need to delete it. The command with "date" afterward is a way to check your access without actually staying in that server.

ssh qateam@hgdownload.soe.ucsc.edu date
ssh qateam@dynablat-01.soe.ucsc.edu date


Run RR push from ~/kent/src/hg/makeDb/doc/[animalGroup]AsmHub:

time (make sendDownload ) >> send.down.log 2>&1 &

check for errors:

  grep -i error send.down.log
  tail send.down.log

Then final check that it is on hgdownload:

time (make verifyDownload) > verify.down.log 2>&1 &

Should have a valid last 'checked' line

tail verify.down.log 2>&1