The Pubs pipeline

From Genecats
Jump to navigationJump to search

This is a description of the pipeline that builds the publications track.

Data travels through these directories:

  1. /hive/data/outside/pubs: Text is downloaded from publisher systems in various formats, mostly XML, but also PDF, .docx, almost any format imaginable. There is one directory per publishing (=download) system. The tools to fill this directory have a name pubGetXXX, e.g. pubGetSpringer downloads into /hive/data/outside/pubs/springer. The directory "crawler" is filled by pubCrawl2, a tool that is as frickly as the rest of the pipeline together.
  2. /hive/data/inside/pubs/text: Text in various formats is converted from /hive/data/outside/pubs to this directory as ASCII text. Tools that fill this directory are named like pubConvXXX, e.g. pubConvSpringer will convert Springer XML files to normal text. The format of the files in this directory is somewhat special, it's two tab-sep files (articles and files), split over many smaller files for cluster processing.
  3. /hive/data/inside/pubs/map: This is used by pubMap the main tool that produces the track. the directory contains BLAT results and meta data extracted from the text

Cronjobs:

  1. every night, Medline, PMC, Springer and Elsevier are downloaded. They are all converted to text right away.