The Pubs pipeline

From Genecats
Revision as of 18:03, 5 July 2016 by Max (talk | contribs) (Created page with "This is a description of the pipeline that builds the publications track. Data travels through these directories: # /hive/data/outside/pubs: Data is downloaded from publisher s...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a description of the pipeline that builds the publications track.

Data travels through these directories:

  1. /hive/data/outside/pubs: Data is downloaded from publisher systems here. The contents are in various formats, mostly XML, but also PDF, .docx, almost any format imaginable. There is one directory per publishing (=download) system. The tools to fill this directory have a name pubGetXXX, e.g. pubGetSpringer downloads into /hive/data/outside/pubs/springer. The directory "crawler" is filled by pubCrawl2, a tool that is as frickly as the rest of the pipeline together.
  2. /hive/data/inside/pubs/text: Data is converted from /hive/data/outside/pubs to this directory as text. Tools that produce this directory start with pubConvXXX, e.g. pubConvSpringer will convert Springer XML files to normal text. The format of the files in this directory is somewhat special, it's two tab-sep files (articles and files), split over many files for cluster processing.