The Pubs pipeline: Difference between revisions

From Genecats
Jump to navigationJump to search
(Created page with "This is a description of the pipeline that builds the publications track. Data travels through these directories: # /hive/data/outside/pubs: Data is downloaded from publisher s...")
 
No edit summary
 
Line 3: Line 3:
Data travels through these directories:
Data travels through these directories:


# /hive/data/outside/pubs: Data is downloaded from publisher systems here. The contents are in various formats, mostly XML, but also PDF, .docx, almost any format imaginable. There is one directory per publishing (=download) system. The tools to fill this directory have a name pubGetXXX, e.g. pubGetSpringer downloads into /hive/data/outside/pubs/springer. The directory "crawler" is filled by pubCrawl2, a tool that is as frickly as the rest of the pipeline together.
# /hive/data/outside/pubs: Text is downloaded from publisher systems in various formats, mostly XML, but also PDF, .docx, almost any format imaginable. There is one directory per publishing (=download) system. The tools to fill this directory have a name pubGetXXX, e.g. pubGetSpringer downloads into /hive/data/outside/pubs/springer. The directory "crawler" is filled by pubCrawl2, a tool that is as frickly as the rest of the pipeline together.
#  /hive/data/inside/pubs/text: Data is converted from /hive/data/outside/pubs to this directory as text. Tools that produce this directory start with pubConvXXX, e.g. pubConvSpringer will convert Springer XML files to normal text. The format of the files in this directory is somewhat special, it's two tab-sep files (articles and files), split over many files for cluster processing.
#  /hive/data/inside/pubs/text: Text in various formats is converted from /hive/data/outside/pubs to this directory as ASCII text. Tools that fill this directory are named like pubConvXXX, e.g. pubConvSpringer will convert Springer XML files to normal text. The format of the files in this directory is somewhat special, it's two tab-sep files (articles and files), split over many smaller files for cluster processing.
# /hive/data/inside/pubs/map: This is used by pubMap the main tool that produces the track. the directory contains BLAT results and meta data extracted from the text
 
Cronjobs:
 
# every night, Medline, PMC, Springer and Elsevier are downloaded. They are all converted to text right away.

Latest revision as of 18:10, 5 July 2016

This is a description of the pipeline that builds the publications track.

Data travels through these directories:

  1. /hive/data/outside/pubs: Text is downloaded from publisher systems in various formats, mostly XML, but also PDF, .docx, almost any format imaginable. There is one directory per publishing (=download) system. The tools to fill this directory have a name pubGetXXX, e.g. pubGetSpringer downloads into /hive/data/outside/pubs/springer. The directory "crawler" is filled by pubCrawl2, a tool that is as frickly as the rest of the pipeline together.
  2. /hive/data/inside/pubs/text: Text in various formats is converted from /hive/data/outside/pubs to this directory as ASCII text. Tools that fill this directory are named like pubConvXXX, e.g. pubConvSpringer will convert Springer XML files to normal text. The format of the files in this directory is somewhat special, it's two tab-sep files (articles and files), split over many smaller files for cluster processing.
  3. /hive/data/inside/pubs/map: This is used by pubMap the main tool that produces the track. the directory contains BLAT results and meta data extracted from the text

Cronjobs:

  1. every night, Medline, PMC, Springer and Elsevier are downloaded. They are all converted to text right away.