The Pubs pipeline

This is a description of the pipeline that builds the publications track.

Data travels through these directories:

/hive/data/outside/pubs: Data is downloaded from publisher systems here. The contents are in various formats, mostly XML, but also PDF, .docx, almost any format imaginable. There is one directory per publishing (=download) system. The tools to fill this directory have a name pubGetXXX, e.g. pubGetSpringer downloads into /hive/data/outside/pubs/springer. The directory "crawler" is filled by pubCrawl2, a tool that is as frickly as the rest of the pipeline together.
/hive/data/inside/pubs/text: Data is converted from /hive/data/outside/pubs to this directory as text. Tools that produce this directory start with pubConvXXX, e.g. pubConvSpringer will convert Springer XML files to normal text. The format of the files in this directory is somewhat special, it's two tab-sep files (articles and files), split over many files for cluster processing.

Navigation menu