Text and Data Mining

PLOS provides access to articles and article meta-data in multiple ways. The preferred method of access depends on the use case.

Text and Data Mining (Bulk Downloads)

Text and Data Miners (TDM) generally want a copy of the entire corpus and write specialized software to process the data. Bulk downloading is the most efficient method for obtaining a copy of the entire corpus. PubMed Central (PMC) has made this extremely easy by packaging the Open Access Subset of research articles from multiple journals into single files and making them available via the PMC OA Bulk Download FTP site. A description of the files and what they contain can be found here.

PMC is an invaluable resource for TDM.  Writing specialized software takes time and effort. Writing software to download data from literally 100’s or 1000’s of journals is a huge barrier for TDM. Open Access (OA) journals remove this barrier in two important ways.

OA article text and meta-data is provided in a single XML file format: the Journal Archive and Interchange Tag Set (JATS). Writing software to process JATS XML will require a larger upfront investment but the reward is the ability to process articles from multiple journals in addition to PLOS.

Secondly OA articles are freely available to download and use for TDM.  TDM software does not have to be written for none standard publisher API’s that change frequently or that don’t even exist. OA publishers syndicate articles to PMC which provides this data as an ongoing service that is updated on a regular basis.  Closed access publishers simply have no incentive to increase the value of research by making it broadly available for TDM.

PLOS API (Non-Bulk Downloads)

PLOS provides 3 ways to access data about PLOS articles or the articles themselves. These methods are not as useful for bulk downloads but do provide anyone with specific interest in PLOS articles and data a way to access it.

JATS XML

The Journal Archive and Interchange Tag Set (JATS) is the standard used to archive scientific articles.  JATS XML is the most convenient format for TDM because the data is structured. Article text and meta-data can be accessed in a single file and in standard
way. Downloading individual article XML from the PLOS website is simple if the DOI of the article is known. Bulk downloading of XML is discouraged but this method is useful if used in conjunction with the PLOS Solr API to identify specific articles of interest.

Example: 
http://journals.plos.org/plosone/article/file?id=
10.1371/journal.pone.0170929&type=manuscript

Article PDF

Each PLOS article is also available as a PDF. Article PDF’s have limited utility for TDM but are useful to printing or reading the article offline. Bulk downloading of article PDF’s is discouraged.

Example:
http://journals.plos.org/plosone/article/file?id=
10.1371/journal.pone.0170929&type=printable

Html Article Page

Article HTML is the primary method used to view PLOS articles online. Scraping the article HTML is a technique used by search engines to index articles and can be used for TDM. It is generally less useful for TDM because the article pages change over time, the data is not structured and meta-data is not easily identified. Bulk downloading of article HTML is discouraged.

Example:
http://journals.plos.org/plosone/article?id=
10.1371/journal.pone.0170929

Conclusion

Virtually any scripting or compiled programming language can be used to access PLOS articles via HTTP.  Bulk downloading using these URL’s is discouraged because it can interfere with access to the site if not done with care. Downloading a small subset of articles based on Solr search results would be acceptable for periodic updates of a previously bulk downloaded corpus.