Text and Data Mining

PLOS provides access to its article corpus and article meta-data (data about the article) in multiple ways. The preferred method of access depends on the use case.

DOI (Digital Object Identifier)

All PLOS articles are assigned a unique DOI . All DOIs can be resolved at doi.org. For example, the PLOS Medicine article with the DOI 10.1371/journal.pmed.0020124 can be resolved at http://doi.org/10.1371/journal.pmed.0020124. Every PLOS article is indexed by DOI in our Solr search API. The search API can be used to download PLOS article metadata, to identify a subset of articles of interest, or to get the DOI of every published PLOS article.

Text and Data Mining (Bulk downloads)

Text and Data Miners (TDM) generally want a copy of the entire corpus and write specialized software to process the data. Bulk downloading is the most efficient method for obtaining a copy of the entire corpus. PubMed Central (PMC) has made this extremely easy by packaging the Open Access Subset of research articles from multiple journals into single files and making them available via the PMC OA Bulk Download FTP site. A description of the files and what they contain can be found here.

PMC is an invaluable resource for TDM.  Writing specialized software takes time and effort. Writing software to download data from literally hundreds or thousands of journals is a huge barrier for TDM. Open Access (OA) journals remove this barrier in two important ways.

First, OA article text and meta-data is provided in a single XML file format: the Journal Archive and Interchange Tag Set (JATS). Writing software to process JATS XML requires a larger upfront investment but the reward is the ability to process articles from multiple journals in addition to PLOS.

Secondly OA articles are freely available to download and use for TDM as part of our CC-BY license standard.  Individual publisher API’s change frequently or do not exist. OA publishers syndicate articles to PMC which provides this data as an ongoing service that is updated on a regular basis.  Closed access publishers often do not make their text available for TDM or only do so under certain restrictions.

Download PLOS Corpus as JATS XML
Download PLOS Corpus as Text

Note: Theses files also contain articles from journals 
      other than PLOS. PLOS journal articles can be found 
      in directories with PLOS in the name.

PLOS API (Non-Bulk Downloads)

PLOS provides 3 ways to access data about PLOS articles or the articles themselves. These methods are not as useful for bulk downloads but do provide anyone with specific interest in PLOS articles and data a way to access it.

JATS XML

The Journal Archive and Interchange Tag Set (JATS) is the standard used to archive scientific articles.  JATS XML is the most convenient format for TDM because the data is structured. Article text and meta-data can be accessed in a single file and in standard
way. Downloading individual article XML from the PLOS website is simple if the DOI of the article is known. Bulk downloading of XML is discouraged but this method is useful if used in conjunction with the PLOS Solr API to identify specific articles of interest.

Example: 
http://journals.plos.org/plosone/article/file?id=
10.1371/journal.pone.0170929&type=manuscript

Article PDF

Each PLOS article is also available as a PDF. Article PDF’s have limited utility for TDM but are useful to printing or reading the article offline. Bulk downloading of article PDF’s is discouraged.

Example:
http://journals.plos.org/plosone/article/file?id=
10.1371/journal.pone.0170929&type=printable

Html Article Page

Article HTML is the primary method used to view PLOS articles online. Scraping the article HTML is a technique used by search engines to index articles and can be used for TDM. It is generally less useful for TDM because the article pages change over time, the data is not structured and meta-data is not easily identified. Bulk downloading of article HTML is discouraged.

Example:
http://journals.plos.org/plosone/article?id=
10.1371/journal.pone.0170929

Conclusion

Virtually any scripting or compiled programming language can be used to access PLOS articles via HTTP.  Bulk downloading using these URL’s is discouraged because it can interfere with access to the site if not done with care. Downloading a small subset of articles based on Solr search results would be acceptable for periodic updates of a previously bulk downloaded corpus.