Text and Data Mining

PLOS provides access to its article corpus and article meta-data (data about the article) in multiple ways. The preferred method of access depends on the use case.

DOI (Digital Object Identifier)

All PLOS articles are assigned a unique DOI . All DOIs can be resolved at doi.org For example, the [PLOS Medicine] article with the DOI 10.1371/journal.pmed.0020124 can be resolved at http://doi.org/10.1371/journal.pmed.0020124. Every PLOS article is indexed by DOI in our Solr search API. The search API can be used to download PLOS article metadata, to identify a subset of articles of interest, or to get the DOI of every published PLOS article.

Text and Data Mining (Bulk downloads)

Text and Data Miners (TDM) generally want a copy of the entire corpus and write specialized software to process the data. Bulk downloading is the most efficient method for obtaining a copy of the entire corpus.

Our approach to TDM is simple: PLOS articles may be mined, reused, and shared by anyone, anywhere, for any purpose. One can easily download our entire text corpus.

Another option is our internal project for downloading/updating/maintaining a repository of all PLOS XML article files. This can be used to have a copy of the PLOS text corpus for further analysis. Use this program to download all PLOS XML article files instead of doing web scraping.

PMC is an invaluable resource for TDM. Writing specialized software takes time and effort. Writing software to download data from literally hundreds or thousands of journals is a huge barrier for TDM. Open Access (OA) journals remove this barrier in two important ways.

OA article text and meta-data is provided in a single XML file format: the Journal Archive and Interchange Tag Set (JATS). Writing software to process JATS XML requires a larger upfront investment but the reward is the ability to process articles from multiple journals in addition to PLOS.

PLOS API (Non-Bulk Downloads)

PLOS provides 3 ways to access data about PLOS articles or the articles themselves. These methods are not as useful for bulk downloads but do provide anyone with specific interest in PLOS articles and data a way to access it.

JATS XML

The Journal Archive and Interchange Tag Set (JATS) is the standard used to archive scientific articles. JATS XML is the most convenient format for TDM because the data is structured. Article text and meta-data can be accessed in a single file and in standard
way. Downloading individual article XML from the PLOS website is simple if the DOI of the article is known. Bulk downloading of XML is discouraged but this method is useful if used in conjunction with the PLOS Solr API to identify specific articles of interest.

Example:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0170929&type=manuscript

Article PDF

Each PLOS article is also available as a PDF. Article PDF’s have limited utility for TDM but are useful to printing or reading the article offline. Bulk downloading of article PDF’s is discouraged.

Example:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0170929&type=printable

Html Article Page

Article HTML is the primary method used to view PLOS articles online. Scraping the article HTML is a technique used by search engines to index articles and can be used for TDM. It is generally less useful for TDM because the article pages change over time, the data is not structured and meta-data is not easily identified. Bulk downloading of article HTML is discouraged.

Example:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0170929

Conclusion

Virtually any scripting or compiled programming language can be used to access PLOS articles via HTTP. Bulk downloading using these URL’s is discouraged because it can interfere with access to the site if not done with care. Downloading a small subset of articles based on Solr search results would be acceptable for periodic updates of a previously bulk downloaded corpus.