The Search API is a great tool to find interesting PLoS articles, and the ALM API can then collect metrics about these articles. Using the R statistical programming language is one of the easiest ways to look at these metrics. Below are a few example visualizations, the source code to all of them can be found in the plosOpenR Github repository.
The first visualization looks at the results of a search for the words tuberculosis and treatment in the abstract. The PLoS Search API found 266 articles, and the most common words in the abstracts can be nicely summarized in a word cloud.
To better understand how well the PLoS Article-Level Metrics cover usage data, citations and the social web, I looked at the numbers for all 47,430 PLoS articles in the latest data dump from April (available here, using the API would not work for this).
This dataset of course also includes recently published articles. About 90% of the articles older than two years have been cited at least once.
Among the social web tools, Mendeley clearly has the best coverage of PLoS articles. We are of course also interested to find out how many Mendeley readers our articles have (whether it is 1 or 100), and a density plot is good way to show this. For this plot I used all PLoS Biology research articles from 2009.
Article-Level Metrics produces a lot of data, and this can be overwhelming. We therefore try to find patterns in the data. One interesting question is how HTML pageviews correlate with PDF downloads. One would assume that HTML pageviews correlate with readers taking a quick look at the abstract or specific information in the paper, whereas the PDF downloads correlate better with readers of the whole paper. For this analysis I looked at a fairly homogenous set of articles, the 181 research articles published by PLoS Biology in 2009.
This is a very nice correlation, and the outliers also tell an interesting story – the paper with 5000 PDF downloads for example is also the most highly-cited in this set of articles.
Article-Level Metrics only make sense in context, and the most important ones are probably article age, subject area and journal. Article age (as days since publication) can be plotted on the x axis, and bubble charts allow visualization of two quantitative parameters (total views and CrossRef citations in this case). For this visualization we look at a set of articles funded by the European Union 7th Framework Programme. They were picked up by a search for fp7 in the financial disclosure section. This is of course only a first step to find every FP7-funded paper, and we are working with the EU OpenAIRE project on a more complete search strategy.
There are probably a bit too many bubbles on the chart, but you can clearly see how papers grow from tiny dots at the far lower left to nice big bubbles as they become older.
Although it is possible to add information about individual papers to bubble charts (e.g. using the Google Visualisation API), it sometimes is helpful to see all paper titles at once. Dot charts are one way to provide this information, and this of course only works for a smaller number of papers.
Not surpringly most people seem to be interested in how to publish and how to present your research. I wouldn’t have thought that the article about Wikipedia is so popular.
Dot charts are limited in the number of categories you can display at once. Heatmaps are better for a more complete analysis. They provide context by comparing the metrics for an article to other other articles in the same set. Darker colors mean higher numbers and the articles are sorted by age (oldest papers at the bottom). Twitter (the rightmost column) was only recently added as ALM source and we can’t retrieve older tweets in retrospect.
The heatmap shows that the different ALM not necessarily correlate with each other (with the exception of the different citation metrics).
A variation is the calendar heatmap which looks at ALM events on a calendar. We can do this only for sources that provide dates for every single event (currently only CiteULike and Twitter), and I want to focus on Twitter, looking at all PLoS papers published in 2012 (125 as of today) with an author from Stanford University.
Remember that PLoS started collecting tweets in May. You can clearly see that July 6 was a busy day, and this is because of all the tweets about a paper published the day before.
This work would not have been possible without the help from Najko Jahn, Jochen Schirrwagen and Harr Dimitropoulos from the OpenAIRE project, and Scott Chamberlain from the rOpenSci project. All calls to the PLoS APIs were made with the rOpenSci rplos library.
A lot of this is obviously work in progress. Feel free to make suggestions in the comments, in the PLoS API Developers Google Group, via the @PLoSALM Twitter account, or in the plosOpenR Github repository.