Example Visualizations using the PLOS Search and ALM APIs

The Search API is a great tool to find interesting PLOS articles, and the ALM API can then collect metrics about these articles. Using the R statistical programming language is one of the easiest ways to look at these metrics. Below are a few example visualizations, the source code to all of them can be found in the plosOpenR Github repository.

The first visualization looks at the results of a search for the words tuberculosis and treatment in the abstract. The PLOS Search API found 266 articles, and the most common words in the abstracts can be nicely summarized in a word cloud.

To better understand how well the PLOS Article-Level Metrics cover usage data, citations and the social web, I looked at the numbers for all 47,430 PLOS articles in the latest data dump from April (available here, using the API would not work for this).

This dataset of course also includes recently published articles. About 90% of the articles older than two years have been cited at least once.

Among the social web tools, Mendeley clearly has the best coverage of PLOS articles. We are of course also interested to find out how many Mendeley readers our articles have (whether it is 1 or 100), and a density plot is good way to show this. For this plot I used all PLOS Biology research articles from 2009.

Article-Level Metrics produces a lot of data, and this can be overwhelming. We therefore try to find patterns in the data. One interesting question is how HTML pageviews correlate with PDF downloads. One would assume that HTML pageviews correlate with readers taking a quick look at the abstract or specific information in the paper, whereas the PDF downloads correlate better with readers of the whole paper. For this analysis I looked at a fairly homogenous set of articles, the 181 research articles published by PLOS Biology in 2009.

This is a very nice correlation, and the outliers also tell an interesting story – the paper with 5000 PDF downloads for example is also the most highly-cited in this set of articles.

Article-Level Metrics only make sense in context, and the most important ones are probably article age, subject area and journal. Article age (as days since publication) can be plotted on the x axis, and bubble charts allow visualization of two quantitative parameters (total views and CrossRef citations in this case). For this visualization we look at a set of articles funded by the European Union 7th Framework Programme. They were picked up by a search for fp7 in the financial disclosure section. This is of course only a first step to find every FP7-funded paper, and we are working with the EU OpenAIRE project on a more complete search strategy.

There are probably a bit too many bubbles on the chart, but you can clearly see how papers grow from tiny dots at the far lower left to nice big bubbles as they become older.

Although it is possible to add information about individual papers to bubble charts (e.g. using the Google Visualisation API), it sometimes is helpful to see all paper titles at once. Dot charts are one way to provide this information, and this of course only works for a smaller number of papers.

Not surpringly most people seem to be interested in how to publish and how to present your research. I wouldn’t have thought that the article about Wikipedia is so popular.

Dot charts are limited in the number of categories you can display at once. Heatmaps are better for a more complete analysis. They provide context by comparing the metrics for an article to other other articles in the same set. Darker colors mean higher numbers and the articles are sorted by age (oldest papers at the bottom). Twitter (the rightmost column) was only recently added as ALM source and we can’t retrieve older tweets in retrospect.

The heatmap shows that the different ALM not necessarily correlate with each other (with the exception of the different citation metrics).

A variation is the calendar heatmap which looks at ALM events on a calendar. We can do this only for sources that provide dates for every single event (currently only CiteULike and Twitter), and I want to focus on Twitter, looking at all PLOS papers published in 2012 (125 as of today) with an author from Stanford University.

Remember that PLOS started collecting tweets in May. You can clearly see that July 6 was a busy day, and this is because of all the tweets about a paper published the day before.

This work would not have been possible without the help from Najko Jahn, Jochen Schirrwagen and Harr Dimitropoulos from the OpenAIRE project, and Scott Chamberlain from the rOpenSci project. All calls to the PLOS APIs were made with the rOpenSci rplos library.

A lot of this is obviously work in progress. Feel free to make suggestions in the comments, in the PLOS API Developers Google Group, via the @PLOSALM Twitter account, or in the plosOpenR Github repository.



9 responses to “Example Visualizations using the PLOS Search and ALM APIs

  1. Pingback: More fun with Visualizations | Gobbledygook

  2. Thanks for the excellent post!

    There’s a lot to take in. My first question is about the HTML/PDF ratio. Have you looked at this ratio in PLoS ONE papers? According to your plot of PLoS Bio papers, my lab’s recent PLoS ONE paper would be an outlier due to the fact that my paper has 3-4 times fewer PDF downloads.

    My explanation for my paper is that there was a lot of casual readership relative to “expert” or “informed” readership, aka other academics. So I can understand why I’d be an outlier. After all, given the “high” IF of PLoS Bio, there’s an incentive for other academics to read the darned paper. Whereas the floodgates have been opened at PLoS ONE, but the academic community hasn’t fully bought in.

    Looking forward to more of these updates!

  3. Martin Fenner

    Thanks Ethan. I will certainly do a more systematic analysis of the HTML/PDF ration, using the whole dataset of now over 50,000 PLoS articles. One question would be whether there is a difference in the ration between the different journals. I’m happy to report the results here.

    There are still several visualizations missing that I want to do. One of them is “micro histograms”, and I could try this with your April PLoS ONE paper as an example.

  4. Martin Fenner

    Ethan, here is a scatterplot looking at all 127 PLoS papers in the subject categories Pharmacology and Mental Health, all but 4 are from PLoS ONE. The correlation looks similar, but there are more outliers. There is your paper at the lower right corner, and a few others with many more HTML views compared to PDF downloads. The bubble size correlates with Facebook Likes/Shares/Comments, and it looks like these papers are also popular on Facebook.

  5. Fascinating, Martin!

    Thanks for putting that Pharma data set together. Seems to me that Facebook popularity is consistent with a larger fraction of casual/non-expert readers.

    Also, it’s consistent with my observation that when my paper was reviewed by Derek Lowe on his blog “In the Pipeline,” I got a mini-surge of HTML views (~500) coinciding with over 100 PDF downloads. In other words, the HTML/PDF ratio got much better. I assume that the readers of Lowe’s blog are more informed than the average person arriving at my paper via Facebook or Twitter, and so more likely to actually take the time to read my paper. (Though I wonder how many academics actually only read papers as PDFs vs in browsers?)

    What happens if you track the HTML/PDF ratio of many individual papers over time? You should see spikes that correspond to swings in readership engagement levels, no?

  6. Ethan, your observations make a lot sense. Up until now the usage stats have only been available as monthly aggregates, this will soon change. You can then much better see the spikes and patterns. For that I want to do the “micro histogram” visualization which looks like this (events the first 30 days after publication), but is currently only possible with Twitter and CiteULike data.

  7. So I’ve been thinking more about the HTML/PDF ratio. Seems to be there should always be nice correlation, but the slope will vary as a function of the size of the field/specialty to which a paper belongs. And not absolute size but effective size, assuming that the rate-limiting step is not access but perception of the journal, i.e., IF.

    That could explain why PDF downloads reach a natural ceiling — before a paper is cited, there is a maximum amount of “interest” in the field. Ultimately time elapsed is dominant because once a paper is no longer “new,” the interest drops exponentially. But growth on the HTML axis is pretty much just a function of social media amplification, though it too will succumb to the drag of “staleness.”

    It takes a citation for a paper to “level up” in terms of PDF downloads, creating the outliers above the regression line. If enough citations or the right citations accrue, then positive feedback would seem to assure a constant rate of growth.

    I really like the microhistogram approach, and I have selfish reasons why you might want to examine my paper’s HTML/PDF ratio history. But I also have practical reasons, namely I can account for spikes daily HTML pageviews because I’ve annotated the trajectory, e.g., http://perlsteinlab.com/round-table/publishing-in-the-era-of-open-science

  8. Martin Fenner

    Ethan, thank you for the link to your blog, the second chart looks very similar to the micro histogram I have in mind. I would just get rid of the axis labels, and would add the other PLoS sources. as separate histograms below.

    If we assume that the HTML/PDF ratio is fairly constant (which I have to test with the whole set of 50,000 PLoS papers, but it has held up well for the datasets I have looked at so far), then we should take a closer look at the papers that fall outside that pattern. In a simple world this would be a) high HTML views, popular on Facebook and b) high PDF downloads, highly cited. And I’m pretty sure this is also subject area specific. All 384 PLoS papers in the subject category cancer genetics are nicely correlated, and it there is no difference between PLoS ONE and PLoS Genetics.

  9. Pingback: Predicting the growth of PLoS ONE « LIBREAS.Library Ideas

Leave a Reply

Your email address will not be published. Required fields are marked *