Get started! A guide to MDPIexploreR

Background

Ever changing scientific publishing strategies shape academic communications.

To date, MDPI is the largest publisher of Open Access articles in the world and top-3 overall publisher (Right after Elsevier and SpringerNature). “The Strain on Scientific Publishing” highlights them as a frequent outlier for several metrics, but also as one of the most transparent major publishers out there.

This R package intends to help users to obtain factual data from MDPI’s journals, special issues and articles directly from their website (via web-scraping). Detailed information on functions and datasets can be found in the Reference section.

The following section aims to provide a brief and approachable tutorial introducing users to the functionalities of the R package MDPIexploreR.

Installing MDPIexploreR

devtools::install_github("pgomba/MDPI_explorer")
library(MDPIexploreR)

Exploring journal articles

Obtaining a list of articles from a journal is easy thanks to the function article_find(). This function returns a vector of articles URLs. To do so, we just need to submit the journal code as a text string.

urls<-article_find("agriculture")

print(paste("Articles found:", length(urls)))

[1] "Articles found: 7328"

The journal code name usually coincides with the journal title, but this is not always the case if the journal name is too long. To find the code name for your journal of interest check the dataset MDPI_journals, included in the package:

MDPIexploreR::MDPI_journals|>
  head(10)

                               name            code
1                         Acoustics       acoustics
2                         Actuators       actuators
3           Administrative Sciences          admsci
4                       Adolescents     adolescents
5  Advances in Respiratory Medicine             arm
6                       Aerobiology     aerobiology
7                         Aerospace       aerospace
8                       Agriculture     agriculture
9                   AgriEngineering agriengineering
10                    Agrochemicals   agrochemicals

Note

Note the code for the journal “Acoustics” matches the title of the journal, but the code for “Advances in Respiratory Medicine” is just the text string “arm”.

The resulting vector from using article_find() (or any vector with scientific papers URLs), can then be combined with the function article_info(). This function will, for every article in the list, obtain receiving and accepting dates (to calculate turnaround times), obtain article type (e.g. editorial, review) and find out if it belongs to a special issue. Lets find information on 10 random articles from the journal “Covid”, leaving 2 seconds between scraping iterations.

# Show article type, turnaround time and if article is included in special issue
info|>
  dplyr::mutate(doi=gsub("https://www.mdpi.com/","",i))|> #To reduce output width
  dplyr::select(doi,article_type,tat,issue_type)

                  doi      article_type      tat    issue_type
1  2673-8112/3/12/121            Review  40 days            No
2    2673-8112/1/3/52      Brief Report 122 days            No
3  2673-8112/2/12/120           Article  73 days            No
4    2673-8112/2/9/94 Systematic Review  46 days Special Issue
5  2673-8112/3/11/112      Brief Report  78 days            No
6    2673-8112/3/6/63      Brief Report  52 days            No
7  2673-8112/3/11/115           Article  10 days            No
8    2673-8112/2/2/15           Article 125 days            No
9    2673-8112/3/9/95      Brief Report  35 days         Topic
10   2673-8112/2/7/63           Article  42 days            No

By default, sleep is two seconds. Reducing this number might cause the server to kick you out, specially when scraping large numbers of articles. sample_size, if blank, will iterate through the whole vector of articles.

Important

A stable internet connexion is recommended, specially for web scraping large numbers of papers

Tip

Web scraping large amounts of URLs can be time consuming (2 seconds per paper, depending on delay) and many things can go wrong during the process (problematic URLs, being kicked out of the server…). My advice is to split large URL vectors in smaller ones.

Plotting article_info()

MDPIexploreR provides with three functions to plot the results from article_info(). Lets load one of the data frames provided by the package first:

agriculture_info<-MDPIexploreR::agriculture

nrow(agriculture_info)

[1] 7160

summary_graph() plots publications over time. The title of the journal must be provided:

summary_graph(agriculture_info, journal="Agriculture")

average_graph() plots average monthly turnaround times for the time period included in the dataset:

average_graph(agriculture_info, journal="Agriculture")

issues_graph() classifies articles depending on where they were published, including special issues

issues_graph(agriculture_info, journal="Agriculture")

Lastly, types_graphs() plots a classification of articles depending on their type (editorial, review, etc)

types_graph(agriculture_info, journal="Agriculture")

All plots can be saved via ggsave()

Exploring special issues and guest editors

Similar to article_find(), the function special_issue_find() outputs a vector with all special issues available in the target journal. By default, it retrieves all CLOSED special issues, but this can be adjusted with the parameter type.

# Creates a vector with all CLOSED special issues from the journal Covid
URLs<-special_issue_find("covid")
print(paste("Closed Special Issues:",length(URLs)))

[1] "Closed Special Issues: 6"

# Creates a vector with all special issues from the journal Covid
URLs<-special_issue_find("covid", type="all")
print(paste("All Special Issues:",length(URLs)))

[1] "All Special Issues: 11"

# Creates a vector with all special issues from the journal Covid
URLs<-special_issue_find("covid", type="open")
print(paste("Open Special Issues:",length(URLs)))

[1] "Open Special Issues: 5"

guest_editor_info() uses then the vector produced by special_issue_find() to look for proportion of articles in special issues where the guest editors were involved and differences between special issue closing time and last article submitted. This function is inspired by MA Oviedo-García work on MDPI’s special issues. Similar to article_info(), it allows to select only a sample of special issues and set up a delay between scraping iterations.

URLs<-special_issue_find("covid")

# Extract data from all URLs, iterating every 3 seconds
guest_editor_info (URLs, sleep=3)

# Extract data from 2 URLs, iterating every 2 seconds (default)
guest_editor_info (URLs, sample_size=2)