::install_github("pgomba/MDPI_explorer")
devtoolslibrary(MDPIexploreR)
Get started! A guide to MDPIexploreR
Background
Ever changing scientific publishing strategies shape academic communications.
To date, MDPI is the largest publisher of Open Access articles in the world and top-3 overall publisher (Right after Elsevier and SpringerNature). “The Strain on Scientific Publishing” highlights them as a frequent outlier for several metrics, but also as one of the most transparent major publishers out there.
This R package intends to help users to obtain factual data from MDPI’s journals, special issues and articles directly from their website (via web-scraping). Detailed information on functions and datasets can be found in the Reference section.
The following section aims to provide a brief and approachable tutorial introducing users to the functionalities of the R package MDPIexploreR.
Installing MDPIexploreR
Exploring journal articles
Obtaining a list of articles from a journal is easy thanks to the function article_find()
. This function returns a vector of articles URLs. To do so, we just need to submit the journal code as a text string.
<-article_find("agriculture")
urls
print(paste("Articles found:", length(urls)))
[1] "Articles found: 7328"
The journal code name usually coincides with the journal title, but this is not always the case if the journal name is too long. To find the code name for your journal of interest check the dataset MDPI_journals
, included in the package:
::MDPI_journals|>
MDPIexploreRhead(10)
name code
1 Acoustics acoustics
2 Actuators actuators
3 Administrative Sciences admsci
4 Adolescents adolescents
5 Advances in Respiratory Medicine arm
6 Aerobiology aerobiology
7 Aerospace aerospace
8 Agriculture agriculture
9 AgriEngineering agriengineering
10 Agrochemicals agrochemicals
Note the code for the journal “Acoustics” matches the title of the journal, but the code for “Advances in Respiratory Medicine” is just the text string “arm”.
The resulting vector from using article_find()
(or any vector with scientific papers URLs), can then be combined with the function article_info()
. This function will, for every article in the list, obtain receiving and accepting dates (to calculate turnaround times), obtain article type (e.g. editorial, review) and find out if it belongs to a special issue. Lets find information on 10 random articles from the journal “Covid”, leaving 2 seconds between scraping iterations.
# Show article type, turnaround time and if article is included in special issue
|>
info::mutate(doi=gsub("https://www.mdpi.com/","",i))|> #To reduce output width
dplyr::select(doi,article_type,tat,issue_type) dplyr
doi article_type tat issue_type
1 2673-8112/3/12/121 Review 40 days No
2 2673-8112/1/3/52 Brief Report 122 days No
3 2673-8112/2/12/120 Article 73 days No
4 2673-8112/2/9/94 Systematic Review 46 days Special Issue
5 2673-8112/3/11/112 Brief Report 78 days No
6 2673-8112/3/6/63 Brief Report 52 days No
7 2673-8112/3/11/115 Article 10 days No
8 2673-8112/2/2/15 Article 125 days No
9 2673-8112/3/9/95 Brief Report 35 days Topic
10 2673-8112/2/7/63 Article 42 days No
By default, sleep is two seconds. Reducing this number might cause the server to kick you out, specially when scraping large numbers of articles. sample_size, if blank, will iterate through the whole vector of articles.
A stable internet connexion is recommended, specially for web scraping large numbers of papers
Web scraping large amounts of URLs can be time consuming (2 seconds per paper, depending on delay) and many things can go wrong during the process (problematic URLs, being kicked out of the server…). My advice is to split large URL vectors in smaller ones.
Plotting article_info()
MDPIexploreR
provides with three functions to plot the results from article_info()
. Lets load one of the data frames provided by the package first:
<-MDPIexploreR::agriculture
agriculture_info
nrow(agriculture_info)
[1] 7160
summary_graph()
plots publications over time. The title of the journal must be provided:
summary_graph(agriculture_info, journal="Agriculture")
average_graph() plots average monthly turnaround times for the time period included in the dataset:
average_graph(agriculture_info, journal="Agriculture")
issues_graph()
classifies articles depending on where they were published, including special issues
issues_graph(agriculture_info, journal="Agriculture")
Lastly, types_graphs() plots a classification of articles depending on their type (editorial, review, etc)
types_graph(agriculture_info, journal="Agriculture")
All plots can be saved via ggsave()
Exploring special issues and guest editors
Similar to article_find()
, the function special_issue_find()
outputs a vector with all special issues available in the target journal. By default, it retrieves all CLOSED special issues, but this can be adjusted with the parameter type.
# Creates a vector with all CLOSED special issues from the journal Covid
<-special_issue_find("covid")
URLsprint(paste("Closed Special Issues:",length(URLs)))
[1] "Closed Special Issues: 6"
# Creates a vector with all special issues from the journal Covid
<-special_issue_find("covid", type="all")
URLsprint(paste("All Special Issues:",length(URLs)))
[1] "All Special Issues: 11"
# Creates a vector with all special issues from the journal Covid
<-special_issue_find("covid", type="open")
URLsprint(paste("Open Special Issues:",length(URLs)))
[1] "Open Special Issues: 5"
guest_editor_info()
uses then the vector produced by special_issue_find() to look for proportion of articles in special issues where the guest editors were involved and differences between special issue closing time and last article submitted. This function is inspired by MA Oviedo-García work on MDPI’s special issues. Similar to article_info(), it allows to select only a sample of special issues and set up a delay between scraping iterations.
<-special_issue_find("covid")
URLs
# Extract data from all URLs, iterating every 3 seconds
guest_editor_info (URLs, sleep=3)
# Extract data from 2 URLs, iterating every 2 seconds (default)
guest_editor_info (URLs, sample_size=2)