5 February 2024 | Lin Zhang, Zhe Cao, Yuanyuan Shang, Gunnar Sivertsen, Ying Huang
OpenAlex is a fully open platform launched in January 2022, offering a free alternative to subscription-based databases like Web of Science and Scopus. It integrates multiple data sources, including Microsoft Academic Graph, ORCID, Crossref, and Unpaywall, and provides an API for data retrieval. OpenAlex is widely used in science of science research and has been adopted as a data source for Leiden University ranking. However, it suffers from a significant data quality issue: missing institutional information in journal article metadata. This study investigates the causes, implications, and potential solutions for this problem. Three types of institutional information are defined: full institutional information (FII), partially missing institutional information (PMII), and completely missing institutional information (CMII). The results show that more than 60% of journal articles in OpenAlex lack complete institutional information, with the problem being particularly severe in early years and in social sciences and humanities. The study explores possible reasons for the problem, the risks it poses for distorted results, and potential solutions. The aim is to highlight the importance of data quality improvements in open resources and to support responsible use of open resources in quantitative science studies and broader contexts. OpenAlex's data volume is extensive, covering over 200 million publications, and its coverage is broader than some traditional bibliometric databases. Despite its advantages, the issue of missing institutional information remains a critical challenge for data quality in open science.OpenAlex is a fully open platform launched in January 2022, offering a free alternative to subscription-based databases like Web of Science and Scopus. It integrates multiple data sources, including Microsoft Academic Graph, ORCID, Crossref, and Unpaywall, and provides an API for data retrieval. OpenAlex is widely used in science of science research and has been adopted as a data source for Leiden University ranking. However, it suffers from a significant data quality issue: missing institutional information in journal article metadata. This study investigates the causes, implications, and potential solutions for this problem. Three types of institutional information are defined: full institutional information (FII), partially missing institutional information (PMII), and completely missing institutional information (CMII). The results show that more than 60% of journal articles in OpenAlex lack complete institutional information, with the problem being particularly severe in early years and in social sciences and humanities. The study explores possible reasons for the problem, the risks it poses for distorted results, and potential solutions. The aim is to highlight the importance of data quality improvements in open resources and to support responsible use of open resources in quantitative science studies and broader contexts. OpenAlex's data volume is extensive, covering over 200 million publications, and its coverage is broader than some traditional bibliometric databases. Despite its advantages, the issue of missing institutional information remains a critical challenge for data quality in open science.