HARNESSING MACHINE LEARNING ALGORITHMS TO STUDY SCIENTIFIC GRANT PEER REVIEW
University Of Wisconsin-Madison, Madison WI
Investigators
Abstract
Armed with a $30 billion annual budget, the U.S. National Institutes of Health (NIH) leads the world in funding research to advance human health and treatments for disease. Like most funding agencies, NIH uses peer review to evaluate the merit of grant applications. Approximately three reviewers from a given review group (called a "study section") assign preliminary impact scores and write critiques to evaluate each application before attending study section meetings where all members contribute to a final priority score. Although NIH's review process is considered one of the best in the world, reports and self-studies show that racial/ethnic minorities and women have lower award rates for first time, and renewal applications, respectively, for NIH's largest funding mechanism, the R01 grant. This is problematic because R01s are critical for career advancement, and research conducted by racial/ethnic minorities and women is linked to technological innovation and is known to address costly education, economic, and health disparities. As a leader in efforts to diversify the science and medical workforce, NIH has called for studies to test for the possibility that bias may operate in its peer review process. This call brings to light the broad need for research on the effectiveness of peer review, which is used across all science and technology fields, and for more scientists to engage in such research. If factors unrelated to the quality of the proposed science negatively impact the outcome of a grant review, it runs counter to funding agencies' goals to select the best science, blocks expensive downstream federal efforts to broaden participation in science, and undermines the competitiveness of the U.S. scientific enterprise. Our group was the first to show that, when combined with traditional analyses of scores and award rates, linguistic analysis of NIH peer reviewers' narrative critiques of R01 applications can show evidence of potential stereotype-based bias in reviewers' decision making. Although such bias is generally unintentional and impacts reviewers' judgment regardless of their own sex or race, it can lead reviewers to differentially enforce evaluation criteria. Controlled experiments show, for instance, that cultural stereotypes that racial/ethnic minorities and women lack intrinsic ability for fields like science, can lead reviewers to unconsciously require more proof to confirm their competence. Over the past decade machine learning technologies have made data-, text-, and video-mining into state-of-the-art analytic techniques, which, if applied to scientific peer review, could revolutionize the field. Long Short Term Memory (LSTMs) neural networks -- algorithms that function like the human brain to identify complex patterns in data -- in particular, have catapulted the application of computer science to the study of social and psychological phenomena. Using a large, demographically diverse set of NIH R01 application critiques, scores, and video of constructed study section discussions, this project is producing analytical tools that use LSTMs to capture evidence of stereotype-based bias in both written and oral discussion of grant applications. Resulting technologies are open-access, and available for applied use across scientific funding agencies. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →