EAGER: SSMCDAT2023: Natural Language Processing and Large Language Models for Automated Extraction of Materials Chemistry Data from Scientific Literature

$200,000FY2023MPSNSF

University Of Utah, Salt Lake City UT

Investigators

Abstract

NONTECHNICAL SUMMARY This award is made on an EAGER proposal. It supports progress on a project advanced at the SSMCDAT 2023 Datathon held at Lehigh University. This project addresses the problem of materials data in scientific journal papers being hard to access and use in modern computer assisted research environments because of the use of portable document format (PDF) files. Previous attempts to encourage better formats have not worked well. To tackle this problem, the team will use advanced artificial intelligence and natural language processing which enables computers to “understand” text, to automatically extract materials data from scientific research papers. A crucial element of the team's approach involves leveraging valuable commercial resources, such as the Pauling File, which provides well-curated examples of materials data. The team will utilize this training data to enhance the performance of the large language models employed in the work. Large language models can “understand” text and it can generate text in a seemingly human way. Access to organized materials data has the potential to transform solid-state materials chemistry and enable faster progress in the field. The project also seeks to engage with materials scientists, authors, and editors from diverse subdisciplines of materials science to integrate this technology seamlessly into academic publishing workflows. This project also supports training graduate and undergraduate students, and creating outreach materials such as podcast episodes and YouTube courses to promote a wider understanding of artificial intelligence and materials science. TECHNICAL SUMMARY This award is made on an EAGER proposal. It supports progress on a project advanced at the SSMCDAT 2023 Datathon held at Lehigh University. This activity addresses the challenge of limited machine-readable materials data in academic literature, mainly due to the prevalence of PDF formats. Prior attempts to encourage machine-readable formats have been unsuccessful. The result has been the emergence of inaccurate and labor-intensive information extraction tools. The team aims to capitalize on very recent advances in natural language processing and large language models, and combine them with the Pauling File's hand-labeled data. This approach eliminates the need for manual labeling, empowering materials chemists to write papers as they always have, while using artificial intelligence to extract and organize materials data into machine-readable formats accurately and automatically. The approach includes steps for machine-learned versus rules-based token size reduction, comparison of open-source versus commercial large language models, expert analysis of errors and incompletions, and expansion to materials property extraction data in addition to synthesis data. Success in this endeavor would be potentially transformative to solid-state materials chemistry by leveraging progress that has been made in materials informatics. The activity aims to transform the materials data landscape, enabling widespread materials informatics progress by automating data extraction from research articles. The project's broader impacts extend to other academic domains, with potential applications in different scientific fields. It also promotes bilingual outreach and education including unique social media content delivered through YouTube and podcast formats. Finally, the activity will substantially bring authors, editors, data practitioners, and publishers together to assess data extraction performance. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →