The Cost of AI: A Comparative Study of Machine Learning Training Methods

$399,634FY2024SBENSF

Georgia Tech Research Corporation, Atlanta GA

Investigators

Elizabeth Disalvocontact Alan Ritter Wei Xu

Abstract

In recent years, Artificial Intelligence (AI) research has made rapid advances that led to numerous real-world applications. While some researchers explore fairness and bias in AI systems, few address how researchers navigate conflicting ethical issues in how data is trained to create these AI systems. This study will identify and articulate the ethical questions regarding Machine Learning (ML) training methods, emphasizing environmental cost, labor practices, financial cost, and data quality trade-offs when choosing ML training methods in research settings. These research findings will contribute a model for AI researchers to weigh data training methods for responsible AI research. Focusing on these ethical and financial considerations in a research setting will provide tools to evaluate research approaches that serve our nation best and will contribute to training students to consider these trade-offs as they move into industry roles. This study focuses on large language models (LLMs), such as ChatGPT, Llama, and Claude. While many approaches to AI and Natural Language Processing (NLP) rely on supervised ML through training data, LLMs employ pre-training of large-scale models on vast amounts of data collected from the internet. To enable LLMs to grasp the intricacies of language and align their outputs to match human preferences, they undergo pre-training on extensive datasets and fine-tuning on labels generated by digital piecework workers. Our study centers on the production of this fine-tuning data. NLP research is a fast-growing field, with over 5,000 papers published in 2021 and over a 50% increase in research production between 2017 and 2021. LLMs offer an excellent case study because various methods are used to train them - broadly categorized into supervised, unsupervised, and reinforcement learning approaches. These methods are often used in combination, and the choice depends on the specific goals of the model and the available data. The study consists of three primary objectives: 1) identify current practices among researchers of large-scale data sets to train LLMs by conducting interviews and surveys; 2) examine the trade-offs from multiple types of data training methods using comparative studies of (a) LLMs to generate training datasets with or without a human-in-the-loop, (b) digital piecework or paid crowd work (such as Mechanical Turk), and (c) dedicated data workers employed by research groups; and 3) develop ethical models and resources that communicate how to weigh data training methods based on cost, time, quality, diversity, environmental impact, and labor practices for responsible AI training. This project is funded through the ER2 program by the Directorate for Social, Behavioral and Economic Sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →