Statistical Methods and Sensitivity Analyses: How the Choice of Outcome Measures Affects the Conclusions of STEM Education Research Studies
University Of North Carolina At Chapel Hill, Chapel Hill NC
Investigators
Abstract
Research studies in Science, Technology, Engineering, and Math (STEM) education often seek to evaluate the impact of educational interventions (e.g., new curricula, new instructional practices) on student outcomes (e.g., STEM learning, persistence in STEM fields). Systematic reviews of the STEM education research literature have repeatedly shown that the results of impact evaluations can vary widely depending on how student outcomes are measured. For example, if student outcomes are measured using assessments developed by the primary researchers involved with the intervention or its evaluation, the results tend to be more favorable (i.e., show bigger impact) than when the outcomes are measured using independently-developed assessments (e.g., standardized tests). This discrepancy between researcher-developed and independently-developed assessments has fueled wider concerns about transparency and replicability in STEM education research, and in the social sciences more generally. The current project addresses this issue by developing statistical procedures for ensuring that the results of impact evaluations are not unduly dependent on the specific outcome measures used. In addition to improving the internal validity of STEM education impact evaluations, the proposed statistical methods can also be used to investigate sources of heterogeneity that may explain the discrepant results, thereby informing STEM education theory. In this project, the choice of outcome measure is conceptualized as a pervasive but understudied source of treatment effect heterogeneity. This heterogeneity is quantified in terms of the variability of treatment effects over the individual questions or “items” that make up an assessment. Intuitively, when treatment effects vary over items, this is a clear indication that different assessments of the same construct (i.e., assessments with different items) will lead to different research conclusions. This project builds on this intuitive idea by developing statistical methodology for evaluating the extent to which observed treatment effects are dependent upon item-level treatment effect heterogeneity that would not be expected to generalize to other assessments of the same construct. This is achieved by combining methods from robust statistics with item response theory to develop new procedures for evaluating differential item functioning with respect to treatment status. These new procedures (a) provide treatment effect estimates that are highly robust to item-level heterogeneity, (b) lead to a general-purpose specification test of whether naive estimates of treatment effects are biased by item-level heterogeneity, and (c) can be used to partition the bias in naive estimates into sources explained by item-level covariates and idiosyncratic sources specific to a given outcome measure. The performance of the proposed methodology is studied using sample sizes and research designs typical of STEM education research, and empirical results will be synthesized over a sample of STEM education impact evaluations. This project is supported by NSF's EDU Core Research (ECR) program. The ECR program emphasizes fundamental STEM education research that generates foundational knowledge in the field. Investments are made in critical areas that are essential, broad and enduring: STEM learning and STEM learning environments, broadening participation in STEM, and STEM workforce development. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →