Statistical methods for air-pollution studies using low-cost monitors
Johns Hopkins University, Baltimore MD
Investigators
Linked publications & trials
Abstract
Project summary/abstract Air pollution research is increasingly adopting emergent cost-effective technologies to measure pollutant levels at spatial and temporal scales ï¬ner than that delivered by the geographically sparse network of regulatory monitors. Low-cost air-pollution monitors, while promising, introduce a series of data features like need for ï¬eld co-location and calibration to eliminate noise, spatio-temporally correlated massive datasets, and repeated mea- sures on exposures. Current statistical methodology for more traditional air-pollution data collection schemes are not optimized to properly exploit the noisy, high-throughput, and spatio-temporally dependent low-cost data. This proposal pursues multi-faceted statistical methods development motivated by the unique features of the low-cost monitoring data to improve the rigor and widen the breadth of scientiï¬c ï¬ndings based on such data. Our ï¬rst innovation is a spatial-ï¬ltering method for calibration of the noisy low-cost data. Regression calibra- tion of low-cost networks using ï¬eld co-location with regulatory monitors leads to underestimation of air-pollution peaks â a critical ï¬aw from a health perspective. The current practice also fails to exploit the spatial correlation among exposure levels in the network. Our proposed ï¬ltering approach mitigates both issues and will be used to produce network-wide calibrated and smooth high resolution spatio-temporal maps of pollutants. Our next set of innovations concern proper utilization of the high-throughput data from low-cost networks. The large low-cost datasets have increased uptake of data-intensive machine-learning (ML) methods like ran- dom forests (RF) for exposure prediction modeling. However, exposure data are spatio-temporally correlated and RF encounters numerous issues for dependent data leading to loss of accuracy. We proposed RF-GLS, a novel extension of RF that explicitly accounts for spatio-temporal correlation to improve predictions. We will develop extensions of RF-GLS for use in the spatial-ï¬ltering, for predicting categorical exposure data (like Air Quality Index category), and for estimating exposure effects after accounting for confounders. We will use RF-GLS for predicting personal exposures using the low-cost ambient and wearable network data in Baltimore. We recognize that the rich repeated measures data on exposures from low-cost monitors can be directly used in association studies between health and air-pollution without any ad-hoc and lossy data reduction like using the mean exposure. We propose a scalar-on-distribution-analysis (SoDA) that uses the entire sample of exposures as a distribution-valued covariate in association studies. SoDA is tailored to repeated measures covariates and will be more efï¬cient than the general-purpose SoFR (scalar-on-function-regression). SoDA will be used to directly assess which aspects of an individual's exposure distribution correlate most with their health, which in turn can help re-evaluate and update current air quality standards. The statistical methods proposed here will be applied to analyze low-cost ambient and personal exposure networks in Baltimore. We will also implement the proposed methods in publicly-available user-friendly software.
View original record on NIH RePORTER →