Semiparametric Efficient Estimation of Models of Measurement Errors and Missing Data
Duke University, Durham NC
Investigators
Abstract
Many empirical studies in economics are complicated by the presence of relevant variables that are not observed, either because they are only available in an incomplete or corrupted way, or because they are unobservable by their own nature. Important examples include attrition in panel data analysis and the ubiquitous presence of measurement error which can potentially be correlated with the true unobserved variables. The program evaluation literature is concerned with the issue that one never observes individual outcomes with and without treatment. In these circumstances, identifying assumptions become necessary to overcome the lack of identification that results from the missing information in what will be referred to as the primary data set. One common solution to this identification problem is the assumption that the missing information can be recovered using auxiliary data sources under a conditional independence assumption. The key element of the identification strategy is that the auxiliary data set must provide information about the conditional distribution of the true variables of interest given a set of proxy variables, where the proxy variables are observed in both the primary sample and the auxiliary sample. This project derives semiparametric efficiency variance bounds for the estimation of parameters defined through generalized nonlinear method of moment models, where the sampling information consists of a primary sample and an auxiliary sample. The variables of interest in the moment conditions are not directly observable in the primary data set. The primary data set contains proxy variables which are correlated with the variables of interest. On the other hand, the auxiliary data set contains information about the conditional distribution of the variables of interest given the proxy variables. Identification is achieved by the assumption that this conditional distribution is the same in both the primary and auxiliary data sets. The results derived in this project are applicable to both the "verify-out-of-sample" case, where the two samples are independent, and the "verify-in-sample case", where the auxiliary sample is a subset of the primary sample. Sieve based semiparametric estimators are developed to achieve the semiparametric efficiency bounds when the propensity score is unknown, when the propensity is known, or when the propensity is assumed to belong to a correctly specified parametric family. These estimators only use one nonparametric estimate of conditional expectation and do not require two nonparametric estimates of both the conditional expectation of the moment functions and the propensity score. They require weaker regularity conditions than the existing ones in the literature. They also allow for unbounded support of conditional variables and nonsmooth moment conditions, and do not require the strong assumption that the propensity score function has to be uniformly bounded away from zero and one. These results will be extended to conditional moment models in which either the dependent variables or the conditioning variables are measured with errors. In these cases only a subset of the variables that are suspected to be measured with error are observable in the auxiliary data set. The estimators currently available require knowledge of the semiparametric efficiency variance bounds. A second extension will consider estimators based on nonparametric maximum likelihood principles that achieve the semiparametric efficiency bound without knowledge of its particular form. Extensive monte carlo simulations and an empirical illustration will be performed to evaluate the finite sample efficiency implications of competing estimators. The proposed project involves joint work with Professor Xiaohong Chen from New York University and Professor Alessandro Tarozzi from Duke University. Broader Impact: The results developed in this project will be applicable to a wide variety of models, including non classical measurement error models, missing data models and nonlinear treatment effect models. This proposal is part of a larger research agenda in the econometrics profession to develop methods to estimate models with latent variables. Many empirical studies in economics are complicated by the presence of relevant variables that are not observed, usually because they are only available in incomplete or corrupted ways. Important examples include attrition in panel data analysis and the presence of measurement error which can potentially be correlated with the true unobserved variables. Another example is the program evaluation literature, where the estimation of treatment effects has to overcome the fact that one never observes individual outcomes with and without treatment. In such circumstances, identifying assumptions based on conditional independence relations become necessary to overcome the lack of identification that results from the missing information in the primary data set. The proposed project will also provide useful guidance to the design of survey data sets, which generate the crucial data input for the analysis of econometric models.
View original record on NSF Award Search →