Statistical Methods for Dependent Data

$319,999FY2008MPSNSF

University Of Pittsburgh, Pittsburgh PA

Investigators

Abstract

This proposal concentrates on various topics relating to the statistical analysis of dependent data. The first project extends the spectral envelope concept for analyzing DNA sequences. A common problem in analyzing long DNA sequence data is in identifying protein-coding sequences that are dispersed throughout the sequence and separated by regions of noncoding. DNA sequences are heterogeneous, so it is necessary to expand the methodology to capture the local behavior of such sequences. To address the problem of local behavior, a local spectral envelope with estimation via mixtures of smoothing splines will be explored. It is the hope that this methodology will help emphasize any periodic feature that exists in a categorical sequence of virtually any length in a quick and automated fashion. Projects such as the human genome project have produced large amounts of data and the methods established in this project will prove to be useful in the analysis of the vast quantities of data being produced by various genome projects. In another project, the focus is on the analysis of longitudinal data and the development of a practical nonparametric procedure for the estimation of the within-subject correlation structure. This technique is used to develop a data driven functional principal components analysis procedure (FPCA). Because longitudinal data often possess the property that observations made within a subject are correlated, an effective analysis of these data is required to account for this within-subject correlation. When a parametric form for the covariance structure is unknown, using a misspecified structure can result in biased and inefficient estimates. This project focuses on the analysis of longitudinal data that can be modeled as observations from smooth subject trajectories that are realizations of a stochastic process observed at discrete time points with noise. The high dimensionality and complexity of longitudinal data has made FPCA a popular tool for data reduction and visualization by capturing the primary modes of variation of the stochastic process generating the data. Scientists are often interested in using longitudinal data to determine the effect that a set of possibly time-varying covariates have on a given response over time. Functional linear models, and in particular the varying-coefficient model, provide a framework for analyzing such data. In many of these data sets, the functional coefficients have shapes that cannot be modeled parametrically. An effective analysis of these data is required to both account for the within-subject correlation and to allow for the flexible shapes of the coefficients. Because a parametric form for the within-subject covariance is not always known, a third project focuses on creating an iterative data-driven spline based procedure for fitting varying-coefficient models. This proposal concentrates on solving problems involved in the analysis of dependent data. The first project will develop a method for detecting genes in a long DNA sequences. Projects such as the human genome project have produced large amounts of data and the methods established in this project will prove to be useful in the analysis of the vast quantities of data being produced by various genome projects. A second proposed project focuses on the analysis of complex data collected over time. This project is also motivated by the analysis of DNA, and in particular, the analysis of gene expression data. In a third project, the investigators will focus on a technique called functional linear models. For example, techniques will be developed for studying the effect that a growth factor should have on the decision to supplement chemotherapy with antiangiogenic therapy when treating ovarian cancer.

View original record on NSF Award Search →