Collaborative Research: Small-Sample Error Estimation for Classification with Application to Genomic Signal Processing

$220,192FY2007CSENSF

Texas A&M Engineering Experiment Station, College Station TX

Investigators

Abstract

The availability of DNA microarray chips and other high-throughput technologies for measuring genomic variables fosters the hope that engineering can successfully address a key problem of translational genomics: using genomic signals to classify disease. Classification can serve to diagnosis the existence or category of a particular pathology or it can be used to prognosticate the effect of a treatment. In cancer, diagnosis can be between different stages of tumor development, and prognosis can be to predict the toxicity or benefit of a drug relative to the particular genetic make-up of an individual an example of personalized medicine. Error estimation is critical because the error of a classifier determines its worth. Gene-based classification typically involves small samples (numbers of microarrays), so that the same data must be used to train and test a classifier. Error estimators that work on the training data tend to suffer from low bias or high variance. This research improved the performance and widens the range of applicability of three recently proposed small-sample estimation paradigms: (1) bolstered error estimators place a kernel at each sample point and then apply the designed classifier to the distribution formed from the bolstering kernels to estimate its error; (2) convex error estimators are formed by an optimal weighted average of low- and high-biased estimators; and (3) calibrated error estimators are formed by using the data to optimally calibrate standard error estimators. This research generalizes the estimation rules, provides methods to obtain estimator parameters, and applies the estimators to genomic diagnosis and prognosis. Properties are mainly studied via simulation: however, analytic results are derived in cases where the error-estimator distributions are known.

View original record on NSF Award Search →