Scalable Biomedical Pattern Recognition Via Deep Learning
Vanderbilt University, Nashville TN
Investigators
Linked publications & trials
Abstract
DESCRIPTION (provided by applicant): Patterns extracted from Electronic Medical Records (EMRs) and other biomedical datasets can provide valuable feedback to a learning healthcare system, but our ability to find them is limited by certain manual steps. The dominant approach to finding the patterns uses supervised learning, where a computational algorithm searches for patterns among input variables (or features) that model an outcome variable (or label). This usually requires an expert to specify the learning task, construct input features, and prepare the outcome labels. This workflow has served us well for decades, but the dependence on human effort prevents it from scaling and it misses the most informative patterns, which are almost by definition the ones that nobody anticipates. It is poorly suited to the emerging era of population-scale data, in which we can conceive of massive new undertakings such as surveiling for all emerging diseases, detecting all unanticipated medication effects, or inferring the complete clinical phenotype of all genetic variants. The approach of unsupervised feature learning overcomes these limitations by identifying meaningful patterns in massive, unlabeled datasets with little or no human involvement. While there is a large literature on feature creation, a new surge of interest in unsupervised methods is being driven by the recent development of deep learning, in which a compact hierarchy of expressive features is learned from large unlabeled datasets. In the domains of image and speech recognition, deep learning has produced features that meet or exceed (by as much as 70%) the previous state of the art on difficult standardized tasks. Unfortunately, the noisy, sparse, and irregular data typically found in an EMR is a poor substrate for deep learning. Our approach uses Gaussian process regression to convert such an irregular sequence of observations into a longitudinal probability density that is suitable for use with a deep architecture. With this approach, we can learn continuous unsupervised features that capture the longitudinal structure of sparse and irregular observations. In our preliminary results unsupervised features were as powerful (0.96 AUC) in an unanticipated classification task as gold-standard features engineered by an expert with full knowledge of the domain, the classification task, and the class labels. In this project we will learn unsupervised features for records of all individuals in our deidentifed EMR image, for each of 100 laboratory tests and 200 medications of relevance to type 1 or type 2 diabetes. We will evaluate the features using three pattern recognition tasks that were unknown to the feature-learning algorithm: 1) an easy supervised classification task of distinguishing diabetics vs. nondiabetics, 2) a much more difficult task of distinguishing type 1 vs. type 2 diabetics, and 3) a genetic association task that considers the features as micro-phenotypes and measures their association with 29 different single nucleotide polymorphisms with known associations to type 1 or type 2 diabetes.
View original record on NIH RePORTER →