III: Small: Representation, Modeling and Inference for Large Biological and Information Networks

$513,780FY2010CSENSF

Harvard University, Cambridge MA

Investigators

Abstract

Modern technology has completely transformed the concept of data in the biological and information sciences. Data collections about the flow of information on the web, for instance, or about regulatory and metabolic dynamics that drive cellular functionality are extremely large and heterogeneous. These collections are often characterized as networks of websites, or proteins, where directed edges denote information flow, or chemical reactions, and with node information described in terms of web pages, or chains of amino acids. Knowledge discovery and management is key. The goal of this proposal is to create novel computational and statistical approaches to store, search, and quantify patterns in large networks efficiently, and to explore the extent to which these new tools help address a number of important open problems and computational issues. The research plan includes theoretical, methodological, data analysis, and dissemination aspects. The approach is to develop new models, methods and algorithms for analyzing large biological and information networks with rich node information. New tools will be developed: to assess the complexity of networks; to compare the fit of alternative network models; to store information about both connectivity and nodes in a network efficiently; to calibrate informative priors for networks that reflect the reality of signaling both in metabolic networks and in the spread of news on the web for empirical Bayesian analyses; to estimate the effects of node information on the local connectivity in a network; and to infer influence potentials and diffusion channels in online information networks. The proposed research is focused on three specific technical tasks: (1) establishing a new representation of valued, multivariate networks based on a statistical models; (2) developing a flexible family or probabilistic graphical models to link local connectivity in the network to high-dimensional node attributes; and (3) developing scalable algorithms to infer a non-observable network structure from multiple trails of informational artifacts on the network itself. In addition, two in-depth case studies will be developed to illustrate the potential of the proposed methodology. The first is an analysis of the effects of local influence patterns among online newspapers, news collectors and blogs on the diffusion of news and information items. The second is an analysis of the effects of local perturbations of signaling in regulatory networks on global cellular responses, for many known functions, from bacteria to human. Insights gained in tackling the case studies will in turn generalize and foster the development of the next wave of core methodology and theory in machine learning. The proposed work meets an urgent need for the development of new and principled methods for analyzing massive amounts of network data, as well as the creation of large-scale data sets for testing and benchmarking, to the benefit of the community at large. The research plan is tightly integrated with an interdisciplinary educational program and with the development of a statistical machine learning curriculum, which will attract many undergraduates to research at the intersection of machine learning and the sciences, and will provide opportunities to actively encourage students from underrepresented groups to pursue careers in computer science and statistics. The team will distribute open source software and set-up websites to enable the community to use and build upon the tools.

View original record on NSF Award Search →