III-CTX-Small: Adaptive Integration of Structured and Unstructured Data from Many Sources in a Biological Domain

$443,128FY2008CSENSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

This project involves the design and building of a knowledge base (KB) system that learns to more accurately integrate the many heterogeneous sources of information that are relevant to a single scientist's research needs. The system works by loosely integrating data of many sorts (including unstructured text) into a single typed directed graph, and then querying the graph using a query language that allows "schema-free similarity queries". These queries specify a set of query terms (e.g. keywords, entities in the KB, etc) and constraints on the desired output (e.g. a target data type). The result of a query is a ranked list of KB entities, ordered by similarity to the query terms. After a query, a user can optionally label any subset of the ranked list of suggested answers as ``relevant'' or ``non-relevant''. These labels drive a learning phase, the goal of which is to produce a better ranking. Types of learning currently being investigated include EM-based parameter turning, learning to discriminatively re-rank, and learning to restructure the graph (by adding or deleting edges or vertexes). Queries collected in the laboratory of working biologists are used to evaluate these learning methods. The broadest impact of this project is on the problem of learning to integrate heterogeneous data sources (including free text and structured data). However, if successful, the KB system will have broad impact in the biological research community; in particular, we believe that adaptive personal KB systems of this sort will be a valuable complement to existing biological KBs. Further information on this project can be found at http://www.cs.cmu.edu/~wcohen/querendipity/

View original record on NSF Award Search →