ITR: Automated Structuring of Text Information
New York University, New York NY
Investigators
Abstract
At present, access to the information in large-scale text collections is largely limited to keyword-based searches which retrieve entire documents or passages. While such tools are often satisfactory in retrieving information on general topics, they provide little support for accessing information involving specific relationships, events, or facts. Information extraction technology offers the possibility of creating structured, tabular representations of selected relations from large text collections --- representations which can support more detailed document querying. Until now, however, developing extraction systems for a broad range of relations has been too expensive and time-consuming to consider its use in this way. Recent developments in extraction system customization offer the promise of substantially easing this task, and so making this approach to document indexing feasible. This research project will: 1) use corpus-based techniques to automatically identify the most common relationships within a sublanguage (the set of texts concerning a particular subject matter), and the different ways in which these relations are expressed in the text; 2)construct systems to extract information about these relationships from new text, building tabular summaries; and 3) provide a user interface for querying these relationships and accessing the underlying documents. Taken together, these tools should offer significant new capabilities for accessing the information in large text collections.
View original record on NSF Award Search →