SBIR Phase II: Units-based numeric data extraction with knowledge of scientific context

$900,112FY2010TIPNSF

Quantifind Inc., Palo Alto CA

Investigators

Abstract

This Small Business Innovation Research (SBIR) Phase II project aims to establish that a units-based approach to retrieving quantitative data from scientific and technical documents is a powerful alternative to keyword and document based search models. Keyword approaches to data extraction and contextualization are limited due to poor semantic contextualization and because quantities are often written in a wide variety of numeric and unit formats. The proposed approach to reliable numeric data extraction begins with quantity-intelligent indexing that recognizes many numeric formats and converts quantities to standardized base-unit tokens, to significantly enhance search recall over keyword approaches. The resulting number-unit pairs will anchor the index to enable efficient scientific exploratory search with high semantic precision, but without overly relying on sophisticated imposed semantic ontologies. Research will focus on a proprietary search-time data scoring algorithm that utilizes context-sensitive numeric spectra, to score otherwise ambiguous results based on probabilistic methods. This approach is expected to improve both precision and recall of contextual numeric data extraction. In turn, the resulting search engine will enable instant visualization and analysis of collective technology landscapes and trends, which will guide researchers in any area of technology represented by the indexed documents. The broader impact of this project will be to enable reliable and efficient extraction of numeric data from diverse sources such as scientific literature and patent databases. These unstructured document sets contain a wealth of latent quantitative data which, if properly extracted and aggregated, can enable powerful modes of data exploration. The unit-based index and data-scoring algorithm are customized for an exploratory search model that will allow non-expert users to rapidly aggregate thousands of relevant data points, with simple keyword inputs and without laboriously opening and parsing individual documents. Researchers and students may thus explore data sets that were previously inaccessible, or known only to experts in a field. This will also contribute to knowledge discovery within large unstructured databases, since patterns and correlations between seemingly disparate variables can be immediately visualized. The platform will provide the capability to efficiently generate technology landscapes, anticipate emerging trends, and recognize competitive technical outliers. If successful, this will be valuable for high-tech industrial innovation including for engineers involved in R&D as well as business development executives and intellectual asset managers who focus on asset allocation, new technology ventures, prior art and patent infringement within a technical parameter space.

View original record on NSF Award Search →