GGrantIndex
← Search

SBIR Phase I: Maximum Entropy Data De-duplication

$99,984FY2001TIPNSF

Choicemaker Technologies, Inc., New York NY

Investigators

Abstract

This Small Business Innovation Research (SBIR) Phase I project will investigate the feasibility of high-risk, high-return research toward creating general-purpose de-duplication software. De-duplication software identifies multiple database records that refer to one entity (such as a person), thereby enabling the merger of fragmented data. ChoiceMaker markets a research-derived de-duplication system called MEDD. Many fundamental social services, including child immunization, require accurate de-duplication. New York City currently uses MEDD to de-duplicate its immunization records, thereby successfully improving children's public health. However, smaller public health organizations cannot benefit from MEDD because they cannot afford the 6 weeks of computer consulting that are required to customize MEDD for their data. ChoiceMaker's proposed research would decrease the adaptation time by an order-of-magnitude-making de-duplication affordable for most public health organizations and nearly every business with mission-critical databases. MEDD employs an important emerging information-theoretic statistical technique (called maximum entropy) to mimic the decisions made by people evaluating whether to merge similar records. Maximum entropy technology supports software that can 'understand' each individual database's idiosyncratic information semantics and structure. In the proposed research, ChoiceMaker will investigate significant, innovative extensions to maximum entropy technology that will dramatically increase MEDD's convenience and flexibility. This research has applications to enhancing the data quality of any database which might contain multiple entries for the same entity due to the lack of a reliable identifying key. Specifically, there are applications to the management of master patient indices by health care providers and lists of clients and vendors at large institutions. The system is equally useful for matching and linking records in two different databases, such as for merging mailing lists for direct marketing, linking medical records for epidemiological research, and matching buy and sell orders for securities transa

View original record on NSF Award Search →