SBIR Phase I: Representation and Deep Learning for Free Text Applications

$165,000FY2014TIPNSF

Textician, Llc, Cambridge MA

Investigators

Abstract

This Small Business Innovation Research (SBIR) Phase I project seeks to validate a new way to process textual material, so that computers can better learn applications related to natural language. A goal is to enable computer learning methods to take better account of parse structure in sentences, which is currently difficult. Two experimental prototypes will be constructed. One will focus upon automatically determining whether online posts about a company, brand or institution are positive or negative (sentiment analysis). This is difficult because sentence structure is important, and also because online posts can be informal, contain slang, etc. The second prototype finds the most important paragraph during a search, also taking sentence structure into account. Such paragraph-level information retrieval helps when extracting factual information from running text or web pages. The new method represents words simultaneously with parse structure using a single high-dimensional vector (for example, a list of 1,000 numbers). A successful SBIR project will ultimately improve machine learning applications for a wide range of tasks, including document retrieval, summarization, and automated translation. Moreover, the same techniques can be applied to represent any structured collection prior to machine learning, including images and genomic information. The broader impact/commercial potential of this project is to enhance capabilities for automated processing of free text and other structured data. Motivation for this approach comes from neural networks and, in turn, it has applications to neural modeling and our understanding of how the brain processes information. In the software industry, commercial innovation continues to revolve around automated processing of web pages, which plays a key role in creating many new companies. Therefore, the ability to handle free text is increasing in importance. A better way to represent text for use with machine learning will open new capabilities wherever the structure of sentences must be taken into account. This will lead to new startups, and provide consumers with new products and services. A successful project will validate new technology that can give a huge competitive edge to companies that take advantage of it, provide new and better capabilities for consumers, and advance those countries, such as our own, whose economies benefit from new technological innovations.

View original record on NSF Award Search →