BIGDATA: Collaborative Research: F: Streaming Architecture for Continuous Entity Linking in Social Media

$799,339FY2016CSENSF

Temple University, Philadelphia PA

Investigators

Abstract

A large fraction of the ever-growing internet content is found in social media such as (micro)blogs. Users access it to both form and share their opinions about events and people, election preferences, product and brand recommendations. This situation provides opportunities to create added layers of data mining and analysis regarding users' views on developing events, products, services, or government actions; at the same time, it raises challenges for Entity Linking (EL) in social media. EL is the task of linking an extracted mention to a specific definition of the entity. The definition of an entity is usually a pointer to a Web page that defines the entity. Information extraction from social media generally faces many challenging issues due to: message volume, message speed (Twitter alone generates over 500 million messages per day), variety, free-form language, lack of context, large reference variation and language diversity. Hashtags are an essential part of the ethos of social networks. They are used to denote brands, events, people, social rallies, etc. The hashtag disambiguation problem is to detect synonymous hashtags and recognize the polysemic ones. For example, the hashtag '#BHaram' refers to the entity 'Boko Haram', defined at Wikipedia page en.wikipedia.org/wiki/Boko_Haram or at National Counterterrorism Center Web web page www.nctc.gov/site/groups/boko_haram.html. The purpose of this project is to perform EL in social media. This work will benefit multiple segments of society that rely on applications using data from microblog systems, such as targeted monitoring of Twitter and Facebook to collect and understand users' opinions about a recent product or a world event; data aggregation (e.g., reviews about products and services); and data mining for early crisis detection and response as well as national security. This project is one more step towards addressing the government's latest initiative of fighting crime using big data. The goals of this project are to research algorithms to detect in near real-time those pieces of text in messages that reference entities, Web pages that describe entities, and to link entity references to Web pages and across microblog systems so that together a broad, more complete characterization of each entity can be automatically generated. The proposed approaches are based on innovative techniques that include: incremental, iterative message analysis; smart indexing techniques with live updates to support fast incremental entity reference detection; computationally light soft-clustering of messages to improve entity reference detection; and fast incremental K-partite graph clustering. The resulting artifacts (e.g., software tools) will be made available to benefit researchers in academe and industry. Distribution of free, open-source software for implementing the techniques developed will enhance existing research infrastructure. The project will support and train at least three PhD students, as well as involve undergraduate students in research at Temple University and Binghampton University. The project web site (http://cis.temple.edu/~edragut/projects/nimel.htm) includes more information on the project, software, datasets, educational materials, and publications.

View original record on NSF Award Search →