CAREER: Multilingual Learning for Event Structures from Text

$460,105FY2023CSENSF

University Of Oregon Eugene, Eugene OR

Investigators

Abstract

Natural language text is replete with important events in different areas (e.g., cybersecurity breaches, disease outbreaks, and business transactions). Identifying events to describe who did what to whom and their relations (causal, subevent, and coreferential) from a large amount of text can provide valuable data to support intelligent applications and data-driven decisions over various domains. However, current event structure extraction systems can only perform over text data for a few popular languages such as English, Chinese, Spanish, and Arabic. Text data from many other languages in the world thus cannot be processed by current event extraction systems. This limitation has hindered the coverage of data sources for the systems, introduced language errors in the extracted events, and delayed updates with latest events in local reports. Eventually, the collected event data from current techniques cannot comprehensively represent the latest dynamics over the world to effectively support decision making for important problems of national interests. To address the multilingual challenges, this project will develop event extraction and event-event relation extraction systems that can be effective for data in multiple languages to improve the coverage of extracted data and promote democratization of technologies. In information retrieval, multilingual event structure data from the developed technologies can enable data management systems to quickly obtain answers and create summaries for broader user queries in many more languages. In cybersecurity, databases for extracted cyber attack events from multilingual sources can be used to generate more fine-grained and comprehensive reports to inform resource allocation decisions to better protect online activities. In socio-political science, coded conflict and meditation events from more languages can increase the scope and reduce data errors to support better decisions. This project will address three fundamental limitations of existing multilingual learning research for event structure extraction: (i) the lack of multilingual datasets that provide data annotation for multiple languages to sufficiently support generalization evaluation of models across different language families, (ii) the limitations of current multilingual representation learning methods when aligning representations between languages to induce language-general features, and (iii) the scarcity of labeled data in different languages to train multilingual models. First, the project will annotate documents for all event extraction and event-event relation extraction tasks in many more languages using consistent schemas. The selected languages for annotation will be typologically diverse, understudied and low-resource to provide reliable multilingual evaluation data for the developed methods. Second, to boost cross-lingual performance for event structure extraction, this project will devise multilingual representation learning methods to enable effective knowledge transfer where models trained on labeled data of high-resource languages can be directly applied to data of other languages. The project will develop novel representation alignment methods for different languages using representation matching, augmentation, and language-general structure induction for text. Third, concerning limited training data for multilingual learning, this project will develop novel methods to automatically generate labeled data in different languages. The project will introduce techniques to mitigate noises in the generated data and optimize generation procedures to boost multilingual learning and performance. The research activities in this project will be closely integrated with education and outreach missions to broaden their impacts. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →