Collaborative Research: CI-P: Creation of an annotated repository of multilingual and multigenre code switched data for several language pairs
Columbia University, New York NY
Investigators
Abstract
Code switching (CS) is the term used to describe a common practice among bilingual speakers of a given language pair in which the speakers switch back and forth between their common languages. CS occurs in all genres of communication, and at different levels of linguistic representation. Computational algorithms trained for a single language fail when the input has other languages in the signal i.e. data with CS phenomena. One major barrier to research on processing CS is the lack of large, accurately annotated corpora of CS data. This planning proposal aims at creating the framework for a large consistently annotated data repository that will target 7 different languages annotated with features at different levels of granularity. In the course of the planning grant, we plan to hold a community workshop to ensure that we are addressing their needs in the repository. We will work with the community in order to prepare the full CRI proposal. This data will be transformative for computational linguistics research as it will provide a testbed for adaptive learning algorithms, lead to significant robustness in handling very diverse data sources, and create a framework for genuine multilingual processing. Moreover, it will have a direct impact on the way sociolinguists account for CS leading to more robust and replicable generalizations. Research on CS will help acknowledge the creativity of bilinguals in exploiting their verbal repertoire. The CS repository will enable new research in many interconnected fields. This research will contribute to raising general awareness of bi/multilingualism.
View original record on NSF Award Search →