Collaborative Research: OLAC: Accessing the World's Language Resources

$147,001FY2007SBENSF

University Of Pennsylvania, Philadelphia PA

Investigators

Abstract

Language resources are the bread and butter of language documentation and linguistic investigation. They include the primary objects of study such as texts and recordings, the outputs of research such as dictionaries and grammars, and the enabling technologies such as software tools and interchange standards. Increasingly, these resources are maintained and distributed in digital form. Although language resources have begun to proliferate on the web, they are often difficult or impossible to locate and reuse. In this collaborative research project, Drs. Mark Liberman and Steven Bird of the University of Pennsylvania and Dr. Gary Simons of the Graduate Institute of Applied Linguistics will address this problem through new research to enhance the digital infrastructure of the Open Language Archives Community (OLAC). OLAC provides a standard set of language resource descriptors and a portal that permits users to query dozens of language archives simultaneously using a single search. However, the current coverage of OLAC is only the tip of the iceberg. The aim of the project is to greatly improve access to language resources for linguists and the broader communities of interest, by achieving an order-of-magnitude increase in the coverage of the OLAC catalog and in the use of OLAC search services. The project will do so through two main areas of activity: developing guidelines and services that encourage language archives to follow best common practices that will facilitate language resource discovery through OLAC, and developing services to bridge from the resource catalogs of the library and web domains to the OLAC catalog. The project should have a broad impact across the field of linguistics by developing an online service that gives linguists access to resources for the thousands of languages in the world. But the impact will extend well beyond the linguistics community. Access to these language resources will assist technologists who are endeavoring to make information technologies work with every language, not just a select few. It will also permit educators, students and members of society at large to access a wealth of materials that demonstrate the full range of linguistic diversity in the world. Yet another audience for access to language resources are the actual speakers of all the languages of the world. In the case of endangered languages, access to language resources is a critical asset in the process of language revitalization. The project will also serve to advocate the widespread use of ISO 639-3, a newly adopted standard that provides codes for precisely identifying the 7,500 known human languages, past and present. This will encourage reform in current cataloging practice which is based on an earlier ISO standard that recognizes fewer than 400 languages, and begin the process of helping the major storehouses of knowledge around the world to deal appropriately with linguistic diversity.

View original record on NSF Award Search →