ABI Development: Global Names Discovery, Indexing and Reconciliation Services
University Of Illinois At Urbana-Champaign, Urbana IL
Investigators
Abstract
In the mid-18th century biology as a science was revolutionized by Carl Linneaus and the use of binomial scientific names. Such names consist of two words in Latin - genus name and specific epithet - for example, Homo sapiens for humans. For higher classifications Linneaus adopted names consisting of a single Latin word. This nomenclatural system dramatically simplified handling species information, and gave scientists from different countries a standard way of communication. This system has been so successful it has persisted for over 250 years and is still in very active use today. An enormous amount of scientific and popular science literature uses scientific names. In the electronic age scientific names enable the effective organization of biodiversity information from various sources, as demonstrated by the Encyclopedia of Life (http://eol.org) and Biodiversity Heritage Library (http://bhl.org). However the naming system introduced by Linnaeus is far from perfect. For example, if scholars decide to move a particular species to different genus - the name changes, and finding information about it becomes much more difficult. For example, a very prominent model organism of modern science Drosophila melanogaster (fruit fly) is now recommended to be placed in the genus Sophophora, changing its name to Sophophora melanogaster. Such changes are inevitable in Linnaean nomenclature, and they create a significant amount of confusion. The project aims to create a system which would allow a user to enter a name, and find out if it is a known scientific name, if another name should be used instead, if there are other historical synonyms, and if this name is a misspelling of a known scientific name. The system will also display which data sources contain information related to a name, and will provide a list of relevant web-pages (URLs). This project aims to make significant strides in removing ambiguity and confusion from biodiversity data. Teaching activities are planned with college students at Arizona State and there are opportunities for participation in Google Summer of Code and Biodiversity informatics training courses developed at MBL. The first stage of the research was also funded by NSF and it allowed the development of a prototype and 'proof of concept' algorithms for finding and verifying scientific names in texts, images, species lists. This second phase aims to make production version of the algorithms, make them precise, fast, and scale them up to serve the global biological community. The task of scientific name disambiguation is not a simple one. It requires cooperation of many researchers. Discovery of scientific names in texts uses natural language processing algorithms which we plan to improve. The project aims to collect all known spellings and renderings of scientific names (20 million are currently collected) and organizes them into groups belonging to the same sets of organisms. It also catalogues where each spelling was used and stores information as a global name index. The database is powered by a search engine which uses fuzzy matching algorithms, and name parsing algorithms to find that names like 'Carek scirpoidea Michx. var. convoluta Kük.' and 'Carex scirpoidea subsp. convoluta (Kukenth) D. A. Dunlop' refer to the same subspecies. The project's URL is: http://globalnames.org/
View original record on NSF Award Search →