CompCog: RI: Small: Human-like semantic grammar induction through knowledge distillation from pre-trained language models

$484,509FY2023CSENSF

Ohio State University, The, Columbus OH

Investigators

Abstract

Human languages are thought to allow for an unbounded set of possible meanings to be expressed using bounded sets of rules, called grammars. These grammars assign meanings to words and compose meanings of words and phrases into larger phrases and clauses. Humans can communicate extremely precise descriptions of goals and world behaviors—and linguists know much about the logical structure of language involved in this kind of precise communication—but one of the central open questions in linguistics is how humans acquire these mechanisms. Computational models of how these grammars are learned from recorded or transcribed utterances can provide evidence that this learning can be accomplished by children without substantial innate biological biases, and these models can also provide automated tools for analysis and documentation of endangered languages, including many Indigenous American languages. Existing statistical and neural grammar learning methods can induce grammars from sentences in text corpora that predict about half of the phrases and clauses annotated by linguists; howevert this level of performance is nowhere near the accuracy of human language learners, and attempts to support this learning using image and video data have not substantially improved induction accuracy. The proposed work will instead extract statistics about logical predicates from large commercially available neural language models as a surrogate for human world knowledge so as to improve the accuracy of grammar induction. The proposed work will develop the first broad-coverage semantic grammar induction model that integrates world knowledge into the acquisition process by distilling it from large pre-trained neural language models. The world knowledge implicit in the large language models will be distilled into a matrix of predicate co-occurrence statistics using argument-specific prompts. The resulting predicate co-occurrence statistics will make no distinction between, for example, active and passive sentences, topicalized and non-topicalized sentences, or declarative and subject-auxiliary inverted sentences. This model will be used to evaluate claims about the statistical learnability of grammar. The proposed work will also continue work on developing resources for evaluating these structural models. The model and corpora collected as part of this project will be freely distributed on both university and external websites. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →