Assessing the Readability of Documents and Statistical Tools for Non-Euclidean Data

$112,277FY2006MPSNSF

Purdue University, West Lafayette IN

Investigators

Abstract

Documents are written with a specific audience in mind that varies across several dimensions. One such dimension is the readability level, which may vary from elementary child readability to adult readability. The investigator developd statistical models for readability prediction and experiment with different alternatives. As most standard representations of documents are not well described using Euclidean geometry, the investigator directd his research at non-Euclidean modeling of the word histogram or term-frequency representation. Specifically, the task is that of non-linear regression where the covariates are points in the simplex, but do not obey Euclidean geometry. The task of predicting the readability of documents is an important one. A likely implication of advances in this area is improvement in matching readability level with documents retrieved by search systems. This in turn will positively effect children and non-native speakers of English in their internet searches and other automated textual efforts. As the research is interdisciplinary it is expected to bring together and foster future collaboration between the communities of statistics, machine learning and information retrieval.

View original record on NSF Award Search →