III: Small: Collection Construction Methodologies for Learning-to-Rank
Northeastern University, Boston MA
Investigators
Abstract
Modern search engines, especially those designed for the World Wide Web, commonly analyze and combine hundreds of features extracted from the submitted query and underlying documents (e.g., web pages) in order to assess the relative relevance of a document to a given query and thus rank the underlying collection. The sheer size of this problem has led to the development of learning-to-rank algorithms that can automate the construction of such ranking functions: Given a training set of (feature vector, relevance) pairs, a machine learning procedure learns how to combine the query and document features in such a way so as to effectively assess the relevance of any document to any query and thus rank a collection in response to a user input. Much thought and research has been placed on feature extraction and the development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on the choice of documents and queries for learning-to-rank data sets nor on the effect of these choices on the ability of a learning-to-rank algorithm to "learn", effectively and efficiently. The proposed work investigates the effect of query, document, and feature selection on the ability of learning-to-rank algorithms to efficiently and effectively learn ranking functions. In preliminary results on document selection, a pilot study has already determined that training sets whose sizes are as small as 2 to 5% of those typically used are just as effective for learning-to-rank purposes. Thus, one can train more efficiently over a much smaller (though effectively equivalent) data set, or, at an equal cost, one can train over a far "larger" and more representative data set. In addition to formally characterizing this phenomenon for document selection, the proposed work investigate this phenomenon for query and feature selection as well, with the end goals of (1) understanding the effect of document, query, and feature selection on learning-to-rank algorithms and (2) developing collection construction methodologies that are efficient and effective for learning-to-rank purposes. In addition to characterizing and developing collection construction methodologies, the project plan includes development and release of new, efficient, and effective learning-to-rank data sets for use by academia and industry. In fostering this effort, the project team has close ties with the National Institute of Standards and Technology (NIST) and Microsoft Research, two of the premier organizations that develop and release Information Retrieval data sets. All research results and data sets developed as part of this project will be made available at the project website (http://www.ccs.neu.edu/home/jaa/IIS-1017903/). The project provides an educational and training experience for students.
View original record on NSF Award Search →