SI2-SSE: STAMLA: Scalable Tree Algorithms for Machine Learning Applications

$499,992FY2017CSENSF

University Of Illinois At Urbana-Champaign, Urbana IL

Investigators

Abstract

The voluminous growth in data together with the burgeoning field of data science have greatly increased the demand for machine learning, a field of computer science that focuses on the development of programs that can learn and change in response to new data. With increasing access to large volumes of data, practitioners often resort to machine learning to construct more precise models of nature, or to learn fundamentally new concepts. For example, machine learning can help improve the accuracy of weather and climate predictions, model the efficacy of drugs and their interactions, and identify specific features, such as a face, from a large set of images or videos. But with the growth of data volumes, the speed with which machine can learn from data is decreasing. As a result, new techniques are required to accelerate learning algorithms so that they can be applied to larger and more complex data sets. This work will develop new approaches to improve the performance of a wide class of machine learning algorithms. Specifically, this work will leverage the C++ programming language and recent research into fundamental bit operations to make fast tree-like data structures that underlie many of the most commonly used implementations of machine learning algorithms. In particular, algorithms in the scikit-learn library, the most widely used machine learning library written in the Python programming language, will be accelerated by using our these new tree data structures. Given the widespread adoption of the scikit learn library, this work will impact diverse fields from Astronomy to Biology to Geoscience to Physics. The scikit learn library is also one of the more popular libraries for teaching (and understanding) machine learning. With an explosion of books, blogs, and tutorials that use scikit learn algorithms and pipelines to demonstrate specific types of machine learning such as classification, regression, clustering, and feature extraction, this project will immediately impact a wide range of people from seasoned practitioners, engineers gaining additional training, and students at universities and colleges across the nation. In addition, these tree data structures will be submitted for inclusion in the C++ standard, which would impact millions of developers world-wide. Finally, the algorithms will be implemented under an open source license in a public forum. The STAMLA project aims at developing efficient and scalable tree algorithms inspired from high performance simulation codes for machine learning. Over the last few years, machine learning has become a popular technique in data mining to extract information from data sets, build models and make predictions across a wide range of application areas. However, current tools have been built in high-level languages with more focus on functionalities than on pure performance. But as scientific experiments are accumulating more and more data, and as complex models are requiring larger and larger training sets, scalability issues are emerging. At the same time, in high performance computing, petascale simulations have shown that fundamental data structure optimizations can have a significant impact on overall code performance. In particular, by replacing straightforward tree implementations with implicit trees based on hash tables, simulation codes are able to make the most of modern architecture, leveraging cache and vectorization. This research project will apply this knowledge to machine learning algorithms in order to overcome the limitations of existing libraries and make analyses of extremely large data sets possible. The proposed research includes the development of three library layers, built on top of each other. The first layer is a library of fast bit manipulation tools for tree indexing. It extends existing research that has already demonstrated two to three orders of magnitude improvements compared to standard solutions provided by compilers. The second layer is a tree building blocks library developed using generative programming techniques. This layer will provide developers with generic tools to build efficient implicit trees for specific domains and optimized at compile time to make the most of the targeted architecture. Finally, the third layer consists in a contribution package to the scikit-learn library to leverage the data structures introduced in the second layer. Together, these three layers form a consistent set that propagates low level optimizations based on high performance computing practices to one of the most widely used high level machine learning library. As machine learning is domain independent, the results of this project have the potential to impact all data intensive applications relying on machine learning algorithms based on tree data structures. Moreover, in addition to being developed in an open source framework via a public repository, the three library layers will be released through different channels: 1) the bit manipulation tools will aim at standardization in the C++ language through a collaboration with the ISO C++ Standards Committee 2) the tree building blocks will be proposed for inclusion in the Boost C++ libraries and 3) the machine learning algorithms will be published as a contribution package of the scikit-learn library. These channels will ensure a large adoption of the tools developed throughout this project, and their long-term support by well established communities.

View original record on NSF Award Search →