Collaborative Research: Phylanx: Python based Array Processing in HPX

$93,300FY2017CSENSF

University Of Oregon Eugene, Eugene OR

Investigators

Abstract

The availability and size of data sets has increased significantly over the course of the past decade. To enable the analysis of large data sets on High Performance Computing (HPC) resources while minimizing time- and energy-to-solution requires incorporating static and runtime information to determine the best possible data layout of the large data arrays used by an application to minimize data movement. The goal of this proposal is to deliver Phylanx, a general purpose framework supporting a variety of data science, machine learning, and statistically oriented applications. Phylanx is designed such that a user?s code will be able to perform efficiently on current and future architecture as long as the runtime system is maintained. This greatly reduces the maintenance burden and will increase the productivity of domain scientists. Phylanx lays a solid foundation for technology transfer from academia to industry and fills the gap between academic innovation and commercial application, by creating a software layer that industrial partners can feel confident relying upon. Phylanx is a scalable, array-based and distributed framework targeting HPC systems using the HPX, dynamic asynchronous task-based parallel runtime system. The dataflow-style capabilities exposed by HPX guarantee the preservation of all data-dependencies even for complex distributed workflows. This project overcomes some of the limitations of existing Big Data solutions such as Hadoop, Spark, and Flink by providing users the ability to: implement NumPy-styled expression-graphs using Python or C/C++, optimize these graphs for optimal data layout, distribution, tiling, and minimal communication overheads, and evaluate those graphs with high efficiency on a runtime interpreter targeting distributed HPC systems. Additionally, Phylanx uses greedy sub-modular techniques on the expression tree to provide a mathematically provable guarantee of optimal performance in machine learning domains and in data placement problems. The platform will provide implementations of 6 benchmarks which have been selected for their domain specificity in text, image, and graph applications.

View original record on NSF Award Search →