Collaborative Research: SHF: Medium: Bringing Python Up to Speed

$437,999FY2020CSENSF

University Of Pennsylvania, Philadelphia PA

Investigators

Abstract

The Python programming language is among today's most popular computer programming languages and is used to write software in a wide variety of domains, from web services to data analysis to machine learning. Unfortunately, Python’s lightweight and flexible nature -- a major source of its appeal -- can cause significant performance and correctness problems - Python programs can suffer slowdowns as high as 60,000x over optimized code written in traditional programming languages like C and C++, and can require an order-of-magnitude more memory. Python's flexible, “dynamic” features also make its programs error-prone, with many coding errors only being discovered late in development or after deployment. Python’s frequent use as a "glue language" -- to integrate and interact with different components written in C or C++ -- exposes many Python programs to the unique dangers of those languages, including susceptibility to memory corruption-based security vulnerabilities. This project aims to remedy these problems by developing new technology for Python in the form of novel performance analysis tools, memory-reduction and speed-improving optimizations (including support for multi-core execution), automated software testing frameworks, and common benchmarks to drive their evaluation. This project will develop (1) performance analysis tools that help Python programmers accurately identify the sources of slowdowns; (2) techniques for automatically identifying code that can be replaced by calls to C/C++ libraries; (3) an approach to unlocking parallelism in Python threads, which currently must execute sequentially due to a global interpreter lock; and (4) automatic techniques to drastically reduce the memory footprints of Python applications. To improve the correctness of Python applications, the project will develop novel automated testing techniques that (1) augment property-based random testing with coverage-guided fuzzing; (2) employ concolic execution for smarter test generation and input minimization; (3) synthesize property-specific generator functions; (4) leverage statistical clustering techniques to reduce duplicated failure-inducing inputs; and (5) leverage parallelism and adaptive scheduling algorithms to increase testing throughput. The project will develop a set of "bug benchmarks" -- indeed, a novel benchmark-producing methodology -- to evaluate these techniques. The twin threads of performance and correctness are synergistic and complementary: automatic testing drives performance analysis, while performance optimizations (like parallelism) speed automatic testing. This award is co-funded by the Software & Hardware Foundations Program in the Division of Computer & Computing Foundations, and the NSF Office of Advanced Cyberinfrastructure. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →