GGrantIndex
← Search

SI2-SSI: Pegasus: Automating Compute and Data Intensive Science

$2,500,000FY2017CSENSF

University Of Southern California, Los Angeles CA

Investigators

Abstract

This project addresses the ever-growing gap between the capabilities offered by on-campus and off-campus cyberinfrastructures (CI) and the ability of researchers to effectively harness these capabilities to advance scientific discovery. Faculty and students on campuses struggle to extract knowledge from data that does not fit on their laptops or cannot be processed by an Excel spreadsheet and they find it difficult to efficiently manage their computations. The project sustains and enhances the Pegasus Workflow Management System, which enables scientist to orchestrate and run data- and compute-intensive computations on diverse distributed computational resources. Enhancements focus on the automation capabilities provided by Pegasus to support workflows handling large data sets, as well as usability of Pegasus that lowers the barrier of its adoption. This effort expands the reach of the advanced capabilities provided by Pegasus to researchers from a broader spectrum of disciplines that range from gravitational-wave physics to bioinformatics, and from earth science to material science. For more than 15 years the Pegasus Workflow Management System has been designed, implemented and supported to provide abstractions that enable scientists to focus on structuring their computations without worrying about the details of the target CI. To support these workflow abstractions Pegasus provides automation capabilities that seamlessly map workflows onto target resources, sparing scientists the overhead of managing the data flow, job scheduling, fault recovery and adaptation of their applications. Automation enables the delivery of services that consider criteria such as time-to-solution, as well as takes into account efficient use of resources, managing the throughput of tasks, and data transfer requests. The power of these abstractions was demonstrated in 2015 when Pegasus was used by an international collaboration to harness a diverse set of resources and to manage compute- and data- intensive workflows that confirmed the existence of gravitational waves, as predicted by Einstein's theory of relativity. Experience from working with diverse scientific domains - astronomy, bioinformatics, climate modeling, earthquake science, gravitational and material science - uncover opportunities for further automation of scientific workflows. This project addresses these opportunities through innovation in the following areas: automation methods to include resource provisioning ahead of and during workflow execution, data-aware job scheduling algorithms, and data sharing mechanisms in high-throughput environments. To support a broader group of "long-tail" scientists, effort is devoted to usability improvements as well as outreach, education, and training activities. The proposed work includes the implementation and evaluation of advanced frameworks, algorithms, and methods that enhance the power of automation in support of data-intensive science. These enhancements are delivers as dependable software tools integrated with Pegasus so that they can be evaluated in the context of real-life applications and computing environments. The data-aware focus targets new classes of applications executing in high-throughput and high-performance environments. Pegasus has been adopted by researchers from a broad spectrum of disciplines that range from gravitational-wave physics to bioinformatics, and from earth science to material science. It provides and enhances access to national CI such as OSG and XSEDE, and as part of this work it will be deployed within Chameleon and Jetstream to provide broader access to NSF's CI investments. Through usability improvements, engagement with CI and community platform providers such as HubZero and Cyverse, combined with educational, training, and tutorial activities, this project broadens the set of researchers that leverage automation for their work. Collaboration with the Gateways Institute assures that Pegasus interfaces are suitable for vertical integration within science gateways and seamlessly supports new scientific communities.

View original record on NSF Award Search →