GGrantIndex
← Search

SI2-SSE: An Ecosystem of Reusable Image Analytics Pipelines

$500,000FY2017CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

Astronomy has entered an era of massive data streams generated by telescopes and surveys that can scan tens of thousands of square degrees of the sky across many decades of the electromagnetic spectrum. The promise of these new experiments - characterizing the nature of dark energy and the composition of dark matter, discovering the most energetic events in the universe, tracking asteroids whose orbits may intersect with that of the Earth - will only be realized if we can address the challenge of how to process and analyze the tens of petabytes of images that these astronomical surveys will generate per year.  With the increasing capacity for scientists to collect ever larger sets of data, often in the form of images, our potential for scientific discovery will soon be limited not by how we collect or store data, but rather how we extract the knowledge that these data contain (e.g. how we account for noise inherent within the data, and understand when we have detected fundamentally new classes and interesting events or physical phenomena).  This project is to develop an open source scalable framework for the analysis of large imaging data sets. It is designed to operate as a cloud service, incorporate seamlessly new or legacy image processing algorithms, support and optimize complex analysis workflows, and scale analyses to thousands of processors without the need for an individual user to develop custom solutions for a specific computer platforms or architecture. This framework will be integrated with state-of-the-art image analysis algorithms developed for astronomical surveys  to provide an image analytics platform that can be used by future telescopes and cameras and the astronomical community as a whole. Beyond astronomy, the framework will be extended to enable scientists from the physical and life sciences that make use of imaging data (e.g. neuroscience, oceanography, biology, seismology) to focus their work on developing scientific algorithms and analyses rather than the infrastructure required to process massive data sets Over the last decade, there have been many advancements in astronomical image analysis algorithms and techniques; driven by new surveys and experiments. The complexity of these techniques and the systems that run them has, however, meant that the number of users who make use of these advancements is small; typically restricted to the experiments themselves or to a small group of expert users. Because of this, the community as a whole does not benefit from the significant investment in image analytics for astronomy.  In this project, the PIs address these issues by developing and deploying a scalable framework for the analysis of small and large imaging datasets. This cloud-based system will be able to incorporate new and legacy image processing algorithms, support and optimize complex analysis workflows, scale applications to thousands of processors without users needing to develop custom code for specific platforms, and support efficient sharing of algorithms and analysis results among users. It will enable state-of-the-art image analysis algorithms (e.g. those developed for surveys such as the Large Synoptic Survey Telescope; LSST) to be used by the broad astronomical community and in so doing will leverage then tens of thousands of hours that has been invested in the development of these techniques. To accomplish this the team will extract key data analysis functions from the LSST data analysis pipeline into a standalone library, independent of the LSST software stack and data access mechanisms.  They will integrate this library with the Myria big data management system. Myria is an elastically scalable big data management system that operates as a service in the Amazon cloud that wedeveloped at the University of Washington. Compared with other big data systems, Myria is especially attractive because it integrates PostgreSQL database instances within its storage layer and thus provides access to PostgreSQL's rich libraries of spatial functions, which are frequently used in astronomical data analysis pipelines. At the same time, it has rich support for new and legacy Python code and for complex analytics. By integrating the library of LSST image analytics functions with Myria, new image analytics pipelines will become significantly easier to write. The skeleton of the analysis pipeline will be expressed in the MyriaL declarative query language (i.e. SQL extended with constructs such as iterations and others). The core data processing functions will directly map to Python functions, enabling the reuse of legacy code and the easy addition of new functions. The resulting code will be amenable to optimization and efficient execution using the Myria service. By doing so, they intend to reduce barriers to adoption. Users will be able to express their analysis in Python without worrying about how data and computation will be distributed in a cluster.  The image analysis framework developed as part of this proposal will be made publicly available as open-source software. The PIs will utilize the use case of neuroscience to demonstrate how their system, developed for astronomy, can be deployed across multiple domains. This project is supported by the Office of Advanced Cyberinfrastructure in the Directorate for Computer & Information Science and Engineering, the Astronomical Sciences Division and Office of Multidisciplinary Activities in the Directorate of Mathematical and Physical Sciences.

View original record on NSF Award Search →