CSR: Small: Empower Data-Intensive Computing: the integrated data management approach

$416,000FY2015CSENSF

Illinois Institute Of Technology, Chicago IL

Investigators

Abstract

From the computer system point of view there are two types of digital data: observational data, the data collected by electrical devices such as sensor, monitor, camera, text, etc.; and simulation data, data generated by computing. The former represents newly emerged internet data-driven applications, such as social media and data analytic; and the latter represents the conventional computing-driven applications, such as climate modeling and computational fluid dynamics. In general, the latter requires strong consistency for correctness and the former does not. The difference in consistency leads to two kinds of file systems: data-intensive distributed file system, represented by the MapReduce-based Hadoop distributed file systems (HDFS); and computing-intensive file systems, represented by the high performance parallel file systems (PFS), such as the IBM general parallel file system (GPFS). These two kinds of file systems are designed with different philosophies, for different applications, and do not talk to each other. Understanding huge amounts of collected data depends on powerful computation, whereas large-scale computation requires the management of large data. Therefore, big data applications demand an integrated solution. The integrated data access system (IDAS) developed under this research is designed to bridge the data management gap. In agreement with the CAP theory in the distributed system design, the IDAS approach is not designed as a new standalone system but as a software layer which provides an integrated interface to conduct cross-platform data access, from HDFS to PFS, or from PFS to HDFS, read or write, effectively and interchangeably without changing the users' applications. The development plan for IDAS has three components: 1) establish the communication channels so that data can be accessed between HDFS and PFS; 2) design an extended semantic interface so that different file systems can be accessed under different computing systems; 3) develop optimization techniques to optimize I/O operation under HDFS, PFS, and under IDAS. Big data requires a joint effort of the data-driven internet computing community and the compute-driven scientific computing community. IDAS provides a sustainable, cost-effective infrastructure for cross-platform, cross-community services of data storage, access, and sharing. This research will create advanced solutions and technologies that will have direct impact on improving the efficiency of data access and management at scale. Since big data is a national strategic infrastructure for science, engineering, and industry, the proposed investigations will advance a broad range of fields. The success of this research will strive to make significant progress of a timely, important, highly challenging, and high-impact problem, namely integrated data access system.

View original record on NSF Award Search →