EAGER: Privacy Preserving Synthetic Graph Generation for System Provenance

$250,003FY2023CSENSF

University Of Texas At Dallas, Richardson TX

Investigators

Kangkook Jeecontact Bhavani Thuraisingham Murat Kantarcioglu

Abstract

System provenance has emerged as an area of prominent research in recent years, garnering attention from both academia and industry. The escalating proliferation of Advanced Persistent Threat (APT) campaigns has been a key driver behind it. Compounding this issue is the growing dependency on open-source software for supply chain components. The origins and potential threats associated with these components are often unclear and difficult to trace, thereby highlighting the importance of a dynamic security defense built on system provenance data collection. However, lack of robust and reliable datasets significantly hinders the progress of system provenance research. While data collection itself is costly due to the overhead of deployment and maintenance, the public sharing of existing datasets is also obstructed due to potential privacy risks. The granular, comprehensive nature of system provenance carries significant sources of privacy leakage, captured during dynamic runtime. System provenance graphs often carry privacy-sensitive information, not only in their explicit textual and numerical attributes but also in their implicit structural relationships, which can easily span high-order interactions. Distinct in their privacy implications from traditional security datasets, system provenance datasets offer a new perspective for privacy research. Composed of four separate research thrusts, the project aims to address the discrepancy between the privacy risks associated with system provenance and the high demand for publicly available system provenance datasets. Firstly, to accurately assess and understand the privacy implications of system provenance graphs, a systematic study will be conducted to identify potential privacy risks associated with these graphs. Secondly, various approaches will be explored to construct models for synthetic graph generation, leveraging paths extracted from system provenance graphs. To restructure the synthetically generated paths and reconstruct realistic synthetic graphs, a series of post-processing techniques will be applied. Thirdly, a set of metrics will be designed to measure both the forensic plausibility and the privacy protections provided by the generated synthetic data. Finally, the project will develop synthetic provenance graph generation techniques to encompass a wider range of graph-structured datasets for system and security applications. The project is founded on extensive system provenance datasets from real-world deployments, collected with end-user consent and under the university's IRB review. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →