CRII: SaTC: Automatic Generation of API to Natural Language Data Type Mappings for Developer and End User Privacy Risk Mitigation
University Of Texas At San Antonio, San Antonio TX
Investigators
Abstract
Since the advent of the smart phone, an increasing amount of the population has gained access to Internet-accessible software applications (apps). This, coupled with the various sensors available on mobile devices, make the general public highly susceptible to privacy risks as sensitive information (e.g., location, camera images, biometrics) may be leaked to the Internet. To help users make informed decisions about the potential privacy risks in using apps, regulators increasingly require app developers to include privacy policies communicating what information is collected or shared and how that information is used. However, even when such privacy policies are present, trust must be put in the app developers to adhere to the promises therein. Furthermore, developers are accountable for their adherence to their policies and must be confident that their privacy policies accurately represent their practices. This project aims to assist both developers and general app users in verifying the alignment of privacy policies and the apps they represent by producing an automated process for linking the semantics of language used in privacy policies with the code used to produce the apps themselves. Furthermore, the project will use this framework to generate tools for end users and developers to directly benefit from this work. The research project aims to produce an automated process for generating mappings between code-level APIs and natural language data types using machine learning. The resulting mappings will be utilized in developer and end user tools to identify and help mitigate potential privacy leakage during development and app usage. The current state of misalignment detection between privacy policies and app code requires the manual generation of mappings from code-level Application Program Interface (API) methods to privacy-oriented natural language data types. Even for small app categories, this process can require a human to review thousands of methods and hundreds of annotations resulting in potential for inaccuracies due to fatigue and incomplete domain knowledge. APIs also change as methods are introduced and deprecated resulting in outdated mappings. These problems make it difficult to apply the framework practically as the environment continually evolves. This project will address these challenges through two contributions. First, machine learning will be applied to the mapping generation process to produce an automated, scalable method for generating code-phrase mappings for APIs as needed. This will allow for misalignment detection for API levels, methods, and app categories beyond those build in previous contributions. This automated approach will make use of a state-of-the-art pre-trained language models to detect semantic similarity between API documentation and natural language data types used in privacy policies. Second, the resulting mappings from the automated model will be applied to practical developer and end user tools to enable informed decision for privacy risk mitigation. The PoliDroid tool suite will be developed including a developer-oriented integrated developer environment plugin which detects potential unintended privacy leaks based on a privacy policy and a real-time misalignment detection tool for end users. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →