🤖 AI Summary
This study addresses the lack of actionable, code-level guidance for data minimization in Android development, which hinders compliance with privacy regulations. Through empirical analysis of 1,114 open-source applications and 9,875 APKs, the authors identify ten recurring data minimization scenarios across the data lifecycle and, for the first time, translate the principle of data minimization into 31 concrete coding guidelines. Combining large-scale static analysis, empirical software engineering methods, and evaluation of code generated by large language models (LLMs), the research reveals that LLM-generated Android code frequently violates data minimization principles. The proposed guidelines effectively mitigate these violations, offering a practical pathway toward regulatory compliance for both human developers and AI-assisted programming tools.
📝 Abstract
Modern mobile applications consume large amounts of data to function, raising significant privacy concerns and regulatory challenges. While prior work has primarily focused on detecting compliance gaps through policy analysis, there remains a lack of actionable guidance for developers to implement privacy principles at the code level. In this paper, we focus on data minimization as a developer-operationalizable principle and investigate its realization in Android applications. We conduct a formative study on 1,114 open-source Android apps to identify ten recurring data minimization scenarios across five data-handling stages. Building on this, we perform a large-scale analysis of 9,875 real-world APKs and distill 31 actionable coding guidelines to support privacy-compliant development. We further examine LLM-based code generation in Android development and find that state-of-the-art models consistently reproduce data minimization-risky practices, indicating that they inherit and amplify patterns from real-world code. Encouragingly, incorporating our guidelines eliminates these issues across all evaluated models. Our work advocates a shift toward responding to privacy regulatory requirements at their code-level root causes, enabling better compliance in both human and AI-assisted programming.