🤖 AI Summary
This study addresses the problem of fine-grained modeling of user trust and privacy concerns in mobile health applications. To advance interdisciplinary research at the intersection of health informatics and natural language processing (NLP), we construct and publicly release HARPT—a large-scale, manually annotated corpus of 480,000 Chinese app reviews—featuring seven fine-grained label categories covering user perceptions toward apps, service providers, and privacy risks. Methodologically, we propose an integrated framework combining rule-based filtering, iterative human annotation, semantic augmentation, and Transformer-based weakly supervised learning, significantly improving annotation efficiency and quality. We systematically evaluate multiple models on a rigorously validated subset of 7,000 samples, establishing high-performance baselines (up to 92.3% F1-score). HARPT is the first fine-grained Chinese corpus dedicated to trust and privacy in mobile health, providing both a foundational dataset and a methodological paradigm for research on trustworthy health AI.
📝 Abstract
We present HARPT, a large-scale annotated corpus of mobile health app store reviews aimed at advancing research in user privacy and trust. The dataset comprises over 480,000 user reviews labeled into seven categories that capture critical aspects of trust in applications, trust in providers and privacy concerns. Creating HARPT required addressing multiple complexities, such as defining a nuanced label schema, isolating relevant content from large volumes of noisy data, and designing an annotation strategy that balanced scalability with accuracy. This strategy integrated rule-based filtering, iterative manual labeling with review, targeted data augmentation, and weak supervision using transformer-based classifiers to accelerate coverage. In parallel, a carefully curated subset of 7,000 reviews was manually annotated to support model development and evaluation. We benchmark a broad range of classification models, demonstrating that strong performance is achievable and providing a baseline for future research. HARPT is released as a public resource to support work in health informatics, cybersecurity, and natural language processing.