Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the high cost of manual annotation for daily physical activities in free-living images captured by wearable cameras. It presents the first systematic evaluation of open-source vision-language models (VLMs) for activity recognition under cross-regional, real-world conditions. Methodologically, we employ VLMs for zero-shot and few-shot inference, benchmarking against fine-tuned discriminative models; performance is quantified using F1 score and Cohen’s kappa. Results show that the best-performing VLM achieves a median F1 of 0.89 for sedentary behavior on the Oxford dataset, demonstrating strong discriminative capability. However, performance degrades substantially on cross-domain data from Sichuan, revealing significant domain shift challenges. Our key contributions are: (1) establishing VLMs as a viable, low-cost alternative to manual annotation for sedentary behavior identification, and (2) providing the first empirical evidence of their geographical generalization bottleneck—thereby establishing a critical baseline and motivation for future domain adaptation research.

Technology Category

Application Category

📝 Abstract

Introduction: Data from wearable devices collected in free-living settings, and labelled with physical activity behaviours compatible with health research, are essential for both validating existing wearable-based measurement approaches and developing novel machine learning approaches. One common way of obtaining these labels relies on laborious annotation of sequences of images captured by cameras worn by participants through the course of a day. Methods: We compare the performance of three vision language models and two discriminative models on two free-living validation studies with 161 and 111 participants, collected in Oxfordshire, United Kingdom and Sichuan, China, respectively, using the Autographer (OMG Life, defunct) wearable camera. Results: We found that the best open-source vision-language model (VLM) and fine-tuned discriminative model (DM) achieved comparable performance when predicting sedentary behaviour from single images on unseen participants in the Oxfordshire study; median F1-scores: VLM = 0.89 (0.84, 0.92), DM = 0.91 (0.86, 0.95). Performance declined for light (VLM = 0.60 (0.56,0.67), DM = 0.70 (0.63, 0.79)), and moderate-to-vigorous intensity physical activity (VLM = 0.66 (0.53, 0.85); DM = 0.72 (0.58, 0.84)). When applied to the external Sichuan study, performance fell across all intensity categories, with median Cohen's kappa-scores falling from 0.54 (0.49, 0.64) to 0.26 (0.15, 0.37) for the VLM, and from 0.67 (0.60, 0.74) to 0.19 (0.10, 0.30) for the DM. Conclusion: Freely available computer vision models could help annotate sedentary behaviour, typically the most prevalent activity of daily living, from wearable camera images within similar populations to seen data, reducing the annotation burden.

Problem

Research questions and friction points this paper is trying to address.

Reducing manual annotation of wearable camera images

Comparing vision-language models for activity recognition

Validating models across diverse populations and activities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for activity annotation

Compares VLMs and discriminative models performance

Reduces annotation burden in free-living studies

🔎 Similar Papers

No similar papers found.

Authors to Follow