🤖 AI Summary
UI snapshot testing suffers from high false-positive rates due to frequent interface changes, and manual differentiation between genuine regression defects and intentional design iterations is labor-intensive and costly. To address this, we propose the first automated framework for semantic-level snapshot difference analysis. Our method leverages vision-language models (VLMs)—notably Gemma-3—to perform hierarchical classification of visual changes, enabling fine-grained root-cause identification (e.g., layout adjustments, styling updates, functional bugs). A configurable feature-flag mechanism ensures precise ground-truth labeling for constructing a high-quality, semantically annotated difference dataset. Experimental results show that a 12B-parameter VLM achieves 84.3% recall in root-cause classification; a 4B variant meets stringent CI/CD latency requirements. This work substantially reduces manual review effort and advances intelligent UI testing toward semantic awareness.
📝 Abstract
Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages vision-based Large Language Models to automatically analyze snapshot test failures through hierarchical classification of UI changes. To evaluate LLMShot's effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.