🤖 AI Summary
This work addresses key limitations in existing GUI grounding methods, including inefficient use of training data, the trade-off between contextual retention and redundancy suppression, and high inference overhead of compact models. To overcome these challenges, the authors propose GUI-C², a novel framework that employs a GUI-D data mining pipeline to identify high-value training samples and introduces difficulty-aware training weights alongside an area-gating mechanism to enable an adaptive coarse-to-fine localization strategy. By leveraging internal model uncertainty signals to dynamically rescale visual regions and integrating refined stage-wise rewards with a lightweight decision process, the method significantly enhances grounding accuracy while substantially reducing inference cost. Experimental results demonstrate that GUI-C² achieves state-of-the-art performance across multiple benchmarks with markedly lower additional inference time.
📝 Abstract
Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.