K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

213K/year
πŸ€– AI Summary
This study addresses the absence of a Korean-language benchmark for evaluating web-browsing agents’ compositional intelligence in realistic settings. We introduce K-BrowseComp, the first Korean web-browsing evaluation benchmark, comprising 400 tasksβ€”300 manually constructed and validated by native speakers, and 100 generated via adversarial synthesis to stress-test model capabilities. Our approach innovatively integrates human validation with adversarial data construction, employing techniques such as few-shot hard-example prompting, failure-mode-targeted design, and adversarial filtering. Experimental results reveal substantial performance gaps: state-of-the-art models achieve only 30.00–45.67% accuracy on the human-curated subset, while Korean-native models score as low as 0.00–10.33%; on the adversarial subset, the best performance is merely 26.00%, highlighting significant limitations in language-specific adaptation.
πŸ“ Abstract
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.
Problem

Research questions and friction points this paper is trying to address.

web browsing agent
Korean context
agent benchmark
compositional reasoning
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

web browsing agent
Korean benchmark
compositional agentic evaluation
adversarial synthetic generation
human-verified dataset
πŸ”Ž Similar Papers