π€ AI Summary
This study addresses the absence of a Korean-language benchmark for evaluating web-browsing agentsβ compositional intelligence in realistic settings. We introduce K-BrowseComp, the first Korean web-browsing evaluation benchmark, comprising 400 tasksβ300 manually constructed and validated by native speakers, and 100 generated via adversarial synthesis to stress-test model capabilities. Our approach innovatively integrates human validation with adversarial data construction, employing techniques such as few-shot hard-example prompting, failure-mode-targeted design, and adversarial filtering. Experimental results reveal substantial performance gaps: state-of-the-art models achieve only 30.00β45.67% accuracy on the human-curated subset, while Korean-native models score as low as 0.00β10.33%; on the adversarial subset, the best performance is merely 26.00%, highlighting significant limitations in language-specific adaptation.
π Abstract
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.