🤖 AI Summary
This study addresses the critical gap in evaluating autonomous AI systems within real-world interactions, where inadequate assessment methods can lead to cross-domain risks such as sensitive data leakage, fraud, and cybersecurity breaches. To tackle this challenge, the work presents the first unified agent safety evaluation framework that explicitly incorporates cultural and linguistic diversity, developed through international collaboration. Leveraging publicly available agent benchmark tasks, the framework systematically evaluates the risk-handling capabilities of diverse large language models—both open- and closed-source—in authentic scenarios. The research identifies fundamental methodological shortcomings in current agent evaluation practices and advocates a paradigm shift from comparative performance metrics toward scientifically rigorous, standardized, and reproducible assessment protocols, thereby laying the groundwork for a robust, cross-domain AI agent safety evaluation ecosystem.
📝 Abstract
The rapid rise of autonomous AI systems and advancements in agent capabilities are introducing new risks due to reduced oversight of real-world interactions. Yet agent testing remains nascent and is still a developing science. As AI agents begin to be deployed globally, it is important that they handle different languages and cultures accurately and securely. To address this, participants from The International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the European Commission, France, Kenya, South Korea, and the United Kingdom have come together to align approaches to agentic evaluations. This is the third exercise, building on insights from two earlier joint testing exercises conducted by the Network in November 2024 and February 2025. The objective is to further refine best practices for testing advanced AI systems. The exercise was split into two strands: (1) common risks, including leakage of sensitive information and fraud, led by Singapore AISI; and (2) cybersecurity, led by UK AISI. A mix of open and closed-weight models were evaluated against tasks from various public agentic benchmarks. Given the nascency of agentic testing, our primary focus was on understanding methodological issues in conducting such tests, rather than examining test results or model capabilities. This collaboration marks an important step forward as participants work together to advance the science of agentic evaluations.