A2Perf: Real-World Autonomous Agents Benchmark

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing autonomous agents lack a unified, real-world-oriented benchmark for fair evaluation of generalization, reliability, and hardware resource efficiency. To address this, we introduce the first high-fidelity, multi-task benchmark covering chip placement, web navigation, and quadrupedal robot locomotion. We propose a system-level, multidimensional evaluation framework—featuring novel data-cost metrics to unify assessment across imitation learning and hybrid algorithms—and integrate real-time latency profiling, energy consumption monitoring, out-of-distribution (OOD) generalization testing, and reliability stress evaluation. The framework is open-source, extensible, and committed to long-term maintenance. Experiments demonstrate: (1) web navigation agents achieve human-comparable response latency (~200 ms) on consumer-grade hardware; (2) a fundamental reliability–performance trade-off in quadrupedal control; and (3) quantified energy-efficiency disparities across learning paradigms in chip design.

Technology Category

Application Category

📝 Abstract

Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and inference, among other requirements. Several methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is a lack of benchmarking suites that define the environments, datasets, and metrics which can be used to provide a meaningful way for the community to compare progress on applying these methods to real-world problems. We introduce A2Perf--a benchmark with three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion. A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. Using A2Perf, we demonstrate that web navigation agents can achieve latencies comparable to human reaction times on consumer hardware, reveal reliability trade-offs between algorithms for quadruped locomotion, and quantify the energy costs of different learning approaches for computer chip-design. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning and hybrid algorithms, which allows us to better compare these approaches. A2Perf also contains several standard baselines, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy. As an open-source benchmark, A2Perf is designed to remain accessible, up-to-date, and useful to the research community over the long term.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarking suites for real-world autonomous agents.

Need for metrics to evaluate task performance and resource efficiency.

Introduction of A2Perf to compare methods in real-world applications.

Innovation

Methods, ideas, or system contributions that make the work stand out.

A2Perf benchmark for real-world autonomous agents

Metrics for performance, generalization, and efficiency

Includes environments like chip floorplanning, web navigation

🔎 Similar Papers

No similar papers found.

Authors to Follow