SCOPE: Real-Time Natural Language Camera Agent at the Edge

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the challenges of deploying language-driven PTZ camera agents at the edge, including latency, accuracy limitations, and the absence of standardized evaluation benchmarks. The authors propose SCOPE, a modular edge agent that integrates a small language model (Qwen3) with a vision-language model (Moondream/Qwen) through a mixture-of-experts architecture, model quantization, and an LM-as-Judge mechanism to enable end-to-end closed-loop execution of natural language instructions, perception, and PTZ control entirely on-device. They introduce the first edge-oriented benchmark for natural language PTZ control, comprising 536 tasks, which reveals a performance bottleneck shift between planning and perception. Experiments demonstrate that strong planning capabilities significantly mitigate hallucination and improve tool invocation; the mixture-of-experts model outperforms dense counterparts under comparable latency and memory constraints, and maintains high accuracy even after quantization, offering a practical solution for edge deployment.

📝 Abstract

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

Problem

Research questions and friction points this paper is trying to address.

edge deployment

natural language camera control

PTZ camera

real-time perception

language-driven agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

edge AI

natural language camera control

mixture-of-experts