Vision Language Models for Optimization-Driven Intent Processing in Autonomous Networks

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses a critical limitation in current intent-based networking (IBN) systems, which support only textual input and struggle to interpret optimization intents expressed through structured network sketches commonly used by engineers. To bridge this gap, the authors introduce IntentOpt, a novel benchmark that explores, for the first time, the feasibility of leveraging vision-language models—such as GPT-5-Mini and Claude-Haiku-4.5—within a Program-of-Thought prompting paradigm and the Model Context Protocol framework to generate executable optimization code end-to-end from annotated network sketches. Experimental results demonstrate that while visual input reduces success rates by 12–21 percentage points, closed-source models significantly outperform open-source counterparts, with GPT-5-Mini achieving a 72% success rate and successful deployment on a real-world network testbed, thereby establishing a foundation for multimodal intent understanding in IBN.

Technology Category

Application Category

📝 Abstract

Intent-Based Networking (IBN) allows operators to specify high-level network goals rather than low-level configurations. While recent work demonstrates that large language models can automate configuration tasks, a distinct class of intents requires generating optimization code to compute provably optimal solutions for traffic engineering, routing, and resource allocation. Current systems assume text-based intent expression, requiring operators to enumerate topologies and parameters in prose. Network practitioners naturally reason about structure through diagrams, yet whether Vision-Language Models (VLMs) can process annotated network sketches into correct optimization code remains unexplored. We present IntentOpt, a benchmark of 85 optimization problems across 17 categories, evaluating four VLMs (GPT-5-Mini, Claude-Haiku-4.5, Gemini-2.5-Flash, Llama-3.2-11B-Vision) under three prompting strategies on multimodal versus text-only inputs. Our evaluation shows that visual parameter extraction reduces execution success by 12-21 percentage points (pp), with GPT-5-Mini dropping from 93% to 72%. Program-of-thought prompting decreases performance by up to 13 pp, and open-source models lag behind closed-source ones, with Llama-3.2-11B-Vision reaching 18% compared to 75% for GPT-5-Mini. These results establish baseline capabilities and limitations of current VLMs for optimization code generation within an IBN system. We also demonstrate practical feasibility through a case study that deploys VLM-generated code to network testbed infrastructure using Model Context Protocol.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Intent-Based Networking

Optimization Code Generation

Network Sketches

Multimodal Intent Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Intent-Based Networking

Optimization Code Generation