ConvFill: Model Collaboration for Responsive Conversational Voice Agents

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between high latency of cloud-based large language models (LLMs) and limited capability of on-device small models, this paper introduces the “dialogue completion” task—first decoupling response latency from model capability. We propose a streaming knowledge fusion mechanism enabling a lightweight on-device model (360M parameters) to integrate cloud-generated reasoning outputs in real time, achieving end-to-end latency under 200 ms while preserving semantic depth. The model is trained on synthetically generated multi-domain dialogue data and incorporates streaming input processing, dynamic context modeling, and parameter-efficient generation. Experiments across multiple backend LLMs demonstrate that our approach improves accuracy by 36–42% over comparably sized on-device baselines, while maintaining stable end-to-end latency below 200 ms—thus reconciling real-time interactivity with rich linguistic understanding.

Technology Category

Application Category

📝 Abstract
Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies. Our results demonstrate the promise of this approach for building on-device conversational agents that are both immediately responsive and knowledgeable.
Problem

Research questions and friction points this paper is trying to address.

Cloud-based models cause disruptive latency in conversations
On-device models lack sophisticated reasoning capabilities
Balancing responsiveness with knowledge access remains challenging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining on-device and cloud models for responsive agents
Using conversational infill to decouple latency from capability
Training lightweight models with synthetic multi-domain dialogues
🔎 Similar Papers
No similar papers found.