Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in long-form video editing, including maintaining narrative coherence across multiple stages and enabling precise error localization and localized correction. The authors propose an open-source, multimodal multi-agent system that structures the editing process into three phases: asset preparation, editing research, and timeline execution. A novel traceable and replayable editing trajectory mechanism is introduced, allowing for accurate diagnosis of failed segments and selective re-editing without requiring full pipeline re-execution. The system integrates multi-agent collaboration, multimodal analysis, tool invocation logging, intermediate rendering, and a verifiable reward design. Experimental evaluation across 23 themes demonstrates that the approach achieves an average human rating of 3.40 out of 5, significantly outperforming CapCut-Mate and CutClaw in thematic relevance, narrative coherence, and editing fluency.

📝 Abstract

Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

Problem

Research questions and friction points this paper is trying to address.

long-form video editing

narrative coherence

traceable workflow

multi-agent system

failure diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

traceable multi-agent workflow

multimodal video editing

artifact-based editing