Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the visual defects—such as layout corruption and text truncation—commonly produced by large code generation models due to their lack of awareness of rendered outputs. To mitigate this, we propose Visual-SDPO, a framework that leverages self-distillation to incorporate post-rendering visual feedback as privileged context for a teacher model, jointly optimized with execution error signals. We introduce a novel visually anchored code credit weighting mechanism that precisely traces visual flaws back to their corresponding source code statements, thereby enhancing distillation signals. Additionally, sequence-level GRPO rewards are integrated to promote high-quality, executable outputs. Built upon a unified Qwen3-VL-8B-Instruct architecture, our method achieves absolute improvements of over 10 points on ChartMimic, Design2Code, and AeSlides benchmarks compared to zero-shot baselines, outperforming GRPO by at least 2.4 points while maintaining training efficiency and incurring no additional inference overhead.

📝 Abstract

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

Problem

Research questions and friction points this paper is trying to address.

visual artifacts

code generation

visual defects

non-differentiable renderers

LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Distillation

Visual Feedback

Policy Optimization