ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

158K/year
🤖 AI Summary
This work addresses the limitation of existing image cropping methods, which produce only a single composition and thus struggle to support multi-shot visual storytelling from a single portrait. We introduce the Tri-Shot Composition (TSC) task, which automatically generates wide, medium, and close-up shots from a single input image, each accompanied by descriptive text. To tackle this, we propose a three-stage training pipeline: chain-of-thought supervised fine-tuning, semi-supervised fine-tuning with high-confidence pseudo-labels derived from multimodal large model scoring, CLIP similarity, and aesthetic evaluation, and group-wise relative policy optimization tailored for TSC (GRPO-S). Evaluated on TSC-Bench—a newly curated test set with 1.2k expert annotations—our method, ShotCrop³, achieves a 2.82× average improvement over GPT-5 in shot placement accuracy, significantly enhancing both narrative coherence and aesthetic quality.
📝 Abstract
Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.
Problem

Research questions and friction points this paper is trying to address.

multi-shot composition
cinematic cropping
visual narration
triple-shot compositions
human-centric images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Triple-Shot Composition
ShotCrop³
Group Relative Policy Optimization
pseudo-labeling
TSC-Bench
D
Dehong Kong
Huawei Noah’s Ark Lab, Sun Yat-sen University
L
Lina Lei
Huawei Noah’s Ark Lab
L
Lingtao Zheng
Huawei Noah’s Ark Lab
Chenyang Wu
Chenyang Wu
Ph.D. Candidate, LAMDA, Nanjing University
reinforcement learningartificial intelligence
A
Ailing Zhang
Huawei Noah’s Ark Lab
X
Xinran Qin
Huawei Noah’s Ark Lab
T
Teng Ma
Sun Yat-sen University
J
Jiaqi Xu
Huawei Noah’s Ark Lab
Zhixin Wang
Zhixin Wang
ZheJiang University
RL systems
Z
Zhikai Chen
Huawei Noah’s Ark Lab
X
Xuecheng Qi
Huawei Noah’s Ark Lab
R
Renjing Pei
Huawei Noah’s Ark Lab
F
Fan Li
Huawei Noah’s Ark Lab