🤖 AI Summary
Existing content moderation systems lack spatial localization and interpretability, making it difficult to pinpoint the specific content, involved entities, and locations of sensitive behaviors. To address this limitation, this work introduces SenBen, the first large-scale scene graph benchmark for sensitive content, and proposes a multi-task distillation framework that integrates suffix-based object identifiers, lexical-aware recall loss, and a decoupled Query2Label head to enable fine-grained and interpretable sensitive content recognition. The proposed method achieves a 6.4 percentage point improvement in recall on SenBen, outperforms state-of-the-art vision-language models and commercial APIs in scene graph generation, object detection, and image captioning, and simultaneously accelerates inference by 7.6× while reducing GPU memory consumption by 16×.
📝 Abstract
Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.