MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limitations of traditional symbol-based approaches to classifying Korean folk painting (minhwa) themes, which rely solely on object lists and yield suboptimal performance. To overcome this, the authors propose a multimodal fusion framework that integrates visual and curatorial textual data, employing a localized object detector to generate spatially faithful object evidence maps and introducing a “faithful yet insufficient” disentanglement mechanism. The work reveals, for the first time, that thematic labels exhibit cross-dataset transferability, whereas stylistic labels do not. Experimental results demonstrate that the proposed multimodal approach significantly outperforms purely symbol-based methods. The authors also release their complete system, detailed case studies, and a new evaluation benchmark tailored for long-tailed cultural heritage datasets.

📝 Abstract

Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

Problem

Research questions and friction points this paper is trying to address.

object grounding

Korean folk painting

genre classification

symbolic representation

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

faithful-but-insufficient dissociation

object grounding

multimodal fusion