🤖 AI Summary
This study addresses the limitations of traditional symbol-based approaches to classifying Korean folk painting (minhwa) themes, which rely solely on object lists and yield suboptimal performance. To overcome this, the authors propose a multimodal fusion framework that integrates visual and curatorial textual data, employing a localized object detector to generate spatially faithful object evidence maps and introducing a “faithful yet insufficient” disentanglement mechanism. The work reveals, for the first time, that thematic labels exhibit cross-dataset transferability, whereas stylistic labels do not. Experimental results demonstrate that the proposed multimodal approach significantly outperforms purely symbol-based methods. The authors also release their complete system, detailed case studies, and a new evaluation benchmark tailored for long-tailed cultural heritage datasets.
📝 Abstract
Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.