🤖 AI Summary
Bongard Problems (BPs) serve as a critical benchmark for abstract visual reasoning (AVR), yet existing datasets exhibit key limitations: early synthetic variants lack real-world grounding; later real-image datasets employ overly coarse-grained concepts; and the recent Bongard-RWR—though focused on fine-grained real-world scenes—contains only 60 manually constructed instances, compromising statistical robustness. To address this, we introduce Bongard-RWR+, the first large-scale (5,400 instances), fine-grained, real-world-oriented BP dataset. Our methodology leverages a vision-language model (VLM)-driven pipeline: Pixtral-12B generates precise conceptual descriptions, Flux.1-dev synthesizes semantically consistent images, and human validation ensures strict concept alignment. Empirical evaluation reveals that state-of-the-art VLMs perform reasonably on coarse-grained recognition but remain substantially limited in fine-grained abstract reasoning—highlighting Bongard-RWR+’s pivotal role in advancing rigorous AVR evaluation and model development.
📝 Abstract
Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.