Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the susceptibility of pretrained language models to human reporting bias in textual data, which leads to distorted commonsense reasoning in zero-shot settings. To mitigate this limitation, the authors propose a novel “machine imagination” mechanism that integrates an image generator into the reasoning pipeline. By synthesizing visual signals corresponding to input text, the framework constructs a synthetic visual question answering dataset and establishes an end-to-end multimodal zero-shot reasoning architecture named Imagine. This approach effectively compensates for gaps in textual knowledge and alleviates reporting bias. Experimental results demonstrate that Imagine significantly outperforms existing zero-shot methods across multiple commonsense reasoning benchmarks and even surpasses several advanced large language models, thereby validating the efficacy of machine imagination in enhancing model generalization.

Technology Category

Application Category

📝 Abstract

Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models

Problem

Research questions and friction points this paper is trying to address.

zero-shot commonsense reasoning

reporting bias

pre-trained language models

machine imagination

visual knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

machine imagination

zero-shot commonsense reasoning

visual knowledge integration