π€ AI Summary
This work investigates the feasibility of extracting referring expressions from vision-grounded dialogues using autoregressive language models (LLMs) aloneβi.e., identifying visually referable objects solely from linguistic context, without image input. Methodologically, we propose a next-token-prediction-based span labeling mechanism, coupled with parameter-efficient fine-tuning (e.g., LoRA), to formulate referring expression detection as a purely textual sequence labeling task. Experiments establish, for the first time, that medium-scale LLMs can effectively perform this task in few-shot settings, demonstrating that linguistic cues alone suffice for coarse-grained referent localization. However, our analysis further reveals the inherently multimodal nature of the task, exposing fundamental limitations of unimodal (text-only) approaches. This study introduces the first purely text-based baseline for visual referring expression understanding and stimulates theoretical reflection on the relationship between linguistic and visual representations.
π Abstract
In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.