Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of visual memorability modeling in the absence of large-scale unsupervised annotations by proposing an unsupervised learning paradigm grounded in natural-language “tip-of-the-tongue” (ToT) recall descriptions. Methodologically, we construct the first large-scale ToT dataset—comprising 82K videos and their open-ended recall texts scraped from platforms such as Reddit—and formulate a multimodal ToT retrieval task. Leveraging vision-language foundation models, we jointly optimize recall generation and cross-modal retrieval via contrastive learning and online ToT query-driven fine-tuning. Our contributions are threefold: (1) the first large-scale, unsupervised visual memorability dataset; (2) the first formalization and modeling of fine-grained memory signals embedded in natural-language recall; and (3) the first model enabling multimodal ToT retrieval, which surpasses strong baselines—including GPT-4o—in recall generation and significantly improves memorability prediction performance.

Technology Category

Application Category

📝 Abstract
Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
Problem

Research questions and friction points this paper is trying to address.

Modeling visual memorability without expensive human annotations
Capturing nuanced memorability signals from natural recall descriptions
Creating scalable datasets for visual content memorability research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging tip-of-the-tongue queries for unsupervised dataset
Fine-tuning vision-language models for memorability description generation
Using contrastive training for multimodal tip-of-the-tongue retrieval
Sree Bhattacharyya
Sree Bhattacharyya
Pennsylvania State University
Multimodal AIVision-LanguageAffective Computing
Yaman Kumar Singla
Yaman Kumar Singla
Adobe
Machine LearningBehavioral ScienceComputational MarketingLarge Language Models
S
Sudhir Yarram
Adobe Media and Data Science Research
S
Somesh Kumar Singh
Adobe Media and Data Science Research
H
Harini S I
Adobe Media and Data Science Research
James Z. Wang
James Z. Wang
The Pennsylvania State University