Vector embedding of multi-modal texts: a tool for discovery?

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the cross-modal retrieval challenge for multimodal educational content—particularly computer science textbooks containing interleaved text and figures. We propose a multi-vector representation method that jointly encodes textual and visual semantics using a vision-language model (VLM), generating fine-grained multimodal embeddings and indexing them in a vector database to enable efficient cross-modal retrieval. Evaluated on over 3,600 pages of textbook material, we systematically compare four similarity metrics and find cosine similarity significantly outperforms alternatives. Benchmarking against 75 natural-language queries confirms substantial improvements in retrieval precision and practical utility within digital library settings. Our approach delivers a reproducible, scalable technical framework for intelligent discovery of multimodal educational resources, advancing the state of cross-modal semantic search in academic and pedagogical contexts.

Technology Category

Application Category

📝 Abstract

Computer science texts are particularly rich in both narrative content and illustrative charts, algorithms, images, annotated diagrams, etc. This study explores the extent to which vector-based multimodal retrieval, powered by vision-language models (VLMs), can improve discovery across multi-modal (text and images) content. Using over 3,600 digitized textbook pages largely from computer science textbooks and a Vision Language Model (VLM), we generate multi-vector representations capturing both textual and visual semantics. These embeddings are stored in a vector database. We issue a benchmark of 75 natural language queries and compare retrieval performance to ground truth and across four similarity (distance) measures. The study is intended to expose both the strengths and weakenesses of such an approach. We find that cosine similarity most effectively retrieves semantically and visually relevant pages. We further discuss the practicality of using a vector database and multi-modal embedding for operational information retrieval. Our paper is intended to offer design insights for discovery over digital libraries. Keywords: Vector embedding, multi-modal document retrieval, vector database benchmark, digital library discovery

Problem

Research questions and friction points this paper is trying to address.

Improving discovery in multi-modal content using vector embeddings

Benchmarking retrieval performance across different similarity measures

Exploring strengths and weaknesses of vision-language models for retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for multimodal retrieval

Generates multi-vector representations for text and images

Compares four similarity measures using vector database

🔎 Similar Papers

No similar papers found.

Authors to Follow