🤖 AI Summary
This study investigates the synergistic mechanisms of multimodal and conversational AI in visually intensive STEM learning contexts—such as biology—where empirical evidence remains scarce. Through a randomized controlled experiment, it compares the effects of a multimodal dialogue system with interleaved text and images (MuDoC), a text-only conversational AI (TexDoC), and a semantic document search interface (DocSearch) on learning outcomes and user experience, analyzed through the lens of cognitive load theory. The findings reveal, for the first time, that while conversational interaction alone enhances perceived usability and engagement, it paradoxically impairs actual learning performance. In contrast, MuDoC significantly fosters deep learning, yielding superior post-test scores and user experience, thereby highlighting a critical dissociation between perceived understanding and actual learning gains. The work innovatively integrates document-anchored multimodal large language models, semantic retrieval, and rigorous experimental design, offering pivotal empirical insights for generative AI in education.
📝 Abstract
Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.