LLMs in the Classroom: Outcomes and Perceptions of Questions Written with the Aid of AI

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study investigates the pedagogical effectiveness and perceptibility of classroom questions generated by large language models (LLMs). To address this, we conducted a randomized controlled experiment in which students were presented with questions authored either by humans or by ChatGPT; their performance was evaluated via answer accuracy, source attribution tasks, and content consistency analysis. Results show: (1) students could not reliably distinguish question origin (p = 0.309); (2) responses to AI-generated questions scored significantly lower—by 8.7% on average (p < 0.01); and (3) we introduce the first quantitative evaluation framework for instructional alignment, leveraging SBERT embeddings and cosine similarity, which confirms significantly lower semantic consistency between AI-generated questions and corresponding textbook content (p < 0.001). This work provides the first empirical evidence on the educational suitability of AI-assisted item writing and establishes a reproducible, embedding-based methodology for assessing content quality in automated assessment design.

Technology Category

Application Category

📝 Abstract

We randomly deploy questions constructed with and without use of the LLM tool and gauge the ability of the students to correctly answer, as well as their ability to correctly perceive the difference between human-authored and LLM-authored questions. In determining whether the questions written with the aid of ChatGPT were consistent with the instructor's questions and source text, we computed representative vectors of both the human and ChatGPT questions using SBERT and compared cosine similarity to the course textbook. A non-significant Mann-Whitney U test (z = 1.018, p = .309) suggests that students were unable to perceive whether questions were written with or without the aid of ChatGPT. However, student scores on LLM-authored questions were almost 9% lower (z = 2.702, p<.01). This result may indicate that either the AI questions were more difficult or that the students were more familiar with the instructor's style of questions. Overall, the study suggests that while there is potential for using LLM tools to aid in the construction of assessments, care must be taken to ensure that the questions are fair, well-composed, and relevant to the course material.

Problem

Research questions and friction points this paper is trying to address.

Assess student performance on LLM-authored vs human-authored questions

Evaluate student perception of LLM vs human question origins

Compare question relevance using SBERT and textbook similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used SBERT for text vector similarity comparison

Deployed randomized human and AI-authored questions

Applied Mann-Whitney U test for perception analysis

🔎 Similar Papers

No similar papers found.