Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In the context of widespread large language model (LLM) deployment, traditional assessments may become invalid due to systematic differences between human and AI response behaviors. This study pioneers the extension of differential item functioning (DIF) analysis—a psychometric technique—into the domain of human–AI ability comparison. By integrating negative contrast analysis with item-total correlation-based discrimination methods, the work establishes a novel assessment design paradigm tailored for the AI era. Leveraging data from high school chemistry diagnostic tests and college entrance examinations, the proposed approach is empirically validated. Expert analyses further identify key task dimensions influencing AI performance, offering both theoretical grounding and practical pathways for developing next-generation educational assessments that are robust against AI misuse while ensuring fairness and validity.

Technology Category

Application Category

📝 Abstract
The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis -- traditionally used to detect bias across demographic groups -- together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o \& 5.2, Gemini 1.5 \& 3 Pro, Claude 3.5 \& 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.
Problem

Research questions and friction points this paper is trying to address.

assessment design
large language models
differential item functioning
AI misuse
educational evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Item Functioning
Large Language Models
Assessment Design
Educational Data Mining
Psychometric Theory
🔎 Similar Papers
No similar papers found.
L
Licol Zeinfeld
Weizmann Institute of Science
A
Alona Strugatski
Weizmann Institute of Science
Z
Ziva Bar-Dov
Weizmann Institute of Science
R
Ron Blonder
Weizmann Institute of Science
S
Shelley Rap
Weizmann Institute of Science
Giora Alexandron
Giora Alexandron
Associate Professor, Weizmann Institute of Science
AI in EducationLearning AnalyticsEducational Data MiningAI Education