Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In the context of widespread large language model (LLM) deployment, traditional assessments may become invalid due to systematic differences between human and AI response behaviors. This study pioneers the extension of differential item functioning (DIF) analysis—a psychometric technique—into the domain of human–AI ability comparison. By integrating negative contrast analysis with item-total correlation-based discrimination methods, the work establishes a novel assessment design paradigm tailored for the AI era. Leveraging data from high school chemistry diagnostic tests and college entrance examinations, the proposed approach is empirically validated. Expert analyses further identify key task dimensions influencing AI performance, offering both theoretical grounding and practical pathways for developing next-generation educational assessments that are robust against AI misuse while ensuring fairness and validity.

Technology Category

Application Category

📝 Abstract

The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis -- traditionally used to detect bias across demographic groups -- together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o \& 5.2, Gemini 1.5 \& 3 Pro, Claude 3.5 \& 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.

Problem

Research questions and friction points this paper is trying to address.

assessment design

large language models

differential item functioning

AI misuse

educational evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Item Functioning

Large Language Models

Assessment Design

Educational Data Mining

Psychometric Theory

🔎 Similar Papers

No similar papers found.

Authors to Follow