🤖 AI Summary
This study addresses the high cost, lengthy duration, and limited sustainability of traditional item pretesting in high-stakes computerized adaptive testing (CAT), which hinders the continuous supply of calibrated items. To overcome these challenges, the authors propose the S2A3 framework, which uniquely integrates Thompson sampling with soft scoring to simultaneously calibrate item parameters and estimate examinee ability in an online setting. The method selects items by sampling from their posterior distributions to maximize information gain, incorporates parameter uncertainty through soft scoring, and employs a temperature-controlled stochastic Sympson-Hetter exposure control mechanism to dynamically manage item exposure. Evaluated on two vocabulary tasks from the Duolingo English Test, S2A3 demonstrates rapid calibration and reliable scoring even under high cold-start item proportions, effectively balancing measurement efficiency with item bank security.
📝 Abstract
High-stakes computerized adaptive tests (CATs) require a continuous supply of calibrated items, yet traditional item piloting is slow, expensive, and operationally hazardous. We introduce the S2A3 framework -- Soft Scoring (S2) and Adaptive Adaptive Administration (A3) -- which unifies item calibration and test administration into a single online process. Thompson sampling enhances item selection by drawing provisional parameters from each item's posterior distribution and selecting the item maximizing expected Fisher information, naturally routing uncertain items to informative test-takers while maintaining measurement precision. Soft scoring integrates over parameter uncertainty so that incompletely calibrated items exert appropriately attenuated influence on ability estimates. A stochastic variant of Sympson-Hetter exposure control balances measurement efficiency against bank security via a tunable temperature parameter and item-specific weights. We validate S2A3 on Yes/No Vocabulary and Vocabulary-in-Context tasks from the Duolingo English Test, demonstrating rapid item calibration and preserved scoring reliability even when cold-start items constitute a significant fraction of the active pool.