π€ AI Summary
This study addresses key challenges in the automatic evaluation and improvement of behavioral interview responses, including insufficiently structured scoring, lack of realistic interviewer behavior simulation, and limited training utility. To overcome these limitations, the authors propose a human-in-the-loop approach that integrates chain-of-thought prompting with an adversarial βBar Raiserβ mechanism grounded in a negativity bias model, effectively emulating authentic interviewer feedback. Experimental results demonstrate statistically significant improvements (p<0.001): response confidence scores increased from 3.16 to 4.16, authenticity rose from 2.94 to 4.53, the number of required iterations decreased fivefold, and the success rate for improving weak responses reached 100%. These outcomes substantially outperform purely automated alternatives.
π Abstract
Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.