🤖 AI Summary
Existing single-turn benchmarks inadequately assess the stability of large language models’ adherence to animal welfare values and their moral sensitivity under multi-turn adversarial pressure. This work proposes MANTA—the first multi-turn adversarial benchmark specifically designed for animal welfare evaluation—comprising 1,088 five-turn dialogues that systematically introduce explicit prompts while applying five categories of adversarial pressure: social, cultural, economic, utilitarian, and cognitive. We introduce, for the first time, a dual-dimensional evaluation framework encompassing Animal Welfare Value Stability (AWVS) and Animal Welfare Moral Sensitivity (AWMS), which we apply via human annotation to assess seven state-of-the-art models. Results reveal that multi-turn evaluation substantially alters model rankings; AWVS and AWMS are positively correlated yet distinct constructs; and models demonstrate significantly stronger moral recognition toward companion animals compared to wildlife, farmed animals, and invertebrates.
📝 Abstract
Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale. We present preliminary results from evaluations of claude-sonnet-4-20250514 and openai/gpt-4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence-based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first-order practical scenarios. We additionally present STYLEJUDGE, a controlled four-judge study demonstrating systematic format bias in LLM-as-judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium-tools/manta.