MANTA: Multi-turn Assessment for Nonhuman Thinking&Alignment

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing single-turn benchmarks inadequately assess the stability of large language models’ adherence to animal welfare values and their moral sensitivity under multi-turn adversarial pressure. This work proposes MANTA—the first multi-turn adversarial benchmark specifically designed for animal welfare evaluation—comprising 1,088 five-turn dialogues that systematically introduce explicit prompts while applying five categories of adversarial pressure: social, cultural, economic, utilitarian, and cognitive. We introduce, for the first time, a dual-dimensional evaluation framework encompassing Animal Welfare Value Stability (AWVS) and Animal Welfare Moral Sensitivity (AWMS), which we apply via human annotation to assess seven state-of-the-art models. Results reveal that multi-turn evaluation substantially alters model rankings; AWVS and AWMS are positively correlated yet distinct constructs; and models demonstrate significantly stronger moral recognition toward companion animals compared to wildlife, farmed animals, and invertebrates.

📝 Abstract

Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale. We present preliminary results from evaluations of claude-sonnet-4-20250514 and openai/gpt-4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence-based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first-order practical scenarios. We additionally present STYLEJUDGE, a controlled four-judge study demonstrating systematic format bias in LLM-as-judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium-tools/manta.

Problem

Research questions and friction points this paper is trying to address.

animal welfare

large language models

adversarial benchmarking

value alignment

moral sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn adversarial benchmark

animal welfare reasoning

value stability