Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) rely on genuine structured causal reasoning or merely exploit lexical cues in causal inference tasks. To disentangle lexical signals from underlying causal structure, the authors propose Caliper, a method that anonymizes variable names in causal queries while preserving the causal graph and probabilistic dependencies. Evaluating nine prominent LLMs across three established benchmarks—CLadder, CRASS, and e-CARE—the analysis reveals that anonymization induces substantial performance drops, with accuracy decreasing by 7.6 to 29.6 percentage points under both zero-shot and few-shot settings. These findings indicate that current instruction-tuned models lack robust, structure-based causal reasoning capabilities and are heavily dependent on surface-level lexical patterns.

📝 Abstract

Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

Problem

Research questions and friction points this paper is trying to address.

causal reasoning

lexical anchors

structural reasoning

large language models

causal benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

lexical anonymization

large language models