Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limitations of traditional corpus linguistics, which relies on labor-intensive manual hypothesis formulation and querying, resulting in low efficiency and high entry barriers. For the first time, it introduces large language models as autonomous agents into this domain, leveraging the Model Context Protocol (MCP) framework to interface with CQP-indexed corpora such as Gutenberg and CLMET. This approach enables AI-driven hypothesis generation, query execution, result interpretation, and iterative analysis, requiring only high-level research direction from scholars and their evaluation of outputs. The method successfully replicates two existing studies with high quantitative consistency and uncovers diachronic replacement chains and semantic evolution pathways of English intensifiers. By doing so, it substantially enhances the falsifiability, precision, and evidence-based capacity of linguistic discovery.

📝 Abstract

Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only "investigate English intensifiers," the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

Problem

Research questions and friction points this paper is trying to address.

corpus linguistics

hypothesis generation

technical barrier

empirical grounding

linguistic discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-Driven Corpus Linguistics

Large Language Model (LLM)

Corpus Query Engine

Model Context Protocol (MCP)

Empirically Grounded Discovery

🔎 Similar Papers

No similar papers found.

Authors to Follow