Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates how formal language pre-pretraining imbues Transformer models with structural inductive biases essential for natural language processing. Addressing the open question—“Which formal language features transfer effectively?”—we posit that effective transfer requires satisfying both structural fidelity (precise modeling of dependency relations) and computational compatibility (alignment with Transformer attention mechanisms). To this end, we propose a pre-pretraining paradigm grounded in structured formal languages—including Dyck and Counter languages—and validate it via syntactic evaluation, attention-head interpretability analysis, and comparative loss trajectory analysis. Experiments on a 1B-parameter model show that formal-language pre-pretraining achieves comparable training loss using only 67% of the natural language token budget, while substantially improving syntactic generalization. Mechanistic analysis further confirms that pre-pretrained attention heads consistently mediate grammatical structure recognition in downstream natural language tasks.

Technology Category

Application Category

📝 Abstract

Pretraining language models on formal languages can improve their acquisition of natural language, but it is unclear which features of the formal language impart an inductive bias that leads to effective transfer. Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when the formal language both captures dependency structures in natural language and remains within the computational limitations of the model architecture. Focusing on transformers, we find that formal languages with both these properties enable language models to achieve lower loss on natural language and better linguistic generalization compared to other languages. In fact, pre-pretraining, or training on formal-then-natural language, reduces loss more efficiently than the same amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. We also give mechanistic evidence of cross-task transfer from formal to natural language: attention heads acquired during formal language pretraining remain crucial for the model's performance on syntactic evaluations.

Problem

Research questions and friction points this paper is trying to address.

Enhance natural language acquisition via formal languages.

Identify formal language features for effective transfer.

Optimize transformer performance with formal language pre-pretraining.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-pretraining on formal languages.

Focusing on transformers architecture.

Mechanistic evidence of cross-task transfer.

🔎 Similar Papers

Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?