From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work investigates how alignment of large language models (LLMs) alters output distributions, affects response diversity, and whether alignment introduces information irrecoverable by the base model via in-context learning (ICL). Method: We propose the “Surface Alignment Hypothesis,” positing that current alignment techniques primarily filter—not expand—the base model’s intrinsic capabilities. Using distributional analysis, diversity metrics, cross-model similarity assessment, and semantic prompt engineering, we evaluate whether aligned behavior can be faithfully reconstructed using low-resolution semantic prompts and few-shot ICL. Contribution/Results: Empirical findings show alignment neither suppresses useful information nor enhances inherent capabilities; aligned behavior is fully reproducible by the base model + ICL without fine-tuning. Response similarity between aligned models approximates that between aligned models and ICL-generated outputs. The core contribution is revealing alignment as a process of capability filtering and information aggregation—not capability augmentation.

Technology Category

Application Category

📝 Abstract

The alignment process changes several properties of a large language model's (LLM's) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at https://github.com/thomlake/investigating-alignment.

Problem

Research questions and friction points this paper is trying to address.

Analyzes post-alignment distributional shift in LLM responses

Examines if alignment suppresses useful information diversity

Tests recoverability of aligned model behavior from base models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment shifts output to longer, diverse responses

Base models recover aligned behavior without fine-tuning

In-context examples imitate aligned models effectively

🔎 Similar Papers

No similar papers found.