🤖 AI Summary
This study addresses the persistent challenge in large language models (LLMs) of balancing controllability with fluency under conditional generation settings, a trade-off often overlooked by existing approaches that neglect output quality. The authors systematically evaluate representative techniques—including activation steering, prompt engineering, and supervised fine-tuning—on concept injection and removal tasks, employing both automated text metrics and LLM-as-a-judge assessments. Their analysis reveals that activation steering exhibits substantially degraded performance on instruction-tuned models, while highly effective control methods frequently compromise textual fluency. Furthermore, prompting and fine-tuning prove suitable for injecting concepts but struggle to reliably remove them. Notably, low-cost automatic metrics demonstrate strong correlation with human judgments, suggesting they can serve as viable, cost-efficient alternatives to expensive LLM-based evaluators.
📝 Abstract
Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.