LoRA-BAM: Input Filtering for Fine-tuned LLMs via Boxed Abstraction Monitors over LoRA Layers

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) often yields unreliable responses to out-of-distribution (OoD) queries. To address this, we propose LoRA-BAM—a lightweight, interpretable framework that embeds a box-based abstract monitor into LoRA adaptation layers for real-time input capability boundary assessment. Our key contributions are: (1) an OoD detection mechanism leveraging feature-space clustering and axis-aligned bounding boxes (AABBs); (2) semantic consistency regularization to enhance robustness against paraphrased inputs; and (3) an adaptive boundary expansion strategy incorporating intra-cluster variance to jointly optimize coverage and precision. Evaluated across multiple benchmarks, LoRA-BAM achieves >92% OoD identification accuracy, <3% false rejection rate, negligible inference overhead (<0.5% latency increase), and zero degradation in downstream task performance—significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) improves performance on domain-specific tasks but can lead to overfitting, making them unreliable on out-of-distribution (OoD) queries. We propose LoRA-BAM - a method that adds OoD detection monitors to the LoRA layer using boxed abstraction to filter questions beyond the model's competence. Feature vectors from the fine-tuning data are extracted via the LLM and clustered. Clusters are enclosed in boxes; a question is flagged as OoD if its feature vector falls outside all boxes. To improve interpretability and robustness, we introduce a regularization loss during fine-tuning that encourages paraphrased questions to stay close in the feature space, and the enlargement of the decision boundary is based on the feature variance within a cluster. Our method complements existing defenses by providing lightweight and interpretable OoD detection.

Problem

Research questions and friction points this paper is trying to address.

Detects out-of-distribution queries in fine-tuned LLMs

Uses boxed abstraction monitors on LoRA layers

Improves interpretability with regularization and clustering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Boxed abstraction monitors for OoD detection

Regularization loss for feature space closeness

Decision boundary enlargement based on variance

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security