🤖 AI Summary
This work addresses the instability in search and recommendation systems caused by time-varying input features, which adversely affects downstream decision-making and user experience. The authors propose a feature pruning mechanism that jointly optimizes predictive performance and temporal stability: leveraging historical snapshot data to identify unstable samples, analyzing feature importance, and removing features that induce prediction fluctuations, followed by retraining the model using the remaining stable features. Evaluated on a large-scale query–app relevance task in an app marketplace, the approach significantly enhances prediction stability—evidenced by a reduced coefficient of variation—and improves classification performance, as measured by higher PR-AUC. This effectively mitigates the trade-off between the instability of interaction features and the insufficient coverage of semantic features.
📝 Abstract
In search and recommendation systems, predictive models often suffer from temporal instability when certain input features introduce volatility in output scores. This instability can degrade model reliability and user experience especially in multi-stage systems where consistent predictions are critical for downstream decision making. We introduce Fortress, a general framework for enhancing model stability and accuracy by identifying and pruning features that contribute to inconsistent prediction scores over time. Fortress leverages historical snapshots temporally partitioned datasets capturing score fluctuations for the same entity across periods and follows a four-step process: (1) collect historical snapshots, (2) identify samples with unstable predictions, (3) isolate and remove instability- inducing features, and (4) retrain models using only stable features. While semantic features from LLMs and BERT-based models improve generalization, they often lack full query or entity coverage. Engagement-based features offer strong predictive power but tend to introduce temporal instability. Fortress mitigates this trade-off by suppressing the volatility of engagement signals while retaining their predictive value leading to more stable and accurate models. We validate Fortress on a query-toapp relevance model in a large-scale app marketplace. Offline experiments demonstrate notable improvements in prediction stability (measured by Coefficient of Variation) and classification performance (measured by PR-AUC).