LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address challenges in modeling ultra-long user behavior sequences in industrial recommendation systems—including difficulty in jointly capturing long- and short-term preferences, inconsistency between upstream and downstream modules, and low computational efficiency—this paper proposes an end-to-end, GPU-optimized long-sequence Transformer architecture. Key contributions include: (1) a global token mechanism enabling stable long-range attention; (2) hierarchical token compression via a lightweight InnerTransformer coupled with hybrid attention; and (3) a fully synchronized GPU training and inference framework supporting unified dense/sparse parameter updates. The method integrates mixed-precision training, activation recomputation, and KV cache optimization. Evaluated on ByteDance’s advertising and e-commerce platforms, it achieves significant offline metric improvements and yields an average +2.1% CTR gain in online A/B tests. The system has been deployed across 10+ core business services, serving over one billion users.

Technology Category

Application Category

📝 Abstract

Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users.

Problem

Research questions and friction points this paper is trying to address.

Modeling ultra-long user behavior sequences for industrial recommenders

Addressing inconsistency and inefficiency in existing two-stage solutions

Optimizing GPU-efficient long-sequence transformer for scalable deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global token mechanism stabilizes long-context attention

Token merge module reduces quadratic complexity

Engineering optimizations enable efficient GPU-based training

🔎 Similar Papers

Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application