🤖 AI Summary
To address the computational and memory bottlenecks of standard Transformers—stemming from the quadratic complexity of self-attention—in real-time long-dialogue understanding, this work systematically evaluates efficient Transformer variants (e.g., Performer, Reformer) and lightweight CNN encoders. We empirically demonstrate, for the first time, that CNN-based architectures achieve both superior efficiency—2.6× faster training, 80% faster inference, and 72% lower memory consumption—and state-of-the-art generalization performance on both real-world customer service dialogues and the Long Range Arena (LRA) benchmark. This challenges the prevailing architectural dependency on Transformers for long-sequence modeling and establishes CNNs as a highly efficient, scalable alternative for real-time semantic understanding under resource constraints.
📝 Abstract
Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we evaluate the CNN model using the Long Range Arena benchmark to demonstrate competitiveness in general long document analysis.