π€ AI Summary
To address the prohibitive computational cost and scalability limitations of large-parameter classical Transformer models, this paper proposes SASQuaTChβa variational quantum Transformer architecture leveraging kernel methods and multidimensional quantum Fourier transforms. It implements the self-attention mechanism via parameterized quantum circuits, introducing for the first time learnable quantum self-attention gates embedded within a quantum kernel framework. Theoretically, SASQuaTCh achieves exponential compression of parameter complexity relative to classical Transformers. Experimentally, it attains high-accuracy embedding and classification of grayscale handwritten digit images using only nine qubits. The approach is validated on both classical quantum simulators and real quantum hardware, demonstrating significant reductions in parameter count and runtime complexity. This work establishes a novel paradigm for lightweight, scalable, quantum-enhanced sequence modeling.
π Abstract
The recent exploding growth in size of state-of-the-art machine learning models highlights a well-known issue where exponential parameter growth, which has grown to trillions as in the case of the Generative Pre-trained Transformer (GPT), leads to training time and memory requirements which limit their advancement in the near term. The predominant models use the so-called transformer network and have a large field of applicability, including predicting text and images, classification, and even predicting solutions to the dynamics of physical systems. Here we present a variational quantum circuit architecture named Self-Attention Sequential Quantum Transformer Channel (SASQuaTCh), which builds networks of qubits that perform analogous operations of the transformer network, namely the keystone self-attention operation, and leads to an exponential improvement in parameter complexity and run-time complexity over its classical counterpart. Our approach leverages recent insights from kernel-based operator learning in the context of predicting spatiotemporal systems to represent deep layers of a vision transformer network using simple gate operations and a set of multi-dimensional quantum Fourier transforms. To validate our approach, we consider image classification tasks in simulation and with hardware, where with only 9 qubits and a handful of parameters we are able to simultaneously embed and classify a grayscale image of handwritten digits with high accuracy.