Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the challenge that Transformer models struggle with effective cross-task compositional reasoning in continual learning, often relying on shortcut solutions that impair generalization and forward transfer. The authors extend the LEGO compositional reasoning benchmark to a continual learning setting—termed Continual LEGO—and systematically evaluate the continual learning performance of BERT (a feedforward architecture) and ALBERT (a recurrent architecture). Their analysis reveals, for the first time, that recurrent-structured Transformers like ALBERT possess inductive biases better suited for continual compositional learning: they tend to acquire general-purpose, loop-like strategies that significantly outperform BERT in cross-task generalization. Furthermore, the study introduces a hybrid multi-stage experience replay strategy that effectively mitigates shortcut learning, enabling only ALBERT to recover and sustain high performance.

📝 Abstract

Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting ("continual LEGO"). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a canonical feedforward Transformer model, learns shortcut solutions that limits its ability to generalize and prevents strong forward transfer to new experiences. In contrast, we find evidence supporting the hypothesis that ALBERT, a recurrent version of BERT, learns a For loop-esque solution, which leads to better CL performance. When applying BERT and ALBERT models to a CL setting that requires composition across experiences, we find that both model families fail. Our investigation suggests that ALBERT models can have their performance drop rescued by use of training strategies that combine data across experiences, but this is not true for BERT models, where a detrimental shortcut solution becomes entrenched with initial training. Our results demonstrate that the recurrent ALBERT model may have an inductive bias better suited for CL and motivate future investigation of the interplay between Transformer architecture and computational solutions that emerge in modern models and tasks.

Problem

Research questions and friction points this paper is trying to address.

continual learning

compositional reasoning

shortcut solutions

Transformer models

forward transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual learning

compositional reasoning

shortcut solutions