Automatic Identification of Parallelizable Loops Using Transformer-Based Source Code Representations

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately identifying parallelizable loops in irregular or dynamically generated code, where traditional static analysis often falls short. To overcome this limitation, the authors propose a lightweight Transformer-based approach that directly processes raw source code sequences. By employing subword tokenization and leveraging DistilBERT, the model automatically learns contextual syntactic and semantic features without relying on handcrafted representations. Evaluated on a balanced dataset combining synthetic and real-world code with 10-fold cross-validation, the method achieves an average accuracy exceeding 99% with a low false positive rate. It significantly outperforms conventional dependence analysis and existing token-based techniques, demonstrating superior generalization capability and reliability in detecting parallelizable loops.
📝 Abstract
Automatic parallelization remains a challenging problem in software engineering, particularly in identifying code regions where loops can be safely executed in parallel on modern multi-core architectures. Traditional static analysis techniques, such as dependence analysis and polyhedral models, often struggle with irregular or dynamically structured code. In this work, we propose a Transformer-based approach to classify the parallelization potential of source code, focusing on distinguishing independent (parallelizable) loops from undefined ones. We adopt DistilBERT to process source code sequences using subword tokenization, enabling the model to capture contextual syntactic and semantic patterns without handcrafted features. The approach is evaluated on a balanced dataset combining synthetically generated loops and manually annotated real-world code, using 10-fold cross-validation and multiple performance metrics. Results show consistently high performance, with mean accuracy above 99\% and low false positive rates, demonstrating robustness and reliability. Compared to prior token-based methods, the proposed approach simplifies preprocessing while improving generalization and maintaining computational efficiency. These findings highlight the potential of lightweight Transformer models for practical identification of parallelization opportunities at the loop level.
Problem

Research questions and friction points this paper is trying to address.

automatic parallelization
parallelizable loops
source code analysis
loop dependence
multi-core architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based code representation
automatic parallelization
loop parallelizability
DistilBERT
subword tokenization
🔎 Similar Papers
No similar papers found.
I
Izavan dos S. Correia
Graduate Program in Applied Informatics (PPGIA), Federal Rural University of Pernambuco (UFRPE), Recife, Pernambuco, Brazil.
H
Henrique C. T. Santos
Undergraduate Program in Analysis and Systems Development (TADS), Federal Institute of Pernambuco (IFPE), Recife, Pernambuco, Brazil.
Tiago A. E. Ferreira
Tiago A. E. Ferreira
Full Professor of Statistical and Informatics Department - Federal Rural University of Pernambuco
intelligent computationTime Series Analysis and ForecastingQuantum ComputationComputational