Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
SME banking transaction texts exhibit severe abbreviation, contextual ambiguity, and extreme label imbalance, resulting in classification difficulty, sparse and noisy annotations. To address these challenges, we propose an end-to-end transaction classification framework: (1) a context-aware, semantics-preserving synthetic data generation method to mitigate data sparsity and distributional shift; and (2) a distribution alignment–driven model calibration mechanism to enhance cross-domain generalization. Evaluated on a real-world SME dataset, our model achieves 73.49% overall accuracy and over 90% accuracy on high-confidence predictions—substantially outperforming baseline methods. This work is the first to jointly integrate semantics-controllable synthetic augmentation with calibration-guided distribution alignment, delivering a deployable, robust solution for risk assessment in cash-flow-based lending.

Technology Category

Application Category

📝 Abstract
Despite their significant economic contributions, Small and Medium Enterprises (SMEs) face persistent barriers to securing traditional financing due to information asymmetries. Cash flow lending has emerged as a promising alternative, but its effectiveness depends on accurate modelling of transaction-level data. The main challenge in SME transaction analysis lies in the unstructured nature of textual descriptions, characterised by extreme abbreviations, limited context, and imbalanced label distributions. While consumer transaction descriptions often show significant commonalities across individuals, SME transaction descriptions are typically nonstandard and inconsistent across businesses and industries. To address some of these challenges, we propose a bank categorisation pipeline that leverages synthetic data generation to augment existing transaction data sets. Our approach comprises three core components: (1) a synthetic data generation module that replicates transaction properties while preserving context and semantic meaning; (2) a fine-tuned classification model trained on this enriched dataset; and (3) a calibration methodology that aligns model outputs with real-world label distributions. Experimental results demonstrate that our approach achieves 73.49% (+-5.09) standard accuracy on held-out data, with high-confidence predictions reaching 90.36% (+-6.52) accuracy. The model exhibits robust generalisation across different types of SMEs and transactions, which makes it suitable for practical deployment in cash-flow lending applications. By addressing core data challenges, namely, scarcity, noise, and imbalance, our framework provides a practical solution to build robust classification systems in data-sparse SME lending contexts.
Problem

Research questions and friction points this paper is trying to address.

Classifying SME bank transactions with unstructured text descriptions
Overcoming data scarcity and imbalance in SME transaction analysis
Enhancing cash-flow lending via accurate transaction categorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for transaction augmentation
Fine-tuned classification model on enriched data
Calibration aligning outputs with label distributions
P
Pietro Alessandro Aluffi
University of Warwick
B
Brandi Jess
Navrisk
Marya Bazzi
Marya Bazzi
The University of Warwick
K
Kate Kennedy
SME Capital
M
Matt Arderne
SME Capital, sea.dev
D
Daniel Rodrigues
SME Capital
Martin Lotz
Martin Lotz
Mathematical Institute, University of Warwick
Mathematics