Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

SME banking transaction texts exhibit severe abbreviation, contextual ambiguity, and extreme label imbalance, resulting in classification difficulty, sparse and noisy annotations. To address these challenges, we propose an end-to-end transaction classification framework: (1) a context-aware, semantics-preserving synthetic data generation method to mitigate data sparsity and distributional shift; and (2) a distribution alignment–driven model calibration mechanism to enhance cross-domain generalization. Evaluated on a real-world SME dataset, our model achieves 73.49% overall accuracy and over 90% accuracy on high-confidence predictions—substantially outperforming baseline methods. This work is the first to jointly integrate semantics-controllable synthetic augmentation with calibration-guided distribution alignment, delivering a deployable, robust solution for risk assessment in cash-flow-based lending.

Technology Category

Application Category

📝 Abstract

Despite their significant economic contributions, Small and Medium Enterprises (SMEs) face persistent barriers to securing traditional financing due to information asymmetries. Cash flow lending has emerged as a promising alternative, but its effectiveness depends on accurate modelling of transaction-level data. The main challenge in SME transaction analysis lies in the unstructured nature of textual descriptions, characterised by extreme abbreviations, limited context, and imbalanced label distributions. While consumer transaction descriptions often show significant commonalities across individuals, SME transaction descriptions are typically nonstandard and inconsistent across businesses and industries. To address some of these challenges, we propose a bank categorisation pipeline that leverages synthetic data generation to augment existing transaction data sets. Our approach comprises three core components: (1) a synthetic data generation module that replicates transaction properties while preserving context and semantic meaning; (2) a fine-tuned classification model trained on this enriched dataset; and (3) a calibration methodology that aligns model outputs with real-world label distributions. Experimental results demonstrate that our approach achieves 73.49% (+-5.09) standard accuracy on held-out data, with high-confidence predictions reaching 90.36% (+-6.52) accuracy. The model exhibits robust generalisation across different types of SMEs and transactions, which makes it suitable for practical deployment in cash-flow lending applications. By addressing core data challenges, namely, scarcity, noise, and imbalance, our framework provides a practical solution to build robust classification systems in data-sparse SME lending contexts.

Problem

Research questions and friction points this paper is trying to address.

Classifying SME bank transactions with unstructured text descriptions

Overcoming data scarcity and imbalance in SME transaction analysis

Enhancing cash-flow lending via accurate transaction categorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for transaction augmentation

Fine-tuned classification model on enriched data

Calibration aligning outputs with label distributions

🔎 Similar Papers

Learning Transactions Representations for Information Management in Banks: Mastering Local, Global, and External Knowledge