Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Regex composition frequently introduces defects, performance degradation, and security vulnerabilities; whether empirical reuse outperforms formal synthesis remains empirically unvalidated. Method: We introduce the first large-scale, real-world regex composition dataset comprising 55,137 tasks and propose “example-driven reuse” as a novel paradigm. We systematically benchmark formal synthesizers against large language models (LLMs) across accuracy, efficiency, readability, and security. Contribution/Results: Reuse-based approaches—particularly those leveraging production-grade regex repositories and LLMs—significantly outperform traditional formal synthesizers on all evaluated metrics, demonstrating superior reliability and cost-effectiveness. Our findings empirically establish that high-quality regex reuse constitutes a more practical, robust, and engineering-sound solution for the vast majority of real-world applications.

Technology Category

Application Category

📝 Abstract
Composing regular expressions (regexes) is a common but challenging engineering activity. Software engineers struggle with regex complexity, leading to defects, performance issues, and security vulnerabilities. Researchers have proposed tools to synthesize regexes automatically, and recent generative AI techniques are also promising. Meanwhile, developers commonly reuse existing regexes from Internet sources and codebases. In this study, we ask a simple question: are regex composition tasks unique enough to merit dedicated machinery, or is reuse all we need? We answer this question through a systematic evaluation of state-of-the-art regex reuse and synthesis strategies. We begin by collecting a novel dataset of regex composition tasks mined from GitHub and RegExLib (55,137 unique tasks with solution regexes). To address the absence of an automated regex reuse formulation, we introduce reuse-by-example, a Programming by Example (PbE) approach that leverages a curated database of production-ready regexes. Although all approaches can solve these composition tasks accurately, reuse-by-example and LLMs both do far better over the range of metrics we applied. Our evaluation then uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches: formal regex synthesizers and generative AI (LLMs). Although all approaches can solve these composition tasks accurately, reuse and LLMs both do far better over the range of metrics we applied. Ceteris paribus, prefer the cheaper solution -- for regex composition, perhaps reuse is all you need. Our findings provide actionable insights for developers selecting regex composition strategies and inform the design of future tools to improve regex reliability in software systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluates regex composition strategies for software engineering
Compares regex reuse versus synthesis techniques
Assesses performance and reliability of regex solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reuse-by-example for regex composition
Compares regex reuse with synthesis and LLMs
Uses novel dataset from GitHub and RegExLib
🔎 Similar Papers
No similar papers found.
Berk Çakar
Berk Çakar
ECE PhD Student, Purdue University
Software EngineeringCybersecurity
C
Charles M. Sale
Computer Science, Purdue University, West Lafayette, IN, USA
S
Sophie Chen
Computer Science, University of Michigan, Ann Arbor, MI, USA
E
Ethan H. Burmane
Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
Dongyoon Lee
Dongyoon Lee
Associate Professor of Computer Science, Stony Brook University
Computer SystemsSoftware ReliabilityProgram AnalysisConcurrencyComputer Architecture
J
James C. Davis
Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA