Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Regex composition frequently introduces defects, performance degradation, and security vulnerabilities; whether empirical reuse outperforms formal synthesis remains empirically unvalidated. Method: We introduce the first large-scale, real-world regex composition dataset comprising 55,137 tasks and propose “example-driven reuse” as a novel paradigm. We systematically benchmark formal synthesizers against large language models (LLMs) across accuracy, efficiency, readability, and security. Contribution/Results: Reuse-based approaches—particularly those leveraging production-grade regex repositories and LLMs—significantly outperform traditional formal synthesizers on all evaluated metrics, demonstrating superior reliability and cost-effectiveness. Our findings empirically establish that high-quality regex reuse constitutes a more practical, robust, and engineering-sound solution for the vast majority of real-world applications.

Technology Category

Application Category

📝 Abstract

Composing regular expressions (regexes) is a common but challenging engineering activity. Software engineers struggle with regex complexity, leading to defects, performance issues, and security vulnerabilities. Researchers have proposed tools to synthesize regexes automatically, and recent generative AI techniques are also promising. Meanwhile, developers commonly reuse existing regexes from Internet sources and codebases. In this study, we ask a simple question: are regex composition tasks unique enough to merit dedicated machinery, or is reuse all we need? We answer this question through a systematic evaluation of state-of-the-art regex reuse and synthesis strategies. We begin by collecting a novel dataset of regex composition tasks mined from GitHub and RegExLib (55,137 unique tasks with solution regexes). To address the absence of an automated regex reuse formulation, we introduce reuse-by-example, a Programming by Example (PbE) approach that leverages a curated database of production-ready regexes. Although all approaches can solve these composition tasks accurately, reuse-by-example and LLMs both do far better over the range of metrics we applied. Our evaluation then uses multiple dimensions, including a novel metric, to compare reuse-by-example against two synthesis approaches: formal regex synthesizers and generative AI (LLMs). Although all approaches can solve these composition tasks accurately, reuse and LLMs both do far better over the range of metrics we applied. Ceteris paribus, prefer the cheaper solution -- for regex composition, perhaps reuse is all you need. Our findings provide actionable insights for developers selecting regex composition strategies and inform the design of future tools to improve regex reliability in software systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluates regex composition strategies for software engineering

Compares regex reuse versus synthesis techniques

Assesses performance and reliability of regex solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reuse-by-example for regex composition

Compares regex reuse with synthesis and LLMs

Uses novel dataset from GitHub and RegExLib

🔎 Similar Papers

SoK: A Literature and Engineering Review of Regular Expression Denial of Service