Classifier-Augmented Generation for Structured Workflow Prediction

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automating the translation of natural language specifications into executable ETL workflows—particularly for industrial tools like IBM DataStage—remains challenging due to complex, nonlinear workflow topologies and fine-grained operator configuration requirements. Method: We propose the first end-to-end generative framework that jointly models workflow stages, dataflow edges, and operator attributes. Our approach integrates sentence decomposition, few-shot prompting enhanced by a stage classifier, context-aware attribute inference, and an explicit edge prediction mechanism. We further introduce a classifier-augmented generation paradigm and embed a robustness verification module. Contribution/Results: Experiments demonstrate significant improvements over single-prompt and agent-based baselines: +18.3% in structural accuracy and +22.7% in attribute accuracy, with 42% reduction in token consumption. The framework enables high-fidelity, interpretable, and low-code workflow generation and evaluation for enterprise data integration.

Technology Category

Application Category

📝 Abstract
ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.
Problem

Research questions and friction points this paper is trying to address.

Automating ETL workflow creation from natural language descriptions
Predicting workflow structure and stage configurations automatically
Reducing manual configuration time and tool expertise requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier-Augmented Generation combines decomposition with classification
Edge prediction connects stages into non-linear workflows
Stage properties inferred from sub-utterance context
🔎 Similar Papers
No similar papers found.
T
Thomas Gschwind
IBM Research
S
Shramona Chakraborty
IBM Research
N
Nitin Gupta
IBM Research
Sameep Mehta
Sameep Mehta
IBM Research India