Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow

📅 2023-06-12
🏛️ arXiv.org
📈 Citations: 50
Influential: 1
📄 PDF
🤖 AI Summary
Existing LLM-driven workflows exhibit significant limitations in complex numerical computation, tabular data manipulation, and long-context processing—critical for real-time analytics over massive domain-specific datasets in finance, meteorology, and energy. To address this, we propose an autonomous analytical agent framework tailored for industry-scale data, introducing a novel “code-centric + pre-exploratory interface” dual paradigm. By abstracting interfaces through proactive data probing and compilation-aware validation, the framework enables end-to-end, zero-code execution—from natural-language queries to robust data processing and structured visualization. It tightly integrates large language models, automated code generation, and a structured visualization engine, substantially reducing analytical error rates while improving response latency and interpretability. The framework is open-sourced and empirically validated on large-scale Chinese financial data—including stocks, mutual funds, and news—demonstrating strong efficacy and generalizability across domain-specific analytical tasks.
📝 Abstract
Industries such as finance, meteorology, and energy generate vast amounts of data daily. Efficiently managing, processing, and displaying this data requires specialized expertise and is often tedious and repetitive. Leveraging large language models (LLMs) to develop an automated workflow presents a highly promising solution. However, LLMs are not adept at handling complex numerical computations and table manipulations and are also constrained by a limited context budget. Based on this, we propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests. The advancements are twofold: First, it is a code-centric agent that receives human requests and generates code as an intermediary to handle massive data, which is quite flexible for large-scale data processing tasks. Second, Data-Copilot involves a data exploration phase in advance, which explores how to design more universal and error-free interfaces for real-time response. Specifically, it actively explores data sources, discovers numerous common requests, and abstracts them into many universal interfaces for daily invocation. When deployed in real-time requests, Data-Copilot only needs to invoke these pre-designed interfaces, transforming raw data into visualized outputs (e.g., charts, tables) that best match the user's intent. Compared to generating code from scratch, invoking these pre-designed and compiler-validated interfaces can significantly reduce errors during real-time requests. Additionally, interface workflows are more efficient and offer greater interpretability than code. We open-sourced Data-Copilot with massive Chinese financial data, such as stocks, funds, and news, demonstrating promising application prospects.
Problem

Research questions and friction points this paper is trying to address.

Autonomously manages and processes vast industry data efficiently
Overcomes LLM limitations in numerical computations and table manipulations
Transforms raw data into user-tailored visual outputs via pre-designed interfaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to automate data workflows
Generates code for large-scale data processing
Pre-designs universal interfaces for real-time response
🔎 Similar Papers
No similar papers found.
Wenqi Zhang
Wenqi Zhang
Zhejiang University
Language ModelMultimodal LearningEmbodied Agents
Y
Yongliang Shen
College of Computer Science and Technology, Zhejiang University
Weiming Lu
Weiming Lu
Zhejiang University
Natural Language ProcessingLarge Language ModelsAGI
Y
Y. Zhuang
College of Computer Science and Technology, Zhejiang University