AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

📅 2025-05-25

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

173K/year

🤖 AI Summary

Large language models (LLMs) struggle to critically leverage external domain knowledge in automated data science. Method: We introduce AssistedDS—the first benchmark for domain-knowledge-assisted evaluation—comprising synthetic and real Kaggle datasets paired with beneficial or adversarial domain documents. Our “interpretable synthesis + real-world scenarios” dual-track framework employs multi-stage prompting to assess end-to-end capabilities: document retrieval, knowledge filtering, code generation, and execution validation. Contribution/Results: We uncover a critical “blind adoption” flaw in LLMs: they fail significantly in time-series modeling, cross-fold consistency, and categorical variable handling. Experiments show state-of-the-art models suffer sharp performance degradation under adversarial documents; beneficial knowledge fails to mitigate harmful information, exposing severe deficiencies in domain knowledge discrimination and robust application.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating how LLMs leverage external domain knowledge in data science workflows

Assessing LLMs' ability to discern helpful versus harmful domain knowledge

Identifying gaps in LLMs' critical evaluation of expert knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLMs using domain knowledge documents

Tests include helpful and adversarial knowledge for tabular tasks

Measures critical evaluation of external information in workflows

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery