SIEVE: Towards Verifiable Certification for Code-datasets

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing public code datasets lack verifiable quality assurance; static dataset cards are non-auditable and provide no statistical guarantees, forcing teams to independently develop ad-hoc cleaning pipelines—increasing costs and undermining trust. Method: We propose SIEVE, a framework that transforms community-driven attribute checks into machine-readable, anytime-valid confidence cards. SIEVE replaces narrative descriptions with statistically bounded quality assertions and integrates distributed auditing with standardized metadata schemas. Contribution/Results: SIEVE is the first framework enabling dynamic, auditable, and low-cost certification of code dataset quality. It significantly enhances transparency and reproducibility while advancing standardization in dataset quality certification. Empirical evaluation demonstrates over 80% reduction in redundant cleaning overhead across diverse code-data curation workflows.

Technology Category

Application Category

📝 Abstract

Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static 'dataset cards' inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.

Problem

Research questions and friction points this paper is trying to address.

Public code datasets lack verifiable quality guarantees and statistical assurances

Teams build isolated cleaning pipelines that fragment efforts and increase costs

Current dataset cards are neither auditable nor provide statistical guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven framework for verifiable code dataset certification

Generates machine-readable certificates with statistical guarantees

Replaces narrative cards with anytime-valid confidence bounds

🔎 Similar Papers

Is The Watermarking Of LLM-Generated Code Robust?