Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of implicit, fragmented, and recursive dependencies in large language model (LLM) development, which are difficult to trace manually and lead to licensing non-compliance, evaluation bias, and documentation inconsistencies. The authors propose ModSleuth, a system that formalizes LLM dependency types for the first time and employs an operation-centric modeling approach. ModSleuth leverages agents to automatically and recursively extract and verify dependencies from public artifacts, accurately distinguishing between direct and indirect dependencies while resolving entity alignment across disparate names, versions, and repositories. Evaluated on four LLM releases, the system identifies 1,060 source-verified dependencies, uncovering critical issues such as multi-hop licensing obligations, training-evaluation coupling, and documentation mismatches. The system and its dependency graphs are publicly released.

📝 Abstract

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLM dependencies

model provenance

dependency auditing

invisible dependencies

artifact tracing

Innovation

Methods, ideas, or system contributions that make the work stand out.

dependency auditing

LLM provenance

ModSleuth