LinkML: An Open Data Modeling Framework

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scientific data are frequently stored in unstructured formats—such as lab notebooks and non-standard spreadsheets—severely hindering interoperability and FAIR (Findable, Accessible, Interoperable, Reusable) compliance. To address this, we propose an open modeling framework based on LinkML that standardizes data at the source through unified semantic modeling. The framework supports ontology alignment, composite inheritance, and schema composition, thereby enhancing model reusability and cross-disciplinary compatibility. It is technology-agnostic and integrates seamlessly with heterogeneous data infrastructure. Deployed at scale across biology, chemistry, and finance domains, the framework demonstrates empirically improved data integration efficiency, automated validation capability, and cross-platform data sharing. Our approach provides a scalable, loosely coupled, infrastructure-level solution for scientific data standardization, advancing both semantic interoperability and FAIR implementation.

Technology Category

Application Category

📝 Abstract
Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult. LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics. LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of standardized data structure in scientific research formats
Solves challenges in data interoperability, integration, validation and reuse
Reduces heterogeneity and complexity of single-use data models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open framework for authoring and validating data
Describes flat to complex interrelated data models
Provides standard syntax for schemas and relationships
🔎 Similar Papers
No similar papers found.
S
Sierra A. T. Moxon
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
H
Harold Solbrig
School of Medicine, Johns Hopkins University, Baltimore, MD 21287, USA
Nomi L. Harris
Nomi L. Harris
Lawrence Berkeley National Laboratory
project managementbioinformaticsmedical and translational informaticsontologies
P
Patrick Kalita
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
M
Mark A. Miller
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
S
Sujay Patil
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
K
Kevin Schaper
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
Chris Bizon
Chris Bizon
Director of Analytics and Data Science, RENCI, University of North Carolina
InformaticsNext Generation SequencingDrug DiscoveryFluid Dynamics
J. Harry Caufield
J. Harry Caufield
Lawrence Berkeley National Laboratory
knowledge graphsbiomedical informaticsartificial intelligencelarge language modelsstandards
S
Silvano Cirujano Cuesta
Siemens AG, Munich, 80333, Germany
C
Corey Cox
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
F
Frank Dekervel
Kapernikov, Leuven, 3010, Belgium
D
Damion M. Dooley
Faculty of Health Sciences, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
W
William D. Duncan
Community Dentistry and Behavioral Science, University of Florida College of Dentistry, Gainesville, FL 32610, USA
T
Tim Fliss
Data and Technology, Allen Institute, Seattle, WA 98109, USA
S
Sarah Gehrke
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
A
Adam S. L. Graefe
Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin 10999, Germany
H
Harshad Hegde
GSK, San Francisco, CA 94080, USA
A
AJ Ireland
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
J
Julius O. B. Jacobsen
School of Medicine, Johns Hopkins University, Baltimore, MD 21287, USA
Madan Krishnamurthy
Madan Krishnamurthy
Data Scientist, UNC-CH
C
Carlo Kroll
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
D
David Linke
Reaction Engineering & Catalyst Development, Leibniz Institute for Catalysis (LIKAT), Rostock, 18059, Germany
R
Ryan Ly
Scientific Data Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Nicolas Matentzoglu
Nicolas Matentzoglu
Semanticly (Independent Consultant)
OntologiesSemantic TechnologiesGraph DatabasesData Engineering