BioCube: A Multimodal Dataset for Biodiversity Research

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current biodiversity research is hindered by the absence of large-scale, spatiotemporally aligned, and standardized multimodal datasets, limiting data-driven ecological modeling. To address this, we introduce BioCube—the first globally comprehensive (2000–2020), WGS84-georeferenced multimodal biodiversity dataset—integrating seven heterogeneous modalities: Sentinel satellite imagery, ERA5 meteorological data, environmental DNA (eDNA), soundscapes, photographic images, textual species descriptions, and land-use maps. We propose a geospatially driven multi-source fusion framework that enables fine-grained spatiotemporal alignment and semantic standardization across modalities for the first time. The dataset and end-to-end processing pipeline are fully open-sourced (Hugging Face / GitHub). Empirical evaluation demonstrates substantial improvements in generalizability and ecological interpretability for species distribution modeling and community dynamics prediction, establishing a foundational resource for scalable, multimodal biodiversity science.

Technology Category

Application Category

📝 Abstract
Biodiversity research requires complete and detailed information to study ecosystem dynamics at different scales. Employing data-driven methods like Machine Learning is getting traction in ecology and more specific biodiversity, offering alternative modelling pathways. For these methods to deliver accurate results there is the need for large, curated and multimodal datasets that offer granular spatial and temporal resolutions. In this work, we introduce BioCube, a multimodal, fine-grained global dataset for ecology and biodiversity research. BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables. All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020. The dataset will become available at https://huggingface.co/datasets/BioDT/BioCube while the acquisition and processing code base at https://github.com/BioDT/bfm-data.
Problem

Research questions and friction points this paper is trying to address.

Lack of large multimodal datasets for biodiversity research
Need for granular spatial-temporal biodiversity data
Absence of curated global species-environment datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with images, audio, and eDNA
Geospatially aligned global biodiversity observations
High-resolution climate and environmental variables
🔎 Similar Papers
No similar papers found.
S
Stylianos Stasinos
TNO ICT, Strategy & Policy, Anna van Buerenplein 1, 2595 DA, Den Haag, The Netherlands
Martino Mensio
Martino Mensio
TNO
MisinformationNLPLLMRAGRL
Elena Lazovik
Elena Lazovik
Senior Specialist Scientist, TNO
Big DataCloud ComputingSensorsSoftware ArchitecturenoSQL
A
Athanasios Trantas
TNO ICT, Strategy & Policy, Anna van Buerenplein 1, 2595 DA, Den Haag, The Netherlands