🤖 AI Summary
Current biodiversity research is hindered by the absence of large-scale, spatiotemporally aligned, and standardized multimodal datasets, limiting data-driven ecological modeling. To address this, we introduce BioCube—the first globally comprehensive (2000–2020), WGS84-georeferenced multimodal biodiversity dataset—integrating seven heterogeneous modalities: Sentinel satellite imagery, ERA5 meteorological data, environmental DNA (eDNA), soundscapes, photographic images, textual species descriptions, and land-use maps. We propose a geospatially driven multi-source fusion framework that enables fine-grained spatiotemporal alignment and semantic standardization across modalities for the first time. The dataset and end-to-end processing pipeline are fully open-sourced (Hugging Face / GitHub). Empirical evaluation demonstrates substantial improvements in generalizability and ecological interpretability for species distribution modeling and community dynamics prediction, establishing a foundational resource for scalable, multimodal biodiversity science.
📝 Abstract
Biodiversity research requires complete and detailed information to study ecosystem dynamics at different scales. Employing data-driven methods like Machine Learning is getting traction in ecology and more specific biodiversity, offering alternative modelling pathways. For these methods to deliver accurate results there is the need for large, curated and multimodal datasets that offer granular spatial and temporal resolutions. In this work, we introduce BioCube, a multimodal, fine-grained global dataset for ecology and biodiversity research. BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables. All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020. The dataset will become available at https://huggingface.co/datasets/BioDT/BioCube while the acquisition and processing code base at https://github.com/BioDT/bfm-data.