🤖 AI Summary
Spectrogram interpretation in marine mammal vocalization analysis remains heavily reliant on manual annotation, while existing vision-language models (VLMs) lack domain-specific adaptation to bioacoustics. Method: This paper proposes a fine-tuning-free, annotation-free VLM–LLM collaborative framework. It leverages a pre-trained VLM to extract visual features directly from acoustic spectrograms and integrates a large language model (LLM) for semantic interpretation, domain-knowledge infusion, and cross-modal reasoning—enabling autonomous construction and validation of underwater bioacoustic knowledge. Contribution/Results: Experiments demonstrate effective zero-shot identification of vocalization patterns, with significant improvements in both classification accuracy and explanatory fidelity. The framework establishes a novel paradigm for automated, knowledge-enhanced analysis of expert-level acoustic spectrograms, bridging the gap between visual representation learning and bioacoustic domain reasoning.
📝 Abstract
Marine mammal vocalization analysis depends on interpreting bioacoustic spectrograms. Vision Language Models (VLMs) are not trained on these domain-specific visualizations. We investigate whether VLMs can extract meaningful patterns from spectrograms visually. Our framework integrates VLM interpretation with LLM-based validation to build domain knowledge. This enables adaptation to acoustic data without manual annotation or model retraining.