This work addresses the challenge of cross-modal bird species retrieval in the absence of audio-image paired data, where representation alignment is inherently difficult. The authors propose leveraging textual semantics as an intermediary by distilling the text embedding space of a pretrained image-text model (BioCLIP-2) into an audio-text model (BioLingual), followed by contrastive learning-based fine-tuning of the audio encoder. This approach achieves implicit alignment between audio and image embeddings without any direct audio-image supervision. By circumventing the conventional reliance on paired multimodal data, the method significantly outperforms zero-shot model ensembles and text-mapping baselines on bioacoustic benchmarks such as SSW60, while preserving audio discriminability and enhancing audio-text alignment.