🤖 AI Summary
Aesthetic evaluation of generated music remains challenging due to the complexity of perceptual dimensions. To address this, we propose a multi-scale hierarchical evaluation framework: (1) a cross-paragraph attention mechanism jointly models local musical details and global structural coherence, integrated with a multi-scale convolutional network; (2) a semantics-preserving C-Mixup audio augmentation strategy enhances data diversity and model robustness; and (3) a regression-ranking joint optimization objective enables consistent learning across segment-level score prediction and full-track ranking. Evaluated on the ICASSP 2026 SongEval benchmark, our method significantly outperforms existing baselines—achieving a 12.3% improvement in Pearson correlation coefficient and a 9.7% gain in Top-10 high-quality song identification accuracy. To our knowledge, this is the first approach to effectively balance multidimensional aesthetic consistency with end-to-end trainability.
📝 Abstract
Evaluating the aesthetic quality of generated songs is challenging due to the multi-dimensional nature of musical perception. We propose a robust music aesthetic evaluation framework that combines (1) multi-source multi-scale feature extraction to obtain complementary segment- and track-level representations, (2) a hierarchical audio augmentation strategy to enrich training data, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-song identification. Experiments on the ICASSP 2026 SongEval benchmark demonstrate that our approach consistently outperforms baseline methods across correlation and top-tier metrics.