🤖 AI Summary
Evaluation metrics in machine learning are numerous and predominantly relative—dependent on baseline models or data distributions—hindering cross-model and cross-task comparability.
Method: We conduct a systematic literature review to comprehensively identify, categorize, and formalize *absolute evaluation metrics*: those with fixed semantic scales, independent of benchmarks or underlying data distributions, for classification, clustering, regression, and ranking tasks. Guided by a task-oriented principle, we construct a structured, unified framework that precisely delineates applicability boundaries and principled selection criteria for each metric.
Contribution/Results: This work fills a critical gap in cross-task evaluation guidance, providing practitioners with a standardized, transferable metric selection protocol. By unifying interpretation and usage conventions, it significantly enhances consistency, comparability, and interpretability in model evaluation across diverse learning paradigms.
📝 Abstract
Machine Learning is a diverse field applied across various domains such as computer science, social sciences, medicine, chemistry, and finance. This diversity results in varied evaluation approaches, making it difficult to compare models effectively. Absolute evaluation measures offer a practical solution by assessing a model's performance on a fixed scale, independent of reference models and data ranges, enabling explicit comparisons. However, many commonly used measures are not universally applicable, leading to a lack of comprehensive guidance on their appropriate use. This survey addresses this gap by providing an overview of absolute evaluation metrics in ML, organized by the type of learning problem. While classification metrics have been extensively studied, this work also covers clustering, regression, and ranking metrics. By grouping these measures according to the specific ML challenges they address, this survey aims to equip practitioners with the tools necessary to select appropriate metrics for their models. The provided overview thus improves individual model evaluation and facilitates meaningful comparisons across different models and applications.