🤖 AI Summary
In forensic applications, dental age estimation models suffer from low interpretability due to their “black-box” nature and substantial performance disparity in automated staging of mandibular second (tooth #37) and third molars (tooth #38), particularly with only 0.462 accuracy for #38. Method: We propose an interpretable framework integrating a convolutional autoencoder (AE) and Vision Transformer (ViT): the AE compresses features into latent space, enables reconstruction analysis, and quantifies morphological variability; ViT performs high-accuracy staging classification; their synergy yields multi-dimensional explanations beyond conventional attention maps. Contribution/Results: We first attribute model uncertainty to the intrinsically high intra-class morphological variability of the third molar. Experiments show accuracy improvements from 0.712 to 0.815 for #37 and from 0.462 to 0.543 for #38, while uncovering fundamental data bottlenecks—significantly enhancing both predictive performance and decision transparency.
📝 Abstract
The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the 'black box' nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE's latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation.