An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In forensic applications, dental age estimation models suffer from low interpretability due to their “black-box” nature and substantial performance disparity in automated staging of mandibular second (tooth #37) and third molars (tooth #38), particularly with only 0.462 accuracy for #38. Method: We propose an interpretable framework integrating a convolutional autoencoder (AE) and Vision Transformer (ViT): the AE compresses features into latent space, enables reconstruction analysis, and quantifies morphological variability; ViT performs high-accuracy staging classification; their synergy yields multi-dimensional explanations beyond conventional attention maps. Contribution/Results: We first attribute model uncertainty to the intrinsically high intra-class morphological variability of the third molar. Experiments show accuracy improvements from 0.712 to 0.815 for #37 and from 0.462 to 0.543 for #38, while uncovering fundamental data bottlenecks—significantly enhancing both predictive performance and decision transparency.

Technology Category

Application Category

📝 Abstract
The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the 'black box' nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE's latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation.
Problem

Research questions and friction points this paper is trying to address.

Enhancing deep learning transparency in dental age estimation
Addressing performance disparity in automated molar staging
Identifying data-centric limitations in tooth morphological variability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining convolutional autoencoder with Vision Transformer
Improving classification accuracy for molar staging
Providing multi-faceted diagnostic insights beyond attention maps
🔎 Similar Papers
No similar papers found.
B
Barkin Buyukcakir
KU Leuven, Department of Electrical Engineering (ESAT) - Processing Speech and Images (PSI), Leuven, 3000, Belgium
J
Jannick De Tobel
Ghent University, Department of Diagnostic Sciences, Ghent, 9000, Belgium
P
Patrick Thevissen
Imaging and Pathology - Forensic Odontology Department, KU Leuven, Leuven, 3000, Belgium
D
Dirk Vandermeulen
KU Leuven, Department of Electrical Engineering (ESAT) - Processing Speech and Images (PSI), Leuven, 3000, Belgium
Peter Claes
Peter Claes
KU Leuven, ESAT-PSI, Dept. Human Genetics
shape analysis(medical) image analysisimaging geneticsfacial genetics