🤖 AI Summary
This study addresses the limitations of current animal individual identification methods, which predominantly rely on physical tagging and lack efficient, non-invasive remote facial recognition approaches, further hindered by the scarcity of large-scale annotated datasets for species-specific models. The work presents the first systematic evaluation of the transferability of pretrained human face models (FaceNet) and general-purpose vision models (ImageNet-pretrained Vision Transformer, ViT) across multiple species—including dogs, cattle, and primates—demonstrating their ability to circumvent the need for extensive animal-specific labeled data. Experimental results show that ViT achieves 96.85% validation accuracy and 84.34% Rank-1 identification rate on a dog dataset, outperforms state-of-the-art methods on cattle, and exhibits promising performance on primates, thereby validating the efficacy and generalization capability of cross-species transfer learning in animal facial recognition.
📝 Abstract
Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.