Comparison of the Interpretability of ResNet50 and ViT-224 Models in the lassification Task is Erroneous on Images of a Scanned Microscope Object
https://doi.org/10.15514/ISPRAS-2025-37(6)-15
Abstract
The paper studies the interpretability of two popular deep learning architectures, ResNet50 and Vision Transformer (ViT-224), in the context of solving the problem of classifying pathogenic microorganisms in images obtained using a scanning electron microscope and preliminary sample preparation using lanthanide contrast. In addition to standard quality metrics such as precision, recall, and F1 score, a key aspect was the study of the built-in attention maps of Vision Transformer and post-interpretation of the performance of the trained ResNet50 model using the Grad-CAM method. The experiments were performed on the original dataset, as well as three of its modifications: with a zeroed background (threshold), with modified image areas using the inpainting method, and with a completely cleared background using zeroed background areas. To evaluate the generality of the attention mechanism in Vision Transformer, a test was also conducted on the classic MNIST handwritten digit recognition task. The results showed that the Vision Transformer architecture exhibits more localized and biologically based attention heatmaps, as well as greater resilience to changes in background noise.
Keywords
About the Authors
Vladimir Nikolaevich GRIDINRussian Federation
Dr. Sci. (Tech.), Prof., Scientific Director of the Design Information Technologies Center Russian Academy of Sciences. Research interests: information technology, artificial intelligence, and CAD systems, including analytical methods.
Ivan Aleksandrovich NOVIKOV
Russian Federation
Senior researcher at the Design Information Technologies Center Russian Academy of Sciences. Research interests: information technology, data mining.
Basim Raed SALEM
Russian Federation
Researcher at the Design Information Technologies Center Russian Academy of Sciences. Research interests: artificial intelligence, decision support systems.
Vladimir Igorevich SOLODOVNIKOV
Russian Federation
Cand. Sci, (Tech.). Director of the Design Information Technologies Center Russian Academy of Sciences. Research interests: information technology, methods of machine learning and artificial intelligence as applied to self-organizing decision support systems and automated data analysis tools.
References
1. Wollek A., Graf R. et al. Attention‑based Saliency Maps Improve Interpretability of Pneumothorax Classification. arXiv:2303.01871 (2023).
2. Huang X. et al. Enhanced tuberculosis detection using Vision Transformers and Grad‑CAM. BMC Medical Imaging (2025).
3. Chen L. et al. MedViT: A robust vision transformer for generalized medical image analysis. Signal Processing: Image Communication, 105 (2023).
4. Smith J., Patel R. et al. Implementing vision transformer for classifying 2D biomedical images. Scientific Reports 14, 63094 (2024).
5. Huang Y. et al. R‑Cut: Relationship Weighted Cut for Denoising ViT Attention Maps. MDPI Image Analysis (2024).
6. Wu Z. et al. SaCo: Salience‑guided Faithfulness Coefficient for Evaluating Explanations. CVPR 2024.
7. Brocki M. et al. Class‑Discriminative Attention Maps (CDAM) for Vision Transformers. ICLR 2024.
8. Xiao T. et al. Evaluating Robustness of Vision Transformers under Common Corruptions. AAAI 2022.
9. Badisa S. et al. Inpainting the Gaps: Framework for Evaluating Explainability under Occlusion. The CVF Open Access 2024.
10. Katar S., Yildirim A. Interpretable Classification of Leukocytes with ViT and Score‑CAM. PMC 2023.
Review
For citations:
GRIDIN V.N., NOVIKOV I.A., SALEM B.R., SOLODOVNIKOV V.I. Comparison of the Interpretability of ResNet50 and ViT-224 Models in the lassification Task is Erroneous on Images of a Scanned Microscope Object. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(6):233-242. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(6)-15






