Comparison of the Interpretability of ResNet50 and ViT-224 Models in the lassification Task is Erroneous on Images of a Scanned Microscope Object

Vladimir Nikolaevich GRIDIN; Ivan Aleksandrovich NOVIKOV; Basim Raed SALEM; Vladimir Igorevich SOLODOVNIKOV

doi:10.15514/ISPRAS-2025-37(6)-15

Comparison of the Interpretability of ResNet50 and ViT-224 Models in the lassification Task is Erroneous on Images of a Scanned Microscope Object

Vladimir Nikolaevich GRIDIN, Ivan Aleksandrovich NOVIKOV, Basim Raed SALEM, Vladimir Igorevich SOLODOVNIKOV

https://doi.org/10.15514/ISPRAS-2025-37(6)-15

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper studies the interpretability of two popular deep learning architectures, ResNet50 and Vision Transformer (ViT-224), in the context of solving the problem of classifying pathogenic microorganisms in images obtained using a scanning electron microscope and preliminary sample preparation using lanthanide contrast. In addition to standard quality metrics such as precision, recall, and F1 score, a key aspect was the study of the built-in attention maps of Vision Transformer and post-interpretation of the performance of the trained ResNet50 model using the Grad-CAM method. The experiments were performed on the original dataset, as well as three of its modifications: with a zeroed background (threshold), with modified image areas using the inpainting method, and with a completely cleared background using zeroed background areas. To evaluate the generality of the attention mechanism in Vision Transformer, a test was also conducted on the classic MNIST handwritten digit recognition task. The results showed that the Vision Transformer architecture exhibits more localized and biologically based attention heatmaps, as well as greater resilience to changes in background noise.

Keywords

Vision Transformer, ResNet50, Grad‑CAM, attention maps, attention heat maps, interpretability, classification, bacteria, image analysis.

About the Authors

Vladimir Nikolaevich GRIDIN

Design Information Technologies Center Russian Academy of Sciences
Russian Federation

Dr. Sci. (Tech.), Prof., Scientific Director of the Design Information Technologies Center Russian Academy of Sciences. Research interests: information technology, artificial intelligence, and CAD systems, including analytical methods.

Ivan Aleksandrovich NOVIKOV

Design Information Technologies Center Russian Academy of Sciences
Russian Federation

Senior researcher at the Design Information Technologies Center Russian Academy of Sciences. Research interests: information technology, data mining.

Basim Raed SALEM

Design Information Technologies Center Russian Academy of Sciences
Russian Federation

Researcher at the Design Information Technologies Center Russian Academy of Sciences. Research interests: artificial intelligence, decision support systems.

Vladimir Igorevich SOLODOVNIKOV

Design Information Technologies Center Russian Academy of Sciences
Russian Federation

Cand. Sci, (Tech.). Director of the Design Information Technologies Center Russian Academy of Sciences. Research interests: information technology, methods of machine learning and artificial intelligence as applied to self-organizing decision support systems and automated data analysis tools.

References

1. Wollek A., Graf R. et al. Attention‑based Saliency Maps Improve Interpretability of Pneumothorax Classification. arXiv:2303.01871 (2023).

2. Huang X. et al. Enhanced tuberculosis detection using Vision Transformers and Grad‑CAM. BMC Medical Imaging (2025).

3. Chen L. et al. MedViT: A robust vision transformer for generalized medical image analysis. Signal Processing: Image Communication, 105 (2023).

4. Smith J., Patel R. et al. Implementing vision transformer for classifying 2D biomedical images. Scientific Reports 14, 63094 (2024).

5. Huang Y. et al. R‑Cut: Relationship Weighted Cut for Denoising ViT Attention Maps. MDPI Image Analysis (2024).

6. Wu Z. et al. SaCo: Salience‑guided Faithfulness Coefficient for Evaluating Explanations. CVPR 2024.

7. Brocki M. et al. Class‑Discriminative Attention Maps (CDAM) for Vision Transformers. ICLR 2024.

8. Xiao T. et al. Evaluating Robustness of Vision Transformers under Common Corruptions. AAAI 2022.

9. Badisa S. et al. Inpainting the Gaps: Framework for Evaluating Explainability under Occlusion. The CVF Open Access 2024.

10. Katar S., Yildirim A. Interpretable Classification of Leukocytes with ViT and Score‑CAM. PMC 2023.

Review

For citations:

GRIDIN V.N., NOVIKOV I.A., SALEM B.R., SOLODOVNIKOV V.I. Comparison of the Interpretability of ResNet50 and ViT-224 Models in the lassification Task is Erroneous on Images of a Scanned Microscope Object. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(6):233-242. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(6)-15

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Comparison of the Interpretability of ResNet50 and ViT-224 Models in the lassification Task is Erroneous on Images of a Scanned Microscope Object

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy