Temporally Coherent Person Matting Trained on Fake-Motion Dataset
https://doi.org/10.15514/ISPRAS-2025-37(3)-6
Abstract
We propose a novel neural-network-based method to perform matting of videos depicting people that does not require additional user input such as trimaps. Our architecture achieves temporal stability of the resulting alpha mattes by using motion-estimation-based smoothing of image-segmentation algorithm outputs, combined with convolutional-LSTM modules on U-Net skip connections. We also propose a fake-motion algorithm that generates training clips for the video-matting network given photos with ground-truth alpha mattes and background videos. We apply random motion to photos and their mattes to simulate movement one would find in real videos and composite the result with the background clips. It lets us train a deep neural network operating on videos in an absence of a large annotated video dataset and provides ground-truth training-clip foreground optical flow for use in loss functions.
About the Authors
Ivan Andreevich MOLODETSKIKHRussian Federation
Received his master degree in computer science from the Moscow State University in 2020. He is currently a postgraduate student at the MSU Graphics & Media Lab. His research interests include super-resolution, semantic video matting and machine learning. Ivan supervised the development of MSU video upscalers and SR+codec benchmarks and was one of the organizers of the ECCV-AIM 2024 video super-resolution quality assessment challenge.
Mikhail Viktorovich EROFEEV
Russian Federation
Cand. Sci. (Phys.-Math.). He graduated from the Computer Science department of Lomonosov Moscow State University, and currently he is a researcher at the MSU Graphics & Media Lab. His research interests include video and image matting, and machine learning. Mikhail is one of the major contributors to the video matting methods benchmark.
Andrey Viktorovich MOSKALENKO
Russian Federation
Received his master degree in computer science from the Moscow State University in 2023. He is currently a postgraduate student at the MSU Graphics & Media Lab. His research interests involve visual attention modeling, interactive segmentation, image inpainting and video quality assessment. Andrey is one of the core contributors to the subjective media quality assessment platform.
Dmitriy Sergeevich VATOLIN
Russian Federation
Cand. Sci. (Phys.-Math.) from Moscow State University, head of the MSU Graphics & Media Lab and MSU AI Institute Video Analysis Lab, and a researcher at the Research Centre for Trusted AI of ISP RAS. His research interests include compression methods, video processing, 3D video techniques (depth from motion, focus and other cues, video matting, background restoration, high-quality stereo generation), as well as video quality assessment and robustness of modern video quality metrics. He is a key cofounder of the 3D video quality measurement project, his most known project is the annual MSU Video Codecs Comparison, that includes up to 25 modern codecs compared subjectively and objectively in several nominations with detailed 20000+ charts.
References
1. Xu N., Price B., Cohen S., Huang T. Deep Image Matting // The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR). 2017.
2. Forte M., Pitié F. 𝐹, 𝐵, Alpha Matting // arXiv preprint arXiv:2003.07711. 2020.
3. Goel A., Kumar M., Sudheendra P., Team V. I. IamAlpha: Instant and Adaptive Mobile Network for Alpha Matting. 2021.
4. Shi X., Chen Z., Wang H., Yeung D.-Y., Wong W.-k., Woo W.-c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting // Advances in Neural Information Pro-cessing Systems 28. Curran Associates, Inc., 2015. сс. 802–810.
5. AISegment dataset [электронный ресурс]. 2019. URL: https://github.com/aisegmentcn/matting_human_datasets/tree/1829b5f722024d29b780993f06b45ea3f47ba777 (дата обращения: 12.04.2020).
6. Shen X., Tao X., Gao H., Zhou C., Jia J. Deep automatic portrait matting // European Conference on Computer Vision. 2016. сс. 92–107.
7. Wang J., Cohen M. F. Image and video matting: a survey. Now Publishers Inc, 2008.
8. Li X., Li J., Lu H. A survey on natural image matting with closed-form solutions // IEEE Access. IEEE, 2019. т. 7. сс. 136658–136675.
9. Chen Q., Ge T., Xu Y., Zhang Z., Yang X., Gai K. Semantic human matting // Proceedings of the 26th ACM international conference on Multimedia. 2018. сс. 618–626.
10. Lu H., Dai Y., Shen C., Xu S. Indices Matter: Learning to Index for Deep Image Matting // The IEEE International Conference on Computer Vision (ICCV). 2019.
11. Seo S., Choi S., Kersner M., Shin B., Yoon H., Byun H., Ha S. Towards Real-Time Automatic Portrait Matting on Mobile Devices // arXiv preprint arXiv:1904.03816. 2019.
12. Zhang Y., Gong L., Fan L., Ren P., Huang Q., Bao H., Xu W. A Late Fusion CNN for Digital Matting // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
13. Liu J., Yao Y., Hou W., Cui M., Xie X., Zhang C., Hua X.-S. Boosting Semantic Human Matting With Coarse Annotations // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
14. Backes M. H., Oliveira M. M. A PatchMatch-based Approach for Matte Propagation in Videos // Computer Graphics Forum. 2019. т. 38, № 7. сс. 651–662.
15. Barnes C., Shechtman E., Goldman D. B., Finkelstein A. The Generalized PatchMatch Correspond-ence Algorithm // Computer Vision – ECCV 2010. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. сс. 29–43.
16. Zou D., Chen X., Cao G., Wang X. Unsupervised video matting via sparse and low-rank representa-tion // IEEE transactions on pattern analysis and machine intelligence. IEEE, 2019.
17. Shen X., Wang R., Zhao H., Jia J. Automatic real-time background cut for portrait videos // arXiv preprint arXiv:1704.08812. 2017.
18. Sengupta S., Jayaram V., Curless B., Seitz S. M., Kemelmacher-Shlizerman I. Background Matting: The World Is Your Green Screen // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
19. Oh S. W., Lee J.-Y., Sunkavalli K., Kim S. J. Fast Video Object Segmentation by Reference-Guided Mask Propagation // The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018.
20. Oh S. W., Lee J.-Y., Xu N., Kim S. J. Video Object Segmentation Using Space-Time Memory Net-works // 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019. т. 0, № . сс. 9225–9234. DOI: 10.1109/ICCV.2019.00932.
21. VideoMatte240K and PhotoMatte85 Datasets [электронный ресурс]. 2022. URL: https://grail.cs.washington.edu/projects/background-matting-v2/#/datasets.
22. Lin S., Ryabtsev A., Sengupta S., Curless B. L., Seitz S. M., Kemelmacher-Shlizerman I. Real-Time High-Resolution Background Matting // Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition (CVPR). 2021. сс. 8762–8771.
23. Cai S., Zhang X., Fan H., Huang H., Liu J., Liu J., Liu J., Wang J., Sun J. Disentangled Image Matting // The IEEE International Conference on Computer Vision (ICCV). 2019.
24. Lutz S., Amplianitis K., Smolic A. Alphagan: Generative adversarial networks for natural image mat-ting // arXiv preprint arXiv:1807.10088. 2018.
25. Tang J., Aksoy Y., Oztireli C., Gross M., Aydin T. O. Learning-based sampling for natural image mat-ting // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. сс. 3055– 3063.
26. Hochreiter S., Schmidhuber J. Long Short-Term Memory // Neural Computation. 1997. т. 9, № 8. сс. 1735–1780. DOI: 10.1162/neco.1997.9.8.1735.
27. Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks // The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018.
28. TorchVision [электронный ресурс]. 2019. URL: https://github.com/pytorch/vision/tree/a263704079d9d35db8b0966a65ec628d28998bce/torchvision.
29. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I. At-tention is all you need // Advances in neural information processing systems. 2017. сс. 5998–6008.
30. Chen L.-C., Zhu Y., Papandreou G., Schroff F., Adam H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation // Proceedings of the European Conference on Com-puter Vision (ECCV). 2018.
31. Chollet F. Xception: Deep Learning With Depthwise Separable Convolutions // The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
32. Everingham M., Eslami S. A., Van Gool L., Williams C. K., Winn J., Zisserman A. The pascal visual object classes challenge: A retrospective // International journal of computer vision. Springer, 2015. т. 111, № 1. сс. 98–136.
33. Perazzi F., Pont-Tuset J., McWilliams B., Van Gool L., Gross M., Sorkine-Hornung A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation // Computer Vision and Pattern Recognition. 2016.
34. Сервис Яндекс.Толока [электронный ресурс]. 2020. URL: https://toloka.yandex.ru/.
35. Rother C., Kolmogorov V., Blake A. "GrabCut" interactive foreground extraction using iterated graph cuts // ACM transactions on graphics (TOG). ACM New York, NY, USA, 2004. т. 23, № 3. сс. 309-314.
36. Kingma D. P., Ba J. Adam: A method for stochastic optimization // arXiv preprint arXiv:1412.6980. 2014.
37. Simonyan K., Grishin S., Vatolin D., Popov D. Video super-resolution using motion compensation and classification-aided fusion // Proceedings of the 24th Spring Conference on Computer Graphics. 2008. сс. 143–148. DOI: 10.1145/1921264.1921294.
38. Hur J., Roth S. MirrorFlow: Exploiting Symmetries in Joint Optical Flow and Occlusion Estimation // Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017.
39. Lu X., Wang W., Ma C., Shen J., Shao L., Porikli F. See More, Know More: Unsupervised Video Ob-ject Segmentation With Co-Attention Siamese Networks // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
40. Wang T., Liu S., Tian Y., Li K., Yang M.-H. Video Matting via Consistency-Regularized Graph Neu-ral Networks // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021. сс. 4902–4911.
41. Erofeev M., Gitman Y., Vatolin D., Fedorov A., Wang J. Perceptually Motivated Benchmark for Vid-eo Matting // Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, 2015. с. 99. DOI: 10.5244/C.29.99.
42. Videos Used for Subjective Evaluation [электронный ресурс]. 2022. URL: https://drive.google.com/drive/folders/1RL9zbn7MPhPyFT4ahW00fjmKlNKdnyy7?usp=sharing.
43. Bradley R. A., Terry M. E. Rank analysis of incomplete block designs: I. The method of paired com-parisons // Biometrika. JSTOR, 1952. т. 39, № 3/4. сс. 324–345.
Review
For citations:
MOLODETSKIKH I.A., EROFEEV M.V., MOSKALENKO A.V., VATOLIN D.S. Temporally Coherent Person Matting Trained on Fake-Motion Dataset. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(3):85-106. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(3)-6