Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting.
In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.
Pipeline of our proposed SVFR. (a) Training pipeline of SVFR: We incorporate task information into the training process using a task embedding approach to enhance unified face restoration. To enhance performance across different tasks, we propose unified latent regularization to effectively aggregate task-specific features, thereby improving the model's performance on various tasks. Additionally, we introduce facial landmark information as an auxiliary guidance to help the model learn structural priors related to human face. Finally, we propose a Self-referred Refinement method that ensures temporal stability. During training, a reference frame is randomly provided, the model will produce consistent results when a reference frame is available. (b) Inference pipeline of SVFR: During inference, we initially generate the first video clip without a reference frame, then select a result frame as the reference image for subsequent video clips.
@misc{wang2025svfrunifiedframeworkgeneralized,
title={SVFR: A Unified Framework for Generalized Video Face Restoration},
author={Zhiyao Wang and Xu Chen and Chengming Xu and Junwei Zhu and Xiaobin Hu and Jiangning Zhang and Chengjie Wang and Yuqi Liu and Yiyi Zhou and Rongrong Ji},
year={2025},
eprint={2501.01235},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.01235},
}