Video diffusion models, an advanced branch of generative models, are fundamental in video synthesis from text descriptions. Despite remarkable progress in such areas as e.g ChatGPT for text and Midjourney for images, video generation models often struggle with temporal consistency and natural dynamics. Addressing this challenge, researchers from the S-Lab at Nanyang Technological University developed FreeInit, a pioneering model designed to bridge the gap between the training and inference phases of video diffusion models, thereby greatly improving video quality.
FreeInit works by regulating the noise initialization process, a crucial step in video generation. Conventional models use Gaussian noise in the training and inference stages. However, this method results in videos without time sequence due to the uneven frequency distribution of the original noise. FreeInit innovatively addresses this problem by iteratively refining the spatiotemporal low-frequency components of the initial noise. This method requires no additional training or learnable parameters, seamlessly integrating into existing video diffusion models during inference.
FreeInit’s core technique lies in re-initializing the noise to narrow the gap between training and inference. It starts with independent Gaussian noise, which goes through a denoising process to produce clean latent video. The generated video latency is then forward-diffused, resulting in noisy latencies with improved temporal consistency. These noisy latencies are then combined with high-frequency components of Gaussian random noise to create a re-initialized noise that serves as the starting point for new sampling iterations. This process greatly improves the temporal consistency and visual appearance of the generated videos.
Extensive experiments were conducted to validate the efficacy of FreeInit, applying it to various text-to-video models such as AnimateDiff, ModelScope, and VideoCrafter. The results were remarkable, showing improvements in temporal consistency scores of 2.92 to 8.62. Qualitative and quantitative improvements were evident in various text prompts, demonstrating the flexibility and effectiveness of FreeInit in improving video generation models.
The researchers made FreeInit openly available, encouraging its widespread use and further development. The integration of FreeInit into current video generation models holds the promise of significantly advancing the field of video generation, bridging a crucial gap that has long been a challenge in the field
Image source: Shutterstock