Skip to main navigation Skip to search Skip to main content

CONTROLVIDEO: TRAINING-FREE CONTROLLABLE TEXT-TO-VIDEO GENERATION

  • Yabo Zhang
  • , Yuxiang Wei
  • , Dongsheng Jiang
  • , Xiaopeng Zhang
  • , Wangmeng Zuo
  • , Qi Tian
  • Harbin Institute of Technology
  • Huawei Cloud

Research output: Contribution to conferencePaperpeer-review

Abstract

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart lags behind due to the excessive training cost. To avert the training burden, we propose a training-free ControlVideo to produce high-quality videos based on the provided text prompts and motion sequences. Specifically, ControlVideo adapts a pre-trained text-to-image model (i.e., ControlNet) for controllable text-to-video generation. To generate continuous videos without flicker effects, we propose an interleaved-frame smoother to smooth the intermediate frames. In particular, interleaved-frame smoother splits the whole video with successive three-frame clips, and stabilizes each clip by updating the middle frame with the interpolation among other two frames in latent space. Furthermore, a fully cross-frame interaction mechanism is exploited to further enhance the frame consistency, while a hierarchical sampler is employed to produce long videos efficiently. Extensive experiments demonstrate that our ControlVideo outperforms the state-of-the-arts both quantitatively and qualitatively. It is worth noting that, thanks to the efficient designs, ControlVideo could generate both short and long videos within several minutes using one NVIDIA 2080Ti. Code and videos are available at this link.

Original languageEnglish
StatePublished - 2024
Event12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria
Duration: 7 May 202411 May 2024

Conference

Conference12th International Conference on Learning Representations, ICLR 2024
Country/TerritoryAustria
CityHybrid, Vienna
Period7/05/2411/05/24

Fingerprint

Dive into the research topics of 'CONTROLVIDEO: TRAINING-FREE CONTROLLABLE TEXT-TO-VIDEO GENERATION'. Together they form a unique fingerprint.

Cite this