Latent-Shift: Latent Diffusion with Temporal Shift

1University of Rochester, 2Meta AI, 3University of Maryland, College Park

Text-to-Video Generation Results


Overhead view from a drone that is flying above a patch of woods with a meadow of flowers of different colors in the middle. a beautiful deer scatting in the snow. birds fly in the sky. Leaves falling from a tree. A massive tower in the distance surrounded by a forest the camera moved towards it like a bird in flight.
a waterfall over a sparkling lake the camera flies down riding the waterfall from the top to the bottom. a cartoon show with robots. Standing on top of a mountainside watching the sunset with the vivid pinks red orange showing from the fire colored sky. a black and white scene with trees. there is a coffee cup on a piece of wood.
someone is showing some small helicopter. a cartoon about a cat. a live concert with a band on stage. an animation for the song twinkle twinkle little star. a horse in it's stable.
water being poured into a bowl with a powdered substance. a animation of a brain. video tour of a rustic campsite. a video shows a drone while music plays. cartoon aliens are dancing in space.

Abstract

We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation. We present the first method capable of photorealistically reconstructing a non-rigidly deforming scene using photos/videos captured casually from mobile phones.

Method Overview

An illustration of our framework. From left to right: (a) An autoencoder is trained on images to learn latent representation. (b) The pretrained autoencoder is adapted to encode and decode video frames independently. (c) During training, a temporal shift U-Net $\epsilon_\theta$ learns to denoise latent video representation at a uniformly sampled diffusion step $t\in [1,T]$. During inference, the U-Net gradually denoises from a normal distribution from step $\widehat{T}-1$ to $0$ where $\widehat{T}$ is the number of resampled diffusion steps in inference. (d) The U-Net $\epsilon_\theta$ is composed of two key building blocks: the $2$D ResNet blocks with convolutional layers, highlighted in violet, and the transformer blocks with spatial attention layers, colored in gray. The temporal shift module, highlighted in red, shifts the feature maps along the temporal dimension. It is inserted into the residual branch of each $2$D ResNet block. The text condition is applied to the transformer blocks via cross-attention. The channel dimension $c$ in the latent space representation of $\mathbf{z}$ and $\mathbf{u}$ are omitted for clarity.

Temporal Shift

The temporal shift module enables each frame's feature $Z_i$ to contain the channels of the adjacent frames $Z_{i-1}$ and $Z_{i+1}$ and thus enlarge the temporal receptive field by $2$. The $2$D convolutions after the temporal shift, which operate independently on each frame, %over the dimension $T$, can capture and model both the spatial and temporal information as if running an additional $1$D convolution with a kernel size of $3$ along the temporal dimension

Text-to-Video Generation Comparison


Video Diffusion Models [1]
CogVideo [2]
Latent-Shift (Ours)
Abstract background. Berlin - Brandenburg Gate at night. fire. Forest in Autumn.
Video Diffusion Models [1]
CogVideo [2]
Latent-Shift (Ours)
Forest in Autumn. path in a tropical forest. snowfall in city. Traffic jam on 23 de Maio avenue, both directions, south of Sao Paulo.

Conditional Video Generation Comparison on UCF-101


CogVideo [2]
Latent-Shift (Ours)
typing. fencing. knitting. bench press.
CogVideo [2]
Latent-Shift (Ours)
handstand pushups. front crawl. kayaking. playing piano.

BibTeX


                    @article{an2023latentshift,
                    title={Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation},
                    author={Jie An and Songyang Zhang and Harry Yang and Sonal Gupta and Jia-Bin Huang and Jiebo Luo and
                    Xi Yin},
                    journal={arXiv preprint arXiv:2304.08477},
                    year={2023},
                    }