214 reads

Advancing Multimodal Video Generation with Responsible AI and Stylization

by Teleplay Technology January 13th, 2025

Too Long; Didn't Read

The research examines video generation, fairness, model scaling, super-resolution, and zero-shot evaluation, with a focus on Responsible AI and stylization.

featured image - Advancing Multimodal Video Generation with Responsible AI and Stylization

Read by Dr. One voice-avatar

Listen to this story

Authors:

(1) Dan Kondratyuk, Google Research and with Equal contribution;

(2) Lijun Yu, Google Research, Carnegie Mellon University and with Equal contribution;

(3) Xiuye Gu, Google Research and with Equal contribution;

(4) Jose Lezama, Google Research and with Equal contribution;

(5) Jonathan Huang, Google Research and with Equal contribution;

(6) Grant Schindler, Google Research;

(7) Rachel Hornung, Google Research;

(8) Vighnesh Birodkar, Google Research;

(9) Jimmy Yan, Google Research;

(10) Krishna Somandepalli, Google Research;

(11) Hassan Akbari, Google Research;

(12) Yair Alon, Google Research;

(13) Yong Cheng, Google DeepMind;

(14) Josh Dillon, Google Research;

(15) Agrim Gupta, Google Research;

(16) Meera Hahn, Google Research;

(17) Anja Hauth, Google Research;

(18) David Hendon, Google Research;

(19) Alonso Martinez, Google Research;

(20) David Minnen, Google Research;

(21) Mikhail Sirotenko, Google Research;

(22) Kihyuk Sohn, Google Research;

(23) Xuan Yang, Google Research;

(24) Hartwig Adam, Google Research;

(25) Ming-Hsuan Yang, Google Research;

(26) Irfan Essa, Google Research;

(27) Huisheng Wang, Google Research;

(28) David A. Ross, Google Research;

(29) Bryan Seybold, Google Research and with Equal contribution;

(30) Lu Jiang, Google Research and with Equal contribution.

Table of Links

Abstract and 1 Introduction

2. Related Work

3. Model Overview and 3.1. Tokenization

3.2. Language Model Backbone and 3.3. Super-Resolution

4. LLM Pretraining for Generation

4.1. Task Prompt Design

4.2. Training Strategy

5. Experiments

5.1. Experimental Setup

5.2. Pretraining Task Analysis

5.3. Comparison with the State-of-the-Art

5.4. LLM’s Diverse Capabilities in Video Generation and 5.5. Limitations

6. Conclusion, Acknowledgements, and References

A. Appendix

A.1. Responsible AI and Fairness Analysis

We evaluate whether the generated outputs of our model are fair regarding protected attributes such as (1) Perceived Age (2) Perceived Gender Expression (3) Perceived Skin Tone. We construct 306 prompts with template — “a {profession or people descriptor} looking {adverb} at the camera” with “profession” being crawled from the US Bureau of Labor and Statistics and “people descriptors” including emotion state, socioeconomic class, etc. The “adverb” is used to generate semantically unchanged prompt templates such as “straightly” or “directly”. We generate 8 videos for each prompt and for each generated video we infer an approximation of the expressed attribute regarding the 3 protected attributes. Across 10 prompts that have the same semantic meaning but different “adverbs”, we observe our outputs generally introduced a stronger distribution shift toward “Young Adults” (age 18-35), “Male” and “Light Skin Tone”. However, we observe changing the “adverb” in the prompt template can significantly alter the output distributions. Therefore, our model can be prompted to produce outputs with non-uniform distributions across these groups, but also possess the ability of being prompted to enhance uniformity, though prompts are semantically unchanged. While research has been conducted in the image generation and recognition domain (Zhang et al., 2023c; Schumann et al., 2021; 2023; Chiu et al., 2023), this finding highlights the importance of continued research to develop strategies to mitigate issues and improve fairness for video generation.

A.2. Model Scale and Performance

To analyze model performance versus model scale, we use a subset of the training set without text-paired data and a slightly different task prompt design. We evaluate the video generation quality using FVD (Unterthiner et al., 2018) and audio generation quality using the Frechet Audio Distance (FAD), which uses the VGGish model as the embedding ´ function (Hershey et al., 2017). Both FVD and FAD metrics are calculated using a held-out subset of 25 thousand videos.

Fig. 8 shows that as the model size grows and the amount of training data increases, performance improves across visual and audiovisual tasks. After obtaining the above results, we retrain our 1B and 8B models using the task design and text-paired training data discussed in Section 3. Appendix A.2.1 shows a qualitative comparison of the 1B and 8B pretrained models. Increasing the model size improved temporal consistency, prompt fidelity, and motion dynamics while adding capabilities for limited text rendering, spatial understanding, and counting.

Figure 8: Effects of model and data scale on video and audio generation quality. The performance, depicted on a log-logscale, improves significantly when we scale up the model and training data. Language models with 300 million, 1 billion, and 8 billion parameters are trained on datasets comprising 10, 37, and 58 billion visual and audio tokens, respectively.

A.2.1. QUALITATIVE COMPARISON OF 1B AND 8B MODELS

In Figure 9, we show outputs of 1B and 8B parameter models on the same prompts. Four frames from the best video output of each model in a batch of four text-to-video samples were selected to represent the model. In the first row, the 1B model is unstable with large changes to the subject over time and misses elements from the complex prompt. This prompt was originally used for scaling comparisons in (Yu et al., 2022), and compared to a dedicated image-only model, our model does

Figure 9: A comparison of a 1B (left) and 8B (right) parameter models on the same prompt and settings.

Figure 10: Example of zero-shot video editing via task chaining (outpainting and stylization) – the original video is first outpainted and then stylized via a text prompt.

not preserve text as well given the training data used. In the second row, we use a simpler text task and show that the 8B model can represent a single letter clearly, but the 1B model still produces artifacts. In the third row, we show that the 8B model learns spatial positioning such as the river being in front of the astronaut and horse. In the fourth row, we show that the 8B parameter model learned a stop motion style to have items disappear “one by one” and can follow a complicated layout from a long prompt. In contrast, the 1B model includes all of the nouns, but is unstable over time and does not follow the layout indicated in the prompt. In the bottom row, we show that the 8B model understands counts of objects in that it displays a full bouquet (though 12 roses are not explicitly in frame) and smooth consistent motion as opposed to the 5 roses and distorting objects produced by the 1B model. Overall, scaling the model improved temporal consistency, prompt fidelity, and motion dynamics while adding capabilities for limited text rendering, spatial understanding, and counting.

A.3. Additional Generated Examples

We include most generated videos in the supplementary materials for an enhanced visualization of motion and visual quality, in addition to Fig. 10 and Fig. 11.

A.4. Video Stylization

To perform video stylization, we follow an approach inspired by (Zhang et al., 2023b; Chen et al., 2023b; Esser et al., 2023) to predict videos from the combination of text, optical flow, and depth signals. On a subset of steps, we also condition on the first video frame. As described in (Esser et al., 2023), the text will generally define the “content” or appearance of the output and the optical flow and depth control the “structure.” In contrast to the diffusion-based approaches that usually use external cross-attention networks (Zhang et al., 2023b) or latent blending (Meng et al., 2021) for stylization, our approach is more closely related to machine translation using large language models in that we only need to provide the structure and text as a prefix to a language model.

To perform the task, we estimate optical flow from RAFT (Sun et al., 2022) and produce monocular depth maps from MIDAS (Ranftl et al., 2020), and then normalize and concatenate on the channel dimension. This conveniently produces the

Figure 11: Examples of directed camera movement from the same initial frame.

Figure 12: Human side-by-side evaluations comparing VideoPoet with the video stylization model Control-avideo (Chen et al., 2023b). Raters prefer VideoPoet on both text fidelity and video quality. Green and pink bars represent the proportion of trials where VideoPoet was preferred over an alternative, or preferred less than an alternative, respectively.

same number of channels as the RGB ground truth and so can be tokenized in the same fashion as RGB videos with the MAGVIT-v2 tokenizer without retraining the tokenizer. The task of stylization is to reconstruct the ground truth video from the given optical flow, depth, and text information. During inference, we apply optical flow and depth estimation on an input video but then vary the text prompt to generate a new style, e.g. “cartoon”.

Table 3: Comparison on video stylization. VideoPoet outperforms Control-A-Video by a large margin.

To evaluate stylization capabilities, we choose 20 videos from the public DAVIS 2016[2] (Perazzi et al., 2016) dataset and provide 2 style prompts for each video. For more details, please refer to Appendix A.5.6. Following (Esser et al., 2023), we evaluated the CLIP-embedding consistency between each frame and the text prompt to determine if the stylization results matches the text. As shown in Table 3, VideoPoet outperforms Control-A-Video conditioned on depth by a large margin. We also conduct human evaluations as discussed above comparing with Control-A-Video (Chen et al., 2023b). Human raters consistently prefer our text fidelity and video quality as shown in Fig. 12.

Table 4: List of representative special tokens used in training and inference.

A.5. Additional Implementation and Evaluation Details

A.5.1. ADDITIONAL IMPLEMENTATION DETAILS

The unified vocabulary is constructed as follows: the initial 256 codes are reserved for special tokens and task prompts. Table 4 lists some examples of special tokens. Subsequently, the next 262,144 codes are allocated for image and video tokenization. This is followed by 4,096 audio codes. We also include a small text vocabulary of English words. Overall, this produces a total vocabulary size of approximately 300,000.

Since the first frame is tokenized separately, MAGVIT-v2 allows images to be represented in the same vocabulary as video. In addition to being more compact, images provide many learnable characteristics that are not typically represented in videos, such as strong visual styles (e.g., art paintings), objects which are infrequently seen in video, rich captions, and significantly more text-image paired training data. When training on images, we resize the images to 128×128 which are then tokenized to a latent shape of (1, 16, 16), or 256 tokens. We scale the MAGVIT-v2 model’s size and train it on the datasets discussed in Section 5.1. The training follows two steps: image training, inflation (Yu et al., 2023c) and video training. Due to images requiring fewer tokens, we can include roughly 5× more images per batch than videos, i.e. 256 image tokens vs. 1280 video tokens. We use up to a maximum of 64 text tokens for all of our experiments. For the <res> token, the resolution is only specified for 128 × 224 output, 128 × 128 resolution is assumed otherwise.

The video-to-video tasks use the COMMIT encoding (Yu et al., 2023a) to obtain the tokens for the tasks such as inpainting and outpainting. Text is encoded as T5 XL embeddings (Raffel et al., 2020) and are inserted into reserved sequence positions right after the <bot_i> token as shown in Fig. 2.

A.5.2. SUPER-RESOLUTION IMPLEMENTATION DETAILS

We use a 1B model for the first 2× spatial super-resolution stage and a 500M model for the second 2× stage. The first superresolution stage models videos of 17×448×256 pixels with a token sequence of shape (5, 56, 32). The second stage models videos of 17 × 896 × 512 pixels with a token sequence of shape (5, 112, 64). The token sequences are obtained with the same MAGVIT-v2 (Yu et al., 2023c) tokenizer used for the base language model. The custom super-resolution transformer has local self-attention windows for vertical, horizontal and temporal layers of shape (1, 56, 4),(1, 8, 32),(5, 8, 8) in the first stage and (1, 112, 2),(1, 4, 64),(5, 8, 8) in the second stage, respectively (Fig. 3). The cross-attention layers attend to local windows in the low-resolution sequence isomorphic to self-attention windows but with half the spatial size.

We train the super-resolution stages on a dataset of 64M high-quality text-video pairs using the masked modeling objective of MAGVIT (Yu et al., 2023a), with token factorization into k = 2 groups (Yu et al., 2023c). During inference, we use the sampling algorithm of MAGVIT-v2 (Yu et al., 2023c) with 24 sampling steps for each stage and classifier-free guidance scale (Ho & Salimans, 2022; Brooks et al., 2023) of 4.0/8.0 for the text condition and 1.0/2.0 for the low-resolution condition, in the first/second stage.

A.5.3. ADDITIONAL EVALUATION DETAILS

We measure CLIP similarity scores (Wu et al., 2021) following an implementation given by Villegas et al. (2022), measure FVD (Unterthiner et al., 2018) following Yu et al. (2023a) on UCF101 dataset and following Zhang et al. (2023a) on MSR-VTT, and measure Inception Score (IS) (Saito et al., 2020). When the evaluation protocol is on 16 frames, we discard the generated last frame to make a 16-frame video.

A.5.4. ZERO-SHOT TEXT-TO-VIDEO EVALUATION SETTINGS

We report the details of our zero-shot text-to-video settings here. We note that some details are missing in previous papers and different papers use different settings. Hence, we provide all the details and hope this evaluation setting can serve as a standard text-to-video generation benchmark. Our results are reported on the 8B model and we adopt classifier-free guidance (Ho & Salimans, 2022).

All metrics are evaluated on generated videos containing 16 frames with a resolution of 256 x 256. We first generate videos of 128 x 128 resolution and then resize to 256 x 256 via bicubic upsampling.

Zero-shot MSR-VTT. For CLIP score, we used all 59,794 captions from the MSR-VTT test set. We use CLIP ViT-B/16 model following Phenaki (Villegas et al., 2022). We note that some papers use other CLIP models, e.g., VideoLDM (Blattmann et al., 2023b) uses ViT-B/32. Our CLIP score evaluated on the ViT-B/32 backbone for MSR-VTT is 30.01. For the FVD metric, to evaluate on a wide range of captions as well as to be comparable with previous papers that evaluate on 2,048 videos, we evaluate on the first 40,960 captions in the MSR-VTT test set. More specifically, we report the FVD metrics on 2048 videos with 20 repeats. The FVD real features are extracted from 2,048 videos sampled from the MSR-VTT test set. We sample the central 16 frames of each real video, without any temporal downsampling, i.e., we use the original fps in the MSR-VTT dataset (30 fps as reported in Xu et al. (2016)). The FVD is evaluated with an I3D model trained on Kinetics-400.

Zero-shot UCF-101. Following VDM (Ho et al., 2022b), we sample 10,000 videos from the UCF-101 test set and use their categories as the text prompts to generate 10,000 videos. We use the class text prompts provided in PYoCo (Ge et al., 2023) to represent the 101 categories. To compute the FVD real features, we sample 10K videos from the training set, following TGAN2 (Saito et al., 2020). We sample the central 16 frames for each real video , without any temporal downsampling, i.e., we use the original fps in the UCF-101 dataset (25 fps as reported in (Soomro et al., 2012)). The FVD metric is evaluated with an I3D model trained on Kinetics-400 and the IS metric is evaluated with a C3D model trained on UCF-101.

A.5.5. SELF-SUPERVISED TASKS EVALUATION SETTINGS

Self-supervised learning tasks include frame prediction on K600 with 5 frames as condition, as well as inpainting and outpainting on SSv2. FVD (Unterthiner et al., 2018) is used as the primary metric, calculated with 16 frames at 128×128 resolution. We follow MAGVIT (Yu et al., 2023a) in evaluating these tasks against the respective real distribution, using 50000×4 samples for K600 and 50000 samples for SSv2.

A.5.6. STYLIZATION EVALUATION ON DAVIS

To evaluate the CLIP similarity score and human preference on video stylization, we use the following set of videos and prompts. We select 20 videos from DAVIS 2016 (Perazzi et al., 2016), and for each video we take 16 frames starting from the initial frame specified below and evaluate stylization on the two text prompts specified below. To be easily reproducible, we use a central square crop at the height of the video and evaluate the output videos at 256x256 resolution. We use CLIP-B/16 for the similarity score. Several prompts below are used in or inspired by previous work (Esser et al., 2023; Chen et al., 2023b; Liew et al., 2023).

Table 5: DAVIS stylization evaluation settings.