Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | 
         @@ -7,7 +7,7 @@ license_link: LICENSE 
     | 
|
| 7 | 
         
             
            <!-- ## **HunyuanVideo** -->
         
     | 
| 8 | 
         | 
| 9 | 
         
             
            <p align="center">
         
     | 
| 10 | 
         
            -
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/logo.png"  height=100>
         
     | 
| 11 | 
         
             
            </p>
         
     | 
| 12 | 
         | 
| 13 | 
         
             
            # HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
         
     | 
| 
         @@ -71,7 +71,7 @@ using a large language model, and used as the condition. Gaussian noise and cond 
     | 
|
| 71 | 
         
             
            input, our generate model generates an output latent, which is decoded to images or videos through
         
     | 
| 72 | 
         
             
            the 3D VAE decoder.
         
     | 
| 73 | 
         
             
            <p align="center">
         
     | 
| 74 | 
         
            -
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png"  height=300>
         
     | 
| 75 | 
         
             
            </p>
         
     | 
| 76 | 
         | 
| 77 | 
         
             
            ## 🎉 **HunyuanVideo Key Features**
         
     | 
| 
         @@ -83,7 +83,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal 
     | 
|
| 83 | 
         
             
            This design captures complex interactions between visual and semantic information, enhancing
         
     | 
| 84 | 
         
             
            overall model performance.
         
     | 
| 85 | 
         
             
            <p align="center">
         
     | 
| 86 | 
         
            -
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png"  height=350>
         
     | 
| 87 | 
         
             
            </p>
         
     | 
| 88 | 
         | 
| 89 | 
         
             
            ### **MLLM Text Encoder**
         
     | 
| 
         @@ -91,13 +91,13 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te 
     | 
|
| 91 | 
         
             
            Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
         
     | 
| 92 | 
         
             
            and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
         
     | 
| 93 | 
         
             
            <p align="center">
         
     | 
| 94 | 
         
            -
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png"  height=275>
         
     | 
| 95 | 
         
             
            </p>
         
     | 
| 96 | 
         | 
| 97 | 
         
             
            ### **3D VAE**
         
     | 
| 98 | 
         
             
            HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
         
     | 
| 99 | 
         
             
            <p align="center">
         
     | 
| 100 | 
         
            -
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png"  height=150>
         
     | 
| 101 | 
         
             
            </p>
         
     | 
| 102 | 
         | 
| 103 | 
         
             
            ### **Prompt Rewrite**
         
     | 
| 
         | 
|
| 7 | 
         
             
            <!-- ## **HunyuanVideo** -->
         
     | 
| 8 | 
         | 
| 9 | 
         
             
            <p align="center">
         
     | 
| 10 | 
         
            +
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/logo.png"  height=100>
         
     | 
| 11 | 
         
             
            </p>
         
     | 
| 12 | 
         | 
| 13 | 
         
             
            # HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
         
     | 
| 
         | 
|
| 71 | 
         
             
            input, our generate model generates an output latent, which is decoded to images or videos through
         
     | 
| 72 | 
         
             
            the 3D VAE decoder.
         
     | 
| 73 | 
         
             
            <p align="center">
         
     | 
| 74 | 
         
            +
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/overall.png"  height=300>
         
     | 
| 75 | 
         
             
            </p>
         
     | 
| 76 | 
         | 
| 77 | 
         
             
            ## 🎉 **HunyuanVideo Key Features**
         
     | 
| 
         | 
|
| 83 | 
         
             
            This design captures complex interactions between visual and semantic information, enhancing
         
     | 
| 84 | 
         
             
            overall model performance.
         
     | 
| 85 | 
         
             
            <p align="center">
         
     | 
| 86 | 
         
            +
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/backbone.png"  height=350>
         
     | 
| 87 | 
         
             
            </p>
         
     | 
| 88 | 
         | 
| 89 | 
         
             
            ### **MLLM Text Encoder**
         
     | 
| 
         | 
|
| 91 | 
         
             
            Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
         
     | 
| 92 | 
         
             
            and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
         
     | 
| 93 | 
         
             
            <p align="center">
         
     | 
| 94 | 
         
            +
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/text_encoder.png"  height=275>
         
     | 
| 95 | 
         
             
            </p>
         
     | 
| 96 | 
         | 
| 97 | 
         
             
            ### **3D VAE**
         
     | 
| 98 | 
         
             
            HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
         
     | 
| 99 | 
         
             
            <p align="center">
         
     | 
| 100 | 
         
            +
              <img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/refs/heads/main/assets/3dvae.png"  height=150>
         
     | 
| 101 | 
         
             
            </p>
         
     | 
| 102 | 
         | 
| 103 | 
         
             
            ### **Prompt Rewrite**
         
     |