caT text to video

Conditionally augmented text-to-video model. Uses pre-trained weights from modelscope text-to-video model, augmented with temporal conditioning transformers to extend generated clips and create a smooth transition between them. Supports prompt interpolation as well to change scenes during clip extensions.

The model was trained on two RTX 6000 Ada GPUs for 5 million steps using the WebWid 10M dataset, with a batch size of 1 and a learning rate of 1e-6 at a resolution of 320x320. It used 8 frames for conditioning and 8 frames for noisy samples, with a stride of 6.

Installation

Clone the Repository

git clone https://github.com/motexture/caT-text-to-video-2.3b/
cd caT-text-to-video-2.3b
python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python run.py

Visit the provided URL in your browser to interact with the interface and start generating videos.

Examples:

A guy is riding a bike -> A guy is riding a motorcycle

Will Smith is eating a hamburger -> Will Smith is eating an ice cream

A lion is looking around -> A lion is running

Darth Vader is surfing on the ocean

A beautiful anime girl with pink hair -> Anime girl laughing

Downloads last month
41
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-to-video models for diffusers library.

Model tree for motexture/caT-text-to-video-2.3b

Finetuned
(1)
this model

Dataset used to train motexture/caT-text-to-video-2.3b