BianYx commited on
Commit
f54b613
·
verified ·
1 Parent(s): 64f9939

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +426 -1
README.md CHANGED
@@ -10,4 +10,429 @@ tags:
10
  - video
11
  - video inpainting
12
  - video editing
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - video
11
  - video inpainting
12
  - video editing
13
+ ---
14
+
15
+ # VideoPainter
16
+
17
+ This repository contains the implementation of the paper "VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control"
18
+
19
+ Keywords: Video Inpainting, Video Editing, Video Generation
20
+
21
+ > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2✉</sup><br>
22
+ > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‡</sup>Project Lead <sup>✉</sup>Corresponding Author
23
+
24
+
25
+
26
+ <p align="center">
27
+ <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
28
+ <a href="https://arxiv.org/abs/xxx">📜Arxiv</a> |
29
+ <a href="https://huggingface.co/datasets/TencentARC/VPBench">🗄️Data</a> |
30
+ <a href="https://youtu.be/HYzNfsD3A0s">📹Video</a> |
31
+ <a href="https://huggingface.co/TencentARC/VideoPainter">🤗Hugging Face Model</a> |
32
+ </p>
33
+
34
+
35
+ **📖 Table of Contents**
36
+
37
+
38
+ - [VideoPainter](#videopainter)
39
+ - [🔥 Update Log](#-update-log)
40
+ - [📌 TODO](#todo)
41
+ - [🛠️ Method Overview](#️-method-overview)
42
+ - [🚀 Getting Started](#-getting-started)
43
+ - [Environment Requirement 🌍](#environment-requirement-)
44
+ - [Data Download ⬇️](#data-download-️)
45
+ - [🏃🏼 Running Scripts](#-running-scripts)
46
+ - [Training 🤯](#training-)
47
+ - [Inference 📜](#inference-)
48
+ - [Evaluation 📏](#evaluation-)
49
+ - [🤝🏼 Cite Us](#-cite-us)
50
+ - [💖 Acknowledgement](#-acknowledgement)
51
+
52
+
53
+
54
+ ## 🔥 Update Log
55
+ - [2025/3/09] 📢 📢 [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
56
+ - [2025/3/09] 📢 📢 [VPData and VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
57
+
58
+ ## TODO
59
+
60
+ - [x] Release trainig and inference code
61
+ - [x] Release evluation code
62
+ - [x] Release [VideoPainter checkpoints](https://huggingface.co/TencentARC/VideoPainter) (based on CogVideoX-5B)
63
+ - [x] Release [VPData and VPBench](https://huggingface.co/datasets/TencentARC/VPBench) for large-scale training and evaluation.
64
+ - [x] Release gradio demo
65
+ - [ ] Data preprocessing code
66
+ ## 🛠️ Method Overview
67
+
68
+ We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
69
+ ![](assets/method.jpg)
70
+
71
+
72
+
73
+ ## 🚀 Getting Started
74
+
75
+ ### Environment Requirement 🌍
76
+
77
+
78
+ Clone the repo:
79
+
80
+ ```
81
+ git clone https://github.com/TencentARC/VideoPainter.git
82
+ ```
83
+
84
+ We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
85
+
86
+
87
+ ```
88
+ conda create -n videopainter python=3.10 -y
89
+ conda activate videopainter
90
+ pip install -r requirements.txt
91
+ ```
92
+
93
+ Then, you can install diffusers (implemented in this repo) with:
94
+
95
+ ```
96
+ cd ./diffusers
97
+ pip install -e .
98
+ ```
99
+
100
+ After that, you can install required ffmpeg thourgh:
101
+
102
+ ```
103
+ conda install -c conda-forge ffmpeg -y
104
+ ```
105
+
106
+ Optional, you can install sam2 for gradio demo thourgh:
107
+
108
+ ```
109
+ cd ./app
110
+ pip install -e .
111
+ ```
112
+
113
+ ### Data Download ⬇️
114
+
115
+
116
+ **VPBench and VPData**
117
+
118
+ You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
119
+
120
+ ```
121
+ |-- data
122
+ |-- davis
123
+ |-- JPEGImages_432_240
124
+ |-- test_masks
125
+ |-- davis_caption
126
+ |-- test.json
127
+ |-- train.json
128
+ |-- videovo/raw_video
129
+ |-- 000005000
130
+ |-- 000005000000.0.mp4
131
+ |-- 000005000001.0.mp4
132
+ |-- ...
133
+ |-- 000005001
134
+ |-- ...
135
+ |-- pexels/pexels/raw_video
136
+ |-- 000000000
137
+ |-- 000000000000_852038.mp4
138
+ |-- 000000000001_852057.mp4
139
+ |-- ...
140
+ |-- 000000001
141
+ |-- ...
142
+ |-- video_inpainting
143
+ |-- videovo
144
+ |-- 000005000000/all_masks.npz
145
+ |-- 000005000001/all_masks.npz
146
+ |-- ...
147
+ |-- pexels
148
+ |-- ...
149
+ |-- pexels_videovo_train_dataset.csv
150
+ |-- pexels_videovo_val_dataset.csv
151
+ |-- pexels_videovo_test_dataset.csv
152
+ |-- our_video_inpaint.csv
153
+ |-- our_video_inpaint_long.csv
154
+ |-- our_video_edit.csv
155
+ |-- our_video_edit_long.csv
156
+ |-- pexels.csv
157
+ |-- videovo.csv
158
+
159
+ ```
160
+
161
+ You can download the VPBench, and put the benchmark to the `data` folder by:
162
+ ```
163
+ git lfs install
164
+ git clone https://huggingface.co/datasets/TencentARC/VPBench
165
+ mv VPBench data
166
+ cd data
167
+ unzip pexels.zip
168
+ unzip videovo.zip
169
+ unzip davis.zip
170
+ unzip video_inpainting.zip
171
+ ```
172
+
173
+ Noted: *Due to the space limit, you need to run the following script to download the complete VPData. The format should be consistent with VPBench above (After download the VPBench, the script will automatically place the video and mask sequences from VPData into the corresponding dataset directories that have been created by VPBench).*
174
+
175
+ ```
176
+ cd data_utils
177
+ python VPData_download.py
178
+ ```
179
+
180
+
181
+ **Checkpoints**
182
+
183
+ Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
184
+
185
+ - VideoPainter pretrained checkpoints for CogVideoX-5b-I2V
186
+ - VideoPainter IP Adapter pretrained checkpoints for CogVideoX-5b-I2V
187
+ - pretrinaed CogVideoX-5b-I2V checkpoint from [HuggingFace](https://huggingface.co/THUDM/CogVideoX-5b-I2V).
188
+
189
+ You can download the checkpoints, and put the checkpoints to the `ckpt` folder by:
190
+ ```
191
+ git lfs install
192
+ git clone https://huggingface.co/TencentARC/VideoPainter
193
+ mv VideoPainter ckpt
194
+ ```
195
+
196
+ You also need to download the base model [CogVideoX-5B-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) by:
197
+ ```
198
+ git lfs install
199
+ cd ckpt
200
+ git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V
201
+ ```
202
+
203
+
204
+ The ckpt structure should be like:
205
+
206
+ ```
207
+ |-- ckpt
208
+ |-- VideoPainter/checkpoints
209
+ |-- branch
210
+ |-- config.json
211
+ |-- diffusion_pytorch_model.safetensors
212
+ |-- VideoPainterID/checkpoints
213
+ |-- pytorch_lora_weights.safetensors
214
+ |-- CogVideoX-5b-I2V
215
+ |-- scheduler
216
+ |-- transformer
217
+ |-- vae
218
+ |-- ...
219
+ ```
220
+
221
+
222
+ ## 🏃🏼 Running Scripts
223
+
224
+
225
+ ### Training 🤯
226
+
227
+ You can train the VideoPainter using the script:
228
+
229
+ ```
230
+ # cd train
231
+ # bash VideoPainter.sh
232
+
233
+ export MODEL_PATH="../ckpt/CogVideoX-5b-I2V"
234
+ export CACHE_PATH="~/.cache"
235
+ export DATASET_PATH="../data/videovo/raw_video"
236
+ export PROJECT_NAME="pexels_videovo-inpainting"
237
+ export RUNS_NAME="VideoPainter"
238
+ export OUTPUT_PATH="./${PROJECT_NAME}/${RUNS_NAME}"
239
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
240
+ export TOKENIZERS_PARALLELISM=false
241
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
242
+
243
+ accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine_rank 0 \
244
+ train_cogvideox_inpainting_i2v_video.py \
245
+ --pretrained_model_name_or_path $MODEL_PATH \
246
+ --cache_dir $CACHE_PATH \
247
+ --meta_file_path ../data/pexels_videovo_train_dataset.csv \
248
+ --val_meta_file_path ../data/pexels_videovo_val_dataset.csv \
249
+ --instance_data_root $DATASET_PATH \
250
+ --dataloader_num_workers 1 \
251
+ --num_validation_videos 1 \
252
+ --validation_epochs 1 \
253
+ --seed 42 \
254
+ --mixed_precision bf16 \
255
+ --output_dir $OUTPUT_PATH \
256
+ --height 480 \
257
+ --width 720 \
258
+ --fps 8 \
259
+ --max_num_frames 49 \
260
+ --video_reshape_mode "resize" \
261
+ --skip_frames_start 0 \
262
+ --skip_frames_end 0 \
263
+ --max_text_seq_length 226 \
264
+ --branch_layer_num 2 \
265
+ --train_batch_size 1 \
266
+ --num_train_epochs 10 \
267
+ --checkpointing_steps 1024 \
268
+ --validating_steps 256 \
269
+ --gradient_accumulation_steps 1 \
270
+ --learning_rate 1e-5 \
271
+ --lr_scheduler cosine_with_restarts \
272
+ --lr_warmup_steps 1000 \
273
+ --lr_num_cycles 1 \
274
+ --enable_slicing \
275
+ --enable_tiling \
276
+ --noised_image_dropout 0.05 \
277
+ --gradient_checkpointing \
278
+ --optimizer AdamW \
279
+ --adam_beta1 0.9 \
280
+ --adam_beta2 0.95 \
281
+ --max_grad_norm 1.0 \
282
+ --allow_tf32 \
283
+ --report_to wandb \
284
+ --tracker_name $PROJECT_NAME \
285
+ --runs_name $RUNS_NAME \
286
+ --inpainting_loss_weight 1.0 \
287
+ --mix_train_ratio 0 \
288
+ --first_frame_gt \
289
+ --mask_add \
290
+ --mask_transform_prob 0.3 \
291
+ --p_brush 0.4 \
292
+ --p_rect 0.1 \
293
+ --p_ellipse 0.1 \
294
+ --p_circle 0.1 \
295
+ --p_random_brush 0.3
296
+
297
+ # cd train
298
+ # bash VideoPainterID.sh
299
+ export MODEL_PATH="../ckpt/CogVideoX-5b-I2V"
300
+ export BRANCH_MODEL_PATH="../ckpt/VideoPainter/checkpoints/branch"
301
+ export CACHE_PATH="~/.cache"
302
+ export DATASET_PATH="../data/videovo/raw_video"
303
+ export PROJECT_NAME="pexels_videovo-inpainting"
304
+ export RUNS_NAME="VideoPainterID"
305
+ export OUTPUT_PATH="./${PROJECT_NAME}/${RUNS_NAME}"
306
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
307
+ export TOKENIZERS_PARALLELISM=false
308
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
309
+
310
+ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine_rank 0 \
311
+ train_cogvideox_inpainting_i2v_video_resample.py \
312
+ --pretrained_model_name_or_path $MODEL_PATH \
313
+ --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
314
+ --cache_dir $CACHE_PATH \
315
+ --meta_file_path ../data/pexels_videovo_train_dataset.csv \
316
+ --val_meta_file_path ../data/pexels_videovo_val_dataset.csv \
317
+ --instance_data_root $DATASET_PATH \
318
+ --dataloader_num_workers 1 \
319
+ --num_validation_videos 1 \
320
+ --validation_epochs 1 \
321
+ --seed 42 \
322
+ --rank 256 \
323
+ --lora_alpha 128 \
324
+ --mixed_precision bf16 \
325
+ --output_dir $OUTPUT_PATH \
326
+ --height 480 \
327
+ --width 720 \
328
+ --fps 8 \
329
+ --max_num_frames 49 \
330
+ --video_reshape_mode "resize" \
331
+ --skip_frames_start 0 \
332
+ --skip_frames_end 0 \
333
+ --max_text_seq_length 226 \
334
+ --branch_layer_num 2 \
335
+ --train_batch_size 1 \
336
+ --num_train_epochs 10 \
337
+ --checkpointing_steps 256 \
338
+ --validating_steps 128 \
339
+ --gradient_accumulation_steps 1 \
340
+ --learning_rate 5e-5 \
341
+ --lr_scheduler cosine_with_restarts \
342
+ --lr_warmup_steps 200 \
343
+ --lr_num_cycles 1 \
344
+ --enable_slicing \
345
+ --enable_tiling \
346
+ --noised_image_dropout 0.05 \
347
+ --gradient_checkpointing \
348
+ --optimizer AdamW \
349
+ --adam_beta1 0.9 \
350
+ --adam_beta2 0.95 \
351
+ --max_grad_norm 1.0 \
352
+ --allow_tf32 \
353
+ --report_to wandb \
354
+ --tracker_name $PROJECT_NAME \
355
+ --runs_name $RUNS_NAME \
356
+ --inpainting_loss_weight 1.0 \
357
+ --mix_train_ratio 0 \
358
+ --first_frame_gt \
359
+ --mask_add \
360
+ --mask_transform_prob 0.3 \
361
+ --p_brush 0.4 \
362
+ --p_rect 0.1 \
363
+ --p_ellipse 0.1 \
364
+ --p_circle 0.1 \
365
+ --p_random_brush 0.3 \
366
+ --id_pool_resample_learnable
367
+ ```
368
+
369
+
370
+
371
+
372
+ ### Inference 📜
373
+
374
+ You can inference for the video inpainting or editing with the script:
375
+
376
+ ```
377
+ cd infer
378
+ # video inpainting
379
+ bash inpaint.sh
380
+ # video inpainting with ID resampling
381
+ bash inpaint_id_resample.sh
382
+ # video editing
383
+ bash edit.sh
384
+ ```
385
+
386
+ Our VideoPainter can also function as a video editing pair data generator, you can inference with the script:
387
+ ```
388
+ bash edit_bench.sh
389
+ ```
390
+
391
+ Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
392
+
393
+
394
+ You can also inference through gradio demo:
395
+
396
+ ```
397
+ # cd app
398
+ CUDA_VISIBLE_DEVICES=0 python app.py \
399
+ --model_path ../ckpt/CogVideoX-5b-I2V \
400
+ --inpainting_branch ../ckpt/VideoPainter/checkpoints/branch \
401
+ --id_adapter ../ckpt/VideoPainterID/checkpoints \
402
+ --img_inpainting_model ../ckpt/flux_inp
403
+ ```
404
+
405
+
406
+ ### Evaluation 📏
407
+
408
+ You can evaluate using the script:
409
+
410
+ ```
411
+ cd evaluate
412
+ # video inpainting
413
+ bash eval_inpainting.sh
414
+ # video inpainting with ID resampling
415
+ bash eval_inpainting_id_resample.sh
416
+ # video editing
417
+ bash eval_edit.sh
418
+ # video editing with ID resampling
419
+ bash eval_editing_id_resample.sh
420
+ ```
421
+
422
+
423
+ ## 🤝🏼 Cite Us
424
+
425
+ ```
426
+ @article{bian2025videopainter,
427
+ title={VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control},
428
+ author={Bian, Yuxuan and Zhang, Zhaoyang and Ju, Xuan and Cao, Mingdeng and Xie, Liangbin and Shan, Ying and Xu, Qiang},
429
+ journal={arXiv preprint arXiv:xxx},
430
+ year={2025}
431
+ }
432
+ ```
433
+
434
+
435
+ ## 💖 Acknowledgement
436
+ <span id="acknowledgement"></span>
437
+
438
+ Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!