Add comprehensive model card for STream3R (#1)
Browse files- Add comprehensive model card for STream3R (672a49ab39c3513ec767d4ed9d7fe558f7d36313)
Co-authored-by: Niels Rogge <[email protected]>
README.md
ADDED
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
pipeline_tag: image-to-3d
|
4 |
+
library_name: stream3r
|
5 |
+
---
|
6 |
+
|
7 |
+
# STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
|
8 |
+
|
9 |
+
**STream3R** presents a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. It introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail.
|
10 |
+
|
11 |
+
<div align="center">
|
12 |
+
<p>
|
13 |
+
<span style="font-variant: small-caps;"><strong>STream3R</strong></span> reformulates dense 3D reconstruction into a sequential registration task with causal attention.
|
14 |
+
<br>
|
15 |
+
<i>⭐ Now supports <b>FlashAttention</b>, <b>KV Cache</b>, <b>Causal Attention</b>, <b>Sliding Window Attention</b>, and <b>Full Attention</b>!</i>
|
16 |
+
</p>
|
17 |
+
<img width="820" alt="pipeline" src="https://github.com/NIRVANALAN/STream3R/raw/main/assets/teaser_dynamic.gif">
|
18 |
+
:open_book: See more visual results on our <a href="https://nirvanalan.github.io/projects/stream3r" target="_blank">project page</a>
|
19 |
+
</div>
|
20 |
+
|
21 |
+
**Paper:** [STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer](https://huggingface.co/papers/2508.10893)
|
22 |
+
**Project Page:** [https://nirvanalan.github.io/projects/stream3r](https://nirvanalan.github.io/projects/stream3r)
|
23 |
+
**Code:** [https://github.com/NIRVANALAN/STream3R](https://github.com/NIRVANALAN/STream3R)
|
24 |
+
|
25 |
+
## Abstract
|
26 |
+
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
|
27 |
+
|
28 |
+
## Installation
|
29 |
+
|
30 |
+
1. Clone Repo
|
31 |
+
```bash
|
32 |
+
git clone https://github.com/NIRVANALAN/STream3R
|
33 |
+
cd STream3R
|
34 |
+
```
|
35 |
+
|
36 |
+
2. Create Conda Environment
|
37 |
+
```bash
|
38 |
+
conda create -n stream3r python=3.11 cmake=3.14.0 -y
|
39 |
+
conda activate stream3r
|
40 |
+
```
|
41 |
+
3. Install Python Dependencies
|
42 |
+
|
43 |
+
**Important:** Install [Torch](https://pytorch.org/get-started/locally/) based on your CUDA version. For example, for *Torch 2.8.0 + CUDA 12.6*:
|
44 |
+
|
45 |
+
```
|
46 |
+
# Install Torch
|
47 |
+
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
|
48 |
+
|
49 |
+
# Install other dependencies
|
50 |
+
pip install -r requirements.txt
|
51 |
+
|
52 |
+
# Install STream3R as a package
|
53 |
+
pip install -e .
|
54 |
+
```
|
55 |
+
|
56 |
+
## Inference
|
57 |
+
You can now try STream3R with the following code. The checkpoint will be downloaded automatically from [Hugging Face](https://huggingface.co/yslan/STream3R).
|
58 |
+
|
59 |
+
You can set the inference mode to `causal` for causal attention, `window` for sliding window attention (with a default window size of 5), or `full` for bidirectional attention.
|
60 |
+
|
61 |
+
```python
|
62 |
+
import os
|
63 |
+
import torch
|
64 |
+
from stream3r.models.stream3r import STream3R
|
65 |
+
from stream3r.models.components.utils.load_fn import load_and_preprocess_images
|
66 |
+
|
67 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
68 |
+
|
69 |
+
model = STream3R.from_pretrained("yslan/STream3R").to(device)
|
70 |
+
|
71 |
+
example_dir = "examples/static_room"
|
72 |
+
image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
|
73 |
+
images = load_and_preprocess_images(image_names).to(device)
|
74 |
+
|
75 |
+
with torch.no_grad():
|
76 |
+
# Use one mode "causal", "window", or "full" in a single forward pass
|
77 |
+
predictions = model(images, mode="causal")
|
78 |
+
```
|
79 |
+
|
80 |
+
We also support a KV cache version to enable streaming input using `StreamSession`. The `StreamSession` takes sequential input and processes them one by one, making it suitable for real-time or low-latency applications. This streaming 3D reconstruction pipeline can be applied in various scenarios such as real-time robotics, autonomous navigation, online 3D understanding and SLAM. An example usage is shown below:
|
81 |
+
|
82 |
+
```python
|
83 |
+
import os
|
84 |
+
import torch
|
85 |
+
from stream3r.models.stream3r import STream3R
|
86 |
+
from stream3r.stream_session import StreamSession
|
87 |
+
from stream3r.models.components.utils.load_fn import load_and_preprocess_images
|
88 |
+
|
89 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
90 |
+
|
91 |
+
model = STream3R.from_pretrained("yslan/STream3R").to(device)
|
92 |
+
|
93 |
+
example_dir = "examples/static_room"
|
94 |
+
image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
|
95 |
+
images = load_and_preprocess_images(image_names).to(device)
|
96 |
+
# StreamSession supports KV cache management for both "causal" and "window" modes.
|
97 |
+
session = StreamSession(model, mode="causal")
|
98 |
+
|
99 |
+
with torch.no_grad():
|
100 |
+
# Process images one by one to simulate streaming inference
|
101 |
+
for i in range(images.shape[0]):
|
102 |
+
image = images[i : i + 1]
|
103 |
+
predictions = session.forward_stream(image)
|
104 |
+
session.clear()
|
105 |
+
```
|
106 |
+
|
107 |
+
## Demo
|
108 |
+
You can run the demo built on [VGG-T's code](https://github.com/facebookresearch/vggt) using the script [`app.py`](https://github.com/NIRVANALAN/STream3R/blob/main/app.py) with the following command:
|
109 |
+
|
110 |
+
```sh
|
111 |
+
python app.py
|
112 |
+
```
|
113 |
+
|
114 |
+
## Quantitative Results
|
115 |
+
|
116 |
+
*3D Reconstruction Comparison on NRGBD.*
|
117 |
+
|
118 |
+
| Method | Type | Acc Mean ↓ | Acc Med. ↓ | Comp Mean ↓ | Comp Med. ↓ | NC Mean ↑ | NC Med. ↑ |
|
119 |
+
|---------------------|----------|------------|------------|-------------|-------------|-----------|-----------|
|
120 |
+
| VGG-T | FA | 0.073 | 0.018 | 0.077 | 0.021 | 0.910 | 0.990 |
|
121 |
+
| DUSt3R | Optim | 0.144 | 0.019 | 0.154 | 0.018 | 0.870 | 0.982 |
|
122 |
+
| MASt3R | Optim | 0.085 | 0.033 | 0.063 | 0.028 | 0.794 | 0.928 |
|
123 |
+
| MonST3R | Optim | 0.272 | 0.114 | 0.287 | 0.110 | 0.758 | 0.843 |
|
124 |
+
| Spann3R | Stream | 0.416 | 0.323 | 0.417 | 0.285 | 0.684 | 0.789 |
|
125 |
+
| CUT3R | Stream | 0.099 | 0.031 | 0.076 | 0.026 | 0.837 | 0.971 |
|
126 |
+
| StreamVGGT | Stream | 0.084 | 0.044 | 0.074 | 0.041 | 0.861 | 0.986 |
|
127 |
+
| Ours | Stream | **0.057** | **0.014** | **0.028** | **0.013** | **0.910** | **0.993** |
|
128 |
+
|
129 |
+
Read our [full paper](https://huggingface.co/papers/2508.10893) for more insights.
|
130 |
+
|
131 |
+
## GPU Memory Usage and Runtime
|
132 |
+
|
133 |
+
We report the peak GPU memory usage (VRAM) and runtime of our full model for processing each streaming input using the `StreamSession` implementation. All experiments were conducted at a common resolution of 518 × 384 on a single H200 GPU. The benchmark includes both *Causal* for causal attention and *Window* for sliding window attention with a window size of 5.
|
134 |
+
|
135 |
+
|
136 |
+
*Run Time (s).*
|
137 |
+
| Num of Frames | 1 | 20 | 40 | 80 | 100 | 120 | 140 | 180 | 200 |
|
138 |
+
|-----------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
|
139 |
+
| Causal | 0.1164 | 0.2034 | 0.3060 | 0.4986 | 0.5945 | 0.6947 | 0.7916 | 0.9911 | 1.1703 |
|
140 |
+
| Window | 0.1167 | 0.1528 | 0.1523 | 0.1517 | 0.1515 | 0.1512 | 0.1482 | 0.1443 | 0.1463 |
|
141 |
+
|
142 |
+
|
143 |
+
*VRAM (GB).*
|
144 |
+
| Num of Frames | 1 | 20 | 40 | 80 | 100 | 120 | 140 | 180 | 200 |
|
145 |
+
|-----------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
|
146 |
+
| Causal | 5.49 | 9.02 | 12.92 | 21.00 | 25.03 | 29.10 | 33.21 | 41.31 | 45.41 |
|
147 |
+
| Window | 5.49 | 6.53 | 6.53 | 6.53 | 6.53 | 6.53 | 6.53 | 6.53 | 6.53 |
|
148 |
+
|
149 |
+
|
150 |
+
## Datasets
|
151 |
+
We follow [CUT3R](https://github.com/CUT3R/CUT3R/blob/main/docs/preprocess.md) to preprocess the dataset for training. The training configuration can be found at ```configs/experiment/stream3r/stream3r.yaml```.
|
152 |
+
|
153 |
+
|
154 |
+
## TODO
|
155 |
+
|
156 |
+
- [ ] Release evaluation code.
|
157 |
+
- [ ] Release training code.
|
158 |
+
- [ ] Release the metric-scale version.
|
159 |
+
|
160 |
+
|
161 |
+
## License
|
162 |
+
|
163 |
+
This project is licensed under [NTU S-Lab License 1.0](https://github.com/NIRVANALAN/STream3R/blob/main/LICENSE). Redistribution and use should follow this license.
|
164 |
+
|
165 |
+
## Citation
|
166 |
+
|
167 |
+
If you find our code or paper helps, please consider citing:
|
168 |
+
|
169 |
+
```bibtex
|
170 |
+
@article{stream3r2025,
|
171 |
+
title={STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer},
|
172 |
+
author={Lan, Yushi and Luo, Yihang and Hong, Fangzhou and Zhou, Shangchen and Chen, Honghua and Lyu, Zhaoyang and Yang, Shuai and Dai, Bo and Loy, Chen Change and Pan, Xingang},
|
173 |
+
booktitle={arXiv preprint arXiv:2508.10893},
|
174 |
+
year={2025}
|
175 |
+
}
|
176 |
+
```
|
177 |
+
## Acknowledgments
|
178 |
+
We recognize several concurrent works on streaming methods. We encourage you to check them out:
|
179 |
+
|
180 |
+
[StreamVGGT](https://github.com/wzzheng/StreamVGGT) | [CUT3R](https://github.com/CUT3R/CUT3R) | [SLAM3R](https://github.com/PKU-VCL-3DV/SLAM3R) | [Spann3R](https://github.com/HengyiWang/spann3r)
|
181 |
+
|
182 |
+
STream3R is built on the shoulders of several outstanding open-source projects. Many thanks to the following exceptional projects:
|
183 |
+
|
184 |
+
[VGG-T](https://github.com/facebookresearch/vggt) | [Fast3R](https://github.com/facebookresearch/fast3r) | [DUSt3R](https://github.com/naver/dust3r) | [MonST3R](https://github.com/Junyi42/monst3r) | [Viser](https://github.com/nerfstudio-project/viser)
|
185 |
+
|
186 |
+
## Contact
|
187 |
+
If you have any question, please feel free to contact us via `[email protected]` or Github issues.
|