bfshi-nvidia commited on
Commit
b0b86b4
·
verified ·
1 Parent(s): 67a9b0d

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/example_selection_maps/bottom_up_selection_prob.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/example_selection_maps/top_down_selection_prob_1.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/example_selection_maps/top_down_selection_prob_2.png filter=lfs diff=lfs merge=lfs -text
39
+ assets/test_images/dock.jpg filter=lfs diff=lfs merge=lfs -text
LICENSE.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ NVIDIA License
2
+
3
+ 1. Definitions
4
+
5
+ “Licensor” means any person or entity that distributes its Work.
6
+ “Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works thereof that are made available under this license.
7
+ The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
8
+ Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
9
+
10
+ 2. License Grant
11
+
12
+ 2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
13
+
14
+ 3. Limitations
15
+
16
+ 3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
17
+
18
+ 3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
19
+
20
+ 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for non-commercial research activities or non-commercial research publications only.
21
+
22
+ 3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
23
+
24
+ 3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
25
+
26
+ 3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
27
+
28
+ 4. Disclaimer of Warranty.
29
+
30
+ THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
31
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.
32
+
33
+ 5. Limitation of Liability.
34
+
35
+ EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
36
+
README.md ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - google/siglip2-so400m-patch14-384
6
+ pipeline_tag: image-feature-extraction
7
+ ---
8
+
9
+ ## Description: <br>
10
+
11
+ PS3-1.5K-SigLIP2 is a vision encoder that extracts visual features from images of up to 1.5K resolution.
12
+
13
+ This model is for research and development only.
14
+
15
+ ### License/Terms of Use: <br>
16
+
17
+ [NVIDIA license](https://huggingface.co/nvidia/PS3-1.5K-SigLIP2/blob/main/LICENSE.md). Additional Information: [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/) for [siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384). The reference to the NVIDIA License means the attached custom NSCLv1 license, under which users may use for purposes of conducting non-commercial research activities and non-commercial research publications.
18
+
19
+ ### Deployment Geography:
20
+
21
+ Global
22
+
23
+ ### Use Case: <br>
24
+
25
+ The model is used for extracting visual features from high-resolution images.
26
+
27
+ ### Release Date: <br>
28
+
29
+ Huggingface [07/26/2025] via [https://huggingface.co/nvidia/PS3-1.5K-SigLIP2] <br>
30
+
31
+ ## Reference(s):
32
+
33
+ The model is from the paper [Scaling Vision Pre-Training to 4K Resolution](https://arxiv.org/abs/2503.19903). Useful links:
34
+
35
+ [![website](https://img.shields.io/badge/website-76b900?style=for-the-badge&logo=safari&labelColor=555555)](https://nvlabs.github.io/PS3/)
36
+ [![Arxiv](https://img.shields.io/badge/Arxiv-b31b1b?style=for-the-badge&logo=arxiv&labelColor=555555)](https://arxiv.org/abs/2503.19903)
37
+ [![VILA-HD Demo](https://img.shields.io/badge/-VILA--HD_Demo-brightgreen?style=for-the-badge&logo=huggingface&labelColor=555555&color=ff6e00)](https://huggingface.co/spaces/bfshi/VILA-HD-demo)
38
+ [![PS3 Models](https://img.shields.io/badge/PS3%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
39
+ [![VILA-HD Models](https://img.shields.io/badge/VILA--HD%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
40
+ [![PS3 Code](https://img.shields.io/badge/PS3%20Code%20-181717?style=for-the-badge&logo=github&labelColor=555555)](https://github.com/NVlabs/PS3)
41
+
42
+
43
+ ## Model Architecture:
44
+ **Architecture Type:** Neural Network
45
+
46
+ **Network Architecture:** Vision Transformer designed for high-resolution images
47
+
48
+ This model was developed based on [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384). Please see training designs in the paper.
49
+
50
+
51
+ ## Input: <br>
52
+ **Input Type(s):** Image <br>
53
+ **Input Format:** Red, Green, Blue (RGB) <br>
54
+ **Input Parameters:** Two-Dimensional (2D) <br>
55
+ **Other Properties Related to Input:** Image resolutions up to 1512*1512. <br>
56
+
57
+ ## Output: <br>
58
+ **Output Type(s):** Embeddings <br>
59
+ **Output Format:** Tensor <br>
60
+ **Output Parameters:** One-Dimensional (1D) <br>
61
+ **Other Properties Related to Output:** Downstream model required to leverage image features <br>
62
+
63
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
64
+
65
+ ## Software Integration:
66
+ **Runtime Engine(s):**
67
+ Not Applicable (N/A) <br>
68
+
69
+ **Supported Hardware Microarchitecture Compatibility:** <br>
70
+ NVIDIA Ampere <br>
71
+ NVIDIA Blackwell <br>
72
+ NVIDIA Jetson <br>
73
+ NVIDIA Hopper <br>
74
+
75
+ **Preferred/Supported Operating System(s):**
76
+ Linux <br>
77
+ Linux 4 Tegra <br>
78
+ QNX <br>
79
+ Windows <br>
80
+
81
+ ## Model Version(s):
82
+
83
+ v1.0 - Initial release
84
+
85
+ ## Pre-Trained Models
86
+
87
+ ### PS3 models
88
+
89
+ | Vision Model | Max Resolution | Pre-Trained Weights |
90
+ |-----------------|----------------|-------------------------------------------------------------------------|
91
+ | PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) |
92
+ | PS3-4K-SigLIP | 3780 * 3780 | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) |
93
+ | PS3-1.5K-C-RADIOv2 | 1536 * 1536 | [nvidia/PS3-1.5K-C-RADIOv2](https://huggingface.co/nvidia/PS3-1.5K-C-RADIOv2) |
94
+ | PS3-4K-C-RADIOv2 | 3840 * 3840 | [nvidia/PS3-4K-C-RADIOv2](https://huggingface.co/nvidia/PS3-4K-C-RADIOv2) |
95
+ | PS3-1.5K-SigLIP2 | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP2](https://huggingface.co/nvidia/PS3-1.5K-SigLIP2) |
96
+ | PS3-4K-SigLIP2 | 3780 * 3780 | [nvidia/PS3-4K-SigLIP2](https://huggingface.co/nvidia/PS3-4K-SigLIP2) |
97
+ | PS3_Lang-1.5K-SigLIP2 | 1512 * 1512 | [nvidia/PS3_Lang-1.5K-SigLIP2](https://huggingface.co/nvidia/PS3_Lang-1.5K-SigLIP2) |
98
+ | PS3_Lang-4K-SigLIP2 | 3780 * 3780 | [nvidia/PS3_Lang-4K-SigLIP2](https://huggingface.co/nvidia/PS3_Lang-4K-SigLIP2) |
99
+
100
+ ## Training Datasets: <br>
101
+
102
+ 75M images <br>
103
+
104
+ 1 dataset that's built based on:
105
+ - SA-1B (https://ai.meta.com/datasets/segment-anything/)
106
+ - IDL (https://huggingface.co/datasets/pixparse/idl-wds)
107
+
108
+ Training: 100% <br>
109
+
110
+ ## Training Dataset:
111
+
112
+ **Link:**
113
+ We used the following dataset during developing PS3:
114
+ - SA-1B (https://ai.meta.com/datasets/segment-anything/)
115
+ - IDL (https://huggingface.co/datasets/pixparse/idl-wds)
116
+
117
+ **Data Collection Method by dataset:** <br>
118
+ Automated
119
+
120
+ **Labeling Method by dataset:** <br>
121
+ Automated
122
+
123
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
124
+ 75M images with resolution up to 4Kx4K.
125
+
126
+ ## Testing & Evaluation Datasets:
127
+ * None <br>
128
+
129
+
130
+ ## Performance
131
+
132
+ ### Performance of PS3 models
133
+
134
+ See Table 1 in the paper for full results.
135
+
136
+ ## Inference:
137
+ **Acceleration Engine:** Not Applicable (N/A) <br>
138
+ **Test Hardware:** <br>
139
+ The model is tested on NVIDIA A100 GPU.
140
+
141
+ ## Installation
142
+
143
+ Install through pip to use PS3 out of the box.
144
+ ```bash
145
+ pip install ps3-torch
146
+ ```
147
+
148
+ If you would like to make changes to the PS3 code, go to [PS3 repository](https://github.com/NVlabs/PS3), clone the repo, and install in editable mode.
149
+ ```bash
150
+ cd PS3
151
+ pip install -e .
152
+ ```
153
+
154
+ ## Inference - Quick Start
155
+
156
+ Here we show example usage including
157
+ - loading the model
158
+ - selectively encoding high-res image based on image saliency (bottom-up selection) and visualizing the selection probabilities
159
+ - selectively encoding high-res image based on text prompts (top-down selection) and visualizing the selection probabilities
160
+ - formatting the encoded features into (masked) feature maps
161
+
162
+ ### 1. Load Model and Image
163
+ ```python
164
+ from PIL import Image
165
+ from ps3 import PS3VisionModel, PS3ImageProcessor
166
+
167
+ # Load the PS3 model and processor.
168
+ vision_model = PS3VisionModel.from_pretrained("nvidia/PS3-4K-SigLIP2")
169
+ processor = PS3ImageProcessor.from_pretrained("nvidia/PS3-4K-SigLIP2")
170
+ vision_model.cuda().eval()
171
+
172
+ # You can replace it with your own image.
173
+ image = Image.open("assets/test_images/dock.jpg")
174
+
175
+ # Preprocess the image.
176
+ x = processor(image)["pixel_values"][0].unsqueeze(0).cuda()
177
+ ```
178
+
179
+ ### 2. Encode High-Res Image with Bottom-Up Selection
180
+
181
+ PS3 can select important high-res patches based on visual saliency and encode those patches.
182
+
183
+ **You can encode the whole high-res image using PS3.**
184
+ ```python
185
+ outs = vision_model(x, num_look_close="all")
186
+ features = outs.last_hidden_state
187
+ print(features.shape) # (1, 88209, 1152)
188
+ ```
189
+ Note the PS3-4K model processes the image at multiple scales: 378 (low-res), 756, 1512, and 3780, and it has a patch size of 14.
190
+
191
+ Then the number of tokens at each scale is (378/14)^2 = 729, (756/14)^2 = 2916, (1512/14)^2 = 11664, and (3780/14)^2 = 72900.
192
+
193
+ The output hidden state concatenates all the tokens along sequence dimension.
194
+ That gives us 729 + 2916 + 11664 + 72900 = 88209 tokens in total.
195
+
196
+ **You can encode parts of the high-res image by setting `num_look_close`, i.e., how many times to run the high-res selection and encoding.**
197
+ ```python
198
+ outs = vision_model(x, num_look_close=2)
199
+ features = outs.last_hidden_state
200
+ print(features.shape) # (1, 5849, 1152)
201
+ ```
202
+ In this example, it only runs the high-res selection and encoding twice.
203
+
204
+ Note that PS3 processes at most 2560 high-res patches at a time. Then running high-res selection and encoding twice gives us 2560 * 2 = 5120 high-res tokens. There is also 729 low-res tokens. That gives us 729 + 5120 = 5849 tokens in total.
205
+
206
+ **You can also decide how many high-res tokens to process by setting `num_token_look_close`.**
207
+ ```python
208
+ outs = vision_model(x, num_token_look_close=3000)
209
+ features = outs.last_hidden_state
210
+ print(features.shape) # (1, 3729, 1152)
211
+ ```
212
+ In this example, it only processes 3000 high-res tokens. Note that PS3 only processes 2560 high-res patches at a time. This means it needs to run the high-res selection and encoding twice, with the first time processing 2560 high-res tokens and the second time processing 440 tokens. In the end it outputs 3729 tokens (3000 high-res + 729 low-res).
213
+
214
+ **Visualize the bottom-up patch selection probabilities.**
215
+ ```python
216
+ ############## Helper functions for visiualization ##############
217
+
218
+ # install cv2, matplotlib, scipy for visualization purpose
219
+ os.system("pip install opencv-python matplotlib scipy")
220
+ from torchvision import transforms
221
+ import numpy as np
222
+ import os
223
+ import cv2
224
+ import matplotlib.pyplot as plt
225
+ from scipy.ndimage import gaussian_filter
226
+
227
+ def create_heatmap_overlay(image, heatmap, alpha=0.4, colormap=plt.cm.jet, sigma=10.0):
228
+ if len(image.shape) == 2:
229
+ image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
230
+
231
+ smoothed_heatmap = gaussian_filter(heatmap.astype(np.float32), sigma=sigma)
232
+ smoothed_heatmap = (smoothed_heatmap - smoothed_heatmap.min()) / \
233
+ (smoothed_heatmap.max() - smoothed_heatmap.min())
234
+ colored_heatmap = (colormap(smoothed_heatmap) * 255).astype(np.uint8)
235
+
236
+ if colored_heatmap.shape[-1] == 4:
237
+ colored_heatmap = colored_heatmap[:, :, :3]
238
+
239
+ overlay = cv2.addWeighted(image, 1 - alpha, colored_heatmap, alpha, 0)
240
+ return Image.fromarray(overlay)
241
+
242
+ def save_visualization(selection_probs, image, output_dir):
243
+ os.makedirs(output_dir, exist_ok=True)
244
+ resize_transform = transforms.Resize(image.size[::-1])
245
+ for i, prob in enumerate(selection_probs):
246
+ prob = (prob - prob.min()) / (prob.max() - prob.min() + 1e-6)
247
+ prob = resize_transform(prob)
248
+ prob = prob.squeeze(0).detach().cpu().numpy()
249
+ # overlay the selection probability map on the original image
250
+ overlay = create_heatmap_overlay(np.array(image), prob)
251
+ overlay.save(os.path.join(output_dir, f"selection_prob_scale_{i}.png"))
252
+ image.save(os.path.join(output_dir, f"image.png"))
253
+
254
+ #################### End of helper functions ####################
255
+
256
+ selection_probs = outs.selection_probs
257
+ print([p.shape for p in selection_probs]) # [(1, 54, 54), (1, 108, 108), (1, 270, 270)]
258
+ save_visualization(selection_probs, image, "save_path/bottom_up_selection_probs")
259
+ ```
260
+ `selection_probs` contains the selection probability map for each scale. In this case, the feature map of each scale has shapes of 54x54, 108x108, and 270x270. The selection probability reflects how salient/important each patch is and patches with higher probability are selected first. You can visit the demo for more visualization.
261
+
262
+ ![Bottom-Up Selection Probabilities](assets/example_selection_maps/bottom_up_selection_prob.png)
263
+
264
+
265
+ ### 3. Encode High-Res Image with Top-Down Selection
266
+
267
+ PS3 can also select important high-res patches based on any text prompt.
268
+
269
+ First of all, load the text model and encode the text prompt.
270
+ ```python
271
+ from ps3 import PS3Tokenizer, PS3TextModel
272
+
273
+ tokenizer = PS3Tokenizer.from_pretrained("nvidia/PS3-4K-SigLIP2")
274
+ text_model = PS3TextModel.from_pretrained("nvidia/PS3-4K-SigLIP2")
275
+ text_model.cuda().eval()
276
+
277
+ text = ["A tall spire with a cross at the top of the building."]
278
+ text = tokenizer(text).cuda()
279
+ prompt = text_model(text).prompt
280
+ ```
281
+
282
+ Then PS3 can select important high-res patches based on the text prompt and encode those patches.
283
+ ```python
284
+ outs = vision_model(x, num_look_close=2, prompt=prompt)
285
+ features = outs.last_hidden_state
286
+ print(features.shape) # (1, 5849, 1152)
287
+ ```
288
+
289
+ You can visualize the top-down selection probabilities. Usually the regions related to the text prompt have higher selection probabilities.
290
+ ```python
291
+ selection_probs = outs.selection_probs
292
+ save_visualization(selection_probs, image, "save_path/top_down_selection_probs_1")
293
+ ```
294
+
295
+ ![Top-Down Selection Probabilities](assets/example_selection_maps/top_down_selection_prob_1.png)
296
+
297
+ You can change to another text prompt and see different selection probabilities.
298
+ ```python
299
+ text = ["A green rope on the green and red boat."]
300
+ text = tokenizer(text).cuda()
301
+ prompt = text_model(text).prompt
302
+ outs = vision_model(x, num_look_close=2, prompt=prompt)
303
+ selection_probs = outs.selection_probs
304
+ save_visualization(selection_probs, image, "save_path/top_down_selection_probs_2")
305
+ ```
306
+
307
+ ![Top-Down Selection Probabilities](assets/example_selection_maps/top_down_selection_prob_2.png)
308
+
309
+ ### 4. Format the Encoded Features into (Masked) Feature Maps
310
+
311
+ The features returned above are the concatenation of all the low-res and high-res features.
312
+
313
+ You can format the features into masked feature maps for each scale.
314
+ ```python
315
+ feature_maps = vision_model.vision_model.format_features_into_feature_maps(outs.last_hidden_state, outs.selection_maps)
316
+ print([x.shape for x in feature_maps]) # [(1, 1152, 27, 27), (1, 1152, 54, 54), (1, 1152, 108, 108), (1, 1152, 270, 270)]
317
+ ```
318
+ This will create a masked feature map `feature_maps` which is a list of feature maps (B * C * H * W) for each scale and each feature map contains the actual feature for the selected patches at that scaleand zero vector for the unselected patches.
319
+
320
+
321
+ ## Inference instructions
322
+
323
+ [Quick Start](#quick-start) gives some examples of how to use PS3 to encode an image. Below are more detailed explanations of the arguments of model inference.
324
+
325
+ ```python
326
+ class PS3VisionModel(PS3PreTrainedModel):
327
+ ...
328
+ def forward(
329
+ self,
330
+ pixel_values,
331
+ num_look_close,
332
+ num_token_look_close=None,
333
+ prompt=None,
334
+ gt_selection_maps=None,
335
+ smooth_selection_prob=False,
336
+ only_select_first_n_scale=None,
337
+ is_global_text=None,
338
+ pool_gt_token_only=False,
339
+ ):
340
+ ...
341
+ ```
342
+ `pixel_values`: the input images with shape (B, C, H, W).
343
+
344
+ `num_look_close`: how many times to run high-res selection and encoding. PS3 selects and processes 2560 patches each time. If set to `all` then it selects all the high-res patches. If set to `0` then PS3 only returns the low-res features. If set to a larger number than what it needs to encode all the high-res patches, then PS3 will clamp it to the max number needed.
345
+
346
+ `num_token_look_close`: (optinoal) how many high-res patches to select and process. Similar to `num_look_close` but `num_token_look_close` directly specifies the number of high-res tokens instead of number of running high-res encoding.
347
+
348
+ `prompt`: (optional) the prompt embedding used to select high-res patches. The prompt embedding can be embedding of some text, or some embedding output by an LLM (see the paper). The shape of prompt embedding is (B, C) where B is the batch size (same in `pixel_values`) and C is the embedding dimension (same as PS3 token embedding dimension). If `prompt=None`, then PS3 will select high-res patches based on visual saliency (bottom-up selection).
349
+
350
+ `gt_selection_maps`: (optional) the ground truth selection maps for the image. It should be a tensor of 0/1 values with shape (B, h, w). Regions with value 1 means they should be selected. When selecting high-res patches, PS3 will interpolate the `gt_selection_maps` to the same size as the feature map at each scale, prioritize selecting the tokens where the value is 1, and if there's still budget for selecting more tokens, it will select the rest based on the original selection probability.
351
+
352
+ `smooth_selection_prob`: (optional) smooth the selection probability map such that the selected patches won't be distributed too scarcely each time it runs high-res selection. It slightly improves the performance occasinoally when selecting all the patches but usually hurts when selecting parts of the patches.
353
+
354
+ `only_select_first_n_scale`: (optional) only select the first n high-res scales. For example, for PS3-4K model, if `only_select_first_n_scale=2`, then it only selects and processes scales of 756 and 1512, and ignores the scale of 3780.
355
+
356
+ `is_global_text`: (optional) only return the pooled low-res feautres. *It will only be used during pre-training.*
357
+
358
+ `pool_gt_token_only`: (optional) only pool the tokens inside the gt selection regions. *It will only be used during pre-training.*
359
+
360
+
361
+ ### Ethical Considerations:
362
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
363
+
364
+
365
+ ## More Details
366
+ Please refer to the [PS3 codebase](https://github.com/NVlabs/PS3) for more details.
367
+
368
+
369
+ ## Citation
370
+
371
+ If you find this work useful in your research, please consider citing:
372
+
373
+ ```bibtex
374
+ @article{shi2025scaling,
375
+ title={Scaling Vision Pre-Training to 4K Resolution},
376
+ author={Shi, Baifeng and Li, Boyi and Cai, Han and Lu, Yao and Liu, Sifei and Pavone, Marco and Kautz, Jan and Han, Song and Darrell, Trevor and Molchanov, Pavlo and others},
377
+ journal={arXiv preprint arXiv:2503.19903},
378
+ year={2025}
379
+ }
380
+ ```
assets/example_selection_maps/bottom_up_selection_prob.png ADDED

Git LFS Details

  • SHA256: 497792f28e133233b02988881b2cd4600ebc9fab29fd74120697c8f527b2ed5a
  • Pointer size: 132 Bytes
  • Size of remote file: 1.21 MB
assets/example_selection_maps/top_down_selection_prob_1.png ADDED

Git LFS Details

  • SHA256: d8f193d0a1a8065070b86cefa41dcefd27389cf88693a7b272547b8a98a2001e
  • Pointer size: 132 Bytes
  • Size of remote file: 1.2 MB
assets/example_selection_maps/top_down_selection_prob_2.png ADDED

Git LFS Details

  • SHA256: 0e191dc01285341f6e353e147ed88f492a74cd282135a2e6bc3af6ee2741c54f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.18 MB
assets/test_images/dock.jpg ADDED

Git LFS Details

  • SHA256: 2c35ed6357e5eed620bbcfda31315dead1e9ad8b6a4f324131705f61f489d99d
  • Pointer size: 132 Bytes
  • Size of remote file: 1.71 MB
config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "PS3Model"
4
+ ],
5
+ "model_type": "ps3",
6
+ "vision_config": {
7
+ "architectures": [
8
+ "PS3VisionModel"
9
+ ],
10
+ "model_type": "ps3_vision_model",
11
+ "model_name": "vit_so400m_patch14_siglip_378",
12
+ "hidden_size": 1152,
13
+ "pool": "map",
14
+ "ps3_scales": [
15
+ 378,
16
+ 756,
17
+ 1512
18
+ ],
19
+ "select_based_on_layer": [
20
+ 0,
21
+ 9,
22
+ 18,
23
+ 26
24
+ ],
25
+ "min_select_num": 1,
26
+ "max_select_num": 2560,
27
+ "seperate_pos_emb": true,
28
+ "highres_selection_feature": true,
29
+ "radio": false,
30
+ "radio_adapter_mlp_version": null,
31
+ "radio_adapter_mlp_input_dim": null,
32
+ "radio_adapter_mlp_hidden_dim": null,
33
+ "radio_adapter_mlp_output_dim": null,
34
+ "radio_adapter_mlp_num_inner": null,
35
+ "img_size": null,
36
+ "drop": 0.0,
37
+ "class_token": null,
38
+ "final_norm": false
39
+ },
40
+ "text_config": {
41
+ "context_length": 64,
42
+ "vocab_size": 256000,
43
+ "hf_tokenizer_name": "timm/ViT-SO400M-14-SigLIP2-378",
44
+ "tokenizer_kwargs": {
45
+ "clean": "canonicalize"
46
+ },
47
+ "width": 1152,
48
+ "heads": 16,
49
+ "layers": 27,
50
+ "mlp_ratio": 3.7362,
51
+ "no_causal_mask": true,
52
+ "proj_bias": true,
53
+ "pool_type": "last",
54
+ "norm_kwargs": {
55
+ "eps": 1e-06
56
+ },
57
+ "act_kwargs": {
58
+ "approximate": "tanh"
59
+ },
60
+ "architectures": [
61
+ "PS3TextModel"
62
+ ],
63
+ "model_type": "ps3_text_model",
64
+ "output_dim": 1152,
65
+ "prompt_proj_dim": 1152
66
+ }
67
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21f9dffe2b006107e535d5d7b17d8220d39035fbb7b5d36616f921db65e5aabd
3
+ size 4785535552
preprocessor_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_size": [
3
+ 1512,
4
+ 1512
5
+ ],
6
+ "mean": [
7
+ 0.5,
8
+ 0.5,
9
+ 0.5
10
+ ],
11
+ "std": [
12
+ 0.5,
13
+ 0.5,
14
+ 0.5
15
+ ],
16
+ "interpolation": "bicubic",
17
+ "resize_mode": "squash"
18
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_name": "timm/ViT-SO400M-14-SigLIP2-378",
3
+ "context_length": 64,
4
+ "clean": "canonicalize"
5
+ }