English
computer_vision
valentinospariza commited on
Commit
9fbca3a
·
verified ·
1 Parent(s): 2478b11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - facebook/dinov2-base
7
+ - facebook/dinov2-small
8
+ tags:
9
+ - computer_vision
10
+ ---
11
+
12
+ # Near, far: Patch-ordering enhances vision foundation models' scene understanding
13
+
14
+ Welcome to the Hugging Face repository for **NeCo**. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects.
15
+
16
+ Our paper discussing our work:
17
+ **"Near, far: Patch-ordering enhances vision foundation models' scene understanding"**
18
+ *[Valentinos Pariza](https://vpariza.github.io), [Mohammadreza Salehi](https://smsd75.github.io),[Gertjan J. Burghouts](https://gertjanburghouts.github.io), [Francesco Locatello](https://www.francescolocatello.com/), [Yuki M. Asano](yukimasano.github.io)*
19
+
20
+ 🌐 **[Project Page](https://vpariza.github.io/NeCo/)**
21
+ ⌨️ **[GitHub Repository](https://github.com/vpariza/NeCo)**
22
+ 📄 **[Read the Paper on arXiv](https://arxiv.org/abs/2408.11054)**
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.
29
+
30
+ - **Model type:** Vision Encoder (Dino, Dinov2, ...)
31
+ - **Language(s) (NLP):** Python
32
+ - **License:** MIT
33
+ - **Finetuned from model [optional]:** Dinov2, Dinov2R, Dino, ...
34
+
35
+
36
+ ## How to Get Started with the Model
37
+
38
+ To use NeCo models on downstream dense prediction tasks, you just need to install `timm` and `torch` and depending on which checkpoint you use you can load it as follows:
39
+
40
+ The models can be download from our [NeCo Hugging Face repo](https://huggingface.co/FunAILab/NeCo/tree/main).
41
+
42
+ #### Models after post-training dinov2 (following dinov2 architecture)
43
+
44
+ ##### NeCo on Dinov2
45
+ ```python
46
+ import torch
47
+ # change to dinov2_vitb14 for base as described in:
48
+ # https://github.com/facebookresearch/dinov2
49
+ model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
50
+ path_to_checkpoint = "<your path to downloaded ckpt>"
51
+ state_dict = torch.load(path_to_checkpoint)
52
+ model.load_state_dict(state_dict, strict=False)
53
+ ```
54
+ ##### NeCo on Dinov2 with Registers
55
+ ```python
56
+ import torch
57
+ # change to dinov2_vitb14_reg for base as described in:
58
+ # https://github.com/facebookresearch/dinov2
59
+ model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
60
+ path_to_checkpoint = "<your path to downloaded ckpt>"
61
+ state_dict = torch.load(path_to_checkpoint)
62
+ model.load_state_dict(state_dict, strict=False)
63
+ ```
64
+ #### Models after post-training dino or similar (following dino architecture)
65
+ ##### NeCo on Dinov2 with Registers
66
+ ```python
67
+ import torch
68
+ from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
69
+ # Change to vit_base_patch8_224() if you want to use our larger model
70
+ model = vit_small_patch16_224()
71
+ path_to_checkpoint = "<your path to downloaded ckpt>"
72
+ state_dict = torch.load(path_to_checkpoint, map_location='cpu')
73
+ model.load_state_dict(state_dict, strict=False)
74
+ ```
75
+
76
+ **Note:** In case you want to directly load the weights of the model from a hugging face url, please execute:
77
+ ```python
78
+ import torch
79
+ state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")
80
+ ```
81
+
82
+ ## Training Details
83
+
84
+ ### Training Data
85
+
86
+ * We have post-trained our models on the **COCO Dataset**.
87
+
88
+ ### Training Procedure
89
+
90
+ Please look our repository and read our paper for more details.
91
+
92
+ ## Environmental Impact
93
+ - **Hardware Type:** NVIDIA A100 GPU
94
+ - **Hours used:** 18 (per model)
95
+ - **Cloud Provider:** Helma NHR FAU (Germany), (Snellius The Netherlands)
96
+ - **Compute Region:** Europe/Germany & Netherlands
97
+
98
+ ## Citation
99
+
100
+ **BibTeX:**
101
+ ```
102
+ @inproceedings{
103
+ pariza2025near,
104
+ title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
105
+ author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
106
+ booktitle={The Thirteenth International Conference on Learning Representations},
107
+ year={2025},
108
+ url={https://openreview.net/forum?id=Qro97zWC29}
109
+ }
110
+
111
+ ```
112
+
113
+ <!-- **APA:** -->
114
+
115
+ <!-- [More Information Needed] -->
116
+
117
+ <!-- ## Glossary [optional] -->
118
+
119
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
120
+
121
+ <!-- [More Information Needed]
122
+
123
+ ## More Information [optional]
124
+
125
+ [More Information Needed]
126
+
127
+ ## Model Card Authors [optional]
128
+
129
+ [More Information Needed]
130
+
131
+ ## Model Card Contact
132
+
133
+ [More Information Needed] -->