English
computer_vision
File size: 4,760 Bytes
9fbca3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47b484b
9fbca3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: mit
language:
- en
base_model:
- facebook/dinov2-base
- facebook/dinov2-small
tags:
- computer_vision
---

# Near, far: Patch-ordering enhances vision foundation models' scene understanding

Welcome to the Hugging Face repository for **NeCo**. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects.  

Our paper discussing our work:
**"Near, far: Patch-ordering enhances vision foundation models' scene understanding"**  
*[Valentinos Pariza](https://vpariza.github.io), [Mohammadreza Salehi](https://smsd75.github.io),[Gertjan J. Burghouts](https://gertjanburghouts.github.io), [Francesco Locatello](https://www.francescolocatello.com/), [Yuki M. Asano](yukimasano.github.io)*  

🌐 **[Project Page](https://vpariza.github.io/NeCo/)**
⌨️ **[GitHub Repository](https://github.com/vpariza/NeCo)**
📄 **[Read the Paper on arXiv](https://arxiv.org/abs/2408.11054)**

## Model Details

### Model Description

NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.

- **Model type:** Vision Encoder (Dino, Dinov2, ...)
- **Language(s) (NLP):** Python
- **License:** MIT
- **Finetuned from model [optional]:** Dinov2, Dinov2R, Dino, ...


## How to Get Started with the Model

To use NeCo models on downstream dense prediction tasks, you just need to install `timm`  and `torch` and depending on which checkpoint you use you can load it as follows:

The models can be download from our [NeCo Hugging Face repo](https://huggingface.co/FunAILab/NeCo/tree/main).

#### Models after post-training dinov2 (following dinov2 architecture)

##### NeCo on Dinov2 
```python
import torch
# change to dinov2_vitb14 for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
```
##### NeCo on Dinov2 with Registers
```python
import torch
# change to dinov2_vitb14_reg for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
```
#### Models after post-training dino or similar (following dino architecture)
##### timm vit-small and vit-base architectures
```python
import torch
from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
# Change to vit_base_patch8_224() if you want to use our larger model
model = vit_small_patch16_224()  
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint, map_location='cpu')
model.load_state_dict(state_dict, strict=False)
```

**Note:** In case you want to directly load the weights of the model from a hugging face url, please execute:
```python
import torch
state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")
```

## Training Details

### Training Data

* We have post-trained our models on the **COCO Dataset**.

### Training Procedure

Please look our repository and read our paper for more details.

## Environmental Impact
- **Hardware Type:** NVIDIA A100 GPU
- **Hours used:** 18 (per model)
- **Cloud Provider:** Helma NHR FAU (Germany), (Snellius The Netherlands)
- **Compute Region:** Europe/Germany & Netherlands

## Citation

**BibTeX:**
``` 
@inproceedings{
   pariza2025near,
   title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
   author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
   booktitle={The Thirteenth International Conference on Learning Representations},
   year={2025},
   url={https://openreview.net/forum?id=Qro97zWC29}
}

```

<!-- **APA:** -->

<!-- [More Information Needed] -->

<!-- ## Glossary [optional] -->

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

<!-- [More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed] -->