File size: 4,760 Bytes
9fbca3a 47b484b 9fbca3a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: mit
language:
- en
base_model:
- facebook/dinov2-base
- facebook/dinov2-small
tags:
- computer_vision
---
# Near, far: Patch-ordering enhances vision foundation models' scene understanding
Welcome to the Hugging Face repository for **NeCo**. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects.
Our paper discussing our work:
**"Near, far: Patch-ordering enhances vision foundation models' scene understanding"**
*[Valentinos Pariza](https://vpariza.github.io), [Mohammadreza Salehi](https://smsd75.github.io),[Gertjan J. Burghouts](https://gertjanburghouts.github.io), [Francesco Locatello](https://www.francescolocatello.com/), [Yuki M. Asano](yukimasano.github.io)*
🌐 **[Project Page](https://vpariza.github.io/NeCo/)**
⌨️ **[GitHub Repository](https://github.com/vpariza/NeCo)**
📄 **[Read the Paper on arXiv](https://arxiv.org/abs/2408.11054)**
## Model Details
### Model Description
NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.
- **Model type:** Vision Encoder (Dino, Dinov2, ...)
- **Language(s) (NLP):** Python
- **License:** MIT
- **Finetuned from model [optional]:** Dinov2, Dinov2R, Dino, ...
## How to Get Started with the Model
To use NeCo models on downstream dense prediction tasks, you just need to install `timm` and `torch` and depending on which checkpoint you use you can load it as follows:
The models can be download from our [NeCo Hugging Face repo](https://huggingface.co/FunAILab/NeCo/tree/main).
#### Models after post-training dinov2 (following dinov2 architecture)
##### NeCo on Dinov2
```python
import torch
# change to dinov2_vitb14 for base as described in:
# https://github.com/facebookresearch/dinov2
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
```
##### NeCo on Dinov2 with Registers
```python
import torch
# change to dinov2_vitb14_reg for base as described in:
# https://github.com/facebookresearch/dinov2
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
```
#### Models after post-training dino or similar (following dino architecture)
##### timm vit-small and vit-base architectures
```python
import torch
from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
# Change to vit_base_patch8_224() if you want to use our larger model
model = vit_small_patch16_224()
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint, map_location='cpu')
model.load_state_dict(state_dict, strict=False)
```
**Note:** In case you want to directly load the weights of the model from a hugging face url, please execute:
```python
import torch
state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")
```
## Training Details
### Training Data
* We have post-trained our models on the **COCO Dataset**.
### Training Procedure
Please look our repository and read our paper for more details.
## Environmental Impact
- **Hardware Type:** NVIDIA A100 GPU
- **Hours used:** 18 (per model)
- **Cloud Provider:** Helma NHR FAU (Germany), (Snellius The Netherlands)
- **Compute Region:** Europe/Germany & Netherlands
## Citation
**BibTeX:**
```
@inproceedings{
pariza2025near,
title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=Qro97zWC29}
}
```
<!-- **APA:** -->
<!-- [More Information Needed] -->
<!-- ## Glossary [optional] -->
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
<!-- [More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed] --> |