File size: 3,352 Bytes
dc5c9a3 0ec4169 dc5c9a3 0ec4169 dc5c9a3 0ec4169 dc5c9a3 0ec4169 dc5c9a3 0ec4169 dc5c9a3 0ec4169 dc5c9a3 0ec4169 dc5c9a3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
datasets:
- yali30/findingdory
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- habitat
- embodied-ai
- memory
---
<center>
<a href="https://arxiv.org/abs/2506.15635" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-FindingDory-red?logo=arxiv" height="20" />
</a>
<a href="https://findingdory-benchmark.github.io/" target="_blank">
<img alt="Website" src="https://img.shields.io/badge/🌎_Website-FindingDory-blue.svg" height="20" />
</a>
<a href="https://github.com/findingdory-benchmark/findingdory-trl" target="_blank">
<img alt="GitHub Code" src="https://img.shields.io/badge/Code-FindingDory--TRL-white?&logo=github&logoColor=white" />
</a>
<a href="https://huggingface.co/datasets/yali30/findingdory/" target="_blank"">
<img alt="Huggingface" src="https://img.shields.io/badge/Dataset-FindingDory-yellow?logo=huggingface" />
</a>
</center>
<center><h1>FindingDory: A Benchmark to Evaluate Memory in Embodied Agents</h1>
<a href="https://www.karmeshyadav.com/">Karmesh Yadav*</a>,
<a href="https://yusufali98.github.io/">Yusuf Ali*</a>,
<a href="https://gunshigupta.netlify.app/">Gunshi Gupta</a>,
<a href="https://www.cs.ox.ac.uk/people/yarin.gal/website/">Yarin Gal</a>,
<a href="https://faculty.cc.gatech.edu/~zk15/">Zsolt Kira</a>
</center>
Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce **FindingDory**, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.
In this repo, we release a **Qwen2.5-VL-3B-Instruct** checkpoint trained on the training split of **FindingDory**. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a **frame index** (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with _immediately after_ the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.
🏋️ Training details
| Property | Value |
| -------- | ----- |
| Epochs | 5 (Total training steps 12840) |
| Effective batch | 32 |
| LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1) |
| Max Pixels. | 360 x 420 |
| Compute | “8 × A40 48 GB for ~84 hours” |
| Input frames | 96 Images (~10k tokens) |
| Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
| Best checkpoint | 8800 Steps |
📊 Evaluation
We compare the performance of our finetuned `FindingDory-Qwen2.5-VL-3B-SFT` checkpoint against other models below:
| Model | High-level Success Rate | Notes |
| ----- | ----------------------- | ----- |
| FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
| Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
| Gemma3-12B-it | 13.2% | zero-shot |
| GPT-4o | 27.3% | zero-shot |
| Gemini-2.0-Flash | 25.4% | zero-shot |
Checkout Fig 2 in the paper for more details.
📄 Citation
```
@article{yadav2025findingdory,
title = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
author = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
journal = {arXiv preprint arXiv:2506.15635},
year = {2025}
}
```
|