File size: 3,352 Bytes

dc5c9a3
 
 
 
 
 
 
 
 
 
 
 
 
 
0ec4169
dc5c9a3
 
 
 
 
 
0ec4169
dc5c9a3
 
0ec4169
 
 
 
dc5c9a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ec4169
dc5c9a3
 
0ec4169
 
 
dc5c9a3
0ec4169
dc5c9a3
 
 
 
 
 
 
 
 
 
 
0ec4169
dc5c9a3

---
datasets:
- yali30/findingdory
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- habitat
- embodied-ai
- memory
---
<center>
<a href="https://arxiv.org/abs/2506.15635" target="_blank">
    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-FindingDory-red?logo=arxiv" height="20" />
</a>
<a href="https://findingdory-benchmark.github.io/" target="_blank">
    <img alt="Website" src="https://img.shields.io/badge/🌎_Website-FindingDory-blue.svg" height="20" />
</a>
<a href="https://github.com/findingdory-benchmark/findingdory-trl" target="_blank">
    <img alt="GitHub Code" src="https://img.shields.io/badge/Code-FindingDory--TRL-white?&logo=github&logoColor=white" />
</a>
<a href="https://huggingface.co/datasets/yali30/findingdory/" target="_blank"">
    <img alt="Huggingface" src="https://img.shields.io/badge/Dataset-FindingDory-yellow?logo=huggingface" />
</a>
</center>

<center><h1>FindingDory: A Benchmark to Evaluate Memory in Embodied Agents</h1>
  <a href="https://www.karmeshyadav.com/">Karmesh Yadav*</a>,
  <a href="https://yusufali98.github.io/">Yusuf Ali*</a>,
  <a href="https://gunshigupta.netlify.app/">Gunshi Gupta</a>,
  <a href="https://www.cs.ox.ac.uk/people/yarin.gal/website/">Yarin Gal</a>,
  <a href="https://faculty.cc.gatech.edu/~zk15/">Zsolt Kira</a>
</center>

Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce **FindingDory**, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks. 

In this repo, we release a **Qwen2.5-VL-3B-Instruct** checkpoint trained on the training split of **FindingDory**. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a **frame index** (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with _immediately after_ the mug”).  
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.

🏋️ Training details
| Property | Value |
| -------- | ----- |
| Epochs   | 5 (Total training steps 12840) |
| Effective batch | 32 |
| LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1)  |
| Max Pixels. | 360 x 420 |
| Compute  | “8 × A40 48 GB for ~84 hours” |
| Input frames | 96 Images (~10k tokens) |
| Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
| Best checkpoint | 8800 Steps |


📊 Evaluation
We compare the performance of our finetuned `FindingDory-Qwen2.5-VL-3B-SFT` checkpoint against other models below:
| Model	| High-level Success Rate | Notes |
| ----- | ----------------------- | ----- |
| FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
| Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
| Gemma3-12B-it | 13.2% | zero-shot |
| GPT-4o | 27.3% | zero-shot |
| Gemini-2.0-Flash | 25.4% | zero-shot |

Checkout Fig 2 in the paper for more details.

📄 Citation
```
@article{yadav2025findingdory,
  title     = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
  author    = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
  journal   = {arXiv preprint arXiv:2506.15635},
  year      = {2025}
}
```