Avada11 nielsr HF Staff commited on
Commit
bb74fe2
·
verified ·
1 Parent(s): 2d94b32

Improve model card: Add pipeline tag, update license, paper, code, usage, and citation (#1)

Browse files

- Improve model card: Add pipeline tag, update license, paper, code, usage, and citation (530167572e8304ad58dbe7b0646762c83c43b256)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +47 -4
README.md CHANGED
@@ -1,9 +1,52 @@
1
  ---
2
- license: cc-by-nc-4.0
 
3
  ---
4
 
5
- This repository contains the camera depth model of the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots.
6
 
7
- Model inference guide: https://github.com/ByteDance-Seed/manip-as-in-sim-suite/tree/main/cdm
8
 
9
- Project page: https://manipulation-as-in-simulation.github.io
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: depth-estimation
4
  ---
5
 
6
+ # Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots
7
 
8
+ This repository contains the Camera Depth Models (CDMs) presented in the paper [Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots](https://huggingface.co/papers/2509.02530).
9
 
10
+ **Project Page:** [https://manipulation-as-in-simulation.github.io/](https://manipulation-as-in-simulation.github.io/)
11
+ **Code:** [https://github.com/ByteDance-Seed/manip-as-in-sim-suite](https://github.com/ByteDance-Seed/manip-as-in-sim-suite)
12
+
13
+ ## Abstract
14
+ Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.
15
+
16
+ ## Key Features
17
+ * **Sim-to-Real Depth Transfer**: Clean, metric depth estimation that matches simulation quality
18
+ * **Multi-Camera Support**: Pre-trained models for various depth sensors (RealSense, ZED, Kinect)
19
+ * **Automated Data Generation**: Scalable demonstration generation using enhanced MimicGen
20
+ * **Whole-Body Control**: Unified control for mobile manipulators for mimicgen
21
+ * **Multi-GPU Parallelization**: Distributed simulation for faster data collection
22
+ * **VR Teleoperation**: Intuitive demonstration recording using Meta Quest controllers
23
+
24
+ ## Usage
25
+
26
+ This section provides an example of how to run depth inference using the Camera Depth Model (CDM). For more details, refer to the [Model inference guide](https://github.com/ByteDance-Seed/manip-as-in-sim-suite/tree/main/cdm) in the GitHub repository.
27
+
28
+ To run depth inference on RGB-D camera data, use the following command:
29
+
30
+ ```bash
31
+ cd cdm
32
+ python infer.py \
33
+ --encoder vitl \
34
+ --model-path /path/to/model.pth \
35
+ --rgb-image /path/to/rgb.jpg \
36
+ --depth-image /path/to/depth.png \
37
+ --output result.png
38
+ ```
39
+
40
+ ## Citation
41
+ If you use this work in your research, please cite:
42
+
43
+ ```bibtex
44
+ @article{liu2025manipulation,
45
+ title={Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots},
46
+ author={Liu, Minghuan and Zhu, Zhengbang and Han, Xiaoshen and Hu, Peng and Lin, Haotong and
47
+ Li, Xinyao and Chen, Jingxiao and Xu, Jiafeng and Yang, Yichu and Lin, Yunfeng and
48
+ Li, Xinghang and Yu, Yong and Zhang, Weinan and Kong, Tao and Kang, Bingyi},
49
+ journal={arXiv preprint},
50
+ year={2025}
51
+ }
52
+ ```