Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots
Abstract
Camera Depth Models (CDMs) enhance depth camera accuracy by denoising and improving metric depth prediction, enabling better generalization of robotic manipulation policies from simulation to real-world tasks.
Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.
Community
🚀 Want to build a 3D-aware manipulation policy, but troubled by the noisy depth perception? Want to train your manipulation policy in simulation, but tired of bridging the sim2real gap by degenerating geometric perception, like adding noise? Now these notorious problems are gone with our Camera depth Models! The Camera Depth Models (CDMs) can be plug-in modules in a real robot pipeline, transforming noisy depth into high-quality perception, enabling seamless sim-to-real transfer, making real robot manipulation work as in simulation!
🎯 Why it matters: Accurate geometry with CDMs helps a sim-data-driven policy solve a set of complex, long-horizon tasks from 0% to 85%+ success! Now you can even train in simulation, deploy on real robots WITHOUT further domain adaptation. Just plug in our CDMs to your existing pipeline!
✨ Highlight:
• Zero-shot sim-to-real transfer with 73%+ success (vs 0% baseline)
• Depth-only imitation learning achieves 85%+ success
• Works with RealSense D435/L515, Kinect, ZED2i & more
🛠️ Everything is open:
• Open-source CDMs for 5 distinct cameras
• Open-source the collected ByteCamDepth Dataset, which contains 170K+ RGB-depth pairs across 7 cameras & 10 configurations, a comprehensive real-world depth dataset.
• Open-source our codes for sim-to-real, camera depth model inference. We also share our modular, real-robot control framework designed for robotic manipulation, which provides you a unified interface for controlling various robot arms, integrating sensors, and executing policies in real-time!
• A clean sim-to-real tutorial based on our framework!
Check everything and interactive demos in https://manipulation-as-in-simulation.github.io/
We expect CDMs to be a foundation of your daily research!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation (2025)
- Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask (2025)
- Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting (2025)
- Physically-based Lighting Augmentation for Robotic Manipulation (2025)
- Visuomotor Grasping with World Models for Surgical Robots (2025)
- 4D Visual Pre-training for Robot Learning (2025)
- AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper