Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
By: Shaocong Xu , Songlin Wei , Qizhe Wei and more
Potential Business Impact:
Helps robots see through glass and reflections.
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
Similar Papers
DepthFocus: Controllable Depth Estimation for See-Through Scenes
CV and Pattern Recognition
Lets computers see through glass like humans.
TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image
CV and Pattern Recognition
Helps robots grab see-through objects accurately.
Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images
CV and Pattern Recognition
Creates 3D scenes from one picture.