CVPR 2026

Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors

1KAIST, 2Microsoft Research Asia 3USTC
Teaser figure

Zero-shot generalization of our model across different dToF sensing conditions. Top: sparse depth inputs from three representative settings—autonomous driving LiDAR (rotating dToF), lightweight mobile dToF, and extremely sparse random sampling. Bottom: our predicted dense metric depth for each case, demonstrating strong robustness under diverse sparsity, noise, and sensor patterns. Right: comparison of accuracy versus inference time (bubble size indicates model parameters). Our method achieves the most favorable balance of low error and fast runtime, outperforming state-of-the-art depth completion and enhancement approaches.

Abstract

Direct Time-of-Flight (dToF) sensors provide highly accurate metric depth and are more robust than indirect ToF systems in challenging real-world conditions. However, their high manufacturing cost and limited photodiode array size produce depth maps that are extremely sparse, low-resolution, and noisy, making them unsuitable for VR/XR, robotics, and 3D perception tasks that require dense metric depth. Existing monocular and depth completion methods struggle to handle the unique sampling patterns and hardware artifacts of dToF devices, and their performance often deteriorates significantly under severe sparsity or noise. We present a generalizable framework for dense metric depth completion from sparse dToF measurements, capable of operating across diverse sensor types, sparsity levels, and noise conditions. Our model employs a depth-guided dual-branch Vision Transformer encoder that processes RGB images and sparse dToF measurements separately, while a masked joint attention module allows depth tokens to reliably guide image features without being overwritten by them. A lightweight decoder reconstructs dense metric depth efficiently, without diffusion-based or refinement-heavy post-processing. To address the scarcity of paired training data, we introduce a comprehensive dToF simulation pipeline that reproduces the characteristics of flash, sub-VGA flash, and rotating sensors, including hardware-induced degradation, irregular sparsity, and realistic noise distributions. Trained entirely on synthetic data, our model achieves strong zero-shot generalization across 6 datasets and 3 real dToF devices, outperforming state-of-the-art approaches in both accuracy and computational efficiency. This establishes a robust and practical solution for dense metric depth completion from sparse direct ToF sensors.

Results

Pick a scene, then drag the two handles to reveal the RGB image, the sparse depth input, and our completed depth within a single frame.

RGB Image
Predicted Depth
Sparse Depth
RGB Predicted Depth Sparse Depth

Comparison

Interactive 3D point clouds. The sparse dToF input is shown alongside each method’s completed result — all views share one synced camera, so dragging to orbit (or scrolling to zoom) in any panel moves them all together.

Input image
Input color image
Input sensor depth
Input depth map

Drag = rotate · Scroll = zoom · Right-drag = pan · all panels share one camera

Supplement Video

BibTeX


      @InProceedings{Kim_2026_CVPR,
        author = {Hakyeong Kim and Ruicheng Wang and 
        Chengtang Yao and Jiaolong Yang and Min H. Kim},
        title = {Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors},
        booktitle = {IEEE Conference on Computer Vision and 
            Pattern Recognition (CVPR)},
        month = {June},
        year = {2026}
      }