ICON Human3R: Everyone Everywhere All at Once

Yue Chen1     Xingyu Chen1*     Yuxuan Xue2     Anpei Chen1     Yuliang Xiu1†     Gerard Pons-Moll2,3    
1Westlake University      2University of Tübingen, Tübingen AI Center      3Max Planck Institute for Informatics
*Project Lead      †Corresponding Author

Inference with One model, One stage; Training in One day using One GPU

Abstract

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.

Online Human-Scene Reconstruction

Given a stream of RGB images, Human3R continuously reconstructs 4D human-scene in real time, online estimating multi-person meshes, camera parameters, and dense 3D geometry within each frame.

Method Overview

Reconstruction

Human3R enables online human-scene reconstruction from video streams. Each frame is encoded into image tokens, with patch-level detection. Each detected head token, concatenated with a human prior token from Multi-HMR ViT-DINO feature, is projected into a human prompt. The human prompts serve as discriminative human-ID queries for the decoder: they self-attend with image tokens to aggregate spatial whole-body information and cross-attend with the scene state to retrieve temporally consistent human tokens within the 3D scene context. Only human-related layers are fine-tuned, other parameters remain frozen and are initialized from CUT3R.

Comparison with Ground-truth

We compare our model prediction with the ground-truth global human motion and camera poses.

Failure Cases

Failure case 1: Human3R offers coarse human-scene interactions that may exhibit penetration, which can be refined through contact-aware iterative optimization. Failure case 2: Opportunities exist for more expressive designs to handle human-object interactions.

Acknowledgements

Thank all members of Endless AI, Inception3D and RVH Group for their help and discussion, and Yiru for creating the fantastic logo — love it! Yue and Xingyu are funded by the Westlake Education Foundation. Gerard and Yuxuan are funded by the Carl Zeiss Foundation, the DFG - 409792180 (EmmyNoether Programme, project: Real Virtual Humans), and the BMBF: Tübingen AI Center, FKZ: 01IS18039A. Gerard is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645.