Human3R: Everyone Everywhere All at Once

Abstract

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.

Online Human-Scene Reconstruction

Given a stream of RGB images, Human3R continuously reconstructs 4D human-scene in real time, online estimating multi-person meshes, camera parameters, and dense 3D geometry within each frame.

Method Overview

Human3R enables online human-scene reconstruction from video streams. Each frame is encoded into image tokens, with patch-level detection. Each detected head token, concatenated with a human prior token from Multi-HMR ViT-DINO feature, is projected into a human prompt. The human prompts serve as discriminative human-ID queries for the decoder: they self-attend with image tokens to aggregate spatial whole-body information and cross-attend with the scene state to retrieve temporally consistent human tokens within the 3D scene context. Only human-related layers are fine-tuned, other parameters remain frozen and are initialized from CUT3R.

Comparison with Ground-truth

We compare our model prediction with the ground-truth global human motion and camera poses.

Failure Cases

Failure case 1: Human3R offers coarse human-scene interactions that may exhibit penetration, which can be refined through contact-aware iterative optimization. Failure case 2: Opportunities exist for more expressive designs to handle human-object interactions.

TTT3R: 3D Reconstruction as Test-Time Training

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

PromptHMR: Promptable Human Mesh Recovery

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

World-Grounded Human Motion Recovery via Gravity-View Coordinates

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Visual Imitation Enables Contextual Humanoid Control

Reconstructing People, Places, and Cameras

BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion

Acknowledgements

Thank all members of Endless AI, Inception3D and RVH Group for their help and discussion, and Yiru for creating the fantastic logo — love it! Yue and Xingyu are funded by the Westlake Education Foundation. Gerard and Yuxuan are funded by the Carl Zeiss Foundation, the DFG - 409792180 (EmmyNoether Programme, project: Real Virtual Humans), and the BMBF: Tübingen AI Center, FKZ: 01IS18039A. Gerard is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645.

BibTeX


    @article{chen2025human3r,
        title={Human3R: Everyone Everywhere All at Once},
        author={Chen, Yue and Chen, Xingyu and Xue, Yuxuan and Chen, Anpei and Xiu, Yuliang and Gerard, Pons-Moll},
        journal={arXiv preprint arXiv:2510.06219},
        year={2025}
        }