Jiatao Gu
Talk @ EmbodiedAIinLife Workshop · Jun 3, 2026
— a recent and influential argument
3D is on trial.
An influential framing — domain by domain.

I basically agree. A sharp framing — and very reasonable.
The row this talk wants to look at more carefully is Robotics — given today's data picture.
— from the same essay
Bars are illustrative orders of magnitude, not exact counts.
End-to-end is appealing. For now, the data is the constraint.
An honest steelman before we get to 3D.
Video is wonderful. It is also not what a body needs.
3D becomes the natural interface — the missing dimension that lets video supervise action.
3D as an interface — three roles for the same primitive.


Building on prior work — WVD (CVPR'25, Highlight): 3D as a supervision signal for video models.

Embodiment-agnostic point trajectories as the interface between video prediction and robot control.

PointAction extends this from static images to dynamic point trajectories — and connects them to action.

One DiT + flexible inpainting — three classically-separate tasks fall out.
Same model, same training — the unification is the contribution, not just any one task.
Monocular depth · AbsRel ↓
| Method | NYU-v2 | BONN |
|---|---|---|
| DUSt3R-512 | 6.5 | 8.1 |
| WVD (Ours) | 9.7 | 7.0 |
Competitive despite lower training resolution.
Video depth · ScanNet++ AbsRel ↓
| Method | AbsRel |
|---|---|
| DUSt3R-512 | 4.9 |
| WVD (Ours) | 5.0 |
Robust generalisation; XYZ joint training pays for itself in temporal stability.
Take-away: 3D as a training signal doesn't just unify tasks — it wins on each.
Web video pretraining + tiny per-arm finetune = the data argument of this whole talk, made concrete.
The world model is trained once; the decoder is swapped per embodiment.


A small amount of paired robot data is only needed to train this decoder — the world model is unchanged across embodiments.
RoboCasa365 sim, avg success (%), 100 rollouts/cell.
| Setting | ID | OOD-Env | OOD-Task |
|---|---|---|---|
| RGB-only (no XYZ) | 25.1 | 20.3 | 5.8 |
| Full-scene points (robot + everything) | 40.3 | 36.4 | 12.0 |
| Robot-centric XYZ (Ours) | 47.7 | 44.1 | 17.0 |
3D is the load-bearing supervision — not a cosmetic addition.
Live 4D rollouts (drag / scroll to navigate) — held-out trajectories from BridgeData V2 (left) and DROID (right). First load is slow (loads recording file in-browser); if blank, open oriontmt.github.io/pointaction directly.
RoboCasa365 sim · avg success (%) · 100 rollouts/cell
| Setting | GR00T N1.7 | $\pi_{0.5}$ | VPP | Cosmos | Ours |
|---|---|---|---|---|---|
| ID (10 seen) | 44.5 | 39.8 | 34.5 | 45.2 | 47.7 |
| OOD-Env | 37.6 | 35.2 | 32.2 | 42.9 | 44.1 |
| OOD-Task | 8.6 | 6.9 | 7.4 | 14.0 | 17.0 |
4D scene generation · 300 held-out DROID + Bridge V2
| Method | PSNR ↑ | SSIM ↑ | FVD ↓ | Chamfer ↓ |
|---|---|---|---|---|
| TesserAct | 12.23 | 0.487 | 746 | 0.389 |
| 4DNeX | 13.86 | 0.542 | 818 | 0.370 |
| LVP | 19.61 | 0.816 | 330 | — |
| Wan 2.1 14B | 14.53 | 0.674 | 671 | — |
| Ours | 19.63 | 0.821 | 320 | 0.122 |
YAM and xArm7 — neither seen during 4D-video pretraining. Only the small per-arm decoder is finetuned (20–50 expert trajectories per task).
(a) xArm7 — 100 rollouts / task
| % | P&P | St.Cu | St.Cup | Avg. |
|---|---|---|---|---|
| GR00T N1.7 | 30.0 | 7.0 | 7.0 | 14.7 |
| $\pi_{0.5}$ | 42.0 | 12.0 | 14.0 | 22.7 |
| Ours | 67.0 | 28.0 | 34.0 | 43.0 |
(b) YAM — 20 evals / task
| % | St.Cu | PickPen | InsertCup |
|---|---|---|---|
| GR00T N1.5 | 0 | 20 | 15 |
| $\pi_0$ | 0 | 10 | 15 |
| Ours | 20 | 60 | 50 |
Roughly 2× the best VLA on average — same world model, swapped decoder.

Test-time scaffolding. Steer any diffusion-based VLA with task-specific 3D attractors and repellers — no extra robot data, no retraining.
No extra robotic data. No retraining. No new VLA expert.
Each guidance source is just an energy function over Cartesian poses — composable, additive, swappable.

Three guidance sources composed at inference (3D foundation models, VLMs, hand-tracking) — applied at the noise prior and every intermediate denoising step.
| Metric | Base VLA | + OmniGuide |
|---|---|---|
| Success rate | 24.2% | 92.4% |
| Collision-avoidance | 7.0% | 93.5% |
Base VLAs: $\pi_{0.5}$, GR00T N1.6 — both improve. Numbers are headline; see PDF for the full breakdown across guidance sources.


All three roles ride on the same primitive — and all are supervisable, in different ways, from web video plus a small amount of physics.
Should embodied intelligence care about 3D?
— a representation that turns unpaired video into supervision the policy can actually use.
All four wins ride on the same 3D point primitive.
Strong enough for this decade. Soft enough to dissolve as the data fills in.
Two threads — keep extending the interface, and look beyond it.
Should embodied intelligence care about 3D? — maybe — at least as an interface, for now.
WVD (CVPR'25) · PointAction (arXiv 2606.03943) · OmniGuide (arXiv'26) · PhysCtrl (NeurIPS'25)
