Should Embodied Intelligence
Care About 3D?

Jiatao Gu

Talk @ EmbodiedAIinLife Workshop · Jun 3, 2026

"Just as we no longer hand-craft features for detection, we will soon stop using 3D as part of embodied intelligence."

— a recent and influential argument

3D is on trial.

"Should you care about 3D?" — domain by domain

An influential framing — domain by domain.

Should you care about 3D? Robotics ✗, Digital Media ✗, Interactive Media ✓, Making Actual Stuff ✓+

I basically agree. A sharp framing — and very reasonable.

The row this talk wants to look at more carefully is Robotics — given today's data picture.

A tension in the same view

"The core challenge of embodied intelligence is the lack of paired perception–action data at scale."

— from the same essay

Bitter-lesson scaling assumes the data is there.
Internet text
~1013 tok Internet video
~109 hr Image–caption pairs
~1010 Robot teleop episodes
~106

Bars are illustrative orders of magnitude, not exact counts.

Embodied AI doesn't yet have it at that scale. Paired observation–action data is teleoperated, expensive, embodiment-bound — orders of magnitude smaller than the corpora that powered the rest of the bitter-lesson story.

End-to-end is appealing. For now, the data is the constraint.

Then — why not just use video?

An honest steelman before we get to 3D.

Web video is the largest paired modality we have. Hours of footage of humans interacting with the world — already at internet scale.
Generative video models scale. They synthesize counterfactuals, predict futures, render unseen scenes — better every year.
The bitter lesson would say: just keep scaling this. Don't add intermediate structure on top.
If video already does so much — why bother with 3D at all?

Because video ≠ action

Video is wonderful. It is also not what a body needs.

Video is rendered, not interacted. No contact, no force, no consequence — pixels can be fluent about physics that does not actually hold.
A frame is a plan-of-pixels, not a plan-of-action. Trajectories on screen do not decompose into commands a particular body can execute.
Generative video can hallucinate geometry. Texture-consistent does not mean shape-consistent — depth and contact are easy to get wrong, and you cannot tell from the pixels.
And, deepest: a frame is a 2D projection. The body acts in 3D space.
Pixels are a view of the world. Action happens in the world.

3D becomes the natural interface — the missing dimension that lets video supervise action.

Three works, one arc

3D as an interface — three roles for the same primitive.

PointAction
3D as action
PointAction — dynamic point trajectories as a universal action interface. arXiv 2606.03943
OmniGuide
3D as guidance
OmniGuide — differentiable 3D energy fields steer any VLA at test-time. arXiv 2026
PhysCtrl
3D as physical prior
PhysCtrl — point-trajectory physics-grounded video on 4 materials. NeurIPS 2025

Building on prior work — WVD (CVPR'25, Highlight): 3D as a supervision signal for video models.

PointAction — 3D points as a universal action representation

PointAction teaser

Embodiment-agnostic point trajectories as the interface between video prediction and robot control.

PointAction — Tong, Jiang, …, Gu et al., arXiv 2606.03943 · oriontmt.github.io/pointaction

Prior work — WVD: 3D as supervision, not output

WVD teaser
XYZ images. Per-pixel global 3D coordinates — pixel-aligned, exactly the shape a video model already speaks. RGB and XYZ are stacked into 6-channel frames.
One DiT, three tasks. Joint RGB + XYZ training + flexible inpainting unifies single-image-to-3D, multi-view stereo, and camera-controlled video — without per-task heads.
The point: 3D enters as a training signal, not as a hand-crafted intermediate the policy must trust.

PointAction extends this from static images to dynamic point trajectories — and connects them to action.

WVD — Zhang et al., CVPR 2025 (Highlight)

The trick: pixel-aligned 3D coordinates

WVD XYZ frames and DiT block
XYZ image. Per-pixel global 3D coordinate transformed into the camera frame. Same H×W as the RGB image — exactly the shape a video model already speaks.
Texture-free consistency. Two pixels with the same XYZ across views correspond to the same 3D point — direct supervision for multi-view geometry, no scene-specific tricks.
One DiT, 6-channel frames. Stack RGB + XYZ into a single diffusion target. No new architecture; the joint distribution is learnt by an off-the-shelf video DiT.
WVD — Zhang et al., CVPR 2025 (Highlight)

Unifying single-to-3D, multi-view, camera control

One DiT + flexible inpainting — three classically-separate tasks fall out.

Single-image → 3D. Inpaint XYZ given RGB. Reconstructed point cloud beats CameraCtrl / MotionCtrl baselines, no depth head needed.
Multi-view stereo. Diffuse XYZ from unposed RGB images; refine cameras + depth via PnP post-optimisation. Competitive with DUSt3R at 512.
Camera-controlled video. Re-project XYZ along a target trajectory, then inpaint RGB. Synthesised video respects the requested camera path — no per-task conditioning.

Same model, same training — the unification is the contribution, not just any one task.

WVD — Zhang et al., CVPR 2025 (Highlight)

Results: state-of-the-art on depth, MVS, video

Monocular depth · AbsRel ↓

MethodNYU-v2BONN
DUSt3R-5126.58.1
WVD (Ours)9.77.0

Competitive despite lower training resolution.

Video depth · ScanNet++ AbsRel ↓

MethodAbsRel
DUSt3R-5124.9
WVD (Ours)5.0

Robust generalisation; XYZ joint training pays for itself in temporal stability.

Ablation — XYZ is essential. Single-image-to-3D FID jumps from 13.3 → 18.3 when XYZ supervision is removed. Pure RGB training loses multi-view consistency.

Take-away: 3D as a training signal doesn't just unify tasks — it wins on each.

WVD — Zhang et al., CVPR 2025 (Highlight)

The representation behind PointAction

Why this scales: geometry + motion are recoverable from any video. Paired action is teleoperated and tiny; video is web-scale.
Carrying WVD's idea forward: 3D as a supervision signal — but now dynamic, lifted into video time.
unpaired web video    dynamic 3D pointmaps    policy
VLAs are bound to action labels → embodiment-specific, brittle on contact, geometry, long-horizon.
Video-action models use rollouts as a reasoning trace — RGB-dominant; 3D motion stays implicit.
PointAction: dynamic 3D pointmaps — supervisable from web video AND actionable for control.
PointAction — Tong, Jiang, …, Gu et al., arXiv 2606.03943

Data recipe: 75K robot videos → embodiment-agnostic 4DVM

Geometry from video alone. 50K DROID (binocular → FoundationStereo) + 25K BridgeData V2 (monocular → Depth-Anything-V3). Stage 1 (4DVM) trained on these — no paired action labels needed.
Action data stays tiny. Stage 2 decoder needs only 20–50 expert trajectories per embodiment. The 4DVM stays frozen across arms.
Flow matching + diffusion forcing. Random clean-context fraction (50%) teaches robust conditioning; history split enables autoregressive sampling.
Why it scales. Geometry supervision is free from any video; the only bottleneck (action) is small and per-arm — exactly inverting the usual VLA cost structure.

Web video pretraining + tiny per-arm finetune = the data argument of this whole talk, made concrete.

PointAction, arXiv 2606.03943

Two stages, one factorization

$\pi_\theta(\tilde o,\,\tilde u,\,\tilde a \mid s_t,\,o_t,\,l) \;=\; \underbrace{\pi_\theta^{\text{4DVM}}(\tilde o,\,\tilde u \mid o_t,\,l)}_{\text{embodiment-agnostic, web-trainable}} \cdot \underbrace{\pi_\theta^{\text{DEC}}(\tilde a \mid \tilde u,\,s_t)}_{\text{embodiment-specific, small}}$
4DVM — Universal Video-Action Model. Foundation video diffusion lifted to jointly generate RGB frames $\tilde o$ and dynamic XYZ pointmaps $\tilde u$ from an image + instruction. Supervisable from any video.
DEC — Point-to-Action Decoder. A thin conditional DiT that denoises the action sequence $\tilde a$ in parallel from the predicted point trajectory and current robot state $s_t$. Small, swappable per arm.

The world model is trained once; the decoder is swapped per embodiment.

PointAction, arXiv 2606.03943

Stage 1: joint RGB + XYZ rollout

PointAction Stage 1 — joint RGB+XYZ DiT
Width-wise modality fusion. RGB and XYZ pointmaps are independently encoded into the same VAE latent and concatenated along width — geometric tokens sit next to their visual counterparts inside DiT self-attention. (Inspired by 4DNeX.)
Frozen DiT + LoRA. Foundation video DiT stays frozen; a small LoRA adapter learns the joint RGB+XYZ rollout. Trained with flow matching under diffusion forcing on cross-arm robot videos (DROID + BridgeData V2).
Output: a 4D scene. Predicted RGB frames $\tilde o_{1:H}$ + pixel-aligned XYZ pointmaps $\tilde u_{1:H}$ — temporally and geometrically consistent. A separate video segmentation step extracts robot-centric 3D points for the decoder.
PointAction, arXiv 2606.03943

Stage 2: point trajectories low-level actions

PointAction decoder
FPS sampling. Robot-centric pointmap → farthest-point sampling per frame.
PointNet-style encoder. Sampled points → token-aligned point features.
DiT denoiser. Conditional DiT denoises the entire action chunk in parallel; AdaLN injects diffusion step + initial robot state.

A small amount of paired robot data is only needed to train this decoder — the world model is unchanged across embodiments.

PointAction, arXiv 2606.03943

Why geometry matters — the ablation story

RoboCasa365 sim, avg success (%), 100 rollouts/cell.

SettingIDOOD-EnvOOD-Task
RGB-only (no XYZ)25.120.35.8
Full-scene points (robot + everything)40.336.412.0
Robot-centric XYZ (Ours)47.744.117.0
RGB-only baseline fails hard. Pixels alone — same architecture — drop 22 points in ID. Visual artifacts dominate when the policy has no geometry to anchor on.
Robot-centric masking is critical. Scene-wide point clouds inject ambiguity; restricting XYZ to the manipulator and contact region is what carries the signal.
Width-wise RGB+XYZ fusion. Concatenated along spatial width (4DNeX-style) so attention sees geometry paired with its visual context — no new channels.

3D is the load-bearing supervision — not a cosmetic addition.

PointAction, arXiv 2606.03943

Results: SOTA 4D + simulation

Live 4D rollouts (drag / scroll to navigate) — held-out trajectories from BridgeData V2 (left) and DROID (right). First load is slow (loads recording file in-browser); if blank, open oriontmt.github.io/pointaction directly.

RoboCasa365 sim · avg success (%) · 100 rollouts/cell

SettingGR00T N1.7$\pi_{0.5}$VPPCosmosOurs
ID (10 seen)44.539.834.545.247.7
OOD-Env37.635.232.242.944.1
OOD-Task8.66.97.414.017.0

4D scene generation · 300 held-out DROID + Bridge V2

MethodPSNR ↑SSIM ↑FVD ↓Chamfer ↓
TesserAct12.230.4877460.389
4DNeX13.860.5428180.370
LVP19.610.816330
Wan 2.1 14B14.530.674671
Ours19.630.8213200.122
PointAction, arXiv 2606.03943 · oriontmt.github.io/pointaction

Cross-embodiment, on two unseen arms

YAM and xArm7 — neither seen during 4D-video pretraining. Only the small per-arm decoder is finetuned (20–50 expert trajectories per task).

xArm7 · pineapple
xArm7 · stack cubes
xArm7 · stack cups
YAM · stack cubes
YAM · pick pens
YAM · insert cups

(a) xArm7 — 100 rollouts / task

%P&PSt.CuSt.CupAvg.
GR00T N1.730.07.07.014.7
$\pi_{0.5}$42.012.014.022.7
Ours67.028.034.043.0

(b) YAM — 20 evals / task

%St.CuPickPenInsertCup
GR00T N1.502015
$\pi_0$01015
Ours206050

Roughly 2× the best VLA on average — same world model, swapped decoder.

PointAction — Tong, Jiang, …, Gu et al., arXiv 2606.03943

OmniGuide — 3D guidance fields for any VLA

OmniGuide teaser

Test-time scaffolding. Steer any diffusion-based VLA with task-specific 3D attractors and repellers — no extra robot data, no retraining.

OmniGuide — Song, Le, …, Gu, Eaton, Jayaraman, Daniilidis, arXiv 2026 · omniguide.github.io

Composing heterogeneous guidance without interference

Collision avoidance. Repellers from 3D geometry (VGGT point clouds) — energy $\mathcal{L}_C$.
Semantic grounding. Attractors from VLM (CLIP / vision-language) — energy $\mathcal{L}_S$.
Human imitation. Attractors from demo trajectories — energy $\mathcal{L}_H$.
$\mathcal{L}_y \;=\; \lambda_C\mathcal{L}_C \;+\; \lambda_S\mathcal{L}_S \;+\; \lambda_H\mathcal{L}_H$   →  additive on one energy surface
Empirical synergy. Combined collision + semantic guidance: 95% success, 56% safety — vs. 70% / 40% with semantic alone (Fig. 8). Energies compose; gradients don't fight.
OmniGuide — Song et al., arXiv 2026

The "last mile" of generalist VLAs

VLAs ($\pi_{0.5}$, GR00T N1.6, …) are jacks-of-all-trades. Broad behavior, weak on long-mile tasks: 3D collision avoidance, precise grounding, articulated objects.
The standard fix is more data. Post-train + fine-tune on more high-quality robot demos — expensive, slow, scarce, embodiment-bound.
Alternative: leave the VLA alone, and help it at test time with foundation-model knowledge it already lacks.

No extra robotic data. No retraining. No new VLA expert.

OmniGuide, arXiv 2026

Anything 3D-grounded becomes a field

3D foundation models.
geometry / collision avoidance → repellers around obstacles.
VLMs.
semantic targets ("the purple bowl") → attractors at task-relevant regions.
Hand / human pose.
one-shot demonstrations → attractors along the demonstrated trajectory.
$\mathcal{L}_y(\mathbf{X})$ — a differentiable energy in 3D space; gradient $\nabla_{\mathbf{A}^\tau}\mathcal{L}_y$ steers the action chunk.

Each guidance source is just an energy function over Cartesian poses — composable, additive, swappable.

OmniGuide, arXiv 2026

How the gradient gets in

OmniGuide method overview
Velocity + score. Any diffusion / flow-matching VLA already produces a velocity $\mathbf{v}_\theta$. Bayes ties it to a posterior score: $\nabla_{\!\mathbf{A}^\tau}\log p(\mathbf{A}^\tau\mid\mathbf{y}) = \nabla_{\!\mathbf{A}^\tau}\log p(\mathbf{A}^\tau) + \nabla_{\!\mathbf{A}^\tau}\log p(\mathbf{y}\mid\mathbf{A}^\tau)$.
External gradient injects $\mathcal{L}_y$. Decode noisy actions to Cartesian poses $\mathbf{X}$, evaluate $\mathcal{L}_y(\mathbf{X})$, backprop through the robot kinematics — guidance vector lands back on the noisy action chunk.
Effect. The external field narrows the VLA's multimodal posterior toward task-effective + safe + physically-grounded actions.
OmniGuide, arXiv 2026 — Method overview, Fig 2

Results: large gains, no retraining

Three guidance sources composed at inference (3D foundation models, VLMs, hand-tracking) — applied at the noise prior and every intermediate denoising step.

MetricBase VLA+ OmniGuide
Success rate24.2%92.4%
Collision-avoidance7.0%93.5%

Base VLAs: $\pi_{0.5}$, GR00T N1.6 — both improve. Numbers are headline; see PDF for the full breakdown across guidance sources.

Matches or beats prior methods designed for a single guidance source — with a single framework.
No significant latency hit. Real-time gradient evaluation through the robot kinematics + foundation model.
No retraining of the VLA. Steering happens entirely at inference; SafeFlow / Inference-Time Policy Steering become special cases.
OmniGuide — Song, Le, …, Gu, Eaton, Jayaraman, Daniilidis, arXiv 2026

PhysCtrl — 3D as physical prior

force-controlled video, four materials
Physics-grounded I2V. A 3D point trajectory becomes the physics simulator's natural state — the model learns material dynamics for elastic, sand, plasticine, and rigid bodies.
Force-controllable. Apply forces to selected points; the rollout obeys the resulting deformation. 550K synthetic animations underwrite training.
Same primitive, new role. The 3D point trajectory now expresses physical causation, not action.
PhysCtrl — Wang et al., NeurIPS 2025 · cwchenwang.github.io/physctrl

How PhysCtrl works — point cloud → physics → video

PhysCtrl architecture
Point cloud lifting. Single image → segment → multi-view novel-view generation → 2048-point cloud reconstruction.
Physics-grounded trajectory model. Diffusion model conditioned on material, force, drag points; trained on 550K MPM + rigid-body simulations across 4 material families.
Track maps drive video. Projected 2D trajectories conditioning a pretrained video generator — physics guides pixels, not the other way around.
PhysCtrl — Wang et al., NeurIPS 2025

Four materials — elastic, sand, plasticine, rigid

Per-material qualitative rollouts
Numbers. Trajectory generation: vIoU 77.6% · Chamfer 0.0028 (Motion2VecSets 24.9%, MDM 53.8%). Spatiotemporal attention + physics loss is the difference.
Ablation. Removing the physics loss collapses vIoU to 33.8%. The physics supervision is load-bearing — not a regulariser, the contribution.
PhysCtrl — Wang et al., NeurIPS 2025 · cwchenwang.github.io/physctrl

One primitive, three roles

PointAction — action
Point trajectories drive a policy. 3D shapes control.
OmniGuide — guidance
3D fields steer any VLA at test-time. 3D shapes conditioning.
PhysCtrl — physical prior
Point trajectories obey forces. 3D shapes causation.
The 3D point trajectory becomes the interlingua between perception, prediction, and action.

All three roles ride on the same primitive — and all are supervisable, in different ways, from web video plus a small amount of physics.

Back to the question

Should embodied intelligence care about 3D?

Maybe — at least as an interface, for now.

— a representation that turns unpaired video into supervision the policy can actually use.

What 3D-as-interface buys us

Web-scale data. Geometry + motion are recoverable from any video. No paired action labels needed for the world model. (WVD, PointAction)
Embodiment transfer. One world model + a small per-arm decoder generalises to unseen arms. (PointAction)
Physics consistency. Point trajectories carry contact, deformation, force. (PhysCtrl)
Test-time controllability. Differentiable 3D fields steer any VLA without retraining. (OmniGuide)

All four wins ride on the same 3D point primitive.

Counterfacts worth holding

Many companies are now collecting action data at scale. The data wall is moving, not standing still — end-to-end policies could catch up faster than we think.
3D doesn't cover everything an action needs. Force, fine dexterity, taste, social cues — much of what bodies do is hard to reduce to point trajectories.
So — is 3D just a distraction?
I don't think so. A useful bridge that pays its rent now, even if the river changes course later.

Strong enough for this decade. Soft enough to dissolve as the data fills in.

Future directions

Two threads — keep extending the interface, and look beyond it.

Extending the interface

Scale 4D world models. Internet-video pretraining with depth + point supervision; few-step samplers for closed-loop control.
More embodiments. Humanoids, dexterous hands, soft robots — same world model, swap the per-arm decoder.
Language-grounded fields. OmniGuide-style guidance for goals specified in natural language, not just geometry.

Beyond the interface

3D inside the world model. A structural prior the world model carries and reasons over — not just an output the policy reads.
Interactive world modeling & fast memory. Queryable 3D state for simulating, intervening, conditioning — read/write keyed by location, not token order.
Long-horizon physical reasoning. Multi-step physical inference grounded in 3D — closer to how people plan in space.

Thank you

Should embodied intelligence care about 3D? — maybe — at least as an interface, for now.

WVD (CVPR'25) · PointAction (arXiv 2606.03943) · OmniGuide (arXiv'26) · PhysCtrl (NeurIPS'25)

https://jiataogu.me
scan for more
jiataogu.me
Jiatao Gu · GMLR · Penn