Are Normalizing Flows
Good Candidates for
Interactive World Models?

Jiatao Gu

Talk @ EDGE Workshop · Jun 3, 2026

Acting in the world needs a model of it

Choice 1 — react. Look at the image, directly predict which block to pull.

Choice 2 — simulate. Model the tower's state → predict the outcome of each pull → pick the move that keeps it standing.

Interactive decisions require rolling out the consequences of actions — i.e. a world model you can query.

World model & the interactive loop

A world model is an agent's internal model of the world, used to predict, plan, and act.

$s_t \;\xrightarrow{\,a_t\,}\; p_\theta(s_{t+1}\mid s_t,\,a_t) \;\rightarrow\; s_{t+1} \;\xrightarrow{\,a_{t+1}\,}\; p_\theta(s_{t+2}\mid s_{t+1},\,a_{t+1}) \;\rightarrow\; \cdots$

Interactive = the loop runs online: you act, it responds, you act again — fast, controllable, and stable over long horizons.

What an interactive world model needs

Causal Streamable autoregressive roll-out — generate the future frame-by-frame, conditioned on the past.
Controllable Respond to conditioning — text, images, actions, edits — without retraining a new model each time.
Robust Stable over long horizons — no drift / error accumulation far beyond the training window.
Real-time Few-step, low-latency generation so the loop can run interactively.
Uncertainty Native density / likelihood — to score, plan, and detect the implausible.

We will revisit this scorecard at the end.

Today's video generators are mostly diffusion models

Powerful — but a few properties remain hard for the interactive setting (each is an active research area):

Error accumulation
Autoregressive roll-out tends to drift — blur, color shift, identity loss can creep in over long horizons.

No native likelihood
Most variants give no tractable density to score futures or quantify uncertainty.

Many steps
Iterative denoising per frame is costly — though distillation is closing this gap fast.

From diffusion to Normalizing Flows

$x \;\;\underset{\textstyle f^{-1}}{\overset{\textstyle f}{\rightleftarrows}}\;\; z \sim \mathcal{N}(0, I)$

A Normalizing Flow is a single invertible network $f$ mapping data $x$ to simple noise $z$.

Run $f^{-1}$ → generate. Draw $z$, push it back to a sample $x$.

Run $f$ → score. Map $x$ to $z$ and read off its exact likelihood — same model, one objective.

Could Normalizing Flows — exact-likelihood, invertible, end-to-end — be a better backbone for interactive world models?

Normalizing Flows in one slide

$p(x) = p_0\!\big(f(x)\big)\,\Big|\det \tfrac{\partial f(x)}{\partial x}\Big|,\qquad z=f(x)\ \text{invertible}$

Exact likelihood
trained by exact MLE — a single objective.

Invertible
$x \leftrightarrow z$ is lossless — no information discarded.

End-to-end
no noise schedule, no discretization.

TARFlow — NFs are capable generative models

Transformer AR Flow. Stacked autoregressive Transformer blocks over patches, alternating scan direction.

Simple recipe — noise augmentation + post-hoc denoising + guidance → diffusion-level samples and SOTA likelihoods, from a stand-alone NF.

Normalizing Flows are Capable Generative ModelsS. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, H. Zheng, T. Chen, M. Bautista, N. Jaitly, J. Susskind · ICML 2025 (Oral)

STARFlow — scaling latent NFs to high resolution

Deep–shallow design. One deep Transformer block carries most capacity + a few efficient shallow blocks.

Latent space. Model in a pretrained autoencoder's latent, not pixels — far more effective at high resolution.

+ new guidance, staying an end-to-end MLE flow. First NF at this scale/resolution, approaching diffusion quality.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

STARFlow-V — the first NF video generator

Built on STARFlow, operating in a spatiotemporal latent space. It checks every box of the interactive-WM scorecard:

Causal Strict left-to-right autoregressive roll-out — streamable, frame-by-frame.
Controllable T2V / I2V / V2V from one model, no fine-tuning (invertibility).
Robust Stable to ~30s — 6× past its 5s training window.
Uncertainty Native exact log-likelihood over video.

STARFlow-V — text-to-video samples

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Architecture: global–local

Deep AR block $f_D$ — causal Transformer carrying global long-range temporal context (left-to-right across frames). Causality lives here.

Shallow blocks $f_S$ — restricted within each frame, alternating masks → rich local interactions. $f_S^{-1}$ decodes frames independently.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Why it may resist error accumulation

$p_\theta(x)=\prod_{n=1}^{N} p_D(u_n \mid u_{<n})\,\big|\det J_{f_S}(x_n)\big|,\qquad u_n=f_S(x_n)$

Conditions on latents, not pixels
at sampling, $f_D^{-1}$ conditions on previously generated latents → data-space errors don't propagate.

Invertible = lossless
unlike diffusion's noise-conditioning, no information is traded away for robustness.

Unimodal next-token
the per-step latent $u$ is easy to regress and error-tolerant.
$p_D(u_{n,i}\mid u_{<n},\,u_{n,<i})=\mathcal{N}\!\left(\mu_\theta,\,\sigma_\theta^2\right)$

One more trick

Efficiency: video-aware Jacobi iteration

Naive AR inversion decodes tokens one-by-one. Recast each flow block as a fixed-point system and sweep all tokens in parallel — converging in $k \ll N$ iterations, without breaking causality.

Sequential AR x₁→x₂→x₃→x₄ N steps, one at a time

Jacobi (ours) x₁x₂x₃x₄ ⟲ ×k all tokens updated in parallel each sweep

Block-wise
parallel within a block, sequential across; completed blocks cached as KV.

Video-aware init
warm-start each frame from the previous converged frame.

~15× lower latency
vs standard AR decoding, preserving fidelity.

Jacobi iteration already achieves ~15× speedup over sequential AR decoding — but that still falls short of real-time interactive latency. Closing this gap is precisely the motivation for the next two works.

One model, three tasks — no fine-tuning

T2V

text → video, generated from scratch.

I2V

flow-encode the first frame into the KV cache; roll out the rest. No separate encoder.

V2V

flow-encode the source clip, roll out edits. Invertibility reuses the decoder as encoder.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Quantitative: closing the gap to diffusion

VBench (T2V)	Total	Causal?
Wan2.1 (diffusion)	83.69	no
CogVideoX	80.91	no
STARFlow-V	78.67	yes
STARFlow-V (+ GPT-rewriter)	79.70	yes

Causality is almost free. Enforcing causal roll-out costs only ~0.5 pt vs the non-causal variant.

First NF-based T2V model to reach this level — substantially narrowing the historical NF↔diffusion gap.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Checkpoint — how far does STARFlow-V get us?

✓Causal streamable autoregressive roll-out.
✓Controllable T2V / I2V / V2V from one model.
✓Robust stable far past the training horizon.
✓Uncertainty native exact likelihood.
○Real-time Jacobi iteration gives ~15× speedup — but still not enough for interactive latency.

The missing piece is few-step, real-time generation.
Next: two complementary approaches — both keep NFs in the loop.

NFM — train as NF, test as flow matching

Like diffusion distillation — but using an AR-NF as the coupling oracle. A clean train / test separation:

Train phase — as an NF. Autoregressive normalizing flow: exact MLE, fully parallel over patches, tractable likelihood.

Test phase — as flow matching. Use the trained NF's deterministic noise↔data bijection as the coupling; train a flow-matching student that runs in far fewer function evaluations.

The Coupling Within: Normalized Flow MatchingD. Berthelot, T. Chen, J. Gu, M. Cuturi, L. Dinh, B. Chandna, M. Klein, J. Susskind, S. Zhai · arXiv 2026

NFM: inherit the coupling from an NF

Don't compute a coupling — read one off. A pretrained autoregressive NF already encodes a deterministic, per-sample, class-aware noise↔data bijection. Use its latent $z_{\epsilon'}=f_{\text{NF}}(x)/\sigma_f$ to replace the independent Gaussian noise; everything else in flow matching is unchanged.

The Coupling Within: Normalized Flow MatchingD. Berthelot, T. Chen, J. Gu, M. Cuturi, L. Dinh, B. Chandna, M. Klein, J. Susskind, S. Zhai · arXiv 2026

Why it works — and beats OT

Lower target variance. A deterministic per-sample coupling reduces $\mathrm{Var}(v_t\mid x_t,t)$ → straighter trajectories → better few-step FID.

Straightest paths. Trajectory curvature lowest of all couplings (e.g. Euler κ: FM 0.086 / OT 0.077 / NFM 0.044).

ImageNet64 · FID	NFE 31	NFE 15	NFE 7
FM (independent)	2.66	4.94	13.21
SD-FM (OT)	2.66	3.12	6.28
NFM	1.80	2.18	3.27

The Coupling Within: Normalized Flow MatchingD. Berthelot, T. Chen, J. Gu, M. Cuturi, L. Dinh, B. Chandna, M. Klein, J. Susskind, S. Zhai · arXiv 2026

NFM: faster than the NF, and better

Beats even the teacher. The NF does one direct inverse pass; the student inherits multi-step ODE refinement. NFM 1.80 < NF teacher 1.98 (ImageNet64).

Massively faster. NF inverse 10.8 s/sample → NFM 0.34 / 0.16 / 0.07 s → 32× / 68× / 145× speedup.

An NF can serve as a one-time "coupling oracle" that buys parallel, few-step, near-real-time generation from an otherwise sequential model.

Cost 1: test-time likelihood gone. The flow-matching student has no tractable density — you gave that up for speed.

Cost 2: two-stage. Train NF → distill → two separate models, two training runs. Is there a cleaner way?

The Coupling Within: Normalized Flow MatchingD. Berthelot, T. Chen, J. Gu, M. Cuturi, L. Dinh, B. Chandna, M. Klein, J. Susskind, S. Zhai · arXiv 2026

Can we do better? Enter NTM

NTM — text-to-image samples at 4 denoising steps (Fig 1)

Normalizing Trajectory Models (NTM) keep exact likelihood throughout — end-to-end, single training, and still sharp in as few as 4 steps.

The problem: Gaussian steps fail when coarse

Diffusion assumes each reverse step is a single small Gaussian. Compress to a few coarse steps and the true reverse $p(x_s\mid x_t)$ becomes a multimodal mixture — the Gaussian assumption breaks, so 4-step flow matching stays blurry.

Each step = an expressive conditional NF. The Gaussian assumption is replaced with a full NF transporter — model the true multimodal reverse exactly.

Normalizing Trajectory ModelsJ. Gu, T. Chen, Y. Shen, D. Berthelot, S. Zhai, J. Susskind · arXiv 2026

Where NTM sits: NF ↔ diffusion

TARFlow
all depth in a single invertible pass.

NTM
a few expressive invertible steps — the interpolation.

Diffusion
many small Gaussian steps.

Normalizing Trajectory ModelsJ. Gu, T. Chen, Y. Shen, D. Berthelot, S. Zhai, J. Susskind · arXiv 2026

NTM: a normalizing flow at each denoising step

Transporter $f_\mathcal{T}$ — shallow TarFlow-style NVP coupling blocks, applied within a single denoising step. Maps both $x_s$ and $x_t$ into a $u$-space where the conditional is Gaussian-simple.

Predictor $f_\mathcal{P}$ — deep full-attention Transformer, operates across the whole trajectory in parallel. Predicts the Gaussian mean and scale in $u$-space.

$\mathcal{L} = -\log p_\mathcal{P}(u_s \mid u_t) - \log\!\big|\det J_{f_\mathcal{T}}\big|$

Because $f_\mathcal{T}$ is invertible (not just a compressive encoder), this is the exact NLL of $p(x_s \mid x_t)$ — not a surrogate. Trained from scratch or initialized from any pretrained flow-matching model (set $f_\mathcal{T} = \mathrm{id}$).

Normalizing Trajectory ModelsJ. Gu, T. Chen, Y. Shen, D. Berthelot, S. Zhai, J. Susskind · arXiv 2026

Fast inference: trajectory denoising + learned denoiser

Step 1 — trajectory score denoising. The exact NTM likelihood gives a joint score over all timesteps. Its gradient (times the trajectory covariance) denoises the whole generated trajectory at once — no extra data needed.

Step 2 — post-train a lightweight denoiser $g_\phi$. Takes the cleanest predictor output $u_{t_0}$ → directly outputs $\hat x_0$. Trained with MSE against the score-denoised targets from the frozen NTM. Single forward pass at inference — no AR decoding, no backprop.

~9× speedup (0.20 → 1.88 img/s), LPIPS 0.121 vs score denoising. The base NTM's exact likelihood is what makes this distillation data-free and self-supervised.

Normalizing Trajectory ModelsJ. Gu, T. Chen, Y. Shen, D. Berthelot, S. Zhai, J. Susskind · arXiv 2026

NTM matches 50-step diffusion at 4 steps

Type	Model	GenEval↑	DPG↑
DM	SD3-Medium	0.62	84.08
DM	FLUX.1-dev	0.66	83.84
DM	Janus-Pro-7B	0.80	84.19
NF	STARFlow	0.56	—
NF	NTM (scratch, 256²)	0.82	79.64
NF	NTM (finetune FLUX, 512²)	0.76	83.38

From-scratch NTM at 4 steps outperforms FLUX.1-dev and matches Janus-Pro, while retaining exact likelihood. Finetuned NTM (from FLUX.2-klein 4B) closes further at 512².

Normalizing Trajectory ModelsJ. Gu, T. Chen, Y. Shen, D. Berthelot, S. Zhai, J. Susskind · arXiv 2026

Back to the question

✓Causal STARFlow-V: strict streamable autoregressive roll-out.
✓Controllable T2V / I2V / V2V from one invertible model, no fine-tuning.
✓Robust Stable to 30s — 6× past the training horizon.
✓Real-time NTM (4 steps) & NFM (up to 145×) attack latency — demonstrated on images; the path to video.
✓Uncertainty Exact density throughout — uniquely retained.

So far, the answer looks like Maybe.

Scaling unverified. NF world models haven't been tested at the scale where diffusion shines — compute, data, and model size all need more exploration.

Absolute gap remains. As a pure video generator, NFs still trail state-of-the-art diffusion on perceptual quality benchmarks.

But — is video gen the right bar? Diffusion is a great generator; it's less clear it's the right backbone for a world model where likelihood, controllability, and causal roll-out matter most.

From images to the video world model

STARFlow-V
causal · controllable · robust — not yet real-time.

NTM
few steps, exact likelihood retained.

NFM
NF coupling → fast parallel flow student.

NTM & NFM are demonstrated on images today — the few-step + coupling machinery transfers directly to the video world model. That is the path to a real-time interactive NF world model.

Open questions & future directions

Real-time interactive WM. NTM/NFM few-step machinery proven on images — port to STARFlow-V's video AR loop.

Action-conditioning. Video prediction → true world model: condition roll-out on actions/controls for planning & embodied agents.

Uncertainty & calibration. NFs give tractable likelihoods — use them to detect OOD states, calibrate planning costs, and score roll-out quality. Something diffusion cannot easily offer.

Closed-loop interaction. Perceive → predict → act in the loop; tight feedback between the world model and a policy.

Other work in this space

iTARFlow · ICML 2026
End-to-end NF with iterative diffusion-style denoising at sampling time — competitive ImageNet generation with exact likelihood training.

TARFlow-LM · NeurIPS 2025
Flexible patch-level AR flows for language — unified text generation via normalizing flows at the token-embedding level.

STARFlow2 · arXiv 2026
Pretzel 🥨 — a unified multimodal AR model bridging language models and NFs for image understanding, reasoning, and generation in one stream.

NF-CoT · coming soon
Latent reasoning with normalizing flows — continuous chain-of-thought inside an LLM backbone, with tractable likelihood enabling policy-gradient in thought space.

Thank you

Are Normalizing Flows good candidates for interactive world models? — so far, maybe.

TARFlow (ICML'25) · STARFlow (NeurIPS'25) · STARFlow-V (CVPR'26) · NTM (arXiv'26) · NFM (arXiv'26)

scan for more

jiataogu.me

Jiatao Gu · GMLR · Penn

Are Normalizing FlowsGood Candidates forInteractive World Models?

Acting in the world needs a model of it

World model & the interactive loop

What an interactive world model needs

Today's video generators are mostly diffusion models

From diffusion to Normalizing Flows

Normalizing Flows in one slide

TARFlow — NFs are capable generative models

STARFlow — scaling latent NFs to high resolution

STARFlow-V — the first NF video generator

Architecture: global–local

Why it may resist error accumulation

Efficiency: video-aware Jacobi iteration

One model, three tasks — no fine-tuning

Quantitative: closing the gap to diffusion

Checkpoint — how far does STARFlow-V get us?

NFM — train as NF, test as flow matching

NFM: inherit the coupling from an NF

Why it works — and beats OT

NFM: faster than the NF, and better

Can we do better? Enter NTM

The problem: Gaussian steps fail when coarse

Where NTM sits: NF ↔ diffusion

NTM: a normalizing flow at each denoising step

Fast inference: trajectory denoising + learned denoiser

NTM matches 50-step diffusion at 4 steps

Back to the question

From images to the video world model

Open questions & future directions

Other work in this space

Thank you

Are Normalizing Flows
Good Candidates for
Interactive World Models?