Jiatao Gu
Talk @ EDGE Workshop · Jun 3, 2026

Interactive decisions require rolling out the consequences of actions — i.e. a world model you can query.
Interactive = the loop runs online: you act, it responds, you act again — fast, controllable, and stable over long horizons.
We will revisit this scorecard at the end.
Powerful — but a few properties remain hard for the interactive setting (each is an active research area):
A Normalizing Flow is a single invertible network $f$ mapping data $x$ to simple noise $z$.



Built on STARFlow, operating in a spatiotemporal latent space. It checks every box of the interactive-WM scorecard:

Naive AR inversion decodes tokens one-by-one. Recast each flow block as a fixed-point system and sweep all tokens in parallel — converging in $k \ll N$ iterations, without breaking causality.
Jacobi iteration already achieves ~15× speedup over sequential AR decoding — but that still falls short of real-time interactive latency. Closing this gap is precisely the motivation for the next two works.
| VBench (T2V) | Total | Causal? |
|---|---|---|
| Wan2.1 (diffusion) | 83.69 | no |
| CogVideoX | 80.91 | no |
| STARFlow-V | 78.67 | yes |
| STARFlow-V (+ GPT-rewriter) | 79.70 | yes |
Like diffusion distillation — but using an AR-NF as the coupling oracle. A clean train / test separation:

| ImageNet64 · FID | NFE 31 | NFE 15 | NFE 7 |
|---|---|---|---|
| FM (independent) | 2.66 | 4.94 | 13.21 |
| SD-FM (OT) | 2.66 | 3.12 | 6.28 |
| NFM | 1.80 | 2.18 | 3.27 |



Because $f_\mathcal{T}$ is invertible (not just a compressive encoder), this is the exact NLL of $p(x_s \mid x_t)$ — not a surrogate. Trained from scratch or initialized from any pretrained flow-matching model (set $f_\mathcal{T} = \mathrm{id}$).

| Type | Model | GenEval↑ | DPG↑ |
|---|---|---|---|
| DM | SD3-Medium | 0.62 | 84.08 |
| DM | FLUX.1-dev | 0.66 | 83.84 |
| DM | Janus-Pro-7B | 0.80 | 84.19 |
| NF | STARFlow | 0.56 | — |
| NF | NTM (scratch, 256²) | 0.82 | 79.64 |
| NF | NTM (finetune FLUX, 512²) | 0.76 | 83.38 |





Are Normalizing Flows good candidates for interactive world models? — so far, maybe.
TARFlow (ICML'25) · STARFlow (NeurIPS'25) · STARFlow-V (CVPR'26) · NTM (arXiv'26) · NFM (arXiv'26)
