Scalable Normalizing Flows
for Visual and
Multimodal Generation

Jiatao Gu

Talk @ T4V Workshop · Jun 3, 2026

Generative AI in 2026 — two paradigms working amazingly well

Vision & audio · diffusion models

Iterative denoising. Many forward passes per sample. Continuous outputs in $\mathbb{R}^d$.

Image: DALL·E 3 · Stable Diffusion 3 · Imagen 3 · FLUX
Video: Sora · Movie Gen · Veo · Kling · CogVideoX
3D / 4D: DiT-3D · GAIA · Genie
Audio: Stable Audio · AudioLDM

Language & code · autoregressive models

One forward pass per token. Discrete softmax. Causal mask, KV-cache, streaming.

Chat: GPT-4 / 4o · Claude 3.7 · Gemini 2 · Llama 3
Code: Codex · Claude Code · Cursor
Reasoning: o1 · DeepSeek-R1
Speech tokens: AudioLM · VALL-E
Two stacks. Two losses. Two inference regimes. Both have to live in one model if we want unified multimodal AI.

Best of both worlds in one model?

$\underbrace{\text{pixels}}_{\text{continuous}\,\in\,\mathbb{R}^d}\quad\text{vs.}\quad\underbrace{\text{tokens}}_{\text{discrete}\,\in\,\{1,\dots,V\}}$

A unified multimodal model has to understand images, generate images, and reason across text and images — under one set of weights, one training objective, one inference loop.

Today's best answer: Transfusion-style

Transfusion-style mixed-modal sequence
What it gets right. One backbone, mixed-modal sequence, end-to-end training. The de-facto SOTA — Transfusion, Show-o, BAGEL, MoT, Pixart-Σ-style hybrids all sit here.
What it leaves on the table. Two heads, two losses, two regimes. Image gen leaves the AR cache for a 50-step diffusion loop. No shared inference, no exact likelihood for pixels.
Can one AR head handle both modalities — natively, in one stream, with one cache?
Transfusion (Zhou et al., Meta) · Show-o (Xie et al., NUS) · BAGEL (ByteDance) — 2024 – 2025

What such a model needs

  • Density Exact likelihood — clean training signal, principled evaluation, on-policy RL.
  • AR-native Streamable, KV-cacheable generation — same machinery LLMs already run at scale.
  • Continuous No quantization — keep pixels in $\mathbb{R}^d$; no codebook bottleneck.
  • Scalable High-resolution images at the quality of modern diffusion (and beyond, into video).
  • Shared One architecture with LMs — same causal mask, same attention, one training stack.

Each ✓ will be cashed in by a paper in the STARFlow family — we revisit this scorecard at the end.

Normalizing Flows in one slide

change of variables
$p(x) = p_0\!\big(f(x)\big)\,\Big|\det \tfrac{\partial f(x)}{\partial x}\Big|,\qquad z=f(x)\ \text{invertible}$
Exact likelihood
trained by exact MLE — one clean objective, no ELBO, no schedule.
Invertible
$x \leftrightarrow z$ is lossless — encoding and generation share the same network.
Continuous
$x \in \mathbb{R}^d$ throughout — no codebook, no quantization, no discretization.

NFs were always there — but didn't quite scale

RealNVP
Dinh et al. · 2017
Real-valued non-volume-preserving — coupling layers, the workhorse design.
Glow
Kingma & Dhariwal · 2018
Invertible 1×1 convs; first photorealistic NF on faces.
MAF / IAF
Papamakarios & Kingma · 2017
Masked autoregressive flows — density estimation, rich likelihoods.
Flow++
Ho et al. · 2019
Mixture-of-CDFs coupling + continuous noise — closed the likelihood gap on CIFAR.
TARFlow
Zhai, Gu et al. · ICML 2025
Transformer-based masked AR flow — finally diffusion-level samples from a stand-alone NF.

Throughout, NFs kept exact likelihood — but lost ground to GANs & diffusion on sample quality. The Transformer revival changes the verdict.

TARFlow — NFs are capable generative models

TARFlow
Transformer Autoregressive Flow. A stack of autoregressive Transformer blocks over image patches, alternating scan direction layer-to-layer — a Transformer-based Masked AR Flow.
Three sample-quality tricks. Gaussian noise augmentation in training, a small post-hoc denoiser, and guidance — together close the gap to diffusion samples.
Stand-alone NF. No GAN, no diffusion. Sets new SOTA image likelihoods and diffusion-level samples from a single MLE objective.
Normalizing Flows are Capable Generative ModelsS. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, H. Zheng, T. Chen, M. Bautista, N. Jaitly, J. Susskind · ICML 2025 (Oral)

STARFlow — scaling latent NFs to high resolution

STARFlow samples
The question. Can a normalizing flow match modern diffusion at 512² & 1024², text-conditional, no quantization?
The answer. Yes — with a deep–shallow latent design, a new guidance recipe, and 3.8B params trained by exact MLE.
First NF at this scale & resolution to approach diffusion sample quality, while keeping exact log-likelihood end-to-end.
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Architecture: deep + shallow

STARFlow architecture
One deep block ($f_D$, 18 layers) — a causal Transformer carrying most of the capacity. Acts like a language model over latent tokens; this is where guidance is applied.
A few shallow blocks ($f_S$, 2 layers each) — alternating scan direction layer-by-layer. Refine local detail; cheap, parallelizable in inverse.

3.8B params (T2I) · 1.4B (class-cond) · DiT-VAE latent at $p=1$ · 1024 tokens for 256², up to 16384 for 1024².

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Why latent space — and why continuous latents

Each AR step is a Gaussian. An AR-flow predicts $\mathcal{N}(\mu_\theta, \sigma_\theta)$ per token. On raw pixels you'd want big patches to keep sequences short — but a big-patch pixel distribution is highly multimodal, and a Gaussian can't fit it well (TARFlow saw this directly).
Latents make tokens easy. A pretrained AE compresses each patch into a smoother, more Gaussian-like vector — perfect food for the per-step Gaussian head.
Continuous, not discrete. VQ-tokenizers throw information away at the codebook bottleneck. STARFlow uses a continuous DiT-VAE latent — invertibility carries through.
End-to-end MLE. Likelihood factors as $\log p_\theta(x)=\log p_\theta(z)+\log p_\psi(x\mid z)$. Train flow + AE jointly, one objective, no schedule.
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

FAE — adapt pretrained features for generation

VAE / VA-VAE / RAE / FAE comparison
The mismatch. Pretrained encoders (DINOv2, SigLIP) want high-dim latents to model the masked-prediction posterior. Generative models want low-dim latents — small Gaussians, smooth trajectories.
FAE's bet. A Feature AutoEncoder: keep the frozen pretrained encoder, compress its features into a 32-dim generation-friendly code with a single attention layer + linear projection.

Channel dim: SD-VAE 4 · VA-VAE 32 · RAE 1536 · FAE 32 — same compactness as VA-VAE, but built from a pretrained understanding model.

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image GenerationY. Gao, C. Chen, T. Chen, J. Gu · CVPR 2026

SOTA on ImageNet, 7–13× faster convergence

FAE convergence vs RAE / SD-VAE baselines
ImageNet 256² · FID-50K80 ep800 ep
FAE-DINOv2-G + LightningDiT-XL2.081.48
↳ + CFG1.701.29
SOTA without CFG. 1.48 FID at 800 epochs — best reported for ImageNet 256² without classifier-free guidance.
7–13× faster than RAE-DINOv2-S/B at matched FID — the minimal-design adapter beats heavier ones.
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image GenerationY. Gao, C. Chen, T. Chen, J. Gu · CVPR 2026

FAE plugs into Normalizing Flows too

FAE vs SD-VAE on STARFlow (with CFG)
Same recipe, NF generator. Train STARFlow (1.4B params) on FAE-DINOv2-G latents — same patch size, same sequence length as SD-VAE baseline for fair comparison.
STARFlow · ImageNet 256²FID @ 400 ep
SD-VAE baseline4.51
FAE (DINOv2-g/14)2.67
Universal. The same compact pretrained-feature latent that won on diffusion also speeds up & sharpens NF generation — no NF-specific changes.
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image GenerationY. Gao, C. Chen, T. Chen, J. Gu · CVPR 2026

Classifier-free guidance — re-derived for AR flows

Diffusion CFG mixes scores; in an AR flow, every step is an explicit Gaussian — so we can mix the conditional/unconditional Gaussians directly:

$\tilde\mu_c = \mu_c + \dfrac{\omega s}{\,1 + \omega - \omega s\,}\,(\mu_c - \mu_u),\; \tilde\sigma_c = \dfrac{1}{\sqrt{\,1 + \omega - \omega s\,}}\;\sigma_c, \; s = \dfrac{\sigma_c^{\,2}}{\sigma_u^{\,2}}\ \text{(clipped)}$
Closed-form. Exact, not a heuristic — $\omega$ is the guidance scale, $s$ a stability clip $\in [0,1]$.
Deep block only. Apply guidance where the model is most "language-model-like"; skip the shallow refiner.
Stable across $\omega$. Doesn't blow up at high guidance — a long-standing failure mode of naive flow CFG.
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Closing the gap to diffusion

ImageNet · FID-50K256²512²
DiT-XL (diffusion)3.60
STARFlow (1.4B)2.403.00
Text-to-image (CC12M)GenEvalCOCO FID-30K
STARFlow (3.8B)0.569.1
First NF to outperform a strong diffusion peer on class-conditional ImageNet — and the first to scale text-conditional NFs to large data, with exact likelihood throughout.
STARFlow text-to-image samples
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Two scorecard ✓s, with one model

  • Density Exact log-likelihood retained at 1024² — no other large-scale image generator offers this.
  • Continuous End-to-end on continuous latents — no codebook, no quantization.
  • Scalable 3.8B params, 1024² text-to-image, FID and GenEval competitive with strong diffusion baselines.
NFs scale on images. What about time? — and what about other modalities?

STARFlow-V — NFs across time

STARFlow-V pipeline
Global deep + local shallow over a spatiotemporal latent — same recipe as STARFlow, now causal across frames.
One model, three tasks. T2V / I2V / V2V from a single weights-set, no fine-tuning — invertibility for free.
Likelihood & long horizons. Exact log-likelihood over video; stable to 30 s, 6× past the 5 s training window.
STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

One model, three video tasks — no fine-tuning

T2V
text → video, generated from scratch.
I2V
flow-encode the first frame into the KV cache; roll out the rest. No separate encoder.
V2V
flow-encode the source clip, roll out edits. Invertibility reuses the decoder as encoder.

A single set of weights — controllability for free, because the network is its own encoder.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Long-horizon robustness, native likelihood

"a corgi…" — 30 s roll-out
"a golden doodle tilting its…" — 30 s roll-out
5 s → 30 s. Stays sharp and identity-consistent 6× past training horizon, where AR-diffusion baselines blur, drift, distort.
Exact log-likelihood over the entire video — something diffusion video models structurally cannot offer.
STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Where the field is — unified multimodal in 2024–25

Chameleon
Meta · 2024
Early-fusion AR over VQ image tokens. Quantization → detail loss.
Transfusion
Meta · 2024
AR text + diffusion image, mixed loss. Two losses, two regimes.
EMU3
BAAI · 2024
Pure next-token over discrete tokens — text, image, video.
Show-o
NUS · 2024
AR text + masked-diffusion image in one Transformer.
Janus
DeepSeek · 2024
Decoupled encoders — separate paths for understand & generate.
MoT / BAGEL
2024–25
Mixture-of-Transformers — horizontal routing per modality.
STARFlow2 🥨
Apple · 2026
AR-NF stream vertically interleaved with a frozen VLM. Continuous, cache-shared.
Common compromise. Either quantize images (lose detail) or bolt on a diffusion head (lose KV-cache and the LM training stack).
What's missing. A model that's continuous, cache-friendly, single-stream, and preserves the pretrained LM — all four.

STARFlow2 — bridging LMs and NFs

STARFlow2 capabilities overview
The setup. Build a single autoregressive multimodal model that (D1) preserves a pretrained VLM's understanding, (D2) generates continuous high-fidelity images, (D3) stays one causal stream with shared KV-cache.
The key observation. An autoregressive normalizing flow is an autoregressive Transformer. Same causal mask, same cache, same L→R. That equivalence is the architecture.
The answer. Pretzel 🥨 — vertically interleave a frozen VLM stream with a trainable TARFlow stream under a shared causal mask.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

AR Normalizing Flows are AR Transformers

AR Transformer (LM)

$p(t_n \mid t_{1:n-1}) = \mathrm{softmax}\big(W \cdot \mathrm{Trf}(t_{1:n-1})\big)$
Discrete next-token. Causal mask · KV-cache · single forward pass per token.

AR Normalizing Flow

$p(x_n \mid x_{1:n-1}) = \mathcal{N}\big(\mu_\theta(x_{1:n-1}),\,\sigma_\theta(x_{1:n-1})\big)$
Continuous next-token. Same causal mask · same KV-cache · same single forward pass.
  • Same mask Both decode strictly left-to-right under one causal attention pattern.
  • Same KV-cache Visual latents enter the cache as ordinary tokens — reused across positions, no re-encoding.
  • Same stack Same training stack, optimizer, attention kernels — only the head differs (softmax vs Gaussian).
Concretely, in STARFlow: the deep block is the AR-Transformer (carries all the language-model-shaped capacity); the shallow blocks just refine within-token detail. Pretzel reuses this deep block as its visual stream — wired into the VLM through one shared causal mask.
"There is no structural gap between AR flows and language models." — pixels are just continuous tokens.

Pretzel 🥨 — vertical interleaving

Pretzel architecture
VLM stream ❄️
Frozen Qwen2.5-VL-7B-Instruct. Pretrained understanding kept intact (D1).
TARFlow stream 🔥
Trainable AR-flow over visual latents. High-fidelity continuous generation (D2).
Vertical residual skips
Cross-modal fusion at every position under one shared causal mask (D3).
$\mathcal{L}_\text{NLL} = \mathcal{L}_\text{NF}\ (\text{visual})\ +\ \lambda\,\mathcal{L}_\text{NTP}\ (\text{text})$
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Why not MoT / BAGEL? — a negative result

Mode A: freeze VLM, train TARFlow branch only.
Generation collapses (left). TARFlow can't enter the VLM's KV-cache as reusable context — horizontal routing keeps streams blind to each other.
Mode B: fine-tune VLM jointly.
Understanding degrades sharply — MME drops to ~30. Pretrained VLM capabilities are washed out by the new visual loss.
Pretzel fixes both. Vertical residuals: VLM stays frozen (no understanding loss), TARFlow lives inside the same causal mask (true cross-modal context).
MoT failure samples
MoT-style generations under mode A — degenerate, mode-collapsed.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Three-stage curriculum

Three-stage training pipeline
Stage 1 · T2I
Train TARFlow stream from scratch on ~800M text→image pairs. VLM, FAE frozen. Establish visual generation backbone.
Stage 2 · I2T adapter
Train only the visual adapter (zero-init) on ~200M image→text. Align FAE latents with the VLM's representation space.
Stage 3 · interleaved joint
Activate vertical skips, train all on ~80M interleaved examples (gen + edit + understand). Joint $\mathcal{L}_\text{NF}+\lambda\mathcal{L}_\text{NTP}$.

3.6B trainable params · Qwen2.5-VL-7B + FAE frozen throughout · zero-init visual adapter for safe cross-modal coupling.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

One causal stream — text and pixels in the same KV-cache

Interleaved text + image generation example
No re-encoding. Generated images are projected into the VLM embedding space and reused as ordinary tokens by later steps — text and visual positions share the cache.
No diffusion loop. Each visual position is a single AR Gaussian, not a 50-step ODE. Same per-token cost as language.
Mix freely. Text → image → text → image, in one rollout, in one cache, in one model. The dream from slide 2.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Understand and generate — same model

Understanding (Table 1)

BenchmarkSTARFlow2
MME-P1607.1
GQA63.9
SEED74.7
MMBench82.2
MMMU58.3
AI2D79.3

Competitive with Janus / Show-o2 / TUNA — without losing the pretrained VLM.

Generation (Tables 2–3)

BenchmarkSTARFlow2
GenEval0.82
DPG-Bench84.94
Resolution256²
Trainable params3.6 B
First unified model to hit D1 + D2 + D3 simultaneously — preserve VLM, continuous high-fidelity gen, single causal stream with shared KV-cache.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

From prompts to edits — one model

STARFlow2 text-to-image samples
Text-to-image · same weights, no separate generation head
STARFlow2 image editing samples
Multi-turn editing · invertibility = encode & decode share the network
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Back to the scorecard

  • Density Exact log-likelihood throughout — TARFlow → STARFlow → STARFlow-V → STARFlow2.
  • AR-native Same causal mask & KV-cache as language models — STARFlow2 makes this explicit.
  • Continuous Pixels stay in $\mathbb{R}^d$ — no codebook bottleneck, no quantization tax.
  • Scalable STARFlow at 1024², STARFlow-V to 30 s video — competitive with strong diffusion peers.
  • Shared One architecture, one training stack — language & vision under one mask.
TARFlow is an AR Transformer with a Gaussian head — same causal mask, same KV-cache, just a continuous next-token instead of a discrete one.

The same architecture extends to everything continuous

Audio
Speech & music: continuous, sequential, AR-native — a Pretzel-style audio stream interleaves naturally with text.
3D / 4D
Point clouds, dynamic point-maps, mesh latents — all continuous; the AR-flow head is modality-agnostic.
Action / control
Continuous robot trajectories & world-model states — same machinery, exact density for planning.
Reasoning
Continuous "thoughts" inside an LLM (NF-CoT) — a teaser for the deck-4 talk.
If the next-token can be a continuous Gaussian, the architecture stays the same — and a single AR Transformer covers everything from text to pixels to actions.

Future directions

On-policy RL in latent space
Tractable likelihood enables PPO-style training over continuous tokens — what was off-limits with diffusion.
Higher resolution & longer video
Push Pretzel + STARFlow-V toward 1024² + minutes-long generation, still single-stream.
Continuous reasoning
NF-based latent "thoughts" inside the LLM — bridge to the KnowledgeMR talk (deck 4).
End-to-end multimodal training
Drop the curriculum — single-stage joint training of VLM + flow, with cleaner scaling laws.

Other works in the NF line

NTM
NTM
Each reverse step = a conditional NF. 4-step generation with exact likelihood across the whole trajectory.
arXiv 2026
NFM
NFM
Distill an AR-NF's deterministic noise↔data coupling into a flow-matching student. Beats independent & OT couplings.
arXiv 2026
iTARFlow
iTARFlow
End-to-end NF training + diffusion-style iterative denoising at sampling. Competitive on ImageNet 64 / 128 / 256.
ICML 2026
TarFlowLM
TarFlowLM
Continuous AR over text with a TARFlow-style Gaussian head — language modeling without the softmax bottleneck.
NeurIPS 2025
coming soon
NF-CoT
Continuous chain-of-thought via AR-NF — latent reasoning. Bridge to the KnowledgeMR talk.
in prep.

Thank you

Scalable Normalizing Flows for Visual & Multimodal Generation — one AR stream, exact likelihood, no quantization.

TARFlow (ICML'25) · STARFlow (NeurIPS'25) · STARFlow-V (CVPR'26) · FAE (CVPR'26) · STARFlow2 (arXiv'26)

https://jiataogu.me
scan for more
jiataogu.me
Jiatao Gu · GMLR · Penn