Scalable Normalizing Flows
for Visual and
Multimodal Generation

Jiatao Gu

Talk @ T4V Workshop · Jun 3, 2026

Generative AI in 2026 — two paradigms working amazingly well

Vision & audio · diffusion models

Iterative denoising. Many forward passes per sample. Continuous outputs in $\mathbb{R}^d$.

Image: DALL·E 3 · Stable Diffusion 3 · Imagen 3 · FLUX
Video: Sora · Movie Gen · Veo · Kling · CogVideoX
3D / 4D: DiT-3D · GAIA · Genie
Audio: Stable Audio · AudioLDM

Language & code · autoregressive models

One forward pass per token. Discrete softmax. Causal mask, KV-cache, streaming.

Chat: GPT-4 / 4o · Claude 3.7 · Gemini 2 · Llama 3
Code: Codex · Claude Code · Cursor
Reasoning: o1 · DeepSeek-R1
Speech tokens: AudioLM · VALL-E

Two stacks. Two losses. Two inference regimes. Both have to live in one model if we want unified multimodal AI.

Best of both worlds in one model?

$\underbrace{\text{pixels}}_{\text{continuous}\,\in\,\mathbb{R}^d}\quad\text{vs.}\quad\underbrace{\text{tokens}}_{\text{discrete}\,\in\,\{1,\dots,V\}}$

A unified multimodal model has to understand images, generate images, and reason across text and images — under one set of weights, one training objective, one inference loop.

Today's best answer: Transfusion-style

What it gets right. One backbone, mixed-modal sequence, end-to-end training. The de-facto SOTA — Transfusion, Show-o, BAGEL, MoT, Pixart-Σ-style hybrids all sit here.

What it leaves on the table. Two heads, two losses, two regimes. Image gen leaves the AR cache for a 50-step diffusion loop. No shared inference, no exact likelihood for pixels.

Can one AR head handle both modalities — natively, in one stream, with one cache?

Transfusion (Zhou et al., Meta) · Show-o (Xie et al., NUS) · BAGEL (ByteDance) — 2024 – 2025

What such a model needs

Density Exact likelihood — clean training signal, principled evaluation, on-policy RL.
AR-native Streamable, KV-cacheable generation — same machinery LLMs already run at scale.
Continuous No quantization — keep pixels in $\mathbb{R}^d$; no codebook bottleneck.
Scalable High-resolution images at the quality of modern diffusion (and beyond, into video).
Shared One architecture with LMs — same causal mask, same attention, one training stack.

Each ✓ will be cashed in by a paper in the STARFlow family — we revisit this scorecard at the end.

Normalizing Flows in one slide

$p(x) = p_0\!\big(f(x)\big)\,\Big|\det \tfrac{\partial f(x)}{\partial x}\Big|,\qquad z=f(x)\ \text{invertible}$

Exact likelihood
trained by exact MLE — one clean objective, no ELBO, no schedule.

Invertible
$x \leftrightarrow z$ is lossless — encoding and generation share the same network.

Continuous
$x \in \mathbb{R}^d$ throughout — no codebook, no quantization, no discretization.

NFs were always there — but didn't quite scale

RealNVP

Dinh et al. · 2017

Real-valued non-volume-preserving — coupling layers, the workhorse design.

Glow

Kingma & Dhariwal · 2018

Invertible 1×1 convs; first photorealistic NF on faces.

MAF / IAF

Papamakarios & Kingma · 2017

Masked autoregressive flows — density estimation, rich likelihoods.

Flow++

Ho et al. · 2019

Mixture-of-CDFs coupling + continuous noise — closed the likelihood gap on CIFAR.

TARFlow

Zhai, Gu et al. · ICML 2025

Transformer-based masked AR flow — finally diffusion-level samples from a stand-alone NF.

Throughout, NFs kept exact likelihood — but lost ground to GANs & diffusion on sample quality. The Transformer revival changes the verdict.

TARFlow — NFs are capable generative models

Transformer Autoregressive Flow. A stack of autoregressive Transformer blocks over image patches, alternating scan direction layer-to-layer — a Transformer-based Masked AR Flow.

Three sample-quality tricks. Gaussian noise augmentation in training, a small post-hoc denoiser, and guidance — together close the gap to diffusion samples.

Stand-alone NF. No GAN, no diffusion. Sets new SOTA image likelihoods and diffusion-level samples from a single MLE objective.

Normalizing Flows are Capable Generative ModelsS. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, H. Zheng, T. Chen, M. Bautista, N. Jaitly, J. Susskind · ICML 2025 (Oral)

STARFlow — scaling latent NFs to high resolution

The question. Can a normalizing flow match modern diffusion at 512² & 1024², text-conditional, no quantization?

The answer. Yes — with a deep–shallow latent design, a new guidance recipe, and 3.8B params trained by exact MLE.

First NF at this scale & resolution to approach diffusion sample quality, while keeping exact log-likelihood end-to-end.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Architecture: deep + shallow

One deep block ($f_D$, 18 layers) — a causal Transformer carrying most of the capacity. Acts like a language model over latent tokens; this is where guidance is applied.

A few shallow blocks ($f_S$, 2 layers each) — alternating scan direction layer-by-layer. Refine local detail; cheap, parallelizable in inverse.

3.8B params (T2I) · 1.4B (class-cond) · DiT-VAE latent at $p=1$ · 1024 tokens for 256², up to 16384 for 1024².

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Why latent space — and why continuous latents

Each AR step is a Gaussian. An AR-flow predicts $\mathcal{N}(\mu_\theta, \sigma_\theta)$ per token. On raw pixels you'd want big patches to keep sequences short — but a big-patch pixel distribution is highly multimodal, and a Gaussian can't fit it well (TARFlow saw this directly).

Latents make tokens easy. A pretrained AE compresses each patch into a smoother, more Gaussian-like vector — perfect food for the per-step Gaussian head.

Continuous, not discrete. VQ-tokenizers throw information away at the codebook bottleneck. STARFlow uses a continuous DiT-VAE latent — invertibility carries through.

End-to-end MLE. Likelihood factors as $\log p_\theta(x)=\log p_\theta(z)+\log p_\psi(x\mid z)$. Train flow + AE jointly, one objective, no schedule.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

FAE — adapt pretrained features for generation

The mismatch. Pretrained encoders (DINOv2, SigLIP) want high-dim latents to model the masked-prediction posterior. Generative models want low-dim latents — small Gaussians, smooth trajectories.

FAE's bet. A Feature AutoEncoder: keep the frozen pretrained encoder, compress its features into a 32-dim generation-friendly code with a single attention layer + linear projection.

Channel dim: SD-VAE 4 · VA-VAE 32 · RAE 1536 · FAE 32 — same compactness as VA-VAE, but built from a pretrained understanding model.

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image GenerationY. Gao, C. Chen, T. Chen, J. Gu · CVPR 2026

SOTA on ImageNet, 7–13× faster convergence

FAE convergence vs RAE / SD-VAE baselines

ImageNet 256² · FID-50K	80 ep	800 ep
FAE-DINOv2-G + LightningDiT-XL	2.08	1.48
↳ + CFG	1.70	1.29

SOTA without CFG. 1.48 FID at 800 epochs — best reported for ImageNet 256² without classifier-free guidance.

7–13× faster than RAE-DINOv2-S/B at matched FID — the minimal-design adapter beats heavier ones.

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image GenerationY. Gao, C. Chen, T. Chen, J. Gu · CVPR 2026

FAE plugs into Normalizing Flows too

Same recipe, NF generator. Train STARFlow (1.4B params) on FAE-DINOv2-G latents — same patch size, same sequence length as SD-VAE baseline for fair comparison.

STARFlow · ImageNet 256²	FID @ 400 ep
SD-VAE baseline	4.51
FAE (DINOv2-g/14)	2.67

Universal. The same compact pretrained-feature latent that won on diffusion also speeds up & sharpens NF generation — no NF-specific changes.

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image GenerationY. Gao, C. Chen, T. Chen, J. Gu · CVPR 2026

Classifier-free guidance — re-derived for AR flows

Diffusion CFG mixes scores; in an AR flow, every step is an explicit Gaussian — so we can mix the conditional/unconditional Gaussians directly:

$\tilde\mu_c = \mu_c + \dfrac{\omega s}{\,1 + \omega - \omega s\,}\,(\mu_c - \mu_u),\; \tilde\sigma_c = \dfrac{1}{\sqrt{\,1 + \omega - \omega s\,}}\;\sigma_c, \; s = \dfrac{\sigma_c^{\,2}}{\sigma_u^{\,2}}\ \text{(clipped)}$

Closed-form. Exact, not a heuristic — $\omega$ is the guidance scale, $s$ a stability clip $\in [0,1]$.

Deep block only. Apply guidance where the model is most "language-model-like"; skip the shallow refiner.

Stable across $\omega$. Doesn't blow up at high guidance — a long-standing failure mode of naive flow CFG.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Closing the gap to diffusion

ImageNet · FID-50K	256²	512²
DiT-XL (diffusion)	3.60	—
STARFlow (1.4B)	2.40	3.00

Text-to-image (CC12M)	GenEval	COCO FID-30K
STARFlow (3.8B)	0.56	9.1

First NF to outperform a strong diffusion peer on class-conditional ImageNet — and the first to scale text-conditional NFs to large data, with exact likelihood throughout.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

Two scorecard ✓s, with one model

✓Density Exact log-likelihood retained at 1024² — no other large-scale image generator offers this.
✓Continuous End-to-end on continuous latents — no codebook, no quantization.
✓Scalable 3.8B params, 1024² text-to-image, FID and GenEval competitive with strong diffusion baselines.

NFs scale on images. What about time? — and what about other modalities?

STARFlow-V — NFs across time

Global deep + local shallow over a spatiotemporal latent — same recipe as STARFlow, now causal across frames.

One model, three tasks. T2V / I2V / V2V from a single weights-set, no fine-tuning — invertibility for free.

Likelihood & long horizons. Exact log-likelihood over video; stable to 30 s, 6× past the 5 s training window.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

One model, three video tasks — no fine-tuning

T2V

text → video, generated from scratch.

I2V

flow-encode the first frame into the KV cache; roll out the rest. No separate encoder.

V2V

flow-encode the source clip, roll out edits. Invertibility reuses the decoder as encoder.

A single set of weights — controllability for free, because the network is its own encoder.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Long-horizon robustness, native likelihood

"a corgi…" — 30 s roll-out

"a golden doodle tilting its…" — 30 s roll-out

5 s → 30 s. Stays sharp and identity-consistent 6× past training horizon, where AR-diffusion baselines blur, drift, distort.

Exact log-likelihood over the entire video — something diffusion video models structurally cannot offer.

STARFlow-V: End-to-End Video Generative Modeling with Normalizing FlowsJ. Gu, Y. Shen, T. Chen, L. Dinh, Y. Wang, M. Bautista, D. Berthelot, J. Susskind, S. Zhai · CVPR 2026 (Highlight)

Where the field is — unified multimodal in 2024–25

Chameleon

Meta · 2024

Early-fusion AR over VQ image tokens. Quantization → detail loss.

Transfusion

Meta · 2024

AR text + diffusion image, mixed loss. Two losses, two regimes.

EMU3

BAAI · 2024

Pure next-token over discrete tokens — text, image, video.

Show-o

NUS · 2024

AR text + masked-diffusion image in one Transformer.

Janus

DeepSeek · 2024

Decoupled encoders — separate paths for understand & generate.

MoT / BAGEL

2024–25

Mixture-of-Transformers — horizontal routing per modality.

STARFlow2 🥨

Apple · 2026

AR-NF stream vertically interleaved with a frozen VLM. Continuous, cache-shared.

Common compromise. Either quantize images (lose detail) or bolt on a diffusion head (lose KV-cache and the LM training stack).

What's missing. A model that's continuous, cache-friendly, single-stream, and preserves the pretrained LM — all four.

STARFlow2 — bridging LMs and NFs

The setup. Build a single autoregressive multimodal model that (D1) preserves a pretrained VLM's understanding, (D2) generates continuous high-fidelity images, (D3) stays one causal stream with shared KV-cache.

The key observation. An autoregressive normalizing flow is an autoregressive Transformer. Same causal mask, same cache, same L→R. That equivalence is the architecture.

The answer. Pretzel 🥨 — vertically interleave a frozen VLM stream with a trainable TARFlow stream under a shared causal mask.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

AR Normalizing Flows are AR Transformers

AR Transformer (LM)

$p(t_n \mid t_{1:n-1}) = \mathrm{softmax}\big(W \cdot \mathrm{Trf}(t_{1:n-1})\big)$

Discrete next-token. Causal mask · KV-cache · single forward pass per token.

AR Normalizing Flow

$p(x_n \mid x_{1:n-1}) = \mathcal{N}\big(\mu_\theta(x_{1:n-1}),\,\sigma_\theta(x_{1:n-1})\big)$

Continuous next-token. Same causal mask · same KV-cache · same single forward pass.

✓Same mask Both decode strictly left-to-right under one causal attention pattern.
✓Same KV-cache Visual latents enter the cache as ordinary tokens — reused across positions, no re-encoding.
✓Same stack Same training stack, optimizer, attention kernels — only the head differs (softmax vs Gaussian).

Concretely, in STARFlow: the deep block is the AR-Transformer (carries all the language-model-shaped capacity); the shallow blocks just refine within-token detail. Pretzel reuses this deep block as its visual stream — wired into the VLM through one shared causal mask.

"There is no structural gap between AR flows and language models." — pixels are just continuous tokens.

Pretzel 🥨 — vertical interleaving

VLM stream ❄️
Frozen Qwen2.5-VL-7B-Instruct. Pretrained understanding kept intact (D1).

TARFlow stream 🔥
Trainable AR-flow over visual latents. High-fidelity continuous generation (D2).

Vertical residual skips
Cross-modal fusion at every position under one shared causal mask (D3).

$\mathcal{L}_\text{NLL} = \mathcal{L}_\text{NF}\ (\text{visual})\ +\ \lambda\,\mathcal{L}_\text{NTP}\ (\text{text})$

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Why not MoT / BAGEL? — a negative result

Mode A: freeze VLM, train TARFlow branch only.
Generation collapses (left). TARFlow can't enter the VLM's KV-cache as reusable context — horizontal routing keeps streams blind to each other.

Mode B: fine-tune VLM jointly.
Understanding degrades sharply — MME drops to ~30. Pretrained VLM capabilities are washed out by the new visual loss.

Pretzel fixes both. Vertical residuals: VLM stays frozen (no understanding loss), TARFlow lives inside the same causal mask (true cross-modal context).

MoT-style generations under mode A — degenerate, mode-collapsed.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Three-stage curriculum

Stage 1 · T2I
Train TARFlow stream from scratch on ~800M text→image pairs. VLM, FAE frozen. Establish visual generation backbone.

Stage 2 · I2T adapter
Train only the visual adapter (zero-init) on ~200M image→text. Align FAE latents with the VLM's representation space.

Stage 3 · interleaved joint
Activate vertical skips, train all on ~80M interleaved examples (gen + edit + understand). Joint $\mathcal{L}_\text{NF}+\lambda\mathcal{L}_\text{NTP}$.

3.6B trainable params · Qwen2.5-VL-7B + FAE frozen throughout · zero-init visual adapter for safe cross-modal coupling.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

One causal stream — text and pixels in the same KV-cache

Interleaved text + image generation example

No re-encoding. Generated images are projected into the VLM embedding space and reused as ordinary tokens by later steps — text and visual positions share the cache.

No diffusion loop. Each visual position is a single AR Gaussian, not a 50-step ODE. Same per-token cost as language.

Mix freely. Text → image → text → image, in one rollout, in one cache, in one model. The dream from slide 2.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Understand and generate — same model

Understanding (Table 1)

Benchmark	STARFlow2
MME-P	1607.1
GQA	63.9
SEED	74.7
MMBench	82.2
MMMU	58.3
AI2D	79.3

Competitive with Janus / Show-o2 / TUNA — without losing the pretrained VLM.

Generation (Tables 2–3)

Benchmark	STARFlow2
GenEval	0.82
DPG-Bench	84.94
Resolution	256²
Trainable params	3.6 B

First unified model to hit D1 + D2 + D3 simultaneously — preserve VLM, continuous high-fidelity gen, single causal stream with shared KV-cache.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

From prompts to edits — one model

Text-to-image · same weights, no separate generation head

Multi-turn editing · invertibility = encode & decode share the network

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Back to the scorecard

✓Density Exact log-likelihood throughout — TARFlow → STARFlow → STARFlow-V → STARFlow2.
✓AR-native Same causal mask & KV-cache as language models — STARFlow2 makes this explicit.
✓Continuous Pixels stay in $\mathbb{R}^d$ — no codebook bottleneck, no quantization tax.
✓Scalable STARFlow at 1024², STARFlow-V to 30 s video — competitive with strong diffusion peers.
✓Shared One architecture, one training stack — language & vision under one mask.

TARFlow is an AR Transformer with a Gaussian head — same causal mask, same KV-cache, just a continuous next-token instead of a discrete one.

The same architecture extends to everything continuous

Audio
Speech & music: continuous, sequential, AR-native — a Pretzel-style audio stream interleaves naturally with text.

3D / 4D
Point clouds, dynamic point-maps, mesh latents — all continuous; the AR-flow head is modality-agnostic.

Action / control
Continuous robot trajectories & world-model states — same machinery, exact density for planning.

Reasoning
Continuous "thoughts" inside an LLM (NF-CoT) — a teaser for the deck-4 talk.

If the next-token can be a continuous Gaussian, the architecture stays the same — and a single AR Transformer covers everything from text to pixels to actions.

Future directions

On-policy RL in latent space
Tractable likelihood enables PPO-style training over continuous tokens — what was off-limits with diffusion.

Higher resolution & longer video
Push Pretzel + STARFlow-V toward 1024² + minutes-long generation, still single-stream.

Continuous reasoning
NF-based latent "thoughts" inside the LLM — bridge to the KnowledgeMR talk (deck 4).

End-to-end multimodal training
Drop the curriculum — single-stage joint training of VLM + flow, with cleaner scaling laws.

Other works in the NF line

NTM

Each reverse step = a conditional NF. 4-step generation with exact likelihood across the whole trajectory.

arXiv 2026

NFM

Distill an AR-NF's deterministic noise↔data coupling into a flow-matching student. Beats independent & OT couplings.

arXiv 2026

iTARFlow

End-to-end NF training + diffusion-style iterative denoising at sampling. Competitive on ImageNet 64 / 128 / 256.

ICML 2026

TarFlowLM

Continuous AR over text with a TARFlow-style Gaussian head — language modeling without the softmax bottleneck.

NeurIPS 2025

coming soon

NF-CoT

Continuous chain-of-thought via AR-NF — latent reasoning. Bridge to the KnowledgeMR talk.

in prep.

Thank you

Scalable Normalizing Flows for Visual & Multimodal Generation — one AR stream, exact likelihood, no quantization.

TARFlow (ICML'25) · STARFlow (NeurIPS'25) · STARFlow-V (CVPR'26) · FAE (CVPR'26) · STARFlow2 (arXiv'26)

scan for more

jiataogu.me

Jiatao Gu · GMLR · Penn

Scalable Normalizing Flowsfor Visual andMultimodal Generation

Generative AI in 2026 — two paradigms working amazingly well

Vision & audio · diffusion models

Language & code · autoregressive models

Best of both worlds in one model?

Today's best answer: Transfusion-style

What such a model needs

Normalizing Flows in one slide

NFs were always there — but didn't quite scale

TARFlow — NFs are capable generative models

STARFlow — scaling latent NFs to high resolution

Architecture: deep + shallow

Why latent space — and why continuous latents

FAE — adapt pretrained features for generation

SOTA on ImageNet, 7–13× faster convergence

FAE plugs into Normalizing Flows too

Classifier-free guidance — re-derived for AR flows

Closing the gap to diffusion

Two scorecard ✓s, with one model

STARFlow-V — NFs across time

One model, three video tasks — no fine-tuning

Long-horizon robustness, native likelihood

Where the field is — unified multimodal in 2024–25

STARFlow2 — bridging LMs and NFs

AR Normalizing Flows are AR Transformers

AR Transformer (LM)

AR Normalizing Flow

Pretzel 🥨 — vertical interleaving

Why not MoT / BAGEL? — a negative result

Three-stage curriculum

One causal stream — text and pixels in the same KV-cache

Understand and generate — same model

Understanding (Table 1)

Generation (Tables 2–3)

From prompts to edits — one model

Back to the scorecard

The same architecture extends to everything continuous

Future directions

Other works in the NF line

Thank you

Scalable Normalizing Flows
for Visual and
Multimodal Generation