Jiatao Gu
Talk @ T4V Workshop · Jun 3, 2026
Iterative denoising. Many forward passes per sample. Continuous outputs in $\mathbb{R}^d$.
One forward pass per token. Discrete softmax. Causal mask, KV-cache, streaming.
A unified multimodal model has to understand images, generate images, and reason across text and images — under one set of weights, one training objective, one inference loop.

Each ✓ will be cashed in by a paper in the STARFlow family — we revisit this scorecard at the end.

Throughout, NFs kept exact likelihood — but lost ground to GANs & diffusion on sample quality. The Transformer revival changes the verdict.



3.8B params (T2I) · 1.4B (class-cond) · DiT-VAE latent at $p=1$ · 1024 tokens for 256², up to 16384 for 1024².

Channel dim: SD-VAE 4 · VA-VAE 32 · RAE 1536 · FAE 32 — same compactness as VA-VAE, but built from a pretrained understanding model.

| ImageNet 256² · FID-50K | 80 ep | 800 ep |
|---|---|---|
| FAE-DINOv2-G + LightningDiT-XL | 2.08 | 1.48 |
| ↳ + CFG | 1.70 | 1.29 |

| STARFlow · ImageNet 256² | FID @ 400 ep |
|---|---|
| SD-VAE baseline | 4.51 |
| FAE (DINOv2-g/14) | 2.67 |
Diffusion CFG mixes scores; in an AR flow, every step is an explicit Gaussian — so we can mix the conditional/unconditional Gaussians directly:
| ImageNet · FID-50K | 256² | 512² |
|---|---|---|
| DiT-XL (diffusion) | 3.60 | — |
| STARFlow (1.4B) | 2.40 | 3.00 |
| Text-to-image (CC12M) | GenEval | COCO FID-30K |
|---|---|---|
| STARFlow (3.8B) | 0.56 | 9.1 |


A single set of weights — controllability for free, because the network is its own encoder.



3.6B trainable params · Qwen2.5-VL-7B + FAE frozen throughout · zero-init visual adapter for safe cross-modal coupling.

| Benchmark | STARFlow2 |
|---|---|
| MME-P | 1607.1 |
| GQA | 63.9 |
| SEED | 74.7 |
| MMBench | 82.2 |
| MMMU | 58.3 |
| AI2D | 79.3 |
Competitive with Janus / Show-o2 / TUNA — without losing the pretrained VLM.
| Benchmark | STARFlow2 |
|---|---|
| GenEval | 0.82 |
| DPG-Bench | 84.94 |
| Resolution | 256² |
| Trainable params | 3.6 B |






Scalable Normalizing Flows for Visual & Multimodal Generation — one AR stream, exact likelihood, no quantization.
TARFlow (ICML'25) · STARFlow (NeurIPS'25) · STARFlow-V (CVPR'26) · FAE (CVPR'26) · STARFlow2 (arXiv'26)
