Probabilistic
Continuous Reasoning

Latent thoughts with a real density — and on-policy RL in latent space

Jiatao Gu

KnowledgeMR Workshop · Jun 4, 2026 · 11:00–11:40

Reasoning ≈ allocating compute between the question and the answer

Large language models often improve reasoning by generating explicit chain-of-thought (CoT) — and recent reasoning models (o1, R1, GRPO-trained CoT) confirm one thing: intermediate computation matters. Quality scales with the amount of work the model does between prompt and final answer.

prompt $\;\to\;$ $d_1, d_2, d_3, \ldots, d_l$ $\;\to\;$ answer

A typical CoT trace: dozens to hundreds of natural-language tokens that verbalize each intermediate step. The longer and richer this stream, the better the answer — up to a point.

If more intermediate computation helps, the question becomes: can we do that computation in a higher-bandwidth medium than text?

Latent reasoning — a higher-bandwidth alternative

Textual CoT forces this intermediate computation through a discrete, serial, communication-oriented token stream — each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed.

Discrete. Reasoning is forced through a vocabulary of words. Updates that are semantic — no single word fits — must still pick a token.
Serial. One verbalized step at a time. Updates that are uncertain — multiple paths plausible — must collapse to one before continuing.
Communication-oriented. The token stream is shaped to be read by humans / other agents. Updates that are only partially formed must still be communicated before the model proceeds.
Latent reasoning performs intermediate computation in compact continuous states before committing to text — same compute, higher bandwidth per step.

What we want from a latent reasoning trace

Three asks (one per constraint above):

Parallel / superposed — for uncertain
multiple hypotheses carried inside one step.
Truly continuous — for semantic
not pinned to a discrete vocabulary.
Internal-first, multi-modal — for partially formed
can carry shape that isn't yet a sentence.

Five must-keeps from explicit CoT:

  • AR-native generate left-to-right; no diffusion loops.
  • Probabilistic real density on the latent trace.
  • RL-friendly policy gradient in latent space.
  • KV-cache same inference machinery as a discrete LLM.
  • Expressive richer than a one-token-per-step scalar.

We will revisit this scorecard at the end.

Coconut — hidden states as continuous thoughts

$h_i \;=\; \text{LLM-hidden}\!\big(q,\; h_{<i}\big)$ — pass the LLM's own last hidden state forward as the next "thought"
Q h₁ h₂ h₃ h_l A deterministic — each h_i is a function of the past, not a sample
Idea. Skip the verbalization step. Use the LLM's own last hidden state $h_t$ as the embedding for the next forward pass — the model "thinks" in its own representation space.
Why this is appealing. Continuous, not vocab-bound. Drop-in for any LLM — no architecture change. Trained by distilling from explicit CoT (or by manually scheduling "thought" tokens).
The catch. $h_t$ is a deterministic function of $(q,\,h_{<t})$. There is no distribution over thoughts — no alternative sample to draw, no entropy to optimize, no policy gradient to take.

The model can think between words — but cannot explore alternative thoughts.

Training Large Language Models to Reason in a Continuous Latent SpaceS. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, Y. Tian · arXiv 2412.06769

Soft Thinking — probability-weighted soft tokens

$s_t \;=\; \sum_{v \in V} \pi_\theta\!\big(v \,\big|\, q,\, s_{<t}\big)\,E[v]$ — pass the average embedding instead of a sampled token
π_θ(v) over vocab V v₁ v₂ v₃ v₄ v₅ v₆ average s_t (soft token) feed s_{t+1} s_T A
Idea. At each step, don't sample a discrete token — pass the full distribution's average embedding forward. Continuous in the embedding space; preserves uncertainty in the merge.
Why this is appealing. Drop-in for an existing LLM, no quantization, no schedule. Improves calibrated uncertainty.
The catch. $s_t$ is a deterministic function of the logits. The per-step policy distribution is collapsed to its mean — there is no sampling to do RL with.

Stochastic variants (Wu et al.; Butt et al., 2025) inject Gaussian / Gumbel noise into logits or hidden states — but the marginal density over the trace remains intractable.

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept SpaceZ. Zhang et al. · arXiv 2025

The remaining ingredient: a tractable distribution on the trace

Coconut. The hidden state $h_t$ is a function of $(q,\,h_{<t})$ — the trace is a single point, not a distribution.
Soft Thinking. The soft token $s_t$ is the mean of the vocab distribution — the per-step policy is integrated out, not sampled.
Stochastic variants. Noise injected into hiddens / logits introduces randomness, but the marginal density on the trace stays intractable.
$\nabla_\theta \mathbb{E}_{u \sim \pi_\theta}\!\big[r(u)\big] \;=\; \mathbb{E}\!\big[r(u)\,\nabla_\theta \log \pi_\theta(u\,|\,q)\big]$
A tractable $\log \pi_\theta(u\,|\,q)$ is exactly what on-policy RL on the latent trace needs — and it's the piece prior continuous-reasoning approaches haven't yet supplied.

We need probabilistic continuous reasoning

If reasoning is the place where compute is allocated, the latent trace has to be the thing we train with RL. That requires:

A real distribution on the trace. $\pi_\theta(u_{1:K}\,|\,q)$ is well-defined, samplable, and evaluable.
Tractable likelihood. $\log \pi_\theta(u\,|\,q)$ is computable, differentiable end-to-end.
AR-native sampling. Left-to-right, KV-cache compatible — same machinery as a discrete LLM.
Two complementary recipes for this — and a multimodal extension. Talk roadmap:
Multiplex Thinking
factorized discrete sampling + continuous merge
NF-CoT
truly continuous thoughts via normalizing flow
STARFlow2
multimodal reasoning in the same substrate

Multiplex Thinking — branch and merge per token

Multiplex branch and merge
Branch. At each step, sample K independent discrete tokens from the LM's logits.
Merge. Average their vocabulary embeddings into one continuous "multiplex token".
Feed forward. The merged token enters the next step. Sequence length unchanged; one continuous step ≈ K discrete paths.

Continuous in the embedding space — but each branch is a real discrete sample, so the per-step policy distribution is preserved. This is what makes Multiplex probabilistic, unlike Soft Thinking.

Multiplex Thinking: Reasoning via Token-wise Branch-and-MergeY. Tang, L. Dong, Y. Hao, Q. Dong, F. Wei, J. Gu · arXiv 2026

The math: averaging in embedding space

$\quad k_{i,1},\,\ldots,\,k_{i,K} \;\sim\; \pi_\theta(\cdot\,|\,q,\,c_{<i})\quad\text{(K i.i.d. samples)}$
$s_i \;=\; \dfrac{1}{K}\sum_{j=1}^{K} \mathbf{z}_{i,j},\qquad c_i \;=\; E^{\!\top}\!\big(s_i \odot w_i\big)$
Reuse the vocab embedding $E$. Multiplex tokens live in the same space as discrete tokens — no separate decoder, no representation drift.
Optional LM-head reweighting. $w_i[v] = \mathbf{1}[\log \pi_\theta(v) > \tau]$ — upweight confident tokens, suppress noise.

Because the K samples are honest draws (not a deterministic merge à la Soft Thinking), the LM-head policy distribution $\pi_\theta(\cdot\,|\,q,c_{<i})$ is unchanged from a standard LLM.

Tractable rollout density $\;\to\;$ on-policy RL

$P(c) \;=\; \prod_{i=1}^{L}\,\prod_{j=1}^{K} P\!\big(k_{i,j} \,\big|\, q,\,c_{<i}\big)$ — multiplex rollout factorizes into independent token samples.
Standard policy gradient (GRPO).
$J_{\text{RL}}(\theta) = \mathbb{E}\!\big[\big(\log \pi_\theta(c) + \log \pi_\theta(y\,|\,q,\,c)\big)\,v(y, y^\star)\big]$. No combinatorial blow-up.
Self-adaptive. Peaked logits → all K samples agree → near-discrete CoT. Flat logits → diverse samples → multi-hypothesis branching. No knob to tune.

This is the first form of probabilistic continuous reasoning we will see — tractability via discrete-sample factorization. Entropy per multiplex token: $H(K_i) = K \cdot H(\pi_\theta)$.

Consistent gains over Discrete RL on math reasoning

Multiplex Table 1
+4.9 pp on AMC'23 (7B)
vs same-recipe Discrete RL (44.7 → 50.7).
MATH-500 (7B): 78.0
vs 74.1 Discrete RL — same GRPO training, only the substrate changes.
AIME 2025 (7B): 19.7
vs 17.1 Discrete RL — biggest gains on hardest benchmarks.
Multiplex Thinking: Reasoning via Token-wise Branch-and-MergeY. Tang, L. Dong, Y. Hao, Q. Dong, F. Wei, J. Gu · arXiv 2026

Widening gap on hard tasks (pass@1 → pass@1024)

Pass@k scaling
AIME 2025 (7B)
Multiplex ≈ 55% at k=1024
vs Discrete RL ≈ 40%
OlympiadBench (7B)
Multiplex pulls ahead as k grows
MATH-500
both methods saturate near the ceiling

Branch-and-merge unlocks trajectories with negligible discrete probability — the bigger the sample budget, the bigger the advantage.

Shorter trajectories at equal accuracy

Token efficiency
Multiplex-I-4k beats Discrete CoT-5k. 4096-token Multiplex output averages 40.5% across 6 benchmarks vs 35.8% for a 5120-token discrete baseline.
Why? Each multiplex token carries up to K=3 paths' worth of information. The model commits less prematurely — the trace covers more without growing in length.

Continuous reasoning isn't just "as good as text" at fixed compute — it goes further.

One free parameter: branch width K

K ablation
K=1 → K=2: big jump. AMC'23: 44.7 → 49.6 (+4.9 pp) — breaking the single-token bottleneck.
K=2 → K=3: small gain. Most value already captured.
K=3 sweet spot. Beyond, diminishing returns; 3 independent samples ≈ enough to encode soft branching.

Adaptive without a schedule: the model commits when confident, branches when uncertain — a single K=3 setting handles both regimes.

What does a multiplex trajectory look like?

Multiplex trajectory
Plain text spans — when all K=3 samples agree (peaked logits): the trajectory reads like standard CoT.
Colored boxes — at decision points (flat logits): K=3 distinct continuations get aggregated into one continuous step.
Adaptive without a schedule — the model commits when confident, branches when uncertain. Nothing in the prompt or sampler forces this behavior.
Multiplex Thinking: Reasoning via Token-wise Branch-and-MergeY. Tang, L. Dong, Y. Hao, Q. Dong, F. Wei, J. Gu · arXiv 2026

Multiplex is probabilistic — but still vocab-bound

What works. Discrete sampling + continuous merge gives us a tractable rollout density. The "uncertain" diagnosis from ds2 — addressed.
What's left. Each branch is still a discrete word with a fixed embedding $E[k]$. The continuous step is a mixture over the discrete vocabulary — it can't represent meanings between vocabulary words.
The "discrete" diagnosis is still open. Semantic updates that no word fits still have no native representation.
Can we model the reasoning latent as a truly continuous random variable on $\mathbb{R}^d$ — with its own density, AR sampling, and tractable likelihood — instead of a soft mixture over vocabulary?

One natural answer: LaDiR — diffusion over latent thoughts

Idea. Model the continuous reasoning trace $u_{1:K} \in \mathbb{R}^{K \times d}$ with a diffusion process. The marginal $p_\theta(u\,|\,q)$ is well-defined and continuous on $\mathbb{R}^d$.
What it solves. Truly continuous (not vocab-bound), distributional (samples differ across runs), expressive — checks several boxes that prior approaches missed.
Cost: iterative denoising. ~30 denoising steps at inference, ~10 at training. Slow when reasoning lives in the inner loop of an LLM.
Cost: implicit likelihood. Diffusion gives marginal density only via expensive ELBO bounds — not the direct $\log \pi_\theta(u\,|\,q)$ that on-policy RL wants.
Cost: not AR-native. Diffusion is non-causal — no KV-cache, no left-to-right, no drop-in fit with discrete LLM decoding.
Continuous and probabilistic — but at the wrong cost. We want the same continuous-distribution structure with tractable likelihood and AR-native sampling.
LaDiR: Latent Diffusion ReasoningarXiv 2025

Normalizing flows in one slide

$x \;\;\underset{\textstyle f^{-1}}{\overset{\textstyle f}{\rightleftarrows}}\;\; z \sim \mathcal{N}(0,\,I)$

A normalizing flow is a single invertible network $f$ mapping data $x$ to standard noise $z$.

$\log p(x) \;=\; \log p_0\!\big(f(x)\big) \;+\; \log\!\Big|\det \tfrac{\partial f(x)}{\partial x}\Big|$ — change of variables
Exact likelihood. Train by MLE — same model gives both samples and density.
Invertible. $x \leftrightarrow z$ is lossless; no information discarded, no schedule.
End-to-end. No noise schedule, no iterative denoiser, no separate decoder.

For reasoning, $x$ will be a continuous thought $u$ — and the same machinery scores, samples, and trains it.

TARFlow — AR normalizing flows are AR Transformers

TARFlow architecture
$z_n \;=\; \dfrac{x_n - \mu_\theta(x_{<n})}{\sigma_\theta(x_{<n})},\qquad x_n = \mu_\theta(x_{<n}) + \sigma_\theta(x_{<n})\,z_n$
Same causal mask. Left-to-right, no future leakage.
Same KV-cache. Token-by-token decoding, single forward pass.
Only the head differs. Discrete: token logits. Continuous: $(\mu, \sigma)$ — Gaussian per step.

TARFlow showed continuous-image AR flows match diffusion quality from a stand-alone NF.

TARFlow: Normalizing Flows are Capable Generative ModelsS. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, et al. · ICML 2025 (Oral)

STARFlow — scaling AR-NFs to high-resolution images

STARFlow architecture
Deep–shallow design. One deep block ($f_D$, 18 layers) carries the global AR capacity; a few shallow blocks ($f_S$, 2 layers each) refine local detail.
Latent space, not pixels. Continuous DiT-VAE latents — invertibility carries through, no quantization.
First NF at 512² / 1024² text-conditional, approaching diffusion sample quality, with exact log-likelihood end-to-end.

For reasoning we don't need 1024² — but the deep–shallow recipe is what NF-CoT borrows to slot the NF head into a regular LLM.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image SynthesisJ. Gu, T. Chen, D. Berthelot, H. Zheng, Y. Wang, R. Zhang, L. Dinh, M. Bautista, J. Susskind, S. Zhai · NeurIPS 2025 (Spotlight)

NF-CoT — slot a TARFlow head inside the LLM

Continuous thoughts $u_{1:K}$ are generated autoregressively by an NF head; answer tokens $x_{1:N}$ are generated by the LM head. Both share the same causal stream and KV-cache.

$[\,\text{prompt};\;\langle\text{BOT}\rangle;\;u_1, u_2, \ldots, u_K;\;x_1, x_2, \ldots, x_N\,]$
One forward pass. No iterative denoising, no two-stage decoder, no separate model for the latents.
  • Truly continuous — the latent $u_t$ lives in $\mathbb{R}^d$, not in a discrete-vocab simplex.
  • Tractable likelihood — change-of-variables gives us $\log p(u\,|\,q)$ exactly.
  • RL-friendly — policy gradient on continuous latents, single update.
  • KV-cache compatible — drop-in for existing LLMs.

All five scorecard items satisfied — no longer vocab-bound, no longer iterative.

Latent Reasoning with Normalizing FlowsG. Tu, X. Fu, S. Yu, Y. Tang, H. Kang, L. Qin, Y. Zhang, J. Gu · In Submission · UPenn / GMLR-Penn

Architecture: VAE encoder + shallow flow + unified causal stream

NF-CoT architecture
1. Frozen VAE encoder. Compresses an explicit-CoT trace $d_{1:l}$ into K continuous codes $e_{1:K}$ — a compact rationale.
2. Shallow flow blocks $F_\theta$. 5 MetaBlocks map $e \to u$ with tractable $\log\det J$. Stage 1: backbone frozen.
3. Unified LLM stream. Same backbone reads prompt + $u_{1:K}$ + answer; NF head for thoughts, LM head for tokens.

Joint loss: $\mathcal{L} = \lambda_{\text{flow}}\,\mathcal{L}_{\text{flow}} + \lambda_{\text{text}}\,\mathcal{L}_{\text{text}}$ with both $\lambda = 1.0$.

The three wins from a continuous flow

$\log p_\theta(e_{1:K}\,|\,q) \;=\; \log p_\theta(u_{1:K}\,|\,q) \;+\; \log\!\big|\det J_{F_\theta}(e\,|\,q)\big|$
1. Tractable likelihood on $\mathbb{R}^d$. Change-of-variables — both terms exact, differentiable. The continuous thought has a real density.
2. AR-native sampling. Causal Gaussian per step, parameterized by LLM hidden states — no diffusion loops, single forward pass.
3. RL on continuous latents. Policy gradient over $\log p(u) + \log p(\text{ans})$ jointly — single GRPO update, latent + token aligned.

Multiplex got tractable RL via discrete-sample factorization; NF-CoT gets it via change-of-variables. Different mechanisms, both honest probabilistic.

Inference: left-to-right, switch heads at <BOT>

  1. Read prompt through the LLM, build initial KV-cache.
  2. Sample continuous thoughts. NF head emits $\tilde{u}_t = \mu_\theta + \sigma_\theta \cdot z_t$ left-to-right; each $\tilde{u}_t$ enters the cache.
  3. Switch to LM head at end-of-thoughts; decode answer tokens $\hat{x}_n$ reusing the cache from steps 1–2.
No second pass. Continuous thoughts and answer share one decoder run — exactly the cost of a discrete CoT trace, but compressed.
Shallow blocks unused at inference. They served only to align $u$ with explicit-CoT distributions during training.

2.70× faster latent generation, 1.92× faster end-to-end vs LaDiR diffusion.

+13.0 pp pass@1 over the base — across 5 code benchmarks

NF-CoT Table 1
Average pass@1
59.6 → 80.0 (Qwen3-8B-Base)
vs LaDiR (diffusion)
+7.1 pp at half the cost
Biggest jumps
MBPP+: 53.8 → 85.0
LiveCodeBench v6: 42.0 → 66.8

Same 8B backbone for all methods. NF-CoT adds the NF head + 5 shallow flow blocks; rest of LLM is the same.

Pass@k scaling — NF-CoT dominates across the budget

Pass@k scaling on MBPP+ and HumanEval+
MBPP+ (k=128)
NF-CoT 87.5 vs LaDiR 80.0 vs Base 72.1
HumanEval+ (k=128)
NF-CoT 97.5 vs LaDiR 90.2 vs Base 86.0
Continues to climb. The gap to LaDiR widens with k — continuous probabilistic latents have more room to explore than diffusion's iterative denoiser.

A real density on the trace ⇒ honest exploration ⇒ scaling with sample budget.

Faster than diffusion-based latent reasoning

NF-CoT efficiency vs LaDiR
2.70× faster latent generation
173.5s vs 468.2s on HumanEval (16 candidates / problem).
2.48× cheaper per sample
19.9T vs 49.3T FLOPs.
1.92× faster end-to-end
325.6s vs 625.3s — including answer decoding.

Single AR forward pass replaces ~30 denoising steps of LaDiR — exactly what dropping the iterative loop buys.

From thoughts to multimodal reasoning

NF-CoT slotted a TARFlow head into an LLM and reasoned with continuous thoughts. The same AR-NF machinery — same causal mask, same KV-cache — has already been shown to model continuous images (TARFlow, STARFlow).

Same substrate, two roles. A continuous AR token can be a thought or a pixel. Both have tractable density. Both train end-to-end.
Multimodal reasoning. A single AR-NF stream can interleave thoughts and images — the model "thinks out loud" with text and pictures, in one cache, in one forward pass.
If continuous tokens can be thoughts and pixels, the natural next question is: can a single AR-NF model understand, reason, and generate across modalities — under one probabilistic objective?

STARFlow2 / Pretzel.

STARFlow2 / Pretzel 🥨 — one AR-NF stream for text + pixels

Pretzel architecture

A single causal model that understands, reasons over, and generates continuous images via the same AR mechanism — same mask, same KV-cache, only the output head changes.

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal GenerationY. Shen, T. Chen, Y. Gao, Y. Zhang, Y. Wang, M. Bautista, S. Zhai, J. Susskind, J. Gu · arXiv 2026

Vertical interleaving: VLM stream + TARFlow stream, one mask

$\hat{c}_t \;=\; u_t + W_{\text{vlm}} \cdot y^{\text{vlm},\,t}$ — visual skip

The frozen VLM contributes contextual semantics; the trainable TARFlow contributes high-fidelity continuous prediction. Residual skips at every position; both streams share one left-to-right mask.

Cache-friendly. Text tokens and continuous visual latents flow through the same KV-cache. No diffusion loops, no re-encoding.
"AR normalizing flows are AR Transformers — sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs. The only difference is the output head."
— STARFlow2 paper, p. 13

Three-stage curriculum: T2II2Tinterleaved

3-stage pipeline
Stage 1: T2I pretrain. Train TARFlow + 2 shallow blocks on 800M text–image pairs. VLM frozen.
Stage 2: I2T adapter align. Train visual adapter only on 200M image-to-text. Shallow blocks + VLM frozen.
Stage 3: interleaved joint. Activate vertical skips on 80M mixed data. Joint loss $\mathcal{L}_{\text{NF}} + \lambda \mathcal{L}_{\text{NTP}}$.

Stage 1 → Stage 3: GenEval 0.51 → 0.82 (+60.8% relative). Joint training doesn't degrade T2I; it improves it.

Unified — and competitive on both axes

Multimodal understanding (Table 1)

Understanding

Text-to-image generation (Table 2, GenEval)

GenEval
GenEval 0.82 — matches FLUX, BAGEL.
MMBench 71.5 — preserved (vs 83.8 baseline) given 256×256 FAE constraint.
Honest line. Not SOTA on every benchmark — but unified, in one causal stream, with one objective.

Edit, understand, reason, generate — one stream

STARFlow2 capabilities teaser
Image reasoning (right panel). The model is asked a visual MCQ, "thinks out loud" with text and a re-inspected image inside the same trace, then concludes — continuous thoughts and continuous pixels woven together.
Same machinery, four capabilities. Editing, understanding, generation, and reasoning all share one causal mask, one KV-cache, one objective. Visual reasoning is the most novel piece — but not a separate model.

Reasoning that mixes modalities before it is communicated — exactly the "internal-first / partially formed" property we asked for in ds3.

Probabilistic continuous reasoning — scorecard

Two complementary recipes for tractable continuous reasoning, plus the same machinery extended to multimodal reasoning (and next, actions).

Open questions & future directions

Beyond verifiable rewards. From math / code (unit tests) to scientific reasoning, planning, and open-ended dialogue.
Thoughts + pixels + actions. Same probabilistic substrate, extended to 3D and physical decision-making.
End-to-end latent reasoning. Today the latent space is fixed by a frozen VAE distilled from explicit CoT. Learn the latent space and the reasoning policy jointly — let the geometry evolve to whatever helps reasoning most.
Self-improving loops. Tractable density on the trace ⇒ score and reweight your own thoughts. Bootstrap better policies without external reward — the model trains itself.

The team — GMLR @ Penn

GMLR-Penn team members

Always recruiting — PhD students, interns, and visiting collaborators. multipath.github.io

Thank you

Probabilistic continuous reasoning — a real density, end-to-end RL.

Multiplex (Tang et al., arXiv'26) · STARFlow2 (Shen et al., arXiv'26) · TARFlow (ICML'25 Oral) · STARFlow (NeurIPS'25 Spotlight)
NF-CoT (Tu, Fu et al., UPenn / GMLR-Penn) — coming soon

https://multipath.github.io
scan for more
multipath.github.io
Jiatao Gu · GMLR · Penn