Latent thoughts with a real density — and on-policy RL in latent space
Jiatao Gu
KnowledgeMR Workshop · Jun 4, 2026 · 11:00–11:40
Large language models often improve reasoning by generating explicit chain-of-thought (CoT) — and recent reasoning models (o1, R1, GRPO-trained CoT) confirm one thing: intermediate computation matters. Quality scales with the amount of work the model does between prompt and final answer.
A typical CoT trace: dozens to hundreds of natural-language tokens that verbalize each intermediate step. The longer and richer this stream, the better the answer — up to a point.
Textual CoT forces this intermediate computation through a discrete, serial, communication-oriented token stream — each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed.
Three asks (one per constraint above):
Five must-keeps from explicit CoT:
We will revisit this scorecard at the end.
The model can think between words — but cannot explore alternative thoughts.
Stochastic variants (Wu et al.; Butt et al., 2025) inject Gaussian / Gumbel noise into logits or hidden states — but the marginal density over the trace remains intractable.
If reasoning is the place where compute is allocated, the latent trace has to be the thing we train with RL. That requires:

Continuous in the embedding space — but each branch is a real discrete sample, so the per-step policy distribution is preserved. This is what makes Multiplex probabilistic, unlike Soft Thinking.
Because the K samples are honest draws (not a deterministic merge à la Soft Thinking), the LM-head policy distribution $\pi_\theta(\cdot\,|\,q,c_{<i})$ is unchanged from a standard LLM.
This is the first form of probabilistic continuous reasoning we will see — tractability via discrete-sample factorization. Entropy per multiplex token: $H(K_i) = K \cdot H(\pi_\theta)$.


Branch-and-merge unlocks trajectories with negligible discrete probability — the bigger the sample budget, the bigger the advantage.

Continuous reasoning isn't just "as good as text" at fixed compute — it goes further.

Adaptive without a schedule: the model commits when confident, branches when uncertain — a single K=3 setting handles both regimes.

A normalizing flow is a single invertible network $f$ mapping data $x$ to standard noise $z$.
For reasoning, $x$ will be a continuous thought $u$ — and the same machinery scores, samples, and trains it.

TARFlow showed continuous-image AR flows match diffusion quality from a stand-alone NF.

For reasoning we don't need 1024² — but the deep–shallow recipe is what NF-CoT borrows to slot the NF head into a regular LLM.
Continuous thoughts $u_{1:K}$ are generated autoregressively by an NF head; answer tokens $x_{1:N}$ are generated by the LM head. Both share the same causal stream and KV-cache.
All five scorecard items satisfied — no longer vocab-bound, no longer iterative.

Joint loss: $\mathcal{L} = \lambda_{\text{flow}}\,\mathcal{L}_{\text{flow}} + \lambda_{\text{text}}\,\mathcal{L}_{\text{text}}$ with both $\lambda = 1.0$.
Multiplex got tractable RL via discrete-sample factorization; NF-CoT gets it via change-of-variables. Different mechanisms, both honest probabilistic.
<BOT>2.70× faster latent generation, 1.92× faster end-to-end vs LaDiR diffusion.

Same 8B backbone for all methods. NF-CoT adds the NF head + 5 shallow flow blocks; rest of LLM is the same.

A real density on the trace ⇒ honest exploration ⇒ scaling with sample budget.

Single AR forward pass replaces ~30 denoising steps of LaDiR — exactly what dropping the iterative loop buys.
NF-CoT slotted a TARFlow head into an LLM and reasoned with continuous thoughts. The same AR-NF machinery — same causal mask, same KV-cache — has already been shown to model continuous images (TARFlow, STARFlow).
STARFlow2 / Pretzel.

A single causal model that understands, reasons over, and generates continuous images via the same AR mechanism — same mask, same KV-cache, only the output head changes.
The frozen VLM contributes contextual semantics; the trainable TARFlow contributes high-fidelity continuous prediction. Residual skips at every position; both streams share one left-to-right mask.

Stage 1 → Stage 3: GenEval 0.51 → 0.82 (+60.8% relative). Joint training doesn't degrade T2I; it improves it.
Multimodal understanding (Table 1)

Text-to-image generation (Table 2, GenEval)


Reasoning that mixes modalities before it is communicated — exactly the "internal-first / partially formed" property we asked for in ds3.

Always recruiting — PhD students, interns, and visiting collaborators. multipath.github.io
Probabilistic continuous reasoning — a real density, end-to-end RL.
Multiplex (Tang et al., arXiv'26) · STARFlow2 (Shen et al., arXiv'26) · TARFlow (ICML'25 Oral) · STARFlow (NeurIPS'25 Spotlight)
NF-CoT (Tu, Fu et al., UPenn / GMLR-Penn) — coming soon
