Reaching Beyond the Mode: RL for Distributional Reasoning in LMs

Intro

Recent post-training methods - particularly RL with verifiable rewards (RLVR) - have dramatically improved the reasoning capabilities of large language models. These approaches train models to produce a single, high-confidence answer, pushing them to commit to the most likely response at every turn.

This works well when there is one right answer. But real world data is messier than that.

Consider a clinician seeing a patient with fever, hemoptysis, and chest pain. The right response isn't one diagnosis - it's a differential: a ranked set of plausible conditions, each with calibrated likelihood. Or consider a coding problem with multiple valid implementations, or a question where missing context leaves genuine ambiguity. In all these cases, the right output is a distribution over answers, not a single point estimate.

Standard RLVR is fundamentally mismatched to these settings. Because it rewards only the single highest-probability answer, models trained with RLVR suffer from mode collapse - they converge toward one dominant answer even when many valid alternatives exist, and they struggle to surface those alternatives even when sampled repeatedly.

The naive fix - just sample the model multiple times - is both expensive and behaviorally misaligned. A model trained to commit to one answer will regenerate the same reasoning scaffold over and over, wasting compute and missing valid alternatives.

We introduce Multi-Answer RL, a training framework that directly optimizes language models to generate sets of diverse, calibrated answers in a single forward pass - reasoning jointly over multiple hypotheses rather than collapsing to one. The result is a model that is more accurate ✅, more diverse 🎲, better calibrated 🎯, and more compute-efficient 🚀 - all at once.

Back to Top

Examples

Below, you can see Multi-Answer RL in action. Click on the tabs to see each model's answers for a given question and compare across RLVR Single, RLVR Multi, and RLCR Multi outputs!

Loading...

These examples are from Multi-Answer RL models trained on DDXPlus (Medical), HotPotQA-Modified (QA), and MBPP (Coding).

Back to Top

Method

Multi-Answer RL makes one key shift: instead of training models to output a best answer, we train them to output the distribution over best answers.

💡 Reward sets, not singletons. The reward signal checks how many generated answers are correct - incentivizing coverage of the full answer space rather than mode-seeking.
💡 One generation, many hypotheses. The model reasons jointly over multiple candidates in a single chain of thought, internalizing what inference-time search externalizes.
💡 Optionally add calibration. By pairing Multi-Answer RL with a proper scoring rule, models learn to attach meaningful confidence estimates to each candidate - turning the output into a full verbalized distribution.

During training, the model reasons jointly about the task and multiple hypotheses at once, producing both a structured set of candidate answers and (in the RLCR variant) a confidence estimate for each.

Intuitive Equations
Symbolic Equations

Standard RLVR reward:

$$R_{\text{RLVR}} = \text{Correctness} = \mathbb{1}[a \in Y^*]$$

Multi-Answer RLVR reward:

$$R^{\text{multi}}_{\text{RLVR}} = \sum_{i=1}^{K} \mathbb{1}[a_i \in Y^*]$$

Sum the correctness signal over all K candidate answers - the model is rewarded for each correct answer it surfaces.

Multi-Answer RLCR reward (adds calibration):

$$R^{\text{multi}}_{\text{RLCR}} = \underbrace{\sum_{i=1}^{K} \mathbb{1}[a_i \in Y^*]}_{\text{Coverage}} - \underbrace{\frac{1}{K}\sum_{i=1}^{K}\left(q_i - \mathbb{1}[a_i \in Y^*]\right)^2}_{\text{Miscalibration penalty}}$$

where $A = \{a_1, \ldots, a_K\}$ are the K candidate answers, $Q = \{q_1, \ldots, q_K\}$ are associated confidence scores, and $Y^*$ is the set of valid ground-truth answers.

Standard RLVR:

$$R_{\text{RLVR}}(a, y^*) = \mathbf{1}[a \equiv y^*]$$

Multi-Answer RLVR:

$$R^{\text{multi}}_{\text{RLVR}}(A, Y^*) = \sum_{i=1}^{K} \mathbf{1}[a_i \in Y^*]$$

Multi-Answer RLCR:

$$R^{\text{multi}}_{\text{RLCR}}(A, Q, Y^*) = R^{\text{multi}}_{\text{RLVR}}(A, Y^*) - \frac{1}{K}\sum_{i=1}^{K} S\!\left(q_i,\, \mathbf{1}[a_i \in Y^*]\right)$$

where $S$ is a proper scoring rule. In practice we use the Brier score: $S(q, c) = (q - c)^2$.

This objective generalizes naturally across settings:

Setting	N (correct answers)	K (generated answers)	Equivalent objective
Standard RLVR	1	1	Binary correctness reward - exact reduction to vanilla RLVR
Pass@K training	1	>1	Rewards any correct answer in the generated set
Partial set recovery	>1	≤ N	Maximizes coverage of distinct valid answers
Full set recovery	>1	≥ N	Optimal policy recovers the entire ground-truth set

What We Prove:

✅ Multi-Answer RLVR strictly subsumes single-answer RLVR as a special case (N=K=1).

✅ Multi-Answer RLCR provably incentivizes both correct and calibrated answer sets using any bounded proper scoring rule.

✅ The model's output (A, Q) naturally defines a verbalized probability distribution - categorical in single-label settings, multivariate Bernoulli in multi-label settings.

Back to Top

Results 📊

Across medical diagnosis (DDXPlus), ambiguous QA (HotPotQA-Modified), and coding (MBPP) benchmarks:

✨ Multi-Answer RL substantially improves coverage over single-answer baselines - recovering correct alternatives that standard RL never surfaces, even with equal total samples.
✨ Simply prompting single-answer models to produce multiple answers doesn't work. Explicit multi-answer training is required.
✨ Significantly more token efficient vs sampling multiple times from standard RL. For example: on coding (MBPP), Multi-Answer RL boosts top-1 accuracy by over 50% while cutting token usage by more than half.
✨ RLCR Multi achieves markedly better calibration than RLVR Multi, with RLVR Multi exhibiting systematic overconfidence across confidence bins.

DDXPlus (Medical): Average coverage as a function of total answers k.

MBPP (Coding): Average coverage as a function of total answers k.

Back to Top

Does Multi-Answer RL Actually Improve Diversity? 🎲

Standard RL is a mode-seeker. Even when you sample it 30 times, it often returns only 3–4 distinct answers - retracing the same reasoning scaffold in slightly different words. Multi-Answer RL breaks this pattern.

We compared RLVR-Single and RLVR-Multi under a matched compute budget: 30 independent samples from RLVR-Single vs. 10 generations of 3 answers from RLVR-Multi (30 total answers each). RLVR-Multi produced nearly twice as many unique answers on average (8 vs. 4 on DDXPlus).

Diversity histogram - unique diagnoses per question, RLVR Single vs Multi

Figure 4: Distribution of unique diagnoses per question across 5,000 DDXPlus test examples. RLVR-Multi produces more distinct diagnoses than RLVR-Single, explaining the coverage gains under multi-answer training.

Critically, this isn't just more variation for its own sake. The increased diversity translates directly into more correct answers being surfaced - the model is exploring genuinely different regions of hypothesis space, not adding noise. Multi-answer training increases diversity without compromising correctness, addressing a key failure mode of single-answer RL.

Word cloud comparing answer diversity: RLVR Single vs RLVR Multi

Figure 8: Word clouds for a single medical question. RLVR-Single produces 3 unique answers across 30 samples. RLVR-Multi produces significantly more distinct diagnoses across 10 sets of 3.

Back to Top

Is Multi-Answer RL More Compute-Efficient? 🚀

When a single-answer model generates multiple independent responses, each one recapitulates nearly the same reasoning chain. The n-gram overlap between independent RLVR-Single samples is remarkably high - even when the final answers differ, the reasoning tokens are largely the same. This means inference-time sampling often wastes a large fraction of compute re-thinking the same thoughts.

Token overlap between independent RLVR-Single responses

Figure 5: Significant subsequence overlap between independently sampled RLVR-Single responses, even those that yield different final answers. Multi-Answer RL mitigates this by reasoning jointly over all candidates in one pass.

Multi-Answer RL mitigates this by sharing the reasoning work across candidates in a single chain-of-thought, only diverging where the hypotheses actually differ.

Average token usage: RLVR/RLCR Single vs Multi on DDXPlus

Figure 7: Average token usage per set on DDXPlus. Multi-Answer training significantly reduces redundant computation.

On DDXPlus, the average token length of an RLVR-Multi response is only ~56% of the total token length required by RLVR-Single to produce the same number of answers. On MBPP, Multi-Answer RL cuts tokens by more than half while also improving top-1 accuracy by over 50%. This positions Multi-Answer RL as a principled, compute-efficient alternative to best-of-k sampling.

Back to Top

Calibrating Distributions of Answers 🎯

Adding the calibration reward (RLCR) to multi-answer training produces models that attach meaningful confidence estimates to each candidate - not just a ranked list, but a genuine distribution. Models are penalized for being confidently wrong and for being unnecessarily uncertain about answers they get right. RLCR Multi gives you confidence + coverage + efficiency, all in one generation.

Calibration curves on DDXPlus: RLCR Multi vs RLVR Multi

Figure 3: Calibration curves on DDXPlus. RLCR-Multi tracks the diagonal much more closely across confidence binsthan RLVR-Multi, which is systematically overconfident at all confidence levels.

For example, on DDXPlus, RLCR-Multi tracks the identity line across confidence bins. RLVR-Multi, by contrast, consistently over-reports confidence - a direct consequence of the binary reward signal leaving no incentive to be honest about uncertainty.

Back to Top

Does Training Remain Stable as K Increases? 📈

We trained Multi-Answer RLVR with K = {2, 3, 4, 5} to understand how performance scales with the number of generated answers.

Training curves for varying K in RLVR Multi

Figure 6: As K increases, Multi-Answer RLVR stably recovers more unique correct diagnoses per set. Training curves remain stable across all values of K.

Training remains stable across all values of K. As K increases, models recover more unique correct hypotheses per set - the gains come from genuinely surfacing new alternatives, not from repeating dominant answers (outputs are constrained to be distinct). Coverage increases monotonically with K, suggesting the approach scales gracefully with additional capacity.

Back to Top

BibTeX

@article{puri2025reachingbeyondmode,
  title={Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models},
  author={Isha Puri and Mehul Damani and Idan Shenfeld and Marzyeh Ghassemi and Jacob Andreas and Yoon Kim},
  year={2025},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.24844},
}

Back to Top

Code

For all code, models, and data, check out the Multi-Answer RL GitHub Repo. We provide a detailed README with instructions for setting up, training, and evaluating the model.

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Table of Contents

Intro

Examples

Method

Results 📊

Does Multi-Answer RL Actually Improve Diversity? 🎲

Is Multi-Answer RL More Compute-Efficient? 🚀

Calibrating Distributions of Answers 🎯

Does Training Remain Stable as K Increases? 📈

BibTeX

Code

Reaching Beyond the Mode:
RL for Distributional Reasoning in Language Models