The Hidden Problem with "Sampling Proportionally to Quality"

Esther Derman, Senior AI Researcher

April 17, 2026

In theory, sampling candidates based on their quality estimate should prioritize high-performing options. That is, since promising candidates have better estimated qualities, they should be selected with greater frequency. However, in extremely large search spaces, this theoretical advantage is neutralized by the sheer volume of sub-optimal candidates. Despite high individual scores, elite candidates are statistically overwhelmed by the cumulative density of average alternatives. Ultimately, the algorithm favors a breadth of acceptable results, so diversity supersedes quality.

This structural problem arises when:

The search space is combinatorial
The scores assigned to candidates are distributed fairly uniformly, without “sharp” peaks in the reward landscape
The algorithm encourages sampling to remain diverse, penalizing the “exceptional” minority in favor of the “acceptable” majority

Why Does This Matter Beyond Academic Research?

Beyond academia, every Enterprise AI system is confronted to this trade-off. The use of proxy objectives is ubiquitous:

Confidence scores generated by machine learning models
Automated evaluation of model quality
Heuristic methods for evaluating outputs
Learned reward models

When these proxy objectives drive generation, optimization, or filtering in addition to promoting diversity, the result is reduced utility in the real world. Examples of this include:

Sending too many poor candidates for human evaluation
Models that continue to explore indefinitely and fail to converge
Systems that “play it safe” instead of presenting meaningful signals

Ultimately, this is not merely an academic issue. At scale, this represents an operational problem.

An Alternative Perspective: Robust Reinforcement Learning

One alternative perspective on this issue is through reinforcement learning (RL). From this viewpoint:

Each decision creates a component of the final product
The final product gets evaluated (i.e., the proxy reward)
The AI learns a policy for producing high-quality products

The key realization here is that:

The proxy objective is NOT the true objective.
The actual objective (e.g., laboratory results, performance of the downstream process, user satisfaction) cannot be predicted with perfect certainty.

Therefore, we should learn policies under assumptions that include uncertainty about the proxy objective. This is known as robust RL.

Entropy to Balance: More Than Just "More or Less Randomness"

There are numerous instances in which AI systems intentionally create randomness:

To prevent the system from becoming stuck on a local minimum
To allow exploration into unexplored regions of the space
To mitigate over-fitting the proxy objective

This type of randomness is typically created via entropy regularization —a formal method of inducing “spreading out” in the actions taken during the course of exploration. Unfortunately, entropy alone induces action only along one dimension:

Be more random/diverse

However, effective discovery requires balancing two dimensions:

Too little randomization → miss potential opportunities
Too much randomization → dilute quality

Selective exploration is missing.

Generalized Mellowmax: An Interpolation Between Two Endpoints

Very recent research presents an intuitive and elegant concept:

Interpolate between exploration (entropy-based) and exploitation (aggressive optimization).

Consider generalized mellowmax as a control knob, rather than an on/off switch.

At one endpoint, the system generates candidates randomly and equally frequently.

At the other endpoint, the system selects only those candidates whose path(s) to the current state are well understood.

Between endpoints:

The system continues to generate candidates randomly. However, the system places increasing emphasis on selecting those candidates whose path(s) to the current state are well understood.
Practitioners now have control over how much diversity is sufficient for the task at hand; rather than relying on a single parameter value being suitable for all tasks.
Furthermore, this interpolation takes place structurally —not through ad-hoc heuristics.

Why Do Full Trajectories Matter? (Not Just Their Final Scores)

Crucially, the generative trajectory of an object takes precedence over its terminal reward. In compositional tasks:

Errors and uncertainties build-up step-by-step.
Early selections limit future possibilities.
A poor intermediate selection can rarely be corrected at the end.

Modern techniques utilize trajectory-level learning to connect early-stage decisions to late-stage outcomes. These advances improve:

Credit assignment
Stability
Sample efficiency

In practice, this means the model learns why a particular outcome is good —not just that it had a high score.

What Does This Teach Us About AI Systems Practically?

Taking a step back from the algorithms themselves, several generalizations emerge:

Optimization of proxy objectives is dangerous without some form of structure. Optimization of a score is simple. Optimization of the right score, with uncertainty about it, is difficult.
Exploration must be directed; it must not simply be encouraged randomly. Randomness is a tool —it’s not inherently virtuous. Realistic systems explore appropriately where it makes sense.
Robustness should be incorporated — not bolted on later. By assuming uncertainty initially, we obtain better reliability downstream.
Interpretability naturally emerges from good abstractions. If regularization corresponds to some real-world uncertainty, then reasoning about system behavior becomes simpler.

These concepts generalize well throughout the entire AI life cycle —particularly in areas such as data curation, quality of training signals, evaluation, and governance.

How Will This Impact the Development of AI Systems at Scale?

As AI systems move further into scientific discovery, enterprise decision-making, and high-risk domains, there will be a growing disconnect between success with respect to proxy objectives and success in the real world. The next generation of AI systems will not be characterized solely by larger models or more data. They will be characterized by smarter objectives, better treatment of uncertainty, and policies that provide a good balance between precision and exploration.

For us at Innodata —this relates directly to our view of AI lifecycle services —which includes treating data, evaluation, and robustness as essential characteristics of systems —and not as afterthoughts.

Bring Intelligence to Your Enterprise Processes with Generative AI.

Innodata provides high-quality data solutions for developing industry-leading generative AI models, including diverse golden datasets, fine-tuning data, human preference optimization, red teaming, model safety, and evaluation.

AI Solutions

Model Safety, Evaluation, + Red Teaming

Agentic AI Evaluation & Observability

Agentic AI Evaluation & Observability

The Innodata GenAI Summit | London 2026

Domain-Specific AI: Smarter, Safer, and Built for Your Industry

AI Solutions

Model Safety, Evaluation, + Red Teaming

Agentic AI Evaluation & Observability

Agentic AI Evaluation & Observability

The Innodata GenAI Summit | London 2026

Domain-Specific AI: Smarter, Safer, and Built for Your Industry

The Hidden Problem with "Sampling Proportionally to Quality"

Esther Derman, Senior AI Researcher

April 17, 2026

Why Does This Matter Beyond Academic Research?

An Alternative Perspective: Robust Reinforcement Learning

Entropy to Balance: More Than Just "More or Less Randomness"

Selective exploration is missing.

Why Do Full Trajectories Matter? (Not Just Their Final Scores)

What Does This Teach Us About AI Systems Practically?

How Will This Impact the Development of AI Systems at Scale?

Bring Intelligence to Your Enterprise Processes with Generative AI.

About

Company

Contact