Judge

Confidence & Uncertainty

system limitscognitive accessibilityaudience adaptationinteractive inquiry

User question

How certain is the system?

Consulting signal

Surfaces when a client's model produces a clean, precise-looking score that is driving consequential decisions, and nobody has checked how reliable that score actually is across different case types.

Overview

Why this pattern exists

AI systems produce precise-looking outputs: scores, classifications, and recommendations. A risk score of 73. A confidence of 94%. A prediction shown on a clean dashboard. This visual precision is a design problem masquerading as a feature.

Most AI systems are not as certain as they appear. A score of 73 typically comes from a model with a confidence interval. A prediction made on edge-case data is less reliable than one made on familiar data. A model encountering inputs very different from its training distribution may be confidently wrong.

When users encounter a precise-looking number without any uncertainty signal, they tend to trust it more than is warranted. This over-trust is one of the most consistent findings in human-AI interaction research, and it has real consequences. Clinicians accept AI-flagged diagnoses they should question. Caseworkers approve recommendations they should override.

Confidence & Uncertainty is the pattern that corrects this. It surfaces the system's own uncertainty at the moment of presentation, not buried in documentation, not in a technical appendix, but in the interface where the decision is made.

Note: this pattern addresses statistical/predictive uncertainty in structured AI systems. For the specific failure mode of LLM hallucination, see Grounding & Hallucination Indicators: that is a distinct problem requiring different design treatment.

Design goal

Surface the AI system's confidence level and uncertainty range in a way that is immediately visible, readable by non-technical users, and calibrated to the actual uncertainty in the output: preventing both over-trust and unnecessary alarm.

Usage guidance

When to use

The system produces probabilistic outputs (scores, risk levels, predictions, classifications)
Decisions based on the output have material consequences
The system may encounter inputs outside its reliable operating range
Users may over-trust a clean numeric output without uncertainty context
Model confidence varies meaningfully across cases and that variation is relevant to the user

When not to use

The system is deterministic: there is no meaningful uncertainty to express
The output is a verified fact, not a prediction
Showing uncertainty would be technically dishonest (e.g. displaying a confidence interval that the model doesn't actually produce)
The user context makes uncertainty display counterproductive: e.g. emergency interfaces where a clear, fast signal is needed and nuance creates dangerous hesitation

Design

UI primitives

to be added soon

Data Visualization / Visualization

Confidence band on numeric output

Rather than showing a single number (score: 73), show a range (score: 68–78). The width of the band communicates uncertainty. Visually: a central value with a shaded range extending on either side.

to be added soon

Inline Signal / Label

Verbal confidence label

A plain-language label alongside the numeric output: "High confidence," "Moderate confidence," "Low confidence: additional review recommended." More accessible than a numeric range for non-technical users.

to be added soon

Inline Signal / Icon

Uncertainty indicator icon

A visual signal (⚠, a shaded zone, a distinct color) that appears when confidence is below a defined threshold. Draws attention to uncertain cases without requiring the user to interpret a number.

to be added soon

Data Visualization / Visualization

Distribution visualization

For users who need more detail: a probability distribution showing the full range of possible outputs, not just the central estimate. Appropriate for analyst or expert views. Can be placed in a secondary layer.

to be added soon

Data Visualization / Chart

Quantile dotplot

A discrete representation of a probability distribution: countable dots rather than a smooth density curve, where each dot represents one equally-probable outcome drawn from the distribution. Empirical research shows this format reduces the variance of user probability estimates: compared to density strips: by leveraging subitizing, the human capacity to recognise groups of fewer than five items instantly and accurately without counting. Particularly effective for non-technical users and for small-screen contexts where a density curve compresses into an unreadable sliver. Prefer quantile dotplots over smooth distributions when user decision precision matters more than visual elegance.

to be added soon

Interactive Control / Slider

Interactive risk slider

A user-adjustable control allowing a reviewer to move a displayed point estimate toward a more conservative or permissive threshold to match their situation-specific risk tolerance: for example, treating a borderline score more cautiously before a high-stakes final decision. The adjustment is made explicit and logged rather than silently absorbed into the reviewer's reasoning. Appropriate in professional contexts where the user's risk tolerance legitimately varies by case type. Connect to the override mechanism in Human-in-the-Loop so that threshold adjustments are recorded alongside the final decision.

to be added soon

Contextual Overlay / Warning

Out of distribution warning

A specific warning that appears when the current input falls outside the model's training distribution: "This case has characteristics the system hasn't seen frequently. Results may be less reliable." Separate from uncertainty in the output: this is uncertainty in the model's applicability.

to be added soon

Inline Signal / Color Coding

Confidence-based color coding

Calibrated color scales (e.g. green/amber/red) tied to confidence levels, not just to the output value itself. A high score with low confidence should look different from a high score with high confidence.

How to use

Distinguish types of uncertainty.

Aleatoric uncertainty (inherent randomness in the world) and epistemic uncertainty (the model's lack of knowledge) are different things with different implications for users. Where technically feasible, surface both, but at minimum, make clear whether the uncertainty is in the data, the model, or both.

Calibrate the display to the actual uncertainty.

A confidence interval that is too wide is useless. One that is too narrow is misleading. The display must reflect the model's actual calibration, not be designed to look impressive. Modern neural networks are systematically overconfident: their raw softmax scores are not reliable probabilities. Confirm with your technical team whether post-hoc calibration (temperature scaling is the most common remedy) has been applied before surfacing the model's own confidence values. An uncalibrated confidence score displayed as though it were trustworthy is worse than showing no confidence at all.

Make low confidence visible at the decision point, not buried.

If confidence is low and the user is about to act on the output, the uncertainty signal must be prominent, not a footnote or a collapsed detail.

Don't use uncertainty as a liability disclaimer.

Showing an uncertainty range is not a way to hedge responsibility. It is a tool for helping users make better decisions. Design it to be useful, not to be technically defensible.

Pair uncertainty with guidance.

Low confidence should trigger a clear action: "This result has low confidence: human review recommended." Don't surface uncertainty without giving users a way to respond to it.

Use cases

flow a

Caseworker handling a borderline case

1. Caseworker opens a risk assessment.
2. Score shows: 71, with a confidence band of 64–78.
3. A ⚠ indicator shows: "Moderate confidence: employment data is incomplete."
4. Caseworker is prompted: "Consider requesting updated employment records before making a final decision."
5. Caseworker requests documents: decision deferred.

flow b

Applicant reading their score

1. Applicant sees: "Risk score: Moderate" with a plain-language description.
2. Below the score: "The system is less certain than usual about this result because some of your information could not be verified."
3. Applicant understands why and is prompted to provide verification documents.

flow c

Out-of-distribution alert

1. System encounters a self-employed applicant with non-standard income structure.
2. Score is generated but flagged: "This application type is less common in our dataset. The estimate may be less accurate than usual."
3. Case is automatically routed to a specialist caseworker.

Design trade-offs

Uncertainty vs. usability

Too many uncertainty signals create alert fatigue: users start ignoring them. Reserve high-visibility uncertainty indicators for cases where confidence is genuinely low. Use subtle indicators for routine moderate-confidence cases.

Precision vs. comprehension

Exact confidence intervals (73.4% ± 4.1%) are technically accurate but harder to act on than "moderate confidence." Find the right level of precision for the user's context.

Honesty vs. system credibility

Consistently showing low confidence may erode user trust in the system overall: even when the system is performing adequately. Communicate what "low confidence" means in context: what should the user do with it, not just that it exists.

Connections

Relation to other patterns

Grounding & Hallucination Indicators

Can I trust what the system generated?

Give users visible signals about the factual grounding of AI-generated content: distinguishing between responses that are well-supported by retrievable evidence and those that may contain fabricated, outdated, or unverifiable claims.

Uncertainty covers statistical confidence in predictions. Grounding covers factual reliability in generative AI output. These are distinct failure modes that require different design treatment.

Model Scope & Limits

What can this system reliably do?

Communicate the intended operating scope of an AI system, and signal clearly when a specific case, query, or context falls outside it: so users can make informed judgments about whether to rely on the output.

Out-of-distribution uncertainty is a specific form of scope violation. The two patterns work together to communicate where the system's reliable operating range ends.

Human-in-the-Loop

Who can review or override this decision?

Provide authorized users with clear, accessible mechanisms to review, modify, and override AI recommendations, and ensure that all interventions are documented in a way that supports accountability and audit.

Low confidence is one of the primary triggers for human review. Uncertainty signals and the HITL mechanism should be directly linked.

Attribution

Why did the system make this decision?

Surface the key factors that influenced an AI decision in a way that is readable, appropriately qualified, and supports the user's ability to evaluate, challenge, or act on the outcome.

Uncertainty applies not just to the final output but to individual attribution factors. A factor's contribution may itself be uncertain: especially in noisy or incomplete data.

Sources

Doshi-Velez et al. (2017) — Accountability of AI Under Uncertainty

argues that AI systems must be able to communicate their uncertainty to be held accountable. Foundational for the normative case underlying this pattern

Kay et al. (2016) — When (ish) Is My Bus? User-Centered Visualizations of Uncertainty

empirical study comparing uncertainty visualization formats. Finds that quantile dotplots outperform continuous density strips on decision precision because they leverage subitizing, the ability to instantly recognise small group quantities, reducing estimation variance by approximately 1.15× compared to density plots. Directly motivates the quantile dotplot primitive in this pattern

Yaniv & Kleinberger (2000) — How Humans Incorporate Advice: Understanding the Weight of Advice

shows that humans systematically under-use uncertain information and over-weight confident-seeming outputs. The psychological basis for why explicit uncertainty signals are necessary

Guo et al. (2017) — On Calibration of Modern Neural Networks

demonstrates that modern deep neural networks are systematically overconfident: their softmax confidence scores are not reliable probabilities, particularly after training on large datasets. Introduces temperature scaling as a post-hoc remedy. Essential context for designing honest uncertainty displays: you cannot simply surface the model's own confidence score as trustworthy