How certain is the system?
Surfaces when a client's model produces a clean, precise-looking score that is driving consequential decisions, and nobody has checked how reliable that score actually is across different case types.
AI systems produce precise-looking outputs: scores, classifications, and recommendations. A risk score of 73. A confidence of 94%. A prediction shown on a clean dashboard. This visual precision is a design problem masquerading as a feature.
Most AI systems are not as certain as they appear. A score of 73 typically comes from a model with a confidence interval. A prediction made on edge-case data is less reliable than one made on familiar data. A model encountering inputs very different from its training distribution may be confidently wrong.
When users encounter a precise-looking number without any uncertainty signal, they tend to trust it more than is warranted. This over-trust is one of the most consistent findings in human-AI interaction research, and it has real consequences. Clinicians accept AI-flagged diagnoses they should question. Caseworkers approve recommendations they should override.
Confidence & Uncertainty is the pattern that corrects this. It surfaces the system's own uncertainty at the moment of presentation, not buried in documentation, not in a technical appendix, but in the interface where the decision is made.
Note: this pattern addresses statistical/predictive uncertainty in structured AI systems. For the specific failure mode of LLM hallucination, see Grounding & Hallucination Indicators: that is a distinct problem requiring different design treatment.
Surface the AI system's confidence level and uncertainty range in a way that is immediately visible, readable by non-technical users, and calibrated to the actual uncertainty in the output: preventing both over-trust and unnecessary alarm.
Rather than showing a single number (score: 73), show a range (score: 68–78). The width of the band communicates uncertainty. Visually: a central value with a shaded range extending on either side.
A plain-language label alongside the numeric output: "High confidence," "Moderate confidence," "Low confidence: additional review recommended." More accessible than a numeric range for non-technical users.
A visual signal (⚠, a shaded zone, a distinct color) that appears when confidence is below a defined threshold. Draws attention to uncertain cases without requiring the user to interpret a number.
For users who need more detail: a probability distribution showing the full range of possible outputs, not just the central estimate. Appropriate for analyst or expert views. Can be placed in a secondary layer.
A discrete representation of a probability distribution: countable dots rather than a smooth density curve, where each dot represents one equally-probable outcome drawn from the distribution. Empirical research shows this format reduces the variance of user probability estimates: compared to density strips: by leveraging subitizing, the human capacity to recognise groups of fewer than five items instantly and accurately without counting. Particularly effective for non-technical users and for small-screen contexts where a density curve compresses into an unreadable sliver. Prefer quantile dotplots over smooth distributions when user decision precision matters more than visual elegance.
A user-adjustable control allowing a reviewer to move a displayed point estimate toward a more conservative or permissive threshold to match their situation-specific risk tolerance: for example, treating a borderline score more cautiously before a high-stakes final decision. The adjustment is made explicit and logged rather than silently absorbed into the reviewer's reasoning. Appropriate in professional contexts where the user's risk tolerance legitimately varies by case type. Connect to the override mechanism in Human-in-the-Loop so that threshold adjustments are recorded alongside the final decision.
A specific warning that appears when the current input falls outside the model's training distribution: "This case has characteristics the system hasn't seen frequently. Results may be less reliable." Separate from uncertainty in the output: this is uncertainty in the model's applicability.
Calibrated color scales (e.g. green/amber/red) tied to confidence levels, not just to the output value itself. A high score with low confidence should look different from a high score with high confidence.
Aleatoric uncertainty (inherent randomness in the world) and epistemic uncertainty (the model's lack of knowledge) are different things with different implications for users. Where technically feasible, surface both, but at minimum, make clear whether the uncertainty is in the data, the model, or both.
A confidence interval that is too wide is useless. One that is too narrow is misleading. The display must reflect the model's actual calibration, not be designed to look impressive. Modern neural networks are systematically overconfident: their raw softmax scores are not reliable probabilities. Confirm with your technical team whether post-hoc calibration (temperature scaling is the most common remedy) has been applied before surfacing the model's own confidence values. An uncalibrated confidence score displayed as though it were trustworthy is worse than showing no confidence at all.
If confidence is low and the user is about to act on the output, the uncertainty signal must be prominent, not a footnote or a collapsed detail.
Showing an uncertainty range is not a way to hedge responsibility. It is a tool for helping users make better decisions. Design it to be useful, not to be technically defensible.
Low confidence should trigger a clear action: "This result has low confidence: human review recommended." Don't surface uncertainty without giving users a way to respond to it.
Uncertainty vs. usability
Too many uncertainty signals create alert fatigue: users start ignoring them. Reserve high-visibility uncertainty indicators for cases where confidence is genuinely low. Use subtle indicators for routine moderate-confidence cases.
Precision vs. comprehension
Exact confidence intervals (73.4% ± 4.1%) are technically accurate but harder to act on than "moderate confidence." Find the right level of precision for the user's context.
Honesty vs. system credibility
Consistently showing low confidence may erode user trust in the system overall: even when the system is performing adequately. Communicate what "low confidence" means in context: what should the user do with it, not just that it exists.
Uncertainty covers statistical confidence in predictions. Grounding covers factual reliability in generative AI output. These are distinct failure modes that require different design treatment.
Out-of-distribution uncertainty is a specific form of scope violation. The two patterns work together to communicate where the system's reliable operating range ends.
Low confidence is one of the primary triggers for human review. Uncertainty signals and the HITL mechanism should be directly linked.
Uncertainty applies not just to the final output but to individual attribution factors. A factor's contribution may itself be uncertain: especially in noisy or incomplete data.
Doshi-Velez et al. (2017) — Accountability of AI Under Uncertainty
argues that AI systems must be able to communicate their uncertainty to be held accountable. Foundational for the normative case underlying this pattern
Kay et al. (2016) — When (ish) Is My Bus? User-Centered Visualizations of Uncertainty
empirical study comparing uncertainty visualization formats. Finds that quantile dotplots outperform continuous density strips on decision precision because they leverage subitizing, the ability to instantly recognise small group quantities, reducing estimation variance by approximately 1.15× compared to density plots. Directly motivates the quantile dotplot primitive in this pattern
Yaniv & Kleinberger (2000) — How Humans Incorporate Advice: Understanding the Weight of Advice
shows that humans systematically under-use uncertain information and over-weight confident-seeming outputs. The psychological basis for why explicit uncertainty signals are necessary
Guo et al. (2017) — On Calibration of Modern Neural Networks
demonstrates that modern deep neural networks are systematically overconfident: their softmax confidence scores are not reliable probabilities, particularly after training on large datasets. Introduces temperature scaling as a post-hoc remedy. Essential context for designing honest uncertainty displays: you cannot simply surface the model's own confidence score as trustworthy