Created with the help of AI

Category — Understand

Data Provenance & Lineage

provenancecausal reasoningaudience adaptation

User question

Where does this data come from?

Consulting signal

Surfaces when a client faces a data quality complaint: a decision made on stale or incorrect data, or when auditors ask 'what data was used in this specific decision and where did it come from?'

Overview

Why this pattern exists

AI systems are built from data, and the data they are built from shapes everything they do. A model trained on historical loan decisions will encode the biases in those decisions. A model trained on data from one population may perform poorly on another. A model using stale data may be confidently wrong about a changed situation.

Users interacting with AI outcomes rarely know any of this. They see a number, a recommendation, or a decision, but not the data substrate underneath it. Data Provenance & Lineage makes that substrate visible: where data came from, when it was collected, how it was processed, and whether it was appropriate for this use case.

This pattern addresses two distinct levels: 1. Training data: what the model was built on (the long-term substrate) 2. Input data: what specific data was used in this decision (the immediate input)

Both matter. A well-sourced model trained on poor input data produces unreliable results. A trustworthy input fed to a biased model produces discriminatory results. The pattern covers both.

Note: this pattern concerns training data and input data. For runtime retrieval in LLM systems, see Retrieved Source Attribution.

Design goal

Give users visibility into the origin, quality, and appropriateness of the data used in an AI decision: at both the model training level and the specific decision level.

Usage guidance

When to use

Data quality directly affects decision reliability (medical, financial, legal, public sector)
Users have a right to know what data was used in a decision about them (GDPR, AI Act)
Input data may be stale, incomplete, or inferred rather than directly collected
The model was trained on data that may not represent the current population
Auditors or compliance teams need to trace the data chain for a specific decision

When not to use

Disclosing the data sources would reveal proprietary model architecture that creates commercial or security risk
The level of data detail is not meaningful for lay users in a low-stakes context
Full provenance is not technically traceable in the production system (don't show incomplete provenance as if it were complete)

Design

UI primitives

Content Block / List

Data source list

A structured list of the data categories or sources used in a specific decision: - Source name or type (self-reported, verified document, third-party bureau, inferred) - Recency (when the data was last updated) - Quality indicator (complete / incomplete / stale / flagged)

Inline Signal / Badge

Data quality badges

Inline badges on specific data fields indicating their status: - ✓ Verified: confirmed against a document or external source - ⚠ Stale: last updated more than X months ago - ? Inferred: derived from other data, not directly collected - ✗ Missing: expected but not present

Content Block / Card

Training data summary card

A compact card (linked to the full Model Transparency Cards) summarizing: - What kind of data the model was trained on - The time range of training data - Known limitations or gaps in training coverage

Inline Signal / Label

Dataset nutrition label

A standardised summary of the training dataset's composition designed for non-technical audiences: analogous to a food nutrition label. Covers the key "ingredients": size, collection date range, source types, known demographic gaps, and any documented quality issues or biases. Where a training data summary card links to the full transparency documentation, the nutrition label is the readable quick-read version intended for practitioners who need to assess fitness-for-purpose without reading the full datasheet. Particularly useful in operator views where the dataset's profile, not just the model's, affects trust in the output.

Data Visualization / Timeline

Data timeline

A simple visual showing when key data points were collected relative to when the decision was made. Helps users see whether the data reflects their current situation.

Content Block / Panel

What data was used panel

For decisions where specific fields drove the outcome, a panel showing exactly which data fields were accessed, their source, and their value, not just that "financial data" was used, but that "income figure of €X from self-reported form, unverified" was used.

Contextual Overlay / Warning

Coverage warning

A visible alert when the current case falls outside the distribution the model was trained on: "This system was primarily validated on applicants in urban employment contexts. Your case may be evaluated with lower accuracy."

How to use

Distinguish training data from input data clearly.

These are different things with different implications. A model trained on biased historical data is a systemic problem. A specific decision made using stale input data is an immediate, addressable problem. Don't conflate them.

Make data quality visible at the point of decision, not buried in documentation.

A stale income figure affects this decision now. The user and caseworker need to see it here, not in a terms-of-service page.

Be honest about gaps.

If some input data is missing or of low quality, show it. A system that presents incomplete data as if it were complete is actively misleading. Missing data displayed as a "?" badge is more trustworthy than a blank field.

Connect provenance to recourse.

If a decision was partly driven by stale or missing data, the natural recourse is to provide updated data. Make that link explicit: see Actionable Recourse.

For training data, link to the full model card.

Full training data documentation belongs in the Model Transparency Cards. The provenance display in the decision interface should show a summary with a link to that fuller documentation.

Use cases

flow a

Spotting stale data in a decision

1. Applicant reviews a loan decision.
2. The data panel shows a ⚠ badge on "Income: last verified 18 months ago."
3. Applicant recognizes this is outdated (they changed jobs).
4. System shows a link: "Provide updated income documentation to request re-evaluation."
5. Applicant uploads current payslips and requests a new assessment.

flow b

Caseworker checking data quality before override

1. Caseworker reviews a borderline case flagged for human review.
2. Data quality panel shows two ✗ Missing fields and one ? Inferred field.
3. Caseworker requests the missing documentation before making a final decision.
4. New data is added; system notes the data update in the audit trail.

flow c

Auditor tracing a decision

1. Auditor investigates a complaint about a declined application.
2. Decision audit shows the data sources used at the time of decision, with timestamps.
3. Auditor sees that the income figure used was from a third-party bureau, not from the applicant's submitted documents.
4. Auditor flags a process inconsistency for review.

Design trade-offs

Transparency vs. complexity

Full data lineage for a complex model can involve dozens of sources and transformations. Surface what's actionable for the user: quality signals on the specific fields that mattered, and link to full documentation for experts who need the complete picture.

Honesty vs. trust erosion

Displaying data quality badges prominently may make users less confident in the system overall: even when most data is high quality. Design the display to be informative rather than alarming. Show what's good as well as what's problematic.

Disclosure vs. proprietary risk

Naming specific third-party data providers may reveal commercial relationships. Consider showing data types (credit bureau, address verification, employment database) rather than specific vendor names where appropriate.

Connections

Relation to other patterns

Retrieved Source Attribution

What sources did the system draw from?

Surface the documents, passages, or records the system retrieved at runtime, so users can verify the basis of a response, assess its reliability, and access primary sources directly.

Provenance covers training data and input data. Retrieved Source Attribution covers runtime retrieval in LLM systems. These are distinct but complementary.

Model Transparency Cards

How was this system built and tested?

Provide a structured, accessible, and honest documentation of an AI system, covering its purpose, data, performance, limitations, fairness properties, and governance, in a form that serves operators, affected persons, regulators, and the public.

Model cards are the canonical place for training data documentation. This pattern surfaces a summary of that documentation at the point of decision.

Actionable Recourse

What can I do to change this outcome?

Translate the factors behind an adverse AI decision into specific, realistic, controllable next steps: giving users a genuine pathway to a different outcome rather than a list of features to optimize.

Stale or missing data is one of the most actionable types of recourse. Provide updated data or request re-evaluation. Provenance and recourse should be linked.

Ethical & Fairness Signals

Is this system fair across groups?

Surface indicators of how equitably an AI system performs across different populations: giving users, operators, and auditors the context to assess fairness concerns, and making bias a visible and contestable property rather than an invisible assumption.

Training data that underrepresents certain populations is a root cause of model bias. Provenance and fairness signals address the same underlying issue from different angles.

Audit Trail & Logging

What is the complete record of this decision?

Capture a comprehensive, tamper-evident, and accessible record of AI decisions, including inputs, outputs, model versions, human interventions, and data states, sufficient to support retrospective audit, regulatory review, and the exercise of user rights.

Audit trails should capture the data state at the time of a decision, not just the decision itself. Provenance data is a key component of a complete audit record.

Sources

Gebru et al. (2018) — Datasheets for Datasets

proposes a standardized documentation format for datasets covering motivation, composition, collection process, and known biases. The direct foundation for training data provenance disclosure

Holland et al. (2018) — Data Nutrition Labels

proposes a "nutrition label" metaphor for dataset transparency, designed for non-technical audiences. Directly informs how provenance summaries should be designed for lay users

Mitchell et al. (2019) — Model Cards for Model Reporting

while focused on model-level documentation, model cards include dataset information as a core component. Establishes the connection between data provenance and model-level transparency

Created as a side project by Christian Laesser & AI