Where does this data come from?
Surfaces when a client faces a data quality complaint: a decision made on stale or incorrect data, or when auditors ask 'what data was used in this specific decision and where did it come from?'
AI systems are built from data, and the data they are built from shapes everything they do. A model trained on historical loan decisions will encode the biases in those decisions. A model trained on data from one population may perform poorly on another. A model using stale data may be confidently wrong about a changed situation.
Users interacting with AI outcomes rarely know any of this. They see a number, a recommendation, or a decision, but not the data substrate underneath it. Data Provenance & Lineage makes that substrate visible: where data came from, when it was collected, how it was processed, and whether it was appropriate for this use case.
This pattern addresses two distinct levels: 1. Training data: what the model was built on (the long-term substrate) 2. Input data: what specific data was used in this decision (the immediate input)
Both matter. A well-sourced model trained on poor input data produces unreliable results. A trustworthy input fed to a biased model produces discriminatory results. The pattern covers both.
Note: this pattern concerns training data and input data. For runtime retrieval in LLM systems, see Retrieved Source Attribution.
Give users visibility into the origin, quality, and appropriateness of the data used in an AI decision: at both the model training level and the specific decision level.
A structured list of the data categories or sources used in a specific decision: - Source name or type (self-reported, verified document, third-party bureau, inferred) - Recency (when the data was last updated) - Quality indicator (complete / incomplete / stale / flagged)
Inline badges on specific data fields indicating their status: - ✓ Verified: confirmed against a document or external source - ⚠ Stale: last updated more than X months ago - ? Inferred: derived from other data, not directly collected - ✗ Missing: expected but not present
A compact card (linked to the full Model Transparency Cards) summarizing: - What kind of data the model was trained on - The time range of training data - Known limitations or gaps in training coverage
A standardised summary of the training dataset's composition designed for non-technical audiences: analogous to a food nutrition label. Covers the key "ingredients": size, collection date range, source types, known demographic gaps, and any documented quality issues or biases. Where a training data summary card links to the full transparency documentation, the nutrition label is the readable quick-read version intended for practitioners who need to assess fitness-for-purpose without reading the full datasheet. Particularly useful in operator views where the dataset's profile, not just the model's, affects trust in the output.
A simple visual showing when key data points were collected relative to when the decision was made. Helps users see whether the data reflects their current situation.
For decisions where specific fields drove the outcome, a panel showing exactly which data fields were accessed, their source, and their value, not just that "financial data" was used, but that "income figure of €X from self-reported form, unverified" was used.
A visible alert when the current case falls outside the distribution the model was trained on: "This system was primarily validated on applicants in urban employment contexts. Your case may be evaluated with lower accuracy."
These are different things with different implications. A model trained on biased historical data is a systemic problem. A specific decision made using stale input data is an immediate, addressable problem. Don't conflate them.
A stale income figure affects this decision now. The user and caseworker need to see it here, not in a terms-of-service page.
If some input data is missing or of low quality, show it. A system that presents incomplete data as if it were complete is actively misleading. Missing data displayed as a "?" badge is more trustworthy than a blank field.
If a decision was partly driven by stale or missing data, the natural recourse is to provide updated data. Make that link explicit: see Actionable Recourse.
Full training data documentation belongs in the Model Transparency Cards. The provenance display in the decision interface should show a summary with a link to that fuller documentation.
Transparency vs. complexity
Full data lineage for a complex model can involve dozens of sources and transformations. Surface what's actionable for the user: quality signals on the specific fields that mattered, and link to full documentation for experts who need the complete picture.
Honesty vs. trust erosion
Displaying data quality badges prominently may make users less confident in the system overall: even when most data is high quality. Design the display to be informative rather than alarming. Show what's good as well as what's problematic.
Disclosure vs. proprietary risk
Naming specific third-party data providers may reveal commercial relationships. Consider showing data types (credit bureau, address verification, employment database) rather than specific vendor names where appropriate.
Provenance covers training data and input data. Retrieved Source Attribution covers runtime retrieval in LLM systems. These are distinct but complementary.
Model cards are the canonical place for training data documentation. This pattern surfaces a summary of that documentation at the point of decision.
Stale or missing data is one of the most actionable types of recourse. Provide updated data or request re-evaluation. Provenance and recourse should be linked.
Training data that underrepresents certain populations is a root cause of model bias. Provenance and fairness signals address the same underlying issue from different angles.
Audit trails should capture the data state at the time of a decision, not just the decision itself. Provenance data is a key component of a complete audit record.
Gebru et al. (2018) — Datasheets for Datasets
proposes a standardized documentation format for datasets covering motivation, composition, collection process, and known biases. The direct foundation for training data provenance disclosure
Holland et al. (2018) — Data Nutrition Labels
proposes a "nutrition label" metaphor for dataset transparency, designed for non-technical audiences. Directly informs how provenance summaries should be designed for lay users
Mitchell et al. (2019) — Model Cards for Model Reporting
while focused on model-level documentation, model cards include dataset information as a core component. Establishes the connection between data provenance and model-level transparency