Skip to main content
Version: Next

Incident

Incidents represent data quality issues, operational problems, or any other type of issue that affects data assets in DataHub. They provide a structured way to track, manage, and resolve problems across datasets, dashboards, charts, data flows, data jobs, and schema fields. Incidents help teams maintain data reliability by documenting problems, assigning responsibility, tracking resolution progress, and maintaining an audit trail of data quality events.

Identity

Incidents are uniquely identified by a generated UUID string. Unlike most other DataHub entities that derive their identity from external systems, incidents are created within DataHub and assigned a unique identifier at creation time.

The URN structure for an incident is:

urn:li:incident:<uuid>

Example:

urn:li:incident:a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d

The UUID is automatically generated by the system when an incident is raised, ensuring global uniqueness across all incidents in the DataHub instance.

Important Capabilities

Incident Types

Incidents can be categorized by type to help teams understand the nature of the problem. DataHub supports several predefined incident types as well as custom types:

  • FRESHNESS: Triggered when data is not updated within expected time windows. Often raised by freshness assertions that detect stale data.
  • VOLUME: Raised when data volume falls outside expected ranges (too much or too little data). Typically generated by volume assertions.
  • FIELD: Indicates issues with specific field values, such as null values, invalid formats, or values outside acceptable ranges. Associated with field-level assertions.
  • SQL: Triggered by SQL-based assertions that validate data using custom queries.
  • DATA_SCHEMA: Raised when schema changes are detected, such as column additions, removals, or type changes.
  • OPERATIONAL: General operational incidents such as pipeline failures, permission issues, or system errors.
  • CUSTOM: User-defined incident types for organization-specific problems. When using CUSTOM type, you must provide a customType string to describe the incident category.

Incident Status and Lifecycle

Incidents follow a lifecycle from creation through resolution, tracked through status and stage fields:

Status State

The top-level state indicates whether an incident is active or resolved:

  • ACTIVE: The incident is ongoing and requires attention or action.
  • RESOLVED: The incident has been addressed and is no longer active.

Lifecycle Stages

Incidents can be assigned to specific stages that represent where they are in the resolution process:

  • TRIAGE: The impact and priority of the incident is being actively assessed. This is typically the first stage for newly reported incidents.
  • INVESTIGATION: The root cause of the incident is being investigated by the assigned team.
  • WORK_IN_PROGRESS: The incident is in the remediation stage, with active work happening to resolve the issue.
  • FIXED: The incident has been resolved through corrective action (completed remediation).
  • NO_ACTION_REQUIRED: The incident is resolved with no action required, for example if it was a false positive, expected behavior, or resolved itself.

The status also includes a message field for providing context about the current state and a lastUpdated timestamp tracking when the status was last modified.

Priority Levels

Incidents can be assigned a priority to help teams triage and focus on the most critical issues:

  • CRITICAL (priority 0): Severe issues requiring immediate attention that significantly impact business operations or data quality.
  • HIGH (priority 1): Important issues that should be addressed promptly but are not immediately blocking.
  • MEDIUM (priority 2): Moderate issues that should be addressed in the normal course of work.
  • LOW (priority 3): Minor issues that can be addressed when time permits.

The priority field is stored as an integer (0-3) in the data model, allowing for programmatic sorting and filtering.

Assignees

Incidents can be assigned to one or more users or groups responsible for investigating and resolving the issue. Each assignee includes:

  • actor: The URN of the user (corpUser) or group (corpGroup) assigned to the incident.
  • assignedAt: An audit stamp capturing who made the assignment and when it occurred.

Multiple assignees can collaborate on resolving a single incident, making it easy to involve cross-functional teams.

Affected Entities

A key feature of incidents is the ability to link them to one or more affected data assets. The entities field contains an array of URNs referencing the assets impacted by the incident. Supported entity types include:

  • dataset: Tables, views, streams, or other data collections
  • chart: Data visualizations
  • dashboard: Dashboard pages containing multiple charts
  • dataFlow: Pipelines or workflows
  • dataJob: Individual tasks or jobs within a pipeline
  • schemaField: Specific fields/columns within a dataset

This linkage allows users to see all incidents affecting a particular asset and understand the scope of an incident across multiple assets.

Incident Source

The source field tracks how the incident was created:

  • MANUAL: The incident was manually created by a user through the UI or API.
  • ASSERTION_FAILURE: The incident was automatically raised by a failed assertion. In this case, the sourceUrn field contains the URN of the assertion that triggered the incident.

This distinction helps teams understand which incidents require manual investigation versus those generated by automated monitoring.

Temporal Tracking

Incidents maintain detailed temporal information:

  • startedAt: The time when the incident actually began (may be earlier than when it was reported).
  • created: An audit stamp tracking who created the incident and when it was first reported.
  • lastUpdated: An audit stamp on the status tracking the most recent status change.

This temporal data helps teams understand incident timelines, calculate mean time to detection (MTTD), and mean time to resolution (MTTR).

Tags

Like other DataHub entities, incidents can be tagged using the globalTags aspect. Tags help categorize and filter incidents, making it easier to find related issues or analyze incident patterns by category.

Code Examples

Create an Incident

The following example demonstrates creating a new incident and associating it with a dataset that has a data quality issue.

Python SDK: Create a basic incident
# metadata-ingestion/examples/library/incident_create.py
import logging
import os
import uuid

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata._urns.urn_defs import IncidentUrn

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Configuration
gms_endpoint = os.getenv("DATAHUB_GMS_URL", "http://localhost:8080")
token = os.getenv("DATAHUB_GMS_TOKEN")
emitter = DatahubRestEmitter(gms_server=gms_endpoint, token=token)

# Generate a unique incident ID
incident_id = str(uuid.uuid4())
incident_urn = IncidentUrn(incident_id)

# Create the dataset URN that this incident affects
dataset_urn = builder.make_dataset_urn(
platform="snowflake", name="analytics.sales_fact", env="PROD"
)

# Get the current actor URN for audit stamps
actor_urn = builder.make_user_urn("datahub")
audit_stamp = models.AuditStampClass(
time=int(builder.get_sys_time() * 1000),
actor=actor_urn,
)

# Create the incident info aspect
incident_info = models.IncidentInfoClass(
type=models.IncidentTypeClass.FRESHNESS,
title="Sales data not updated in 48 hours",
description="The sales_fact table has not been refreshed since 2023-10-15. Expected daily updates are missing, which may impact downstream reporting and dashboards.",
entities=[dataset_urn],
status=models.IncidentStatusClass(
state=models.IncidentStateClass.ACTIVE,
stage=models.IncidentStageClass.TRIAGE,
message="Investigating potential pipeline failure",
lastUpdated=audit_stamp,
),
priority=0, # CRITICAL priority (0=CRITICAL, 1=HIGH, 2=MEDIUM, 3=LOW)
source=models.IncidentSourceClass(type=models.IncidentSourceTypeClass.MANUAL),
created=audit_stamp,
)

# Create and emit the metadata change proposal
metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=str(incident_urn),
aspect=incident_info,
)

emitter.emit(metadata_change_proposal)
log.info(f"Created incident {incident_urn} for dataset {dataset_urn}")
log.info(
f"Incident details: type={incident_info.type}, priority={incident_info.priority}, status={incident_info.status.state}"
)

Update Incident Status

As incidents progress through their lifecycle, you'll need to update their status to reflect the current state and stage.

Python SDK: Update incident status and stage
# metadata-ingestion/examples/library/incident_update_status.py
import logging

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.urns import IncidentUrn

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Configuration
gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

# Specify the incident to update (use the incident ID from incident_create.py)
incident_id = "a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d"
incident_urn = IncidentUrn(incident_id)

# Retrieve the current incident info to preserve other fields
current_incident_info = graph.get_aspect(
entity_urn=str(incident_urn),
aspect_type=models.IncidentInfoClass,
)

if not current_incident_info:
raise ValueError(f"Incident {incident_urn} not found")

# Get the current actor URN for audit stamps
actor_urn = builder.make_user_urn("jdoe")
audit_stamp = models.AuditStampClass(
time=int(builder.get_sys_time() * 1000),
actor=actor_urn,
)

# Update the status to reflect progress in resolving the incident
current_incident_info.status = models.IncidentStatusClass(
state=models.IncidentStateClass.ACTIVE,
stage=models.IncidentStageClass.WORK_IN_PROGRESS,
message="Pipeline has been restarted. Monitoring for successful completion.",
lastUpdated=audit_stamp,
)

# Optionally update priority if severity assessment changed
current_incident_info.priority = (
1 # HIGH priority (0=CRITICAL, 1=HIGH, 2=MEDIUM, 3=LOW)
)

# Optionally assign team members to work on the incident
assignee1 = models.IncidentAssigneeClass(
actor=builder.make_user_urn("jdoe"),
assignedAt=audit_stamp,
)
assignee2 = models.IncidentAssigneeClass(
actor=builder.make_user_urn("asmith"),
assignedAt=audit_stamp,
)
current_incident_info.assignees = [assignee1, assignee2]

# Create and emit the metadata change proposal
metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=str(incident_urn),
aspect=current_incident_info,
)

emitter.emit(metadata_change_proposal)
log.info(
f"Updated incident {incident_urn} status to {current_incident_info.status.state}"
)
log.info(
f"Status details: stage={current_incident_info.status.stage}, message={current_incident_info.status.message}"
)
log.info(f"Priority updated to {current_incident_info.priority}")
log.info(f"Assigned to {len(current_incident_info.assignees)} team members")

Add Tags to an Incident

Tags can be added to incidents to categorize them by team, system, severity, or any other organizational dimension.

Python SDK: Add a tag to an incident
# metadata-ingestion/examples/library/incident_add_tag.py
import logging

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.urns import IncidentUrn

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Configuration
gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

# Specify the incident to tag (use the incident ID from incident_create.py)
incident_id = "a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d"
incident_urn = IncidentUrn(incident_id)

# Create the tag URN
tag_urn = builder.make_tag_urn("data-quality")

# Get the current actor URN for audit stamps
actor_urn = builder.make_user_urn("datahub")
audit_stamp = models.AuditStampClass(
time=int(builder.get_sys_time() * 1000),
actor=actor_urn,
)

# Read current tags to preserve existing ones
current_tags = graph.get_aspect(
entity_urn=str(incident_urn),
aspect_type=models.GlobalTagsClass,
)

# Create tag association
tag_association = models.TagAssociationClass(
tag=tag_urn,
context="incident_categorization",
)

if current_tags:
# Check if tag already exists
tag_exists = any(existing_tag.tag == tag_urn for existing_tag in current_tags.tags)
if not tag_exists:
current_tags.tags.append(tag_association)
updated_tags = current_tags
else:
log.info(f"Tag {tag_urn} already exists on incident {incident_urn}")
updated_tags = current_tags
else:
# No existing tags, create new GlobalTags aspect
updated_tags = models.GlobalTagsClass(tags=[tag_association])

# Create and emit the metadata change proposal
metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=str(incident_urn),
aspect=updated_tags,
)

emitter.emit(metadata_change_proposal)
log.info(f"Added tag {tag_urn} to incident {incident_urn}")
log.info(f"Incident now has {len(updated_tags.tags)} tag(s)")

Query Incident via REST API

After creating incidents, you can retrieve them using the DataHub REST API to integrate with external monitoring or ticketing systems.

Query incident using REST API
# metadata-ingestion/examples/library/incident_query_rest_api.py
import logging
import os

import requests

import datahub.metadata.schema_classes as models
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata._urns.urn_defs import IncidentUrn

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Configuration
gms_endpoint = os.getenv("DATAHUB_GMS_URL", "http://localhost:8080")
token = os.getenv("DATAHUB_GMS_TOKEN")
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint, token=token))

# Specify the incident to query (use the incident ID from incident_create.py)
incident_id = "a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d"
incident_urn = IncidentUrn(incident_id)

# Query the incident info aspect
incident_info = graph.get_aspect(
entity_urn=str(incident_urn),
aspect_type=models.IncidentInfoClass,
)

if incident_info:
log.info(f"Incident: {incident_urn}")
log.info(f" Type: {incident_info.type}")
log.info(f" Title: {incident_info.title}")
log.info(f" Description: {incident_info.description}")
log.info(f" Priority: {incident_info.priority}")
log.info(f" Status State: {incident_info.status.state}")
log.info(f" Status Stage: {incident_info.status.stage}")
log.info(f" Status Message: {incident_info.status.message}")
log.info(f" Affected Entities: {len(incident_info.entities)}")
for entity_urn in incident_info.entities:
log.info(f" - {entity_urn}")

if incident_info.assignees:
log.info(f" Assignees: {len(incident_info.assignees)}")
for assignee in incident_info.assignees:
log.info(f" - {assignee.actor}")

if incident_info.source:
log.info(f" Source Type: {incident_info.source.type}")
if incident_info.source.sourceUrn:
log.info(f" Source URN: {incident_info.source.sourceUrn}")

log.info(
f" Created: {incident_info.created.time} by {incident_info.created.actor}"
)
log.info(
f" Last Updated: {incident_info.status.lastUpdated.time} by {incident_info.status.lastUpdated.actor}"
)
else:
log.warning(f"Incident {incident_urn} not found")

# Query the tags aspect
tags = graph.get_aspect(
entity_urn=str(incident_urn),
aspect_type=models.GlobalTagsClass,
)

if tags:
log.info(f" Tags: {len(tags.tags)}")
for tag_association in tags.tags:
log.info(f" - {tag_association.tag}")

# Alternative: Use the REST API directly with requests
# This approach is useful for integration with external systems

# Query incident entity using the REST API
headers = {"Content-Type": "application/json"}
if token:
headers["Authorization"] = f"Bearer {token}"

response = requests.get(
f"{gms_endpoint}/entities/{incident_urn}",
headers=headers,
)

if response.status_code == 200:
entity_data = response.json()
log.info("\nREST API Response:")
log.info(f" Entity URN: {entity_data.get('urn')}")
log.info(f" Aspects: {list(entity_data.get('aspects', {}).keys())}")
else:
log.error(f"Failed to query incident via REST API: {response.status_code}")

Integration Points

Relationship with Assertions

Incidents are tightly integrated with DataHub's assertion framework. When assertions (data quality checks) fail and are configured to raise incidents, they automatically create incident entities. These incidents:

  • Reference the assertion that triggered them via the sourceUrn field
  • Inherit the type from the assertion (FRESHNESS, VOLUME, FIELD, SQL, DATA_SCHEMA)
  • Link to the assets being monitored by the assertion
  • Can be configured at the assertion level to control whether failures generate incidents

This integration provides automatic incident creation for monitored data quality checks.

Incidents Summary on Assets

DataHub entities that can have incidents (datasets, dashboards, charts, dataFlows, dataJobs, schemaFields) include an incidentsSummary aspect. This aspect provides:

  • A count of active incidents affecting the entity
  • A count of resolved incidents
  • The priority breakdown of active incidents
  • Quick access to incident details without querying the incident entities directly

This summary appears in the UI on asset pages, giving users immediate visibility into data quality issues.

GraphQL Operations

The DataHub GraphQL API provides several operations for working with incidents:

  • raiseIncident: Creates a new incident with specified type, priority, status, and affected entities
  • updateIncident: Updates incident properties including title, description, status, priority, assignees, and affected entities
  • updateIncidentStatus: Specifically updates the status state and stage of an incident
  • entityIncidents: Queries all incidents affecting a particular entity

These operations are used by the DataHub UI and can be called directly by external applications.

Authorization

Incident operations respect DataHub's authorization model. Users must have the EDIT_ENTITY_INCIDENTS privilege on an entity to:

  • Create incidents affecting that entity
  • Update incidents linked to that entity
  • Change the status of incidents affecting that entity

This ensures that only users with appropriate permissions can manage incidents for sensitive data assets.

Health Status

Incidents factor into the overall health status of DataHub entities. Assets with active CRITICAL or HIGH priority incidents may be marked as unhealthy in the UI, helping users quickly identify problematic data assets.

Notable Exceptions

Single vs. Multiple Affected Entities

While the data model supports incidents affecting multiple entities (via the entities array), some GraphQL resolvers currently have limitations when working with multi-entity incidents. Specifically, the UpdateIncidentStatusResolver currently only checks authorization against the first entity in the array. This is noted in the code as a TODO for future enhancement.

When creating incidents, it's recommended to:

  • Use multiple entities when they're all affected by the same root cause (e.g., all downstream datasets affected by an upstream data quality issue)
  • Be aware that users need appropriate permissions on all affected entities to update the incident
  • Consider the UI implications of multi-entity incidents when displaying incident details

Priority Field Type

The priority field is stored as an integer (0-3) rather than as an enum in the PDL model. This was noted in the schema comments as a potential area for future improvement. The GraphQL layer provides an enum interface (CRITICAL, HIGH, MEDIUM, LOW) that maps to these integer values, but the underlying storage uses integers.

When working with the low-level SDK, use the integer values:

  • 0 = CRITICAL
  • 1 = HIGH
  • 2 = MEDIUM
  • 3 = LOW

Automatic vs. Manual Incidents

Incidents created automatically by assertion failures cannot have their source field changed to MANUAL, and vice versa. The source field is set at creation time and reflects the origin of the incident. This distinction is important for reporting and analytics, as it helps teams understand the effectiveness of automated monitoring versus manual incident reporting.

Status Message Length

While there is no explicit length limit on the status message field in the schema, UI components may truncate very long messages. It's recommended to keep status messages concise (under 500 characters) and use the incident description field for longer explanations.

Incident Retention

Incidents are not automatically deleted when their affected entities are removed. This preserves the historical record of data quality issues even after assets are deprecated or deleted. However, this can lead to orphaned incidents that reference non-existent entities. It's recommended to implement cleanup processes for incidents linked to deleted assets if this becomes an issue in your organization.

Technical Reference

For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.