Skip to main content
Version: Next

Assertion

The assertion entity represents a data quality rule that can be applied to one or more datasets. Assertions are the foundation of DataHub's data quality framework, enabling organizations to define, monitor, and enforce expectations about their data. They encompass various types of checks including field-level validation, volume monitoring, freshness tracking, schema validation, and custom SQL-based rules.

Assertions can originate from multiple sources: they can be defined natively within DataHub, ingested from external data quality tools (such as Great Expectations, dbt tests, or Snowflake Data Quality), or inferred by ML-based systems. Each assertion tracks its evaluation history over time, maintaining a complete audit trail of passes, failures, and errors.

Identity

An Assertion is uniquely identified by an assertionId, which is a globally unique identifier that remains constant across runs of the assertion. The URN format is:

urn:li:assertion:<assertionId>

The assertionId is typically a generated GUID that uniquely identifies the assertion definition. For example:

urn:li:assertion:432475190cc846f2894b5b3aa4d55af2

Generating Stable Assertion IDs

The logic for generating stable assertion IDs differs based on the source of the assertion:

  • Native Assertions: Created in DataHub Cloud's UI or API, the platform generates a UUID
  • External Assertions: Each integration tool generates IDs based on its own conventions:
    • Great Expectations: Combines expectation suite name, expectation type, and parameters
    • dbt Tests: Uses the test's unique_id from the manifest
    • Snowflake Data Quality: Uses the native DMF rule ID
  • Inferred Assertions: ML-based systems generate IDs based on the inference model and target

The key requirement is that the same assertion definition should always produce the same assertionId, enabling DataHub to track the assertion's history over time even as it's re-evaluated.

Important Capabilities

Assertion Types

DataHub supports several types of assertions, each designed to validate different aspects of data quality:

1. Field Assertions (FIELD)

Field assertions validate individual columns or fields within a dataset. They come in two sub-types:

Field Values Assertions: Validate that each value in a column meets certain criteria. For example:

  • Values must be within a specific range
  • Values must match a regex pattern
  • Values must be one of a set of allowed values
  • Values must not be null

Field Metric Assertions: Validate aggregated statistics about a column. For example:

  • Null percentage must be less than 5%
  • Unique count must equal row count (uniqueness check)
  • Mean value must be between 0 and 100
  • Standard deviation must be less than 10
Python SDK: Create a field uniqueness assertion
# metadata-ingestion/examples/library/assertion_field_uniqueness.py
import os

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AssertionInfoClass,
AssertionStdOperatorClass,
AssertionTypeClass,
FieldAssertionInfoClass,
FieldAssertionTypeClass,
FieldMetricAssertionClass,
FieldMetricTypeClass,
SchemaFieldSpecClass,
)

emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)

dataset_urn = builder.make_dataset_urn(platform="snowflake", name="mydb.myschema.users")

field_assertion_info = FieldAssertionInfoClass(
type=FieldAssertionTypeClass.FIELD_METRIC,
entity=dataset_urn,
fieldMetricAssertion=FieldMetricAssertionClass(
field=SchemaFieldSpecClass(
path="user_id",
type="VARCHAR",
nativeType="VARCHAR",
),
metric=FieldMetricTypeClass.UNIQUE_COUNT,
operator=AssertionStdOperatorClass.EQUAL_TO,
parameters=None,
),
)

assertion_info = AssertionInfoClass(
type=AssertionTypeClass.FIELD,
fieldAssertion=field_assertion_info,
description="User ID must be unique across all rows",
)

assertion_urn = builder.make_assertion_urn(
builder.datahub_guid(
{"entity": dataset_urn, "field": "user_id", "type": "uniqueness"}
)
)

assertion_info_mcp = MetadataChangeProposalWrapper(
entityUrn=assertion_urn,
aspect=assertion_info,
)

emitter.emit_mcp(assertion_info_mcp)
print(f"Created field uniqueness assertion: {assertion_urn}")

2. Volume Assertions (VOLUME)

Volume assertions monitor the amount of data in a dataset. They support several sub-types:

  • ROW_COUNT_TOTAL: Total number of rows must meet expectations
  • ROW_COUNT_CHANGE: Change in row count over time must meet expectations
  • INCREMENTING_SEGMENT_ROW_COUNT_TOTAL: Latest partition/segment row count
  • INCREMENTING_SEGMENT_ROW_COUNT_CHANGE: Change between consecutive partitions

Volume assertions are critical for detecting data pipeline failures, incomplete loads, or unexpected data growth.

Python SDK: Create a row count volume assertion
# metadata-ingestion/examples/library/assertion_volume_rows.py
import os

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AssertionInfoClass,
AssertionStdOperatorClass,
AssertionStdParameterClass,
AssertionStdParametersClass,
AssertionStdParameterTypeClass,
AssertionTypeClass,
RowCountTotalClass,
VolumeAssertionInfoClass,
VolumeAssertionTypeClass,
)

emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)

dataset_urn = builder.make_dataset_urn(
platform="bigquery", name="project.dataset.orders"
)

volume_assertion_info = VolumeAssertionInfoClass(
type=VolumeAssertionTypeClass.ROW_COUNT_TOTAL,
entity=dataset_urn,
rowCountTotal=RowCountTotalClass(
operator=AssertionStdOperatorClass.BETWEEN,
parameters=AssertionStdParametersClass(
minValue=AssertionStdParameterClass(
type=AssertionStdParameterTypeClass.NUMBER,
value="1000",
),
maxValue=AssertionStdParameterClass(
type=AssertionStdParameterTypeClass.NUMBER,
value="1000000",
),
),
),
)

assertion_info = AssertionInfoClass(
type=AssertionTypeClass.VOLUME,
volumeAssertion=volume_assertion_info,
description="Orders table must contain between 1,000 and 1,000,000 rows",
)

assertion_urn = builder.make_assertion_urn(
builder.datahub_guid({"entity": dataset_urn, "type": "row-count-range"})
)

assertion_info_mcp = MetadataChangeProposalWrapper(
entityUrn=assertion_urn,
aspect=assertion_info,
)

emitter.emit_mcp(assertion_info_mcp)
print(f"Created volume assertion: {assertion_urn}")

3. Freshness Assertions (FRESHNESS)

Freshness assertions ensure data is updated within expected time windows. Two types are supported:

  • DATASET_CHANGE: Based on dataset change operations (insert, update, delete) captured from audit logs
  • DATA_JOB_RUN: Based on successful execution of a data job

Freshness assertions define a schedule that specifies when updates should occur (e.g., daily by 9 AM, every 4 hours) and what tolerance is acceptable.

Python SDK: Create a dataset change freshness assertion
# metadata-ingestion/examples/library/assertion_freshness.py
import os

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AssertionInfoClass,
AssertionTypeClass,
FreshnessAssertionInfoClass,
FreshnessAssertionScheduleClass,
FreshnessAssertionScheduleTypeClass,
FreshnessAssertionTypeClass,
FreshnessCronScheduleClass,
)

emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)

dataset_urn = builder.make_dataset_urn(
platform="redshift", name="prod.analytics.daily_metrics"
)

freshness_assertion_info = FreshnessAssertionInfoClass(
type=FreshnessAssertionTypeClass.DATASET_CHANGE,
entity=dataset_urn,
schedule=FreshnessAssertionScheduleClass(
type=FreshnessAssertionScheduleTypeClass.CRON,
cron=FreshnessCronScheduleClass(
cron="0 9 * * *",
timezone="America/Los_Angeles",
windowStartOffsetMs=None,
),
),
)

assertion_info = AssertionInfoClass(
type=AssertionTypeClass.FRESHNESS,
freshnessAssertion=freshness_assertion_info,
description="Daily metrics table must be updated every day by 9 AM Pacific Time",
)

assertion_urn = builder.make_assertion_urn(
builder.datahub_guid({"entity": dataset_urn, "type": "freshness-daily-9am"})
)

assertion_info_mcp = MetadataChangeProposalWrapper(
entityUrn=assertion_urn,
aspect=assertion_info,
)

emitter.emit_mcp(assertion_info_mcp)
print(f"Created freshness assertion: {assertion_urn}")

4. Schema Assertions (DATA_SCHEMA)

Schema assertions validate that a dataset's structure matches expectations. They verify:

  • Presence or absence of specific columns
  • Column data types
  • Column ordering (optional)
  • Schema compatibility modes:
    • EXACT_MATCH: Schema must match exactly
    • SUPERSET: Actual schema can have additional columns
    • SUBSET: Actual schema can have fewer columns

Schema assertions are valuable for detecting breaking changes in upstream data sources.

Python SDK: Create a schema assertion
# metadata-ingestion/examples/library/assertion_schema.py
import os
import time

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AssertionInfoClass,
AssertionTypeClass,
AuditStampClass,
NumberTypeClass,
SchemaAssertionCompatibilityClass,
SchemaAssertionInfoClass,
SchemaFieldClass,
SchemaFieldDataTypeClass,
SchemalessClass,
SchemaMetadataClass,
StringTypeClass,
)

emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)

dataset_urn = builder.make_dataset_urn(platform="kafka", name="prod.user_events")

current_timestamp = int(time.time() * 1000)
audit_stamp = AuditStampClass(
time=current_timestamp,
actor="urn:li:corpuser:datahub",
)

expected_schema = SchemaMetadataClass(
schemaName="user_events",
platform=builder.make_data_platform_urn("kafka"),
version=0,
created=audit_stamp,
lastModified=audit_stamp,
fields=[
SchemaFieldClass(
fieldPath="user_id",
type=SchemaFieldDataTypeClass(type=StringTypeClass()),
nativeDataType="string",
lastModified=audit_stamp,
),
SchemaFieldClass(
fieldPath="event_type",
type=SchemaFieldDataTypeClass(type=StringTypeClass()),
nativeDataType="string",
lastModified=audit_stamp,
),
SchemaFieldClass(
fieldPath="timestamp",
type=SchemaFieldDataTypeClass(type=NumberTypeClass()),
nativeDataType="long",
lastModified=audit_stamp,
),
SchemaFieldClass(
fieldPath="properties",
type=SchemaFieldDataTypeClass(type=StringTypeClass()),
nativeDataType="string",
lastModified=audit_stamp,
),
],
hash="",
platformSchema=SchemalessClass(),
)

schema_assertion_info = SchemaAssertionInfoClass(
entity=dataset_urn,
schema=expected_schema,
compatibility=SchemaAssertionCompatibilityClass.SUPERSET,
)

assertion_info = AssertionInfoClass(
type=AssertionTypeClass.DATA_SCHEMA,
schemaAssertion=schema_assertion_info,
description="User events stream must have required schema fields (can include additional fields)",
)

assertion_urn = builder.make_assertion_urn(
builder.datahub_guid({"entity": dataset_urn, "type": "schema-check"})
)

assertion_info_mcp = MetadataChangeProposalWrapper(
entityUrn=assertion_urn,
aspect=assertion_info,
)

emitter.emit_mcp(assertion_info_mcp)
print(f"Created schema assertion: {assertion_urn}")

5. SQL Assertions (SQL)

SQL assertions allow custom validation logic using arbitrary SQL queries. Two types:

  • METRIC: Execute SQL and assert the returned metric meets expectations
  • METRIC_CHANGE: Assert the change in a SQL metric over time

SQL assertions provide maximum flexibility for complex validation scenarios that don't fit other assertion types, such as cross-table referential integrity checks or business rule validation.

Python SDK: Create a SQL metric assertion
# metadata-ingestion/examples/library/assertion_sql_metric.py
import os

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AssertionInfoClass,
AssertionStdOperatorClass,
AssertionStdParameterClass,
AssertionStdParametersClass,
AssertionStdParameterTypeClass,
AssertionTypeClass,
SqlAssertionInfoClass,
SqlAssertionTypeClass,
)

emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)

dataset_urn = builder.make_dataset_urn(platform="postgres", name="public.transactions")

sql_assertion_info = SqlAssertionInfoClass(
type=SqlAssertionTypeClass.METRIC,
entity=dataset_urn,
statement="SELECT SUM(amount) FROM public.transactions WHERE status = 'completed' AND date = CURRENT_DATE",
operator=AssertionStdOperatorClass.GREATER_THAN_OR_EQUAL_TO,
parameters=AssertionStdParametersClass(
value=AssertionStdParameterClass(
type=AssertionStdParameterTypeClass.NUMBER,
value="0",
)
),
)

assertion_info = AssertionInfoClass(
type=AssertionTypeClass.SQL,
sqlAssertion=sql_assertion_info,
description="Total completed transaction amount today must be non-negative",
)

assertion_urn = builder.make_assertion_urn(
builder.datahub_guid(
{"entity": dataset_urn, "type": "sql-completed-transactions-sum"}
)
)

assertion_info_mcp = MetadataChangeProposalWrapper(
entityUrn=assertion_urn,
aspect=assertion_info,
)

emitter.emit_mcp(assertion_info_mcp)
print(f"Created SQL assertion: {assertion_urn}")

6. Custom Assertions (CUSTOM)

Custom assertions provide an extension point for assertion types not directly modeled in DataHub. They're useful when:

  • Integrating third-party data quality tools with unique assertion types
  • Starting integration before fully mapping to DataHub's type system
  • Implementing organization-specific validation logic

Assertion Source

The assertionInfo aspect includes an AssertionSource that identifies the origin of the assertion:

  • NATIVE: Defined directly in DataHub (DataHub Cloud feature)
  • EXTERNAL: Ingested from external tools (Great Expectations, dbt, Snowflake, etc.)
  • INFERRED: Generated by ML-based inference systems (DataHub Cloud feature)

External assertions should have a corresponding dataPlatformInstance aspect that identifies the specific platform instance they originated from.

Assertion Run Events

Assertion evaluations are tracked using the assertionRunEvent timeseries aspect. Each evaluation creates a new event with:

  • timestampMillis: When the evaluation occurred
  • runId: Platform-specific identifier for this evaluation run
  • asserteeUrn: The entity being asserted (typically a dataset)
  • assertionUrn: The assertion being evaluated
  • status: COMPLETE, RUNNING, or ERROR
  • result: SUCCESS, FAILURE, or ERROR with details
  • batchSpec: Optional information about the data batch evaluated
  • runtimeContext: Optional key-value pairs with runtime parameters

Run events enable tracking assertion health over time, identifying trends, and debugging failures.

Assertion Actions

The assertionActions aspect defines automated responses to assertion outcomes:

  • onSuccess: Actions triggered when assertion passes
  • onFailure: Actions triggered when assertion fails

Common actions include:

  • Sending notifications (email, Slack, PagerDuty)
  • Creating incidents
  • Triggering downstream workflows
  • Updating metadata

Tags and Metadata

Like other DataHub entities, assertions support standard metadata capabilities:

  • globalTags: Categorize and organize assertions
  • glossaryTerms: Link assertions to business concepts
  • status: Mark assertions as active or deprecated
Python SDK: Add tags to an assertion
# metadata-ingestion/examples/library/assertion_add_tags.py
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.metadata.schema_classes import (
GlobalTagsClass,
TagAssociationClass,
)

graph = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))
emitter = DatahubRestEmitter("http://localhost:8080")

assertion_urn = "urn:li:assertion:432475190cc846f2894b5b3aa4d55af2"

existing_tags = graph.get_aspect(
entity_urn=assertion_urn,
aspect_type=GlobalTagsClass,
)

if existing_tags is None:
existing_tags = GlobalTagsClass(tags=[])

tag_to_add = builder.make_tag_urn("data-quality")

tag_association = TagAssociationClass(tag=tag_to_add)

if tag_association not in existing_tags.tags:
existing_tags.tags.append(tag_association)

tags_mcp = MetadataChangeProposalWrapper(
entityUrn=assertion_urn,
aspect=existing_tags,
)

emitter.emit_mcp(tags_mcp)
print(f"Added tag '{tag_to_add}' to assertion {assertion_urn}")
else:
print(f"Tag '{tag_to_add}' already exists on assertion {assertion_urn}")

Standard Operators and Parameters

Assertions use a standard set of operators for comparisons:

Numeric: BETWEEN, LESS_THAN, LESS_THAN_OR_EQUAL_TO, GREATER_THAN, GREATER_THAN_OR_EQUAL_TO, EQUAL_TO, NOT_EQUAL_TO

String: CONTAIN, START_WITH, END_WITH, REGEX_MATCH, IN, NOT_IN

Boolean: IS_TRUE, IS_FALSE, NULL, NOT_NULL

Native: _NATIVE_ for platform-specific operators

Parameters are provided via AssertionStdParameters:

  • value: Single value for most operators
  • minValue, maxValue: Range endpoints for BETWEEN
  • Parameter types: NUMBER, STRING, SET

Standard Aggregations

Field and volume assertions can apply aggregation functions before evaluation:

Statistical: MEAN, MEDIAN, STDDEV, MIN, MAX, SUM

Count-based: ROW_COUNT, COLUMN_COUNT, UNIQUE_COUNT, NULL_COUNT

Proportional: UNIQUE_PROPORTION, NULL_PROPORTION

Identity: IDENTITY (no aggregation), COLUMNS (all columns)

Integration Points

Relationship to Datasets

Assertions have a strong relationship with datasets through the Asserts relationship:

  • Field assertions target specific dataset columns
  • Volume assertions monitor dataset row counts
  • Freshness assertions track dataset update times
  • Schema assertions validate dataset structure
  • SQL assertions query dataset contents

Datasets maintain a reverse relationship, showing all assertions that validate them. This enables users to understand the quality checks applied to any dataset.

Relationship to Data Jobs

Freshness assertions can target data jobs (pipelines) to ensure they execute on schedule. When a FreshnessAssertionInfo has type=DATA_JOB_RUN, the entity field references a dataJob URN rather than a dataset.

Relationship to Data Platforms

External assertions maintain a relationship to their source platform through the dataPlatformInstance aspect. This enables:

  • Filtering assertions by source tool
  • Deep-linking back to the source platform
  • Understanding the assertion's external context

GraphQL API

Assertions are fully accessible via DataHub's GraphQL API:

  • Query assertions and their run history
  • Create and update native assertions
  • Delete assertions
  • Retrieve assertions for a specific dataset

Key GraphQL types:

  • Assertion: The main assertion entity
  • AssertionInfo: Assertion definition and type
  • AssertionRunEvent: Evaluation results
  • AssertionSource: Origin metadata

Integration with dbt

DataHub's dbt integration automatically converts dbt tests into assertions:

  • Schema Tests: Mapped to field assertions (not_null, unique, accepted_values, relationships)
  • Data Tests: Mapped to SQL assertions
  • Test Metadata: Test severity, tags, and descriptions are preserved

Integration with Great Expectations

The Great Expectations integration maps expectations to assertion types:

  • Column expectations → Field assertions
  • Table expectations → Volume or schema assertions
  • Custom expectations → Custom assertions

Each expectation suite becomes a collection of assertions in DataHub.

Integration with Snowflake Data Quality

Snowflake DMF (Data Metric Functions) rules are ingested as assertions:

  • Row count rules → Volume assertions
  • Uniqueness rules → Field metric assertions
  • Freshness rules → Freshness assertions
  • Custom metric rules → SQL assertions

Notable Exceptions

Legacy Dataset Assertion Type

The DATASET assertion type is a legacy format that predates the more specific field, volume, freshness, and schema assertion types. It uses DatasetAssertionInfo with a generic structure. New integrations should use the more specific assertion types (FIELD, VOLUME, FRESHNESS, DATA_SCHEMA, SQL) as they provide better type safety and UI rendering.

Assertion Results vs. Assertion Metrics

While assertions track pass/fail status, DataHub also supports more detailed metrics through the AssertionResult object:

  • actualAggValue: The actual value observed (for numeric assertions)
  • externalUrl: Link to detailed results in the source system
  • nativeResults: Platform-specific result details

This enables richer debugging and understanding of why assertions fail.

Assertion Scheduling

DataHub tracks when assertions run through assertionRunEvent timeseries data, but does not directly schedule assertion evaluations. Scheduling is handled by:

  • Native Assertions: DataHub Cloud's built-in scheduler
  • External Assertions: The source platform's scheduler (dbt, Airflow, etc.)
  • On-Demand: Manual or API-triggered evaluations

DataHub provides monitoring and alerting based on the assertion run events, regardless of the scheduling mechanism.

Assertion vs. Test Results

DataHub has two related concepts:

  • Assertions: First-class entities that define data quality rules
  • Test Results: A simpler aspect that can be attached to datasets

Test results are lightweight pass/fail indicators without the full expressiveness of assertions. Use assertions for production data quality monitoring and test results for simple ingestion-time validation.

Technical Reference

For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.