Skip to main content
Version: Next

Tag

Tags are one of the core metadata entities in DataHub, providing a flexible mechanism for classification, categorization, and organization of data assets. They represent labels that can be applied to entities such as datasets, dashboards, charts, and more, enabling users to quickly identify, filter, and group related assets across the data ecosystem.

Identity

Tags are identified by a single piece of information:

  • The tag name: A unique string identifier that serves as both the technical key and the human-readable reference for the tag. The name should be simple, descriptive, and typically follows lowercase naming conventions (e.g., pii, deprecated, quarterly).

An example of a tag identifier is urn:li:tag:pii.

The URN structure is straightforward:

urn:li:tag:<tag_name>

Where <tag_name> is the unique identifier for the tag. Unlike many other DataHub entities, tags do not require platform qualifiers or environment specifications, making them universally applicable across all data assets.

Important Capabilities

Tag Properties

Tags support several properties that enhance their usability and appearance in DataHub:

  • Display Name: A human-friendly name that may differ from the technical identifier. For example, a tag with name pii might have display name "Personally Identifiable Information".
  • Description: Detailed documentation explaining what the tag represents, when it should be used, and any organizational policies related to it.
  • Color: A hex color code (e.g., #FF0000) that allows for visual distinction in the UI, making it easier to spot tagged assets at a glance.

These properties are stored in the tagProperties aspect and can be set when creating a tag or updated later.

Applying Tags to Entities

Tags are applied to other entities through the globalTags aspect. Almost all core DataHub entities support tagging, including:

  • Datasets: Tables, views, streams, and other data collections
  • Dashboards: BI dashboards and reporting interfaces
  • Charts: Individual visualizations and reports
  • Data Jobs: ETL jobs, transformation pipelines
  • Data Flows: Complete data pipelines and workflows
  • ML Models: Machine learning models and deployments
  • Containers: Databases, schemas, and other organizational structures
  • Glossary Terms: Business terminology and concepts

Tags can be applied at multiple levels:

  1. Entity-level: Applied to the entire asset (e.g., tagging a whole dataset as sensitive)
  2. Field-level: Applied to specific columns or fields within datasets (e.g., tagging only the email column as pii)

Tag vs. Glossary Terms

While both tags and glossary terms provide classification capabilities, they serve different purposes:

  • Tags are lightweight, informal labels for quick categorization. They're ideal for operational concerns like data quality states (needs_review), security classifications (confidential), or project associations (q4_initiative).
  • Glossary Terms are formal business vocabulary with rich metadata, relationships, and governance. They're best for business concepts like "Customer", "Revenue", or "Product SKU".

Read this blog for a detailed comparison.

Ownership

Like other core entities, tags support the ownership aspect. This allows organizations to designate who is responsible for maintaining tag definitions and ensuring consistent usage. Tag owners can be users or groups with various ownership types (e.g., DATAOWNER, STEWARD).

Deprecation and Status

Tags can be marked as deprecated through the deprecation aspect, signaling that they should no longer be used. The status aspect allows tags to be soft-deleted while maintaining historical references.

Code Examples

Creating a Tag

Python SDK: Create a basic tag
# Inlined from /metadata-ingestion/examples/library/tag_create_basic.py
# metadata-ingestion/examples/library/tag_create_basic.py
import logging
import os

from datahub.emitter.mce_builder import make_tag_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import TagPropertiesClass

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Create a tag URN
tag_urn = make_tag_urn("pii")

# Define tag properties
tag_properties = TagPropertiesClass(
name="Personally Identifiable Information",
description="This tag indicates that the asset contains PII data and should be handled according to data privacy regulations.",
colorHex="#FF0000",
)

# Create the metadata change proposal
event = MetadataChangeProposalWrapper(
entityUrn=tag_urn,
aspect=tag_properties,
)

# Emit to DataHub
rest_emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)
rest_emitter.emit(event)
log.info(f"Created tag {tag_urn}")

Adding Ownership to a Tag

Python SDK: Add an owner to a tag
# Inlined from /metadata-ingestion/examples/library/tag_add_ownership.py
# metadata-ingestion/examples/library/tag_add_ownership.py
import logging

from datahub.emitter.mce_builder import make_tag_urn, make_user_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
OwnerClass,
OwnershipClass,
OwnershipTypeClass,
)

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Create a tag URN
tag_urn = make_tag_urn("data_quality")

# Define ownership
ownership = OwnershipClass(
owners=[
OwnerClass(
owner=make_user_urn("data_steward"),
type=OwnershipTypeClass.DATAOWNER,
)
]
)

# Create the metadata change proposal
event = MetadataChangeProposalWrapper(
entityUrn=tag_urn,
aspect=ownership,
)

# Emit to DataHub
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
rest_emitter.emit(event)
log.info(f"Added ownership to tag {tag_urn}")

Applying Tags to Datasets

Python SDK: Apply a tag to a dataset
# Inlined from /metadata-ingestion/examples/library/tag_apply_to_dataset.py
# metadata-ingestion/examples/library/tag_apply_to_dataset.py
import logging

from datahub.emitter.mce_builder import make_dataset_urn, make_tag_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import GlobalTagsClass, TagAssociationClass

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Create URNs
dataset_urn = make_dataset_urn(
platform="snowflake", name="db.schema.customers", env="PROD"
)
tag_urn = make_tag_urn("pii")

# Define global tags
global_tags = GlobalTagsClass(
tags=[
TagAssociationClass(tag=tag_urn),
]
)

# Create the metadata change proposal
event = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=global_tags,
)

# Emit to DataHub
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
rest_emitter.emit(event)
log.info(f"Applied tag {tag_urn} to dataset {dataset_urn}")

Querying Tag Information

The standard REST APIs can be used to retrieve tag metadata and see which entities are tagged.

REST API: Fetch tag entity information
# Inlined from /metadata-ingestion/examples/library/tag_query_rest.py
# metadata-ingestion/examples/library/tag_query_rest.py
import logging
from urllib.parse import quote

import requests

from datahub.emitter.mce_builder import make_tag_urn

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Configuration
gms_server = "http://localhost:8080"
tag_urn = make_tag_urn("pii")

# Fetch tag entity
response = requests.get(f"{gms_server}/entities/{quote(tag_urn, safe='')}")

if response.status_code == 200:
tag_data = response.json()
log.info(f"Successfully retrieved tag: {tag_urn}")

# Extract tag properties
if "aspects" in tag_data and "tagProperties" in tag_data["aspects"]:
properties = tag_data["aspects"]["tagProperties"]["value"]
log.info(f"Tag name: {properties.get('name')}")
log.info(f"Description: {properties.get('description')}")
log.info(f"Color: {properties.get('colorHex')}")

# Extract ownership if present
if "aspects" in tag_data and "ownership" in tag_data["aspects"]:
ownership = tag_data["aspects"]["ownership"]["value"]
log.info(f"Number of owners: {len(ownership.get('owners', []))}")
for owner in ownership.get("owners", []):
log.info(f" - Owner: {owner['owner']}, Type: {owner['type']}")
else:
log.error(f"Failed to retrieve tag: {response.status_code} - {response.text}")

# Query relationships to find all entities tagged with this tag
relationships_url = (
f"{gms_server}/relationships"
f"?direction=INCOMING"
f"&urn={quote(tag_urn, safe='')}"
f"&types=TaggedWith"
)

response = requests.get(relationships_url)

if response.status_code == 200:
relationships = response.json()
total = relationships.get("total", 0)
log.info(f"Found {total} entities tagged with this tag")

for rel in relationships.get("relationships", []):
log.info(f" - {rel['entity']} (type: {rel['type']})")
else:
log.error(
f"Failed to retrieve relationships: {response.status_code} - {response.text}"
)

Searching for Tagged Assets

Tags are fully integrated with DataHub's search capabilities, allowing you to find all assets with a specific tag.

Python SDK: Search for assets by tag
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

client = DataHubClient.from_env()

# Find all assets tagged with "pii"
results = client.search.get_urns(filter=F.tag("urn:li:tag:pii"))

print(f"Found {len(results)} assets tagged with 'pii'")
for urn in results:
print(f" - {urn}")

Integration Points

Relationship with Other Entities

Tags create a TaggedWith relationship between the tagged entity and the tag entity. This bidirectional relationship enables:

  • Forward navigation: From a dataset, see all its tags
  • Reverse navigation: From a tag, see all entities using it
  • Impact analysis: Understand the scope of a tag before deprecating it

GraphQL API Support

Tags are fully supported in DataHub's GraphQL API, with dedicated resolvers for:

  • Creating tags: CreateTagResolver allows programmatic tag creation with authorization checks
  • Updating tags: SetTagColorResolver and update operations for tag properties
  • Deleting tags: DeleteTagResolver for removing obsolete tags
  • Adding tags to entities: AddTagResolver, AddTagsResolver, and batch operations
  • Removing tags from entities: RemoveTagResolver and batch removal operations

These resolvers enforce authorization policies, ensuring only users with appropriate privileges (CREATE_TAG, MANAGE_TAGS, or EDIT_ENTITY) can modify tags and tag assignments.

Search and Discovery

Tags are indexed for search with the following capabilities:

  • Full-text search: Tag names and descriptions are searchable
  • Autocomplete: Tag names support autocomplete for easy selection
  • Filtering: Assets can be filtered by tag in all search interfaces
  • Faceting: Tags appear as filter options in search results

Notable Exceptions

Tag Naming Conventions

While DataHub doesn't enforce strict naming conventions, consider these best practices:

  • Use lowercase: Makes tags case-insensitive in practice (pii vs PII)
  • Use underscores or hyphens: For multi-word tags (data_quality or data-quality)
  • Keep it concise: Short names are easier to read and apply
  • Avoid special characters: Stick to alphanumeric characters, underscores, and hyphens

Tag Proliferation

Organizations should establish governance around tag creation to avoid "tag sprawl":

  • Define a core set: Start with 10-20 essential tags
  • Document usage: Maintain clear descriptions for when each tag should be used
  • Regular audits: Periodically review and consolidate similar or unused tags
  • Ownership model: Assign tag owners who can approve new tags or changes

System vs. User Tags

While DataHub doesn't formally distinguish between system and user tags, organizations often establish conventions:

  • System tags: Created by automated processes (e.g., ingestion_error, schema_drift)
  • User tags: Created manually by data practitioners (e.g., important, sandbox)

Consider using prefixes or namespacing to distinguish these categories if needed.

Tags and Access Control

Tags themselves don't grant or restrict access to data. However, they can be used in conjunction with DataHub policies to:

  • Control who can view certain tagged assets
  • Restrict who can apply sensitive tags
  • Trigger workflows based on tag presence (e.g., auto-generating documentation for assets tagged requires_docs)

Tags are metadata about your data, not a security mechanism. Use DataHub's authorization features for access control.

Technical Reference Guide

The sections above provide an overview of how to use this entity. The following sections provide detailed technical information about how metadata is stored and represented in DataHub.

Aspects are the individual pieces of metadata that can be attached to an entity. Each aspect contains specific information (like ownership, tags, or properties) and is stored as a separate record, allowing for flexible and incremental metadata updates.

Relationships show how this entity connects to other entities in the metadata graph. These connections are derived from the fields within each aspect and form the foundation of DataHub's knowledge graph.

Reading the Field Tables

Each aspect's field table includes an Annotations column that provides additional metadata about how fields are used:

  • ⚠️ Deprecated: This field is deprecated and may be removed in a future version. Check the description for the recommended alternative
  • Searchable: This field is indexed and can be searched in DataHub's search interface
  • Searchable (fieldname): When the field name in parentheses is shown, it indicates the field is indexed under a different name in the search index. For example, dashboardTool is indexed as tool
  • → RelationshipName: This field creates a relationship to another entity. The arrow indicates this field contains a reference (URN) to another entity, and the name indicates the type of relationship (e.g., → Contains, → OwnedBy)

Fields with complex types (like Edge, AuditStamp) link to their definitions in the Common Types section below.

Aspects

tagKey

Key for a Tag

FieldTypeRequiredDescriptionAnnotations
namestringThe tag name, which serves as a unique idSearchable (id)

ownership

Ownership information of an entity.

FieldTypeRequiredDescriptionAnnotations
ownersOwner[]List of owners of the entity.
ownerTypesmapOwnership type to Owners map, populated via mutation hook.Searchable
lastModifiedAuditStampAudit stamp containing who last modified the record and when. A value of 0 in the time field indi...

tagProperties

Properties associated with a Tag

FieldTypeRequiredDescriptionAnnotations
namestringDisplay name of the tagSearchable
descriptionstringDocumentation of the tagSearchable
colorHexstringThe color associated with the Tag in Hex. For example #FFFFFF.

status

The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc. This aspect is used to represent soft deletes conventionally.

FieldTypeRequiredDescriptionAnnotations
removedbooleanWhether the entity has been removed (soft-deleted).Searchable

deprecation

Deprecation status of an entity

FieldTypeRequiredDescriptionAnnotations
deprecatedbooleanWhether the entity is deprecated.Searchable
decommissionTimelongThe time user plan to decommission this entity.
notestringAdditional information about the entity deprecation plan, such as the wiki, doc, RB.
actorstringThe user URN which will be credited for modifying this deprecation content.
replacementstring

testResults

Information about a Test Result

FieldTypeRequiredDescriptionAnnotations
failingTestResult[]Results that are failingSearchable, → IsFailing
passingTestResult[]Results that are passingSearchable, → IsPassing

Common Types

These types are used across multiple aspects in this entity.

AuditStamp

Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage.

Fields:

  • time (long): When did the resource/association/sub-resource move into the specific lifecyc...
  • actor (string): The entity (e.g. a member URN) which will be credited for moving the resource...
  • impersonator (string?): The entity (e.g. a service URN) which performs the change on behalf of the Ac...
  • message (string?): Additional context around how DataHub was informed of the particular change. ...

TestResult

Information about a Test Result

Fields:

  • test (string): The urn of the test
  • type (TestResultType): The type of the result
  • testDefinitionMd5 (string?): The md5 of the test definition that was used to compute this result. See Test...
  • lastComputed (AuditStamp?): The audit stamp of when the result was computed, including the actor who comp...

Relationships

Outgoing

These are the relationships stored in this entity's aspects

  • OwnedBy

    • Corpuser via ownership.owners.owner
    • CorpGroup via ownership.owners.owner
  • ownershipType

    • OwnershipType via ownership.owners.typeUrn
  • IsFailing

    • Test via testResults.failing
  • IsPassing

    • Test via testResults.passing

Incoming

These are the relationships stored in other entity's aspects

  • SchemaFieldTaggedWith

    • Dataset via schemaMetadata.fields.globalTags
    • Chart via inputFields.fields.schemaField.globalTags
    • Dashboard via inputFields.fields.schemaField.globalTags
  • TaggedWith

    • Dataset via schemaMetadata.fields.globalTags.tags
    • Dataset via editableSchemaMetadata.editableSchemaFieldInfo.globalTags.tags
    • Dataset via globalTags.tags
    • DataJob via globalTags.tags
    • DataFlow via globalTags.tags
    • Chart via globalTags.tags
    • Chart via inputFields.fields.schemaField.globalTags.tags
    • Dashboard via globalTags.tags
    • Dashboard via inputFields.fields.schemaField.globalTags.tags
    • Notebook via globalTags.tags
    • Corpuser via globalTags.tags
    • CorpGroup via globalTags.tags
    • Container via globalTags.tags
  • EditableSchemaFieldTaggedWith

    • Dataset via editableSchemaMetadata.editableSchemaFieldInfo.globalTags

Global Metadata Model

Global Graph