GlossaryTerm
A GlossaryTerm represents a standardized business definition or vocabulary term that can be associated with data assets across your organization. GlossaryTerms are the fundamental building blocks of DataHub's Business Glossary feature, enabling teams to establish and maintain a shared vocabulary for describing data concepts.
In practice, GlossaryTerms allow you to:
- Define business terminology with clear, authoritative definitions
- Create relationships between related business concepts (inheritance, containment, etc.)
- Tag data assets (datasets, dashboards, charts, etc.) with standardized business terms
- Establish governance and ownership over business vocabulary
- Link to external resources and documentation
For example, a GlossaryTerm might define "Customer Lifetime Value (CLV)" with a precise business definition, relate it to other terms like "Revenue" and "Customer", and be applied to specific dataset columns that store CLV calculations.
Identity
GlossaryTerms are uniquely identified by a single field: their name. This name serves as the persistent identifier for the term throughout its lifecycle.
URN Structure
The URN (Uniform Resource Name) for a GlossaryTerm follows this pattern:
urn:li:glossaryTerm:<term_name>
Where:
<term_name>: A unique string identifier for the term. This can be human-readable (e.g., "CustomerLifetimeValue") or a generated ID (e.g., "clv-001" or a UUID).
Examples
# Simple term name
urn:li:glossaryTerm:Revenue
# Hierarchical naming convention (common pattern)
urn:li:glossaryTerm:Finance.Revenue
urn:li:glossaryTerm:Classification.PII
urn:li:glossaryTerm:Classification.Confidential
# UUID-based identifier
urn:li:glossaryTerm:41516e31-0acb-fd90-76ff-fc2c98d2d1a3
# Descriptive identifier
urn:li:glossaryTerm:CustomerLifetimeValue
Best Practices for Term Names
- Use hierarchical notation: Prefix terms with their category (e.g.,
Classification.PII,Finance.Revenue) to indicate structure even though the name is flat. - Be consistent: Choose a naming convention (camelCase, dot notation, etc.) and apply it uniformly.
- Keep it permanent: The term name is the identifier and should not change. Use the
namefield inglossaryTermInfofor the display name. - Consider organization: While the URN is flat, you can use glossaryNodes (term groups) to create hierarchical organization in the UI.
Important Capabilities
Core Business Definition (glossaryTermInfo)
The glossaryTermInfo aspect contains the essential business information about a term:
- definition (required): The authoritative business definition of the term. This should be clear, concise, and provide sufficient context for anyone to understand the term's meaning.
- name: The display name shown in the UI. This can be more human-friendly than the URN identifier (e.g., "Customer Lifetime Value" vs. "CustomerLifetimeValue").
- parentNode: A reference to a GlossaryNode (term group) that acts as a folder for organizing terms hierarchically.
- termSource: Indicates whether the term is "INTERNAL" (defined within your organization) or "EXTERNAL" (from an external standard like FIBO).
- sourceRef: A reference identifier for external term sources (e.g., "FIBO" for Financial Industry Business Ontology).
- sourceUrl: A URL pointing to the external definition of the term.
- customProperties: Key-value pairs for additional metadata specific to your organization.
Example:
{
"name": "Customer Lifetime Value",
"definition": "The total revenue a business can expect from a single customer account throughout the business relationship.",
"termSource": "INTERNAL",
"parentNode": "urn:li:glossaryNode:Finance"
}
Term Relationships (glossaryRelatedTerms)
GlossaryTerms support several relationship types that help model the semantic connections between business concepts:
1. IsA Relationships (Inheritance)
Indicates that one term is a specialized type of another term. This creates an "Is-A" hierarchy where more specific terms inherit the characteristics of broader terms.
Use case: Email IsA PersonalInformation, SocialSecurityNumber IsA PersonalInformation
2. HasA Relationships (Containment)
Indicates that one term contains or is composed of another term. This creates a "Has-A" relationship where a complex concept consists of simpler parts.
Use case: Address HasA ZipCode, Address HasA Street, Address HasA City
3. Values Relationships
Defines the allowed values for an enumerated term. Useful for controlled vocabularies where a term has a fixed set of valid values.
Use case: ColorEnum HasValues Red, Green, Blue
4. RelatedTo Relationships
General-purpose relationship for terms that are semantically related but don't fit the other categories.
Use case: Revenue RelatedTo Profit, Customer RelatedTo Account
Hierarchical Organization
GlossaryTerms can be organized hierarchically through GlossaryNodes (term groups). The parentNode field in glossaryTermInfo establishes this relationship:
GlossaryNode: Classification
├── GlossaryTerm: Sensitive
├── GlossaryTerm: Confidential
└── GlossaryTerm: HighlyConfidential
GlossaryNode: PersonalInformation
├── GlossaryTerm: Email
├── GlossaryTerm: Address
└── GlossaryTerm: PhoneNumber
This hierarchy is visible in the DataHub UI and helps users navigate large glossaries.
Applying Terms to Data Assets
GlossaryTerms become valuable when applied to actual data assets. Terms can be attached to:
- Datasets (tables, views, files)
- Dataset fields (columns)
- Dashboards
- Charts
- Data Jobs
- Containers
- And many other entity types
When a term is applied to a data asset, it creates a TermedWith relationship, which enables:
- Discovery: Find all assets tagged with a specific business concept
- Governance: Track which assets contain sensitive data types
- Documentation: Provide business context for technical assets
- Compliance: Identify datasets subject to regulatory requirements
Code Examples
Creating a GlossaryTerm
Python SDK: Create a basic GlossaryTerm
import os
from datahub.emitter.mce_builder import make_term_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import GlossaryTermInfoClass
# Get DataHub connection details from environment
gms_server = os.getenv("DATAHUB_GMS_URL", "http://localhost:8080")
token = os.getenv("DATAHUB_GMS_TOKEN")
# Create a term URN - the unique identifier for the glossary term
term_urn = make_term_urn("CustomerLifetimeValue")
# Define the term's core information
term_info = GlossaryTermInfoClass(
name="Customer Lifetime Value",
definition="The total revenue a business can expect from a single customer account throughout the business relationship. This metric helps prioritize customer retention efforts and marketing spend.",
termSource="INTERNAL",
)
# Create a metadata change proposal
event = MetadataChangeProposalWrapper(
entityUrn=term_urn,
aspect=term_info,
)
# Emit the metadata
rest_emitter = DatahubRestEmitter(gms_server=gms_server, token=token)
rest_emitter.emit(event)
print(f"Created glossary term: {term_urn}")
Python SDK: Create a GlossaryTerm with full metadata
import os
from datahub.emitter.mce_builder import make_term_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AuditStampClass,
GlossaryTermInfoClass,
InstitutionalMemoryClass,
InstitutionalMemoryMetadataClass,
OwnerClass,
OwnershipClass,
OwnershipSourceClass,
OwnershipSourceTypeClass,
OwnershipTypeClass,
)
from datahub.metadata.urns import GlossaryNodeUrn
# Create the term URN
term_urn = make_term_urn("Classification.PII")
# Create GlossaryTermInfo with full metadata
term_info = GlossaryTermInfoClass(
name="Personally Identifiable Information",
definition="Information that can be used to identify, contact, or locate a single person, or to identify an individual in context. Examples include name, email address, phone number, and social security number.",
termSource="INTERNAL",
# Link to a parent term group (glossary node)
parentNode=str(GlossaryNodeUrn("Classification")),
# Custom properties for additional metadata
customProperties={
"sensitivity_level": "HIGH",
"data_retention_period": "7_years",
"regulatory_framework": "GDPR,CCPA",
},
)
# Add ownership information
ownership = OwnershipClass(
owners=[
OwnerClass(
owner="urn:li:corpuser:datahub",
type=OwnershipTypeClass.DATAOWNER,
source=OwnershipSourceClass(type=OwnershipSourceTypeClass.MANUAL),
),
OwnerClass(
owner="urn:li:corpGroup:privacy-team",
type=OwnershipTypeClass.DATAOWNER,
source=OwnershipSourceClass(type=OwnershipSourceTypeClass.MANUAL),
),
]
)
# Add links to related documentation
institutional_memory = InstitutionalMemoryClass(
elements=[
InstitutionalMemoryMetadataClass(
url="https://wiki.company.com/privacy/pii-guidelines",
description="Internal PII Handling Guidelines",
createStamp=AuditStampClass(time=0, actor="urn:li:corpuser:datahub"),
),
InstitutionalMemoryMetadataClass(
url="https://gdpr.eu/",
description="GDPR Official Documentation",
createStamp=AuditStampClass(time=0, actor="urn:li:corpuser:datahub"),
),
]
)
# Emit all aspects for the glossary term
# Get DataHub connection details from environment
gms_server = os.getenv("DATAHUB_GMS_URL", "http://localhost:8080")
token = os.getenv("DATAHUB_GMS_TOKEN")
rest_emitter = DatahubRestEmitter(gms_server=gms_server, token=token)
# Emit term info
rest_emitter.emit(MetadataChangeProposalWrapper(entityUrn=term_urn, aspect=term_info))
# Emit ownership
rest_emitter.emit(MetadataChangeProposalWrapper(entityUrn=term_urn, aspect=ownership))
# Emit institutional memory (documentation links)
rest_emitter.emit(
MetadataChangeProposalWrapper(entityUrn=term_urn, aspect=institutional_memory)
)
print(f"Created glossary term with full metadata: {term_urn}")
Managing Term Relationships
Python SDK: Add relationships between GlossaryTerms
from datahub.emitter.mce_builder import make_term_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import GlossaryRelatedTermsClass
from datahub.metadata.urns import GlossaryTermUrn
# First, ensure the related terms exist (you would have created these previously)
# For this example, assume we have:
# - Classification.PII (a broad category)
# - Classification.Sensitive (another category)
# - PersonalInformation.Email (a specific term)
# - PersonalInformation.Address (another specific term)
# Create relationships for the Email term
email_term_urn = make_term_urn("PersonalInformation.Email")
# Define relationships
email_relationships = GlossaryRelatedTermsClass(
# IsA relationship: Email is a type of PII
# This creates an inheritance hierarchy
isRelatedTerms=[
str(GlossaryTermUrn("Classification.PII")),
str(GlossaryTermUrn("Classification.Sensitive")),
],
# RelatedTo: General semantic relationship
relatedTerms=[
str(GlossaryTermUrn("PersonalInformation.PhoneNumber")),
str(GlossaryTermUrn("PersonalInformation.Contact")),
],
)
# Create relationships for the Address term
address_term_urn = make_term_urn("PersonalInformation.Address")
address_relationships = GlossaryRelatedTermsClass(
# IsA: Address is also a type of PII
isRelatedTerms=[str(GlossaryTermUrn("Classification.PII"))],
# HasA: Address contains these components
hasRelatedTerms=[
str(GlossaryTermUrn("PersonalInformation.ZipCode")),
str(GlossaryTermUrn("PersonalInformation.Street")),
str(GlossaryTermUrn("PersonalInformation.City")),
str(GlossaryTermUrn("PersonalInformation.Country")),
],
)
# Create an enumeration term with fixed values
color_enum_urn = make_term_urn("ColorEnum")
color_enum_relationships = GlossaryRelatedTermsClass(
# Values: Define the allowed values for this enumeration
values=[
str(GlossaryTermUrn("Colors.Red")),
str(GlossaryTermUrn("Colors.Green")),
str(GlossaryTermUrn("Colors.Blue")),
str(GlossaryTermUrn("Colors.Yellow")),
]
)
# Emit the relationships
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
# Emit Email term relationships
rest_emitter.emit(
MetadataChangeProposalWrapper(entityUrn=email_term_urn, aspect=email_relationships)
)
print(f"Added relationships to: {email_term_urn}")
# Emit Address term relationships
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=address_term_urn, aspect=address_relationships
)
)
print(f"Added relationships to: {address_term_urn}")
# Emit Color enumeration relationships
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=color_enum_urn, aspect=color_enum_relationships
)
)
print(f"Added value relationships to: {color_enum_urn}")
print("\nRelationship types explained:")
print("- isRelatedTerms (IsA): Inheritance relationship - term is a type of another")
print("- hasRelatedTerms (HasA): Containment relationship - term contains other terms")
print("- values: Enumeration values - defines allowed values for the term")
print("- relatedTerms: General semantic relationship between terms")
Applying Terms to Assets
Python SDK: Add a GlossaryTerm to a dataset
from typing import List, Optional, Union
from datahub.sdk import DataHubClient, DatasetUrn, GlossaryTermUrn
def add_terms_to_dataset(
client: DataHubClient,
dataset_urn: DatasetUrn,
term_urns: List[Union[GlossaryTermUrn, str]],
) -> None:
"""
Add glossary terms to a dataset.
Args:
client: DataHub client to use
dataset_urn: URN of the dataset to update
term_urns: List of term URNs or term names to add
"""
dataset = client.entities.get(dataset_urn)
for term in term_urns:
if isinstance(term, str):
resolved_term_urn = client.resolve.term(name=term)
dataset.add_term(resolved_term_urn)
else:
dataset.add_term(term)
client.entities.update(dataset)
def main(client: Optional[DataHubClient] = None) -> None:
"""
Main function to add terms to dataset example.
Args:
client: Optional DataHub client (for testing). If not provided, creates one from env.
"""
client = client or DataHubClient.from_env()
dataset_urn = DatasetUrn(platform="hive", name="realestate_db.sales", env="PROD")
# Add terms using both URN and name resolution
add_terms_to_dataset(
client=client,
dataset_urn=dataset_urn,
term_urns=[
GlossaryTermUrn("Classification.HighlyConfidential"),
"PII", # Will be resolved by name
],
)
if __name__ == "__main__":
main()
Python SDK: Add a GlossaryTerm to a dataset column
from datahub.sdk import DataHubClient, DatasetUrn, GlossaryTermUrn
client = DataHubClient.from_env()
dataset = client.entities.get(
DatasetUrn(platform="hive", name="realestate_db.sales", env="PROD")
)
dataset["address.zipcode"].add_term(GlossaryTermUrn("Classification.Location"))
client.entities.update(dataset)
Querying GlossaryTerms
REST API: Get a GlossaryTerm by URN
# Fetch a GlossaryTerm entity
curl -X GET 'http://localhost:8080/entities/urn%3Ali%3AglossaryTerm%3ACustomerLifetimeValue' \
-H 'Authorization: Bearer <token>'
# Response includes all aspects:
# - glossaryTermKey (identity)
# - glossaryTermInfo (definition, name, etc.)
# - glossaryRelatedTerms (relationships)
# - ownership (who owns this term)
# - institutionalMemory (links to documentation)
# - etc.
REST API: Search for assets tagged with a term
# Find all datasets tagged with a specific term
curl -X POST 'http://localhost:8080/entities?action=search' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <token>' \
-d '{
"entity": "dataset",
"input": "*",
"filter": {
"or": [
{
"and": [
{
"field": "glossaryTerms",
"value": "urn:li:glossaryTerm:Classification.PII",
"condition": "EQUAL"
}
]
}
]
},
"start": 0,
"count": 10
}'
Python SDK: Query terms applied to a dataset
from datahub.sdk import DataHubClient, DatasetUrn
client = DataHubClient.from_env()
dataset = client.entities.get(
DatasetUrn(platform="hive", name="realestate_db.sales", env="PROD")
)
print(dataset.terms)
Bulk Operations
YAML Ingestion: Create multiple terms from a Business Glossary file
# business_glossary.yml
version: "1"
source: MyOrganization
owners:
users:
- datahub
nodes:
- name: Classification
description: Data classification categories
terms:
- name: PII
description: Personally Identifiable Information
- name: Confidential
description: Confidential business data
- name: Public
description: Publicly available data
- name: Finance
description: Financial domain terms
terms:
- name: Revenue
description: Total income from business operations
- name: Profit
description: Financial gain after expenses
related_terms:
- Finance.Revenue
# Ingest using the DataHub CLI:
# datahub ingest -c business_glossary.yml
See the Business Glossary Source documentation for the full YAML format specification.
Integration Points
Relationship with GlossaryNode
GlossaryNodes (term groups) provide hierarchical organization for GlossaryTerms. Think of GlossaryNodes as folders and GlossaryTerms as files within those folders.
- A GlossaryTerm can have at most one parent GlossaryNode (specified via
parentNodeinglossaryTermInfo) - GlossaryNodes can contain both GlossaryTerms and other GlossaryNodes (creating nested hierarchies)
- Terms at the root level (no parent) appear at the top of the glossary
Application to Data Assets
GlossaryTerms can be applied to most entity types in DataHub through the glossaryTerms aspect:
Supported entities:
- dataset, schemaField (dataset columns)
- dashboard, chart
- dataJob, dataFlow
- mlModel, mlFeature, mlFeatureTable, mlPrimaryKey
- notebook
- container
- dataProduct, application
- erModelRelationship, businessAttribute
When you apply a term to an entity, DataHub creates:
- A
glossaryTermsaspect on the target entity containing the term association - A TermedWith relationship edge in the graph
- A searchable index entry allowing you to find all assets with that term
GraphQL API
The GraphQL API provides rich querying and mutation capabilities for GlossaryTerms:
Queries:
- Fetch term details with related entities
- Browse terms hierarchically
- Search terms by name or definition
- Get all entities tagged with a term
Mutations:
createGlossaryTerm: Create a new termaddTerms,addTerm: Apply terms to entitiesremoveTerm,batchRemoveTerms: Remove terms from entitiesupdateParentNode: Move a term to a different parent group
See the GraphQL API documentation for detailed examples.
Integration with Search and Discovery
GlossaryTerms enhance discoverability in multiple ways:
- Faceted Search: Users can filter search results by glossary terms
- Term Propagation: When a term is applied at the dataset level, it can be inherited by downstream assets
- Related Entities: The term's page shows all assets tagged with that term
- Autocomplete: Terms are suggested as users type in search or when tagging assets
Governance and Access Control
GlossaryTerms support fine-grained access control through DataHub's policy system:
- Manage Direct Glossary Children: Permission to create/edit/delete terms directly under a specific term group
- Manage All Glossary Children: Permission to manage any term within a term group's entire subtree
- Standard entity policies (view, edit, delete) apply to individual terms
See the Business Glossary documentation for details on privileges.
Notable Exceptions
Term Name vs Display Name
The URN identifier (name in glossaryTermKey) is separate from the display name (name in glossaryTermInfo). Best practice:
- URN name: Use a stable, unchanging identifier (e.g., "clv-001", "Classification.PII")
- Display name: Use a human-friendly label that can be updated (e.g., "Customer Lifetime Value", "Personally Identifiable Information")
External Term Sources
When using terms from external standards (FIBO, ISO, industry glossaries):
- Set
termSourceto "EXTERNAL" - Populate
sourceRefwith the standard name (e.g., "FIBO") - Include
sourceUrllinking to the authoritative definition - Consider using the external standard's identifier as your URN name for consistency
Term Relationships vs Hierarchy
Don't confuse:
- Parent-child hierarchy (via
parentNode→ GlossaryNode): Organizational structure for browsing - Semantic relationships (via
glossaryRelatedTerms): Meaning connections between concepts
A term can have a parentNode for organization (e.g., term "Email" under node "PersonalInformation") AND semantic relationships (e.g., "Email" IsA "PII", "Email" RelatedTo "Contact").
Schema Metadata on GlossaryTerm
GlossaryTerms support the schemaMetadata aspect, which is rarely used but can be helpful for defining structured attributes on terms themselves. This is an advanced feature for when terms need to carry typed properties beyond simple custom properties.
Deprecation Behavior
When a GlossaryTerm is deprecated (via the deprecation aspect):
- The term remains in the system and its relationships are preserved
- Assets tagged with the term retain those associations
- The UI displays a deprecation warning
- The term may be hidden from autocomplete and suggestions
- Consider creating a new term and migrating assets rather than reusing deprecated term names
Technical Reference
For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.