GlossaryNode
A GlossaryNode represents a hierarchical grouping or category within DataHub's Business Glossary. GlossaryNodes act as folders or containers that organize GlossaryTerms into a logical structure, making it easier to navigate and manage large business glossaries.
In practice, GlossaryNodes allow you to:
- Create hierarchical categories for organizing business terminology
- Build multi-level taxonomies (e.g., Finance > Revenue > Recurring Revenue)
- Establish ownership and governance over specific glossary sections
- Apply metadata consistently across related terms within a category
- Manage permissions at the category level
For example, you might create a GlossaryNode called "Finance" containing terms like "Revenue", "Profit", and "EBITDA", with a nested GlossaryNode "Compliance" underneath containing "SOX", "GDPR", and "CCPA" terms.
Identity
GlossaryNodes are uniquely identified by a single field: their name. This name serves as the persistent identifier for the node throughout its lifecycle.
URN Structure
The URN (Uniform Resource Name) for a GlossaryNode follows this pattern:
urn:li:glossaryNode:<node_name>
Where:
<node_name>: A unique string identifier for the node. This can be human-readable (e.g., "Finance") or a generated ID (e.g., "fin-category-001" or a UUID).
Examples
# Simple node name
urn:li:glossaryNode:Finance
# Hierarchical naming convention (common pattern)
urn:li:glossaryNode:Finance.Revenue
urn:li:glossaryNode:Classification
urn:li:glossaryNode:Classification.DataSensitivity
# UUID-based identifier
urn:li:glossaryNode:41516e31-0acb-fd90-76ff-fc2c98d2d1a3
# Descriptive identifier
urn:li:glossaryNode:PersonalInformation
Best Practices for Node Names
- Use hierarchical notation: Prefix nodes with their parent category (e.g.,
Finance.Revenue,Classification.PII) to indicate structure even though the name is flat. - Be consistent: Choose a naming convention (camelCase, dot notation, etc.) and apply it uniformly across your glossary.
- Keep it permanent: The node name is the identifier and should not change. Use the
namefield inglossaryNodeInfofor the display name. - Consider depth: While nesting is supported, keep hierarchies manageable (typically 2-4 levels deep) for usability.
Important Capabilities
Core Node Information (glossaryNodeInfo)
The glossaryNodeInfo aspect contains the essential information about a glossary node:
- definition (required): A description of what this node/category represents. This helps users understand the purpose and scope of terms within this node.
- name: The display name shown in the UI. This can be more human-friendly than the URN identifier (e.g., "Financial Metrics" vs. "FinancialMetrics").
- parentNode: A reference to another GlossaryNode that acts as the parent in the hierarchy. This creates the tree structure visible in the UI.
- id: An optional identifier field that can store an external reference or alternate ID.
- customProperties: Key-value pairs for additional metadata specific to your organization.
Example:
{
"name": "Financial Metrics",
"definition": "Category for all financial and accounting-related business terms including revenue, costs, and profitability measures.",
"parentNode": "urn:li:glossaryNode:Finance"
}
Hierarchical Structure
GlossaryNodes support arbitrary nesting through the parentNode field, creating tree structures:
GlossaryNode: DataGovernance
├── GlossaryNode: Classification
│ ├── GlossaryTerm: Public
│ ├── GlossaryTerm: Internal
│ └── GlossaryTerm: Confidential
│
├── GlossaryNode: PersonalInformation
│ ├── GlossaryNode: DirectIdentifiers
│ │ ├── GlossaryTerm: Email
│ │ └── GlossaryTerm: SSN
│ └── GlossaryNode: IndirectIdentifiers
│ ├── GlossaryTerm: IPAddress
│ └── GlossaryTerm: DeviceID
│
└── GlossaryNode: Compliance
├── GlossaryTerm: GDPR
└── GlossaryTerm: CCPA
Key characteristics:
- A GlossaryNode can have at most one parent node (single inheritance)
- A GlossaryNode can contain both GlossaryTerms and child GlossaryNodes
- Nodes at the root level (no parent) appear at the top of the glossary hierarchy
- Moving a node automatically moves all its descendants
Ownership and Governance
GlossaryNodes support standard ownership metadata through the ownership aspect. Ownership at the node level can represent:
- Stewardship responsibility for maintaining the category and its terms
- Subject matter expertise for the business domain
- Accountability for term quality and accuracy within the category
Ownership is particularly powerful for GlossaryNodes because:
- Owners can be granted special permissions (Manage Direct Children, Manage All Children)
- Ownership can cascade to terms within the node
- It establishes clear accountability for glossary sections
Documentation and Links
GlossaryNodes support the institutionalMemory aspect, allowing you to:
- Link to external documentation (Confluence pages, wikis, etc.)
- Reference governance policies or standards
- Point to training materials or style guides
- Maintain a history of important links related to the category
This is especially useful for top-level nodes representing major domains or initiatives.
Code Examples
Creating a GlossaryNode
Python SDK: Create a root-level GlossaryNode
import logging
import os
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata._urns.urn_defs import GlossaryNodeUrn
from datahub.metadata.schema_classes import GlossaryNodeInfoClass
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Create a GlossaryNode URN
node_urn = GlossaryNodeUrn("Finance")
# Create the glossary node info with definition and display name
node_info = GlossaryNodeInfoClass(
definition="Category for all financial and accounting-related business terms including revenue, costs, and profitability measures.",
name="Financial Metrics",
)
# Create metadata change proposal
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=str(node_urn),
aspect=node_info,
)
# Emit to DataHub
gms_server = os.getenv("DATAHUB_GMS_URL", "http://localhost:8080")
token = os.getenv("DATAHUB_GMS_TOKEN")
rest_emitter = DatahubRestEmitter(gms_server=gms_server, token=token)
rest_emitter.emit(event)
log.info(f"Created glossary node {node_urn}")
Python SDK: Create a nested GlossaryNode with parent
import logging
import os
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata._urns.urn_defs import GlossaryNodeUrn
from datahub.metadata.schema_classes import GlossaryNodeInfoClass
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# First, ensure the parent node exists (Finance)
parent_node_urn = GlossaryNodeUrn("Finance")
parent_node_info = GlossaryNodeInfoClass(
definition="Top-level category for financial metrics and terms",
name="Finance",
)
parent_event = MetadataChangeProposalWrapper(
entityUrn=str(parent_node_urn),
aspect=parent_node_info,
)
# Create a nested child node under Finance
child_node_urn = GlossaryNodeUrn("RevenueMetrics")
child_node_info = GlossaryNodeInfoClass(
definition="Metrics related to revenue recognition and reporting",
name="Revenue Metrics",
parentNode=str(parent_node_urn), # Set the parent relationship
)
child_event = MetadataChangeProposalWrapper(
entityUrn=str(child_node_urn),
aspect=child_node_info,
)
# Emit both to DataHub
rest_emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)
rest_emitter.emit(parent_event)
rest_emitter.emit(child_event)
log.info(f"Created parent glossary node {parent_node_urn}")
log.info(f"Created child glossary node {child_node_urn} under {parent_node_urn}")
Managing Hierarchy
Python SDK: Build a multi-level glossary hierarchy
import logging
import os
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata._urns.urn_defs import GlossaryNodeUrn, GlossaryTermUrn
from datahub.metadata.schema_classes import (
GlossaryNodeInfoClass,
GlossaryTermInfoClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Create a multi-level glossary hierarchy:
# DataGovernance
# ├── Classification
# │ ├── Public (term)
# │ └── Confidential (term)
# └── PersonalInformation
# ├── Email (term)
# └── SSN (term)
rest_emitter = DatahubRestEmitter(
gms_server=os.getenv("DATAHUB_GMS_URL", "http://localhost:8080"),
token=os.getenv("DATAHUB_GMS_TOKEN"),
)
# Level 1: Root node
root_urn = GlossaryNodeUrn("DataGovernance")
root_info = GlossaryNodeInfoClass(
definition="Top-level governance structure for data classification and management",
name="Data Governance",
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(root_urn),
aspect=root_info,
)
)
log.info(f"Created root node: {root_urn}")
# Level 2: Child nodes
classification_urn = GlossaryNodeUrn("Classification")
classification_info = GlossaryNodeInfoClass(
definition="Data classification categories",
name="Classification",
parentNode=str(root_urn),
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(classification_urn),
aspect=classification_info,
)
)
log.info(f"Created child node: {classification_urn}")
pii_urn = GlossaryNodeUrn("PersonalInformation")
pii_info = GlossaryNodeInfoClass(
definition="Personal and sensitive data categories",
name="Personal Information",
parentNode=str(root_urn),
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(pii_urn),
aspect=pii_info,
)
)
log.info(f"Created child node: {pii_urn}")
# Level 3: Terms under Classification
public_term_urn = GlossaryTermUrn("Public")
public_term_info = GlossaryTermInfoClass(
definition="Publicly available data with no restrictions",
termSource="INTERNAL",
name="Public",
parentNode=str(classification_urn),
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(public_term_urn),
aspect=public_term_info,
)
)
log.info(f"Created term: {public_term_urn}")
confidential_term_urn = GlossaryTermUrn("Confidential")
confidential_term_info = GlossaryTermInfoClass(
definition="Restricted access data for internal use only",
termSource="INTERNAL",
name="Confidential",
parentNode=str(classification_urn),
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(confidential_term_urn),
aspect=confidential_term_info,
)
)
log.info(f"Created term: {confidential_term_urn}")
# Level 3: Terms under PersonalInformation
email_term_urn = GlossaryTermUrn("Email")
email_term_info = GlossaryTermInfoClass(
definition="Email addresses that can identify individuals",
termSource="INTERNAL",
name="Email Address",
parentNode=str(pii_urn),
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(email_term_urn),
aspect=email_term_info,
)
)
log.info(f"Created term: {email_term_urn}")
ssn_term_urn = GlossaryTermUrn("SSN")
ssn_term_info = GlossaryTermInfoClass(
definition="Social Security Numbers - highly sensitive personal identifiers",
termSource="INTERNAL",
name="Social Security Number",
parentNode=str(pii_urn),
)
rest_emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=str(ssn_term_urn),
aspect=ssn_term_info,
)
)
log.info(f"Created term: {ssn_term_urn}")
log.info("Successfully created glossary hierarchy with nodes and terms")
Adding Ownership
Python SDK: Add an owner to a GlossaryNode
import logging
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata._urns.urn_defs import CorpUserUrn, GlossaryNodeUrn
from datahub.metadata.schema_classes import (
OwnerClass,
OwnershipClass,
OwnershipTypeClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Create a glossary node URN
node_urn = GlossaryNodeUrn("Finance")
# Define the owner
owner_urn = CorpUserUrn("jdoe")
# Create ownership aspect
# This makes jdoe a TECHNICAL_OWNER of the Finance glossary node
ownership = OwnershipClass(
owners=[
OwnerClass(
owner=str(owner_urn),
type=OwnershipTypeClass.TECHNICAL_OWNER,
)
]
)
# Create the metadata change proposal
event = MetadataChangeProposalWrapper(
entityUrn=str(node_urn),
aspect=ownership,
)
# Emit to DataHub
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
rest_emitter.emit(event)
log.info(f"Added owner {owner_urn} to glossary node {node_urn}")
Querying GlossaryNodes
REST API: Get a GlossaryNode by URN
# Fetch a GlossaryNode entity
curl -X GET 'http://localhost:8080/entities/urn%3Ali%3AglossaryNode%3AFinance' \
-H 'Authorization: Bearer <token>'
# Response includes all aspects:
# - glossaryNodeKey (identity)
# - glossaryNodeInfo (definition, name, parentNode, etc.)
# - ownership (who owns this node)
# - institutionalMemory (links to documentation)
# - etc.
GraphQL: Query root-level GlossaryNodes
query GetRootGlossaryNodes {
getRootGlossaryNodes {
nodes {
urn
properties {
name
definition
}
ownership {
owners {
owner {
... on CorpUser {
urn
username
}
}
}
}
}
}
}
GraphQL: Query children of a GlossaryNode
query GetGlossaryNodeChildren {
glossaryNode(urn: "urn:li:glossaryNode:Finance") {
urn
properties {
name
definition
}
children {
count
relationships {
entity {
... on GlossaryNode {
urn
properties {
name
}
}
... on GlossaryTerm {
urn
properties {
name
definition
}
}
}
}
}
}
}
Bulk Operations
YAML Ingestion: Create node hierarchy from Business Glossary file
# business_glossary.yml
version: "1"
source: MyOrganization
owners:
users:
- datahub
nodes:
- name: DataGovernance
description: Top-level governance structure
nodes:
- name: Classification
description: Data classification categories
terms:
- name: Public
description: Publicly available data
- name: Internal
description: Internal use only
- name: Confidential
description: Restricted access data
- name: PersonalInformation
description: Personal and sensitive data categories
nodes:
- name: DirectIdentifiers
description: Direct personal identifiers
terms:
- name: Email
description: Email addresses
- name: SSN
description: Social Security Numbers
- name: IndirectIdentifiers
description: Indirect identifiers
terms:
- name: IPAddress
description: Internet Protocol addresses
- name: DeviceID
description: Device identifiers
# Ingest using the DataHub CLI:
# datahub ingest -c business_glossary.yml
See the Business Glossary Source documentation for the full YAML format specification.
Integration Points
Relationship with GlossaryTerm
GlossaryNodes provide organizational structure for GlossaryTerms. The relationship is established through:
- GlossaryTerm → GlossaryNode: A term's
glossaryTermInfo.parentNodefield references its containing node - Navigation: The UI renders this as a browsable hierarchy where users can expand nodes to see contained terms
- Search: Users can filter by glossary node to find all terms within a category
Think of this relationship as:
- GlossaryNode: Folder/directory (can contain terms and other nodes)
- GlossaryTerm: File (the actual business definition)
Parent-Child Relationships
GlossaryNodes form a tree structure through self-referential parent-child relationships:
- A child node references its parent via
glossaryNodeInfo.parentNode - A parent node can have many children (both nodes and terms)
- The DataHub UI displays this as an expandable tree in the glossary browser
- GraphQL resolvers provide specialized queries for traversing the hierarchy
Key operations:
getRootGlossaryNodes: Fetch all top-level nodes (no parent)parentNodes: Navigate upward to find all ancestorschildren: Navigate downward to find immediate children- Moving a node updates its
parentNodereference and affects the entire subtree
GraphQL API
The GraphQL API provides specialized operations for GlossaryNodes:
Queries:
glossaryNode(urn): Fetch a specific node with childrengetRootGlossaryNodes: Get all root-level nodessearch(entity: "glossaryNode"): Search nodes by name/definition
Mutations:
createGlossaryNode: Create a new node with optional parentupdateParentNode: Move a node to a different parentupdateName: Update the display nameupdateDescription: Update the definition
Resolvers:
children: Fetch immediate children (nodes and terms)childrenCount: Count of children under this nodeparentNodes: Fetch ancestor path from node to root
See the Business Glossary documentation for UI operations.
Access Control and Permissions
GlossaryNodes support fine-grained access control through special glossary-specific privileges:
Manage Direct Glossary Children
Users with this privilege on a node can:
- Create new terms and nodes directly under this node
- Edit terms and nodes directly under this node
- Delete terms and nodes directly under this node
- Cannot affect grandchildren or deeper descendants
Use case: Department leads managing their immediate category structure
Manage All Glossary Children
Users with this privilege on a node can:
- Create, edit, and delete any term or node in the entire subtree
- Manage nested hierarchies of any depth
- Full control over the category and all descendants
Use case: Data governance team managing an entire domain (e.g., all PII-related terms)
Global Privilege: Manage Glossaries
Users with this platform-level privilege can:
- Manage any node or term across the entire glossary
- Create root-level nodes
- Full administrative control
These privileges are checked hierarchically - if you have permission on a parent node, it may grant permissions on children depending on the privilege type.
Integration with Search and Discovery
While GlossaryNodes don't get applied to data assets directly (that's the role of GlossaryTerms), they enhance discoverability by:
- Faceted Navigation: Users can browse the glossary hierarchy to find relevant terms
- Context: The node structure provides semantic grouping that helps users understand term relationships
- Filtering: Search interfaces can filter terms by their containing node
- Autocomplete: Node structure influences term suggestions and grouping
Notable Exceptions
Node Name vs Display Name
Similar to GlossaryTerms, the URN identifier (name in glossaryNodeKey) is separate from the display name (name in glossaryNodeInfo):
- URN name: Use a stable, unchanging identifier (e.g., "finance-001", "DataGovernance")
- Display name: Use a human-friendly label that can be updated (e.g., "Financial Metrics", "Data Governance")
This separation allows you to rename nodes in the UI without breaking references.
Circular References Not Allowed
The hierarchy must be a tree structure (directed acyclic graph):
- A node cannot be its own ancestor
- Moving a node under one of its descendants is prevented
- DataHub validates the hierarchy to prevent cycles
If you attempt to create a circular reference, the operation will fail with a validation error.
Root-Level Nodes
Nodes with no parent (parentNode is null or not set) appear at the root level of the glossary:
- These represent top-level categories
- Creating root-level nodes may require higher privileges
- Root nodes typically represent major domains or organizational divisions
Deleting Nodes with Children
Current behavior (subject to change):
- DataHub may require nodes to be empty before deletion
- You must first delete or move all child nodes and terms
- This prevents accidental loss of large glossary sections
Best practice: Always move or reassign children before deleting a node, or use bulk operations that handle the entire subtree.
Display Properties
GlossaryNodes support the displayProperties aspect (added in newer versions), which provides additional UI customization:
- Custom icons or colors for the node
- Display order hints
- UI-specific rendering preferences
This is an optional enhancement for organizations that want more visual control over their glossary.
No Direct Application to Assets
Unlike GlossaryTerms, GlossaryNodes are not directly applied to data assets:
- You cannot tag a dataset with a GlossaryNode
- Only GlossaryTerms can be applied to datasets, columns, dashboards, etc.
- Nodes exist solely for organizational purposes within the glossary itself
If you need to tag assets with a category, create a GlossaryTerm within that node and apply the term.
Moving Nodes Affects All Descendants
When you move a node to a new parent:
- All child nodes and terms move with it
- The entire subtree is relocated
- References from terms to their parent node are automatically maintained
- No manual updates to individual terms are needed
This makes reorganization efficient but requires care to avoid unintended moves.
Technical Reference
For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.