Skip to main content
Version: Next

Container

The container entity is a core entity in the metadata model that represents a grouping of related data assets. Containers provide hierarchical organization for datasets, charts, dashboards, and other containers, enabling navigation and structure discovery within data platforms.

Identity

Containers are uniquely identified by a GUID (Globally Unique Identifier) that is typically derived from a combination of attributes specific to the container type. Unlike datasets which use platform, name, and environment, containers use a more flexible identification scheme based on their hierarchical properties.

The URN structure for a container is: urn:li:container:{guid}

The GUID is typically computed from container-specific properties such as:

  • Database containers: platform + instance + database name
  • Schema containers: platform + instance + database + schema name
  • Project containers: platform + instance + project_id
  • Folder containers: platform + instance + folder_abs_path
  • Bucket containers: platform + instance + bucket_name

URN Examples

urn:li:container:b5e95fce839e7d78151ed7e0a7420d84

The GUID is generated using the datahub_guid() function from a dictionary of properties. For example, a Snowflake schema container would be identified by:

{
"platform": "snowflake",
"instance": "prod_instance",
"database": "analytics",
"schema": "reporting"
}

Real-World Concepts

Containers represent various hierarchical structures in data platforms:

  • Databases: Top-level organizational units in relational systems (MySQL, PostgreSQL, Snowflake)
  • Schemas: Logical groupings within databases (Snowflake schemas, PostgreSQL schemas)
  • Projects: Organizational units in cloud platforms (BigQuery projects)
  • Datasets: Logical groupings in cloud platforms (BigQuery datasets)
  • Folders: Directory structures in file systems and data lakes (S3 folders, ADLS directories)
  • Buckets: Top-level storage containers in cloud object stores (S3 buckets, GCS buckets)
  • Workspaces: Organizational units in BI platforms (Power BI workspaces, Tableau sites)
  • Catalogs: Top-level organizational units in data catalogs (Unity Catalog, Iceberg catalogs)
  • Metastores: Storage metadata repositories (Hive metastore, Unity metastore)

Important Capabilities

Container Properties

The containerProperties aspect contains metadata inherited from the source system:

  • name: Display name of the container (required)
  • qualifiedName: Fully-qualified name (optional, e.g., "prod.analytics.reporting")
  • description: Description from the source system
  • env: Environment indicator (PROD, DEV, QA, etc.)
  • customProperties: Additional key-value properties from the source system
  • externalUrl: Link to the container in the source system
  • created: Timestamp when the container was created in the source system
  • lastModified: Timestamp when the container was last modified in the source system

Editable Container Properties

The editableContainerProperties aspect allows users to override or add information via the UI:

  • description: User-provided description that supplements or overrides the source system description

This separation ensures that metadata from source systems doesn't conflict with user-provided annotations.

Hierarchical Relationships

Containers support nested hierarchies through the container aspect, which links a container to its parent container. This enables multi-level organizational structures:

Platform (implicit)
└── Database Container
└── Schema Container
└── Dataset

For example, in Snowflake:

Snowflake Platform
└── ANALYTICS_DB (Database Container)
└── REPORTING (Schema Container)
└── SALES_METRICS (Dataset)
└── REVENUE_TABLE (Dataset)

Subtypes

The subTypes aspect specifies the type of container, which helps the UI render appropriate icons and behaviors. Common subtypes include:

  • Database: Relational database containers
  • Schema: Schema-level containers within databases
  • Project: Cloud project containers (GCP, Azure)
  • Dataset: BigQuery dataset containers
  • Folder: File system folders
  • Bucket: Object storage buckets
  • Workspace: BI platform workspaces
  • Catalog: Data catalog containers
  • Metastore: Metadata storage containers
  • MLflow Experiment (MLAssetSubTypes.MLFLOW_EXPERIMENT): ML experiment containers that organize training runs

ML Experiments as Containers

Machine learning experiments are modeled as containers with the MLFLOW_EXPERIMENT subtype. This pattern enables organizing related training runs (which are dataProcessInstance entities) into logical groups for comparison and tracking:

ML Experiment (Container)
├── Training Run 1 (DataProcessInstance)
├── Training Run 2 (DataProcessInstance)
└── Training Run 3 (DataProcessInstance)

Training runs belong to experiments through the container aspect. This structure mirrors common ML platform patterns (like MLflow) and enables:

  • Comparing metrics across multiple training attempts
  • Tracking the evolution of a model through iterations
  • Organizing training work by project or objective

For more information on ML experiments and training runs, see:

Containable Entities

The following entity types can be contained within a container:

  • Datasets
  • Charts
  • Dashboards
  • DataProcessInstances (e.g., training runs in ML experiments)
  • Other Containers (for nested hierarchies)

Code Examples

Create a Database Container

Python SDK: Create a database container
# metadata-ingestion/examples/library/container_create_database.py
from datahub.emitter.mcp_builder import DatabaseKey
from datahub.sdk import Container, DataHubClient

client = DataHubClient.from_env()

container = Container(
container_key=DatabaseKey(
platform="snowflake",
instance="production",
database="analytics_db",
),
display_name="Analytics Database",
description="Main analytics database containing reporting and metrics data",
subtype="Database",
external_url="https://app.snowflake.com/analytics_db",
parent_container=None,
)

client.entities.upsert(container)

print(f"Created database container with URN: {container.urn}")

Create a Schema Container with Parent

Python SDK: Create a schema container with parent database
# metadata-ingestion/examples/library/container_create_schema.py
from datahub.emitter.mcp_builder import DatabaseKey, SchemaKey
from datahub.sdk import Container, DataHubClient

client = DataHubClient.from_env()

# First, create the database container
database_key = DatabaseKey(
platform="snowflake",
instance="production",
database="analytics_db",
)

database_container = Container(
container_key=database_key,
display_name="Analytics Database",
description="Main analytics database",
subtype="Database",
)

client.entities.upsert(database_container)
print(f"Created database container: {database_container.urn}")

# Create a schema container within the database
schema_key = SchemaKey(
platform="snowflake",
instance="production",
database="analytics_db",
schema="reporting",
)

schema_container = Container(
container_key=schema_key,
display_name="Reporting Schema",
description="Schema containing all reporting tables and views",
subtype="Schema",
)

client.entities.upsert(schema_container)
print(f"Created schema container: {schema_container.urn}")
print("Schema container is nested under database container")

Add Metadata to a Container

Python SDK: Add tags, terms, and ownership to a container
from datahub.emitter.mcp_builder import DatabaseKey
from datahub.sdk import ContainerUrn, CorpUserUrn, DataHubClient, DomainUrn, TagUrn

client = DataHubClient.from_env()

database_key = DatabaseKey(
platform="snowflake",
instance="production",
database="analytics_db",
)

container = client.entities.get(ContainerUrn.from_string(database_key.as_urn()))

container.set_display_name("Analytics Database")
container.set_description(
"Main analytics database containing reporting and metrics data"
)
container.set_subtype("Database")
container.set_external_url("https://app.snowflake.com/analytics_db")

container.set_tags([TagUrn("production"), TagUrn("analytics"), TagUrn("pii")])

container.set_terms(["urn:li:glossaryTerm:Finance.ReportingData"])

container.set_owners(
[
(CorpUserUrn("john.doe"), "DATAOWNER"),
(CorpUserUrn("analytics-team"), "TECHNICAL_OWNER"),
]
)

container.set_domain(DomainUrn("Analytics"))

container.set_links(
[
(
"https://wiki.company.com/analytics-db",
"Database Documentation",
),
(
"https://jira.company.com/ANALYTICS-123",
"Setup Ticket",
),
]
)

client.entities.update(container)

print(f"Updated container with comprehensive metadata: {container.urn}")
print(f" - Tags: {len(container.tags or [])} tags")
print(f" - Terms: {len(container.terms or [])} terms")
print(f" - Owners: {len(container.owners or [])} owners")
print(f" - Links: {len(container.links or [])} links")
print(f" - Domain: {container.domain}")

Query Container via REST API

Containers can be retrieved using the standard entity retrieval APIs:

Fetch container entity including all aspects
curl 'http://localhost:8080/entities/urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84'

The response will include all aspects associated with the container, including properties, ownership, tags, terms, etc.

To find all entities within a container, use the relationships API:

Find all entities contained within a container
curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84&types=IsPartOf'

This returns all entities (datasets, charts, dashboards, sub-containers) that have this container as their parent.

Integration Points

Relationship with Datasets

Datasets are the most common entities contained within containers. The relationship is established through the container aspect on the dataset, which points to the container URN.

# Dataset links to its parent container (schema)
dataset = Dataset(
platform="snowflake",
name="analytics_db.reporting.sales_table",
env="PROD",
parent_container=schema_key, # Links to schema container
)

Hierarchical Navigation

Containers enable hierarchical navigation in the DataHub UI through parent-child relationships:

  1. Top-down browsing: Users can navigate from databases to schemas to tables
  2. Bottom-up breadcrumbs: Datasets show their parent containers in breadcrumb trails
  3. Browse paths: Containers are used to generate browse paths automatically

GraphQL Resolvers

The container entity has specialized GraphQL resolvers:

  • ContainerEntitiesResolver: Retrieves all entities (datasets, charts, dashboards, sub-containers) within a container
  • ParentContainersResolver: Retrieves the full hierarchy of parent containers for any entity

These resolvers power the UI's hierarchical navigation and container overview pages.

Common Usage Patterns

  1. Database/Schema Hierarchy: Relational databases use Database and Schema containers
  2. Project/Dataset Hierarchy: BigQuery uses Project and Dataset containers
  3. Workspace/Folder Hierarchy: BI tools use Workspace containers for organization
  4. Bucket/Folder Hierarchy: Data lakes use Bucket and Folder containers
  5. Catalog/Schema Hierarchy: Modern catalogs (Unity, Iceberg) use Catalog and Schema containers

Notable Exceptions

GUID Stability

Container GUIDs must remain stable across ingestion runs. Since containers are identified by GUID rather than explicit properties in the URN, changing the GUID computation will create a new container entity instead of updating the existing one.

When creating custom containers, ensure that the properties used to generate the GUID are:

  • Stable across time
  • Unique within the platform
  • Derived from immutable source system identifiers

Self-Referential Containers

While containers can contain other containers, be careful not to create circular references. The parent-child relationship should form a directed acyclic graph (DAG), not a cycle.

Environment Handling

The env field in ContainerKey has special handling for backwards compatibility. In some sources, the platform instance was incorrectly set to the environment value. The backcompat_env_as_instance flag handles this case.

When using the env field:

  • Set it to a valid FabricType (PROD, DEV, QA, etc.)
  • Don't use it for platform instance identification
  • Use the separate instance field for multi-instance deployments

Platform Instance Association

Unlike datasets which embed platform instance in their URN, containers associate platform instances through the dataPlatformInstance aspect. This allows containers to be associated with specific instances of a data platform while maintaining a stable GUID.

Access Control

Containers support the access aspect, which can be used to model access control policies at the container level. This is particularly useful for:

  • Database-level permissions
  • Schema-level access control
  • Project-level authorization
  • Workspace-level security

Access controls set on containers can be inherited by contained entities, though this behavior depends on the specific platform's implementation.

Technical Reference

For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.