Skip to main content
Version: Next

Data Product

Data Products are curated collections of data assets designed for easy discovery and consumption. They represent an innovative way to organize and package related data assets such as Tables, Dashboards, Charts, Pipelines, and other entities within DataHub. Data Products are a key concept in data mesh architecture, where they serve as independent units of data managed by specific domain teams.

Unlike other entities in DataHub that typically represent technical assets in source systems, Data Products are a DataHub-invented concept that provides a logical grouping mechanism for organizing assets into consumable offerings.

Identity

Data Products are identified by a single field:

  • id: A unique identifier for the Data Product, typically a human-readable string such as pet_of_the_week or customer_360.

An example of a Data Product identifier is urn:li:dataProduct:pet_of_the_week.

The simplicity of the identifier makes Data Products easy to create and reference, as they don't need to be tied to any particular platform or technology.

Important Capabilities

Data Product Properties

The core properties of a Data Product are captured in the dataProductProperties aspect, which includes:

  • name: The display name of the Data Product, which is searchable and used for autocomplete
  • description: Documentation describing what the Data Product offers and how to use it
  • assets: A list of data assets that are part of this Data Product, with each asset having an optional outputPort flag

Asset Associations

Data Products can contain a wide variety of asset types as defined in the dataProductProperties aspect:

  • Datasets (tables, views, streams)
  • Data Jobs and Data Flows (pipelines)
  • Dashboards and Charts (visualizations)
  • Notebooks
  • Containers (schemas, databases)
  • ML Models, ML Model Groups, ML Feature Tables, ML Features, and ML Primary Keys

Each asset association can be marked as an output port, which in data mesh terminology represents a data asset that is intended to be shared and consumed by other teams. This allows Data Product owners to distinguish between:

  • Internal assets: Data used internally within the Data Product for processing
  • Output ports: Data explicitly published for external consumption

The following code snippet shows how to create a Data Product with multiple assets, including marking one as an output port.

Python: Create a Data Product with assets
# Inlined from /metadata-ingestion/examples/library/dataproduct_create_sdk.py
from datahub.api.entities.dataproduct.dataproduct import DataProduct
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

data_product = DataProduct(
id="customer_360",
display_name="Customer 360",
domain="urn:li:domain:marketing",
description="A comprehensive view of customer data including profiles, transactions, and behaviors.",
assets=[
"urn:li:dataset:(urn:li:dataPlatform:snowflake,customer_db.public.customer_profile,PROD)",
"urn:li:dataset:(urn:li:dataPlatform:snowflake,customer_db.public.customer_transactions,PROD)",
"urn:li:dashboard:(looker,customer_overview)",
],
output_ports=[
"urn:li:dataset:(urn:li:dataPlatform:snowflake,customer_db.public.customer_profile,PROD)"
],
owners=[
{"id": "urn:li:corpuser:datahub", "type": "BUSINESS_OWNER"},
{"id": "urn:li:corpuser:jdoe", "type": "TECHNICAL_OWNER"},
],
terms=["urn:li:glossaryTerm:CustomerData"],
tags=["urn:li:tag:production"],
properties={"tier": "gold", "sla": "99.9%"},
external_url="https://wiki.company.com/customer-360",
)

for mcp in data_product.generate_mcp(upsert=True):
graph.emit(mcp)

print(f"Created Data Product: urn:li:dataProduct:{data_product.id}")

Asset Settings

The assetSettings aspect allows Data Products to configure custom settings, such as custom asset summary configurations. This aspect is shared with other organizational entities like Domains and Glossary Terms, providing a consistent way to customize how assets are displayed and summarized.

Tags and Glossary Terms

Data Products support Tags and Glossary Terms, allowing you to categorize and document your data offerings. Tags can be used for informal categorization (e.g., "adoption", "experimental"), while Glossary Terms provide formal business vocabulary linkage.

Here is an example of adding metadata to a Data Product:

Python SDK: Add tags and terms to a Data Product
# Inlined from /metadata-ingestion/examples/library/dataproduct_add_metadata.py
import logging

from datahub.emitter.mce_builder import (
make_data_product_urn,
make_tag_urn,
make_term_urn,
)
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
GlossaryTermAssociationClass,
TagAssociationClass,
)
from datahub.specific.dataproduct import DataProductPatchBuilder

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

data_product_urn = make_data_product_urn("customer_360")

for mcp in (
DataProductPatchBuilder(data_product_urn)
.add_tag(TagAssociationClass(tag=make_tag_urn("production")))
.add_tag(TagAssociationClass(tag=make_tag_urn("pii")))
.add_term(
GlossaryTermAssociationClass(urn=make_term_urn("CustomerData.PersonalInfo"))
)
.build()
):
rest_emitter.emit(mcp)
log.info(f"Added metadata to Data Product {data_product_urn}")

Ownership

Data Products support ownership through the ownership aspect. Owners can be individuals or groups, and can have different ownership types (BUSINESS_OWNER, TECHNICAL_OWNER, DATA_STEWARD, etc.). When a Data Product is created through the UI, the creator is automatically added as an owner.

Ownership helps establish accountability and makes it clear who is responsible for maintaining the Data Product and ensuring data quality.

Domains

Every Data Product must belong to exactly one Domain. This is a core organizational principle in DataHub's Data Product model - Data Products cannot exist independently but must be associated with a Domain that represents the business area or team responsible for the Data Product.

The Domain association is captured in the domains aspect and is enforced by the UI and API when creating Data Products.

Documentation and Institutional Memory

Data Products can have rich documentation beyond the basic description field:

  • institutionalMemory: Links to external resources like Confluence pages, Google Docs, or other documentation
  • forms: Structured documentation through DataHub's Forms feature
  • structuredProperties: Custom metadata fields defined by your organization

Adding Assets to a Data Product

Assets can be associated with a Data Product in two ways:

  1. From the Data Product page: Use the "Add Assets" button to search for and add multiple assets at once
  2. From the Asset page: Use the "Set Data Product" option in the asset's sidebar to add it to a Data Product
Python SDK: Add assets to an existing Data Product
# Inlined from /metadata-ingestion/examples/library/dataproduct_add_assets.py
import logging

from datahub.emitter.mce_builder import make_data_product_urn, make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.specific.dataproduct import DataProductPatchBuilder

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

data_product_urn = make_data_product_urn("customer_360")

new_assets = [
make_dataset_urn(
platform="snowflake",
name="customer_db.public.customer_orders",
env="PROD",
),
make_dataset_urn(
platform="snowflake",
name="customer_db.public.customer_support_tickets",
env="PROD",
),
]

for mcp in (
DataProductPatchBuilder(data_product_urn)
.add_asset(new_assets[0])
.add_asset(new_assets[1])
.build()
):
rest_emitter.emit(mcp)
log.info(f"Added assets to Data Product {data_product_urn}")

Querying Data Products

Data Products can be queried using the REST API to retrieve their properties and associated assets.

Query a Data Product via REST API
# Inlined from /metadata-ingestion/examples/library/dataproduct_query_rest.py
import logging

from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

data_product_urn = "urn:li:dataProduct:customer_360"

data_product = graph.get_entity_raw(
entity_urn=data_product_urn,
aspects=[
"dataProductKey",
"dataProductProperties",
"ownership",
"domains",
"globalTags",
"glossaryTerms",
],
)

if data_product:
log.info(f"Successfully retrieved Data Product: {data_product_urn}")

properties = data_product.get("dataProductProperties")
if properties:
log.info(f"Name: {properties.get('name')}")
log.info(f"Description: {properties.get('description')}")

assets = properties.get("assets", [])
log.info(f"Number of assets: {len(assets)}")
for asset in assets:
asset_urn = asset.get("destinationUrn")
is_output_port = asset.get("outputPort", False)
log.info(f" - Asset: {asset_urn} (Output Port: {is_output_port})")

domains = data_product.get("domains")
if domains:
domain_urns = domains.get("domains", [])
log.info(f"Domain: {domain_urns}")

ownership = data_product.get("ownership")
if ownership:
owners = ownership.get("owners", [])
log.info(f"Number of owners: {len(owners)}")
for owner in owners:
log.info(f" - Owner: {owner.get('owner')} (Type: {owner.get('type')})")

tags = data_product.get("globalTags")
if tags:
tag_list = tags.get("tags", [])
log.info(f"Tags: {[t.get('tag') for t in tag_list]}")

terms = data_product.get("glossaryTerms")
if terms:
term_list = terms.get("terms", [])
log.info(f"Glossary Terms: {[t.get('urn') for t in term_list]}")
else:
log.error(f"Data Product not found: {data_product_urn}")

Integration Points

Data Products integrate with several key areas of DataHub:

Relationship to Domains

Data Products must belong to a Domain, creating a hierarchical organization:

Domain (e.g., "Marketing")
└── Data Product (e.g., "Customer 360")
├── Dataset: customer_profile
├── Dataset: customer_transactions
├── Dashboard: customer_overview
└── DataFlow: customer_pipeline

This hierarchy allows organizations to implement data mesh principles where each domain owns and manages its Data Products.

Relationship to Assets

Data Products create a DataProductContains relationship with their assets. This relationship is bidirectional:

  • From the Data Product, you can see all contained assets
  • From any asset, you can see which Data Product(s) it belongs to

An asset can belong to multiple Data Products, allowing for flexible organization schemes (e.g., an asset could be part of both a "Customer 360" product and a "Marketing Analytics" product).

Authorization and Access Control

DataHub provides fine-grained permissions for Data Products:

  • Manage Data Product: Required to create/delete Data Products within a Domain
  • Edit Data Product: Required to add/remove assets from a Data Product

These privileges can be granted through Metadata Policies, allowing organizations to control who can create and modify Data Products.

GraphQL API

The DataHub GraphQL API provides several mutations for working with Data Products:

  • createDataProduct: Create a new Data Product within a Domain
  • updateDataProduct: Update Data Product properties
  • deleteDataProduct: Delete a Data Product
  • batchSetDataProduct: Add or remove multiple assets from a Data Product
  • listDataProductAssets: Query assets belonging to a Data Product

Search and Discovery

Data Products are searchable entities in DataHub. The name and description fields are indexed, and Data Products can be filtered by:

  • Domain
  • Ownership
  • Tags
  • Glossary Terms
  • Structured Properties

This makes it easy for data consumers to discover relevant Data Products across the organization.

Notable Exceptions

Domain Requirement

Unlike many other entities in DataHub, Data Products have a hard requirement to belong to a Domain. This is by design to support data mesh principles where every Data Product must have a clear organizational owner. You cannot create a Data Product without first having a Domain to associate it with.

Output Ports

The outputPort flag on asset associations is a forward-looking feature aligned with data mesh principles. While the flag can be set today, advanced features around output ports (such as differentiated access control or versioning) are still being developed. The current roadmap includes:

  • Support for marking data assets in a Data Product as private versus shareable
  • Support for declaring data lineage manually between Data Products
  • Support for declaring logical schemas for Data Products
  • Support for associating data contracts with Data Products
  • Support for semantic versioning of Data Products

YAML-based Management

DataHub supports managing Data Products as code through YAML files. This enables GitOps workflows where Data Product definitions are version-controlled and deployed through CI/CD pipelines. The datahub CLI provides commands to:

  • datahub dataproduct upsert: Create or update Data Products from YAML
  • datahub dataproduct diff: Compare YAML with current state
  • datahub dataproduct delete: Remove Data Products

This allows for a hybrid model where business users can manage Data Products through the UI while technical teams can use infrastructure-as-code practices.

Multi-Asset Membership

Unlike some organizational constructs in other systems, an asset in DataHub can belong to multiple Data Products simultaneously. This flexibility supports different organizational perspectives - for example, a dataset might be part of a domain-specific product while also being included in a cross-functional analytics product.

Technical Reference Guide

The sections above provide an overview of how to use this entity. The following sections provide detailed technical information about how metadata is stored and represented in DataHub.

Aspects are the individual pieces of metadata that can be attached to an entity. Each aspect contains specific information (like ownership, tags, or properties) and is stored as a separate record, allowing for flexible and incremental metadata updates.

Relationships show how this entity connects to other entities in the metadata graph. These connections are derived from the fields within each aspect and form the foundation of DataHub's knowledge graph.

Reading the Field Tables

Each aspect's field table includes an Annotations column that provides additional metadata about how fields are used:

  • ⚠️ Deprecated: This field is deprecated and may be removed in a future version. Check the description for the recommended alternative
  • Searchable: This field is indexed and can be searched in DataHub's search interface
  • Searchable (fieldname): When the field name in parentheses is shown, it indicates the field is indexed under a different name in the search index. For example, dashboardTool is indexed as tool
  • → RelationshipName: This field creates a relationship to another entity. The arrow indicates this field contains a reference (URN) to another entity, and the name indicates the type of relationship (e.g., → Contains, → OwnedBy)

Fields with complex types (like Edge, AuditStamp) link to their definitions in the Common Types section below.

Aspects

ownership

Ownership information of an entity.

FieldTypeRequiredDescriptionAnnotations
ownersOwner[]List of owners of the entity.
ownerTypesmapOwnership type to Owners map, populated via mutation hook.Searchable
lastModifiedAuditStampAudit stamp containing who last modified the record and when. A value of 0 in the time field indi...

glossaryTerms

Related business terms information

FieldTypeRequiredDescriptionAnnotations
termsGlossaryTermAssociation[]The related business terms
auditStampAuditStampAudit stamp containing who reported the related business term

globalTags

Tag aspect used for applying tags to an entity

FieldTypeRequiredDescriptionAnnotations
tagsTagAssociation[]Tags associated with a given entitySearchable, → TaggedWith

domains

Links from an Asset to its Domains

FieldTypeRequiredDescriptionAnnotations
domainsstring[]The Domains attached to an AssetSearchable, → AssociatedWith

applications

Links from an Asset to its Applications

FieldTypeRequiredDescriptionAnnotations
applicationsstring[]The Applications attached to an AssetSearchable, → AssociatedWith

dataProductProperties

The main properties of a Data Product

FieldTypeRequiredDescriptionAnnotations
customPropertiesmapCustom property bag.Searchable
externalUrlstringURL where the reference existSearchable
namestringDisplay name of the Data ProductSearchable
descriptionstringDocumentation of the data productSearchable
assetsDataProductAssociation[]A list of assets that are part of this Data Product→ DataProductContains

institutionalMemory

Institutional memory of an entity. This is a way to link to relevant documentation and provide description of the documentation. Institutional or tribal knowledge is very important for users to leverage the entity.

FieldTypeRequiredDescriptionAnnotations
elementsInstitutionalMemoryMetadata[]List of records that represent institutional memory of an entity. Each record consists of a link,...

status

The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc. This aspect is used to represent soft deletes conventionally.

FieldTypeRequiredDescriptionAnnotations
removedbooleanWhether the entity has been removed (soft-deleted).Searchable

structuredProperties

Properties about an entity governed by StructuredPropertyDefinition

FieldTypeRequiredDescriptionAnnotations
propertiesStructuredPropertyValueAssignment[]Custom property bag.

forms

Forms that are assigned to this entity to be filled out

FieldTypeRequiredDescriptionAnnotations
incompleteFormsFormAssociation[]All incomplete forms assigned to the entity.Searchable
completedFormsFormAssociation[]All complete forms assigned to the entity.Searchable
verificationsFormVerificationAssociation[]Verifications that have been applied to the entity via completed forms.Searchable

testResults

Information about a Test Result

FieldTypeRequiredDescriptionAnnotations
failingTestResult[]Results that are failingSearchable, → IsFailing
passingTestResult[]Results that are passingSearchable, → IsPassing

subTypes

Sub Types. Use this aspect to specialize a generic Entity e.g. Making a Dataset also be a View or also be a LookerExplore

FieldTypeRequiredDescriptionAnnotations
typeNamesstring[]The names of the specific types.Searchable

assetSettings

Settings associated with this asset

FieldTypeRequiredDescriptionAnnotations
assetSummaryAssetSummarySettingsInformation related to the asset summary for this asset

Common Types

These types are used across multiple aspects in this entity.

AuditStamp

Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage.

Fields:

  • time (long): When did the resource/association/sub-resource move into the specific lifecyc...
  • actor (string): The entity (e.g. a member URN) which will be credited for moving the resource...
  • impersonator (string?): The entity (e.g. a service URN) which performs the change on behalf of the Ac...
  • message (string?): Additional context around how DataHub was informed of the particular change. ...

FormAssociation

Properties of an applied form.

Fields:

  • urn (string): Urn of the applied form
  • incompletePrompts (FormPromptAssociation[]): A list of prompts that are not yet complete for this form.
  • completedPrompts (FormPromptAssociation[]): A list of prompts that have been completed for this form.

TestResult

Information about a Test Result

Fields:

  • test (string): The urn of the test
  • type (TestResultType): The type of the result
  • testDefinitionMd5 (string?): The md5 of the test definition that was used to compute this result. See Test...
  • lastComputed (AuditStamp?): The audit stamp of when the result was computed, including the actor who comp...

Relationships

Outgoing

These are the relationships stored in this entity's aspects

  • OwnedBy

    • Corpuser via ownership.owners.owner
    • CorpGroup via ownership.owners.owner
  • ownershipType

    • OwnershipType via ownership.owners.typeUrn
  • TermedWith

    • GlossaryTerm via glossaryTerms.terms.urn
  • TaggedWith

    • Tag via globalTags.tags
  • AssociatedWith

    • Domain via domains.domains
    • Application via applications.applications
  • DataProductContains

    • Dataset via dataProductProperties.assets
    • DataJob via dataProductProperties.assets
    • DataFlow via dataProductProperties.assets
    • Chart via dataProductProperties.assets
    • Dashboard via dataProductProperties.assets
    • Notebook via dataProductProperties.assets
    • Container via dataProductProperties.assets
    • MlModel via dataProductProperties.assets
    • MlModelGroup via dataProductProperties.assets
    • MlFeatureTable via dataProductProperties.assets
    • MlFeature via dataProductProperties.assets
    • MlPrimaryKey via dataProductProperties.assets
  • IsFailing

    • Test via testResults.failing
  • IsPassing

    • Test via testResults.passing
  • HasSummaryTemplate

    • DataHubPageTemplate via assetSettings.assetSummary.templates

Incoming

These are the relationships stored in other entity's aspects

  • PostTarget

    • Post via postInfo.target

Global Metadata Model

Global Graph