Skip to main content
Version: Next

Dataset

The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).

Identity

Datasets are identified by three pieces of information:

  • The platform that they belong to: this is the specific data technology that hosts this dataset. Examples are hive, bigquery, redshift etc. See dataplatform for more details.
  • The name of the dataset in the specific platform. Each platform will have a unique way of naming assets within its system. Usually, names are composed by combining the structural elements of the name and separating them by .. e.g. relational datasets are usually named as <db>.<schema>.<table>, except for platforms like MySQL which do not have the concept of a schema; as a result MySQL datasets are named <db>.<table>. In cases where the specific platform can have multiple instances (e.g. there are multiple different instances of MySQL databases that have different data assets in them), names can also include instance ids, making the general pattern for a name <platform_instance>.<db>.<schema>.<table>.
  • The environment or fabric in which the dataset belongs: this is an additional qualifier available on the identifier, to allow disambiguating datasets that live in Production environments from datasets that live in Non-production environments, such as Staging, QA, etc. The full list of supported environments / fabrics is available in FabricType.pdl.

An example of a dataset identifier is urn:li:dataset:(urn:li:dataPlatform:redshift,userdb.public.customer_table,PROD).

Important Capabilities

Schemas

Datasets support flat and nested schemas. Metadata about schemas are contained in the schemaMetadata aspect. Schemas are represented as an array of fields, each identified by a specific field path.

Field Paths explained

Fields that are either top-level or expressible unambiguously using a . based notation can be identified via a v1 path name, whereas fields that are part of a union need further disambiguation using [type=X] markers. Taking a simple nested schema as described below:

{
"type": "record",
"name": "Customer",
"fields":[
{
"type": "record",
"name": "address",
"fields": [
{ "name": "zipcode", "type": string},
{"name": "street", "type": string}]
}],
}
  • v1 field path: address.zipcode
  • v2 field path: [version=2.0].[type=struct].address.[type=string].zipcode". More examples and a formal specification of a v2 fieldPath can be found here.

Understanding field paths is important, because they are the identifiers through which tags, terms, documentation on fields are expressed. Besides the type and name of the field, schemas also contain descriptions attached to the individual fields, as well as information about primary and foreign keys.

The following code snippet shows you how to add a Schema containing 3 fields to a dataset.

Python SDK: Add a schema to a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_schema.py
from datahub.sdk import DataHubClient, Dataset

client = DataHubClient.from_env()

dataset = Dataset(
platform="hive",
name="realestate_db.sales",
schema=[
# tuples of (field name / field path, data type, description)
(
"address.zipcode",
"varchar(50)",
"This is the zipcode of the address. Specified using extended form and limited to addresses in the United States",
),
("address.street", "varchar(100)", "Street corresponding to the address"),
("last_sold_date", "date", "Date of the last sale date for this property"),
],
)

client.entities.upsert(dataset)

Tags and Glossary Terms

Datasets can have Tags or Terms attached to them. Read this blog to understand the difference between tags and terms so you understand when you should use which.

Adding Tags or Glossary Terms at the top-level to a dataset

At the top-level, tags are added to datasets using the globalTags aspect, while terms are added using the glossaryTerms aspect.

Here is an example for how to add a tag to a dataset. Note that this involves reading the currently set tags on the dataset and then adding a new one if needed.

Python SDK: Add a tag to a dataset at the top-level
# Inlined from /metadata-ingestion/examples/library/dataset_add_tag.py
from datahub.sdk import DataHubClient, DatasetUrn, TagUrn

client = DataHubClient.from_env()

dataset = client.entities.get(DatasetUrn(platform="hive", name="realestate_db.sales"))
dataset.add_tag(TagUrn("purchase"))

client.entities.update(dataset)

Here is an example of adding a term to a dataset. Note that this involves reading the currently set terms on the dataset and then adding a new one if needed.

Python SDK: Add a term to a dataset at the top-level
# Inlined from /metadata-ingestion/examples/library/dataset_add_term.py
from typing import List, Optional, Union

from datahub.sdk import DataHubClient, DatasetUrn, GlossaryTermUrn


def add_terms_to_dataset(
client: DataHubClient,
dataset_urn: DatasetUrn,
term_urns: List[Union[GlossaryTermUrn, str]],
) -> None:
"""
Add glossary terms to a dataset.

Args:
client: DataHub client to use
dataset_urn: URN of the dataset to update
term_urns: List of term URNs or term names to add
"""
dataset = client.entities.get(dataset_urn)

for term in term_urns:
if isinstance(term, str):
resolved_term_urn = client.resolve.term(name=term)
dataset.add_term(resolved_term_urn)
else:
dataset.add_term(term)

client.entities.update(dataset)


def main(client: Optional[DataHubClient] = None) -> None:
"""
Main function to add terms to dataset example.

Args:
client: Optional DataHub client (for testing). If not provided, creates one from env.
"""
client = client or DataHubClient.from_env()

dataset_urn = DatasetUrn(platform="hive", name="realestate_db.sales", env="PROD")

# Add terms using both URN and name resolution
add_terms_to_dataset(
client=client,
dataset_urn=dataset_urn,
term_urns=[
GlossaryTermUrn("Classification.HighlyConfidential"),
"PII", # Will be resolved by name
],
)


if __name__ == "__main__":
main()

Adding Tags or Glossary Terms to columns / fields of a dataset

Tags and Terms can also be attached to an individual column (field) of a dataset. These attachments are done via the schemaMetadata aspect by ingestion connectors / transformers and via the editableSchemaMetadata aspect by the UI. This separation allows the writes from the replication of metadata from the source system to be isolated from the edits made in the UI.

Here is an example of how you can add a tag to a field in a dataset using the low-level Python SDK.

Python SDK: Add a tag to a column (field) of a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_column_tag.py
from datahub.sdk import DataHubClient, DatasetUrn, TagUrn

client = DataHubClient.from_env()

dataset = client.entities.get(
DatasetUrn(platform="hive", name="fct_users_created", env="PROD")
)

dataset["user_name"].add_tag(TagUrn("deprecated"))

client.entities.update(dataset)

Similarly, here is an example of how you would add a term to a field in a dataset using the low-level Python SDK.

Python SDK: Add a term to a column (field) of a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_column_term.py
from datahub.sdk import DataHubClient, DatasetUrn, GlossaryTermUrn

client = DataHubClient.from_env()

dataset = client.entities.get(
DatasetUrn(platform="hive", name="realestate_db.sales", env="PROD")
)

dataset["address.zipcode"].add_term(GlossaryTermUrn("Classification.Location"))

client.entities.update(dataset)

Ownership

Ownership is associated to a dataset using the ownership aspect. Owners can be of a few different types, DATAOWNER, PRODUCER, DEVELOPER, CONSUMER, etc. See OwnershipType.pdl for the full list of ownership types and their meanings. Ownership can be inherited from source systems, or additionally added in DataHub using the UI. Ingestion connectors for sources will automatically set owners when the source system supports it.

Adding Owners

The following script shows you how to add an owner to a dataset using the low-level Python SDK.

Python SDK: Add an owner to a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_owner.py
from datahub.sdk import CorpUserUrn, DataHubClient, DatasetUrn

client = DataHubClient.from_env()

dataset = client.entities.get(DatasetUrn(platform="hive", name="realestate_db.sales"))

# Add owner with the TECHNICAL_OWNER type
dataset.add_owner(CorpUserUrn("jdoe"))

client.entities.update(dataset)

Fine-grained lineage

Fine-grained lineage at field level can be associated to a dataset in two ways - either directly attached to the upstreamLineage aspect of a dataset, or captured as part of the dataJobInputOutput aspect of a dataJob.

Python SDK: Add fine-grained lineage to a dataset
# Inlined from /metadata-ingestion/examples/library/lineage_dataset_add_with_query_node.py
from datahub.metadata.urns import DatasetUrn
from datahub.sdk.main_client import DataHubClient

client = DataHubClient.from_env()

upstream_urn = DatasetUrn(platform="snowflake", name="upstream_table")
downstream_urn = DatasetUrn(platform="snowflake", name="downstream_table")

transformation_text = """
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("HighValueFilter").getOrCreate()
df = spark.read.table("customers")
high_value = df.filter("lifetime_value > 10000")
high_value.write.saveAsTable("high_value_customers")
"""

client.lineage.add_lineage(
upstream=upstream_urn,
downstream=downstream_urn,
transformation_text=transformation_text,
column_lineage={"id": ["id", "customer_id"]},
)

# by passing the transformation_text, the query node will be created with the table level lineage.
# transformation_text can be any transformation logic e.g. a spark job, an airflow DAG, python script, etc.
# if you have a SQL query, we recommend using add_dataset_lineage_from_sql instead.
# note that transformation_text itself will not create a column level lineage.

Python SDK: Add fine-grained lineage to a datajob
# Inlined from /metadata-ingestion/examples/library/lineage_emitter_datajob_finegrained.py
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
FineGrainedLineage,
FineGrainedLineageDownstreamType,
FineGrainedLineageUpstreamType,
)
from datahub.metadata.schema_classes import DataJobInputOutputClass


def datasetUrn(tbl):
return builder.make_dataset_urn("postgres", tbl)


def fldUrn(tbl, fld):
return builder.make_schema_field_urn(datasetUrn(tbl), fld)


# Lineage of fields output by a job
# bar.c1 <-- unknownFunc(bar2.c1, bar4.c1)
# bar.c2 <-- myfunc(bar3.c2)
# {bar.c3,bar.c4} <-- unknownFunc(bar2.c2, bar2.c3, bar3.c1)
# bar.c5 <-- unknownFunc(bar3)
# {bar.c6,bar.c7} <-- unknownFunc(bar4)
# bar2.c9 has no upstream i.e. its values are somehow created independently within this job.

# Note that the semantic of the "transformOperation" value is contextual.
# In above example, it is regarded as some kind of UDF; but it could also be an expression etc.

fineGrainedLineages = [
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c1"), fldUrn("bar4", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c1")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar3", "c2")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c2")],
confidenceScore=0.8,
transformOperation="myfunc",
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c2"), fldUrn("bar2", "c3"), fldUrn("bar3", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c3"), fldUrn("bar", "c4")],
confidenceScore=0.7,
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar3")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c5")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar4")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c6"), fldUrn("bar", "c7")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.NONE,
upstreams=[],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar2", "c9")],
),
]

# The lineage of output col bar.c9 is unknown. So there is no lineage for it above.
# Note that bar2 is an input as well as an output dataset, but some fields are inputs while other fields are outputs.

dataJobInputOutput = DataJobInputOutputClass(
inputDatasets=[datasetUrn("bar2"), datasetUrn("bar3"), datasetUrn("bar4")],
outputDatasets=[datasetUrn("bar"), datasetUrn("bar2")],
inputDatajobs=None,
inputDatasetFields=[
fldUrn("bar2", "c1"),
fldUrn("bar2", "c2"),
fldUrn("bar2", "c3"),
fldUrn("bar3", "c1"),
fldUrn("bar3", "c2"),
fldUrn("bar4", "c1"),
],
outputDatasetFields=[
fldUrn("bar", "c1"),
fldUrn("bar", "c2"),
fldUrn("bar", "c3"),
fldUrn("bar", "c4"),
fldUrn("bar", "c5"),
fldUrn("bar", "c6"),
fldUrn("bar", "c7"),
fldUrn("bar", "c9"),
fldUrn("bar2", "c9"),
],
fineGrainedLineages=fineGrainedLineages,
)

dataJobLineageMcp = MetadataChangeProposalWrapper(
entityUrn=builder.make_data_job_urn("spark", "Flow1", "Task1"),
aspect=dataJobInputOutput,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata!
emitter.emit_mcp(dataJobLineageMcp)

Querying lineage information

The standard GET APIs to retrieve entities can be used to fetch the dataset/datajob created by the above example. The response will include the fine-grained lineage information as well.

Fetch entity snapshot, including fine-grained lineages
curl 'http://localhost:8080/entities/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Apostgres,bar,PROD)'
curl 'http://localhost:8080/entities/urn%3Ali%3AdataJob%3A(urn%3Ali%3AdataFlow%3A(spark,Flow1,prod),Task1)'

The below queries can be used to find the upstream/downstream datasets/fields of a dataset/datajob.

Find upstream datasets and fields of a dataset
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Apostgres,bar,PROD)&types=DownstreamOf'

{
"start": 0,
"count": 9,
"relationships": [
{
"type": "DownstreamOf",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c1)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c3)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c2)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c2)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD),c1)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c1)"
}
],
"total": 9
}
Find the datasets and fields consumed by a datajob i.e. inputs to a datajob
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AdataJob%3A(urn%3Ali%3AdataFlow%3A(spark,Flow1,prod),Task1)&types=Consumes'

{
"start": 0,
"count": 9,
"relationships": [
{
"type": "Consumes",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD)"
},
{
"type": "Consumes",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD)"
},
{
"type": "Consumes",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD),c1)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c2)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c1)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c3)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c2)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c1)"
}
],
"total": 9
}
Find the datasets and fields produced by a datajob i.e. outputs of a datajob
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AdataJob%3A(urn%3Ali%3AdataFlow%3A(spark,Flow1,prod),Task1)&types=Produces'

{
"start": 0,
"count": 11,
"relationships": [
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c9)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c9)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c7)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c6)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c5)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c4)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c3)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c2)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c1)"
},
{
"type": "Produces",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD)"
},
{
"type": "Produces",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD)"
}
],
"total": 11
}

Documentation for Datasets is available via the datasetProperties aspect (typically filled out via ingestion connectors when information is already present in the source system) and via the editableDatasetProperties aspect (filled out via the UI typically)

Links that contain more knowledge about the dataset (e.g. links to Confluence pages) can be added via the institutionalMemory aspect.

Here is a simple script that shows you how to add documentation for a dataset including some links to pages using the low-level Python SDK.

Python SDK: Add documentation, links to a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_documentation.py
from datahub.sdk import DataHubClient, DatasetUrn

client = DataHubClient.from_env()

dataset = client.entities.get(DatasetUrn(platform="hive", name="realestate_db.sales"))

# Add dataset documentation
documentation = """## The Real Estate Sales Dataset
This is a really important Dataset that contains all the relevant information about sales that have happened organized by address.
"""
dataset.set_description(documentation)

# Add link to institutional memory
dataset.add_link(
(
"https://wikipedia.com/real_estate",
"This is the definition of what real estate means", # link description
)
)

client.entities.update(dataset)

Notable Exceptions

The following overloaded uses of the Dataset entity exist for convenience, but will likely move to fully modeled entity types in the future.

  • OpenAPI endpoints: the GET API of OpenAPI endpoints are currently modeled as Datasets, but should really be modeled as a Service/API entity once this is created in the metadata model.
  • DataHub's Logical Entities (e.g.. Dataset, Chart, Dashboard) are represented as Datasets, with sub-type Entity. These should really be modeled as Entities in a logical ER model once this is created in the metadata model.

Technical Reference Guide

The sections above provide an overview of how to use this entity. The following sections provide detailed technical information about how metadata is stored and represented in DataHub.

Aspects are the individual pieces of metadata that can be attached to an entity. Each aspect contains specific information (like ownership, tags, or properties) and is stored as a separate record, allowing for flexible and incremental metadata updates.

Relationships show how this entity connects to other entities in the metadata graph. These connections are derived from the fields within each aspect and form the foundation of DataHub's knowledge graph.

Reading the Field Tables

Each aspect's field table includes an Annotations column that provides additional metadata about how fields are used:

  • ⚠️ Deprecated: This field is deprecated and may be removed in a future version. Check the description for the recommended alternative
  • Searchable: This field is indexed and can be searched in DataHub's search interface
  • Searchable (fieldname): When the field name in parentheses is shown, it indicates the field is indexed under a different name in the search index. For example, dashboardTool is indexed as tool
  • → RelationshipName: This field creates a relationship to another entity. The arrow indicates this field contains a reference (URN) to another entity, and the name indicates the type of relationship (e.g., → Contains, → OwnedBy)

Fields with complex types (like Edge, AuditStamp) link to their definitions in the Common Types section below.

Aspects

datasetKey

Key for a Dataset

FieldTypeRequiredDescriptionAnnotations
platformstringData platform urn associated with the datasetSearchable
namestringUnique guid for datasetSearchable (id)
originFabricTypeFabric type where dataset belongs to or where it was generated.Searchable

datasetProperties

Properties associated with a Dataset

FieldTypeRequiredDescriptionAnnotations
customPropertiesmapCustom property bag.Searchable
externalUrlstringURL where the reference existSearchable
namestringDisplay name of the DatasetSearchable
qualifiedNamestringFully-qualified name of the DatasetSearchable
descriptionstringDocumentation of the datasetSearchable
uristringThe abstracted URI such as hdfs:///data/tracking/PageViewEvent, file:///dir/file_name. Uri should...⚠️ Deprecated
createdTimeStampA timestamp documenting when the asset was created in the source Data Platform (not on DataHub)Searchable
lastModifiedTimeStampA timestamp documenting when the asset was last modified in the source Data Platform (not on Data...Searchable
tagsstring[][Legacy] Unstructured tags for the dataset. Structured tags can be applied via the GlobalTags a...⚠️ Deprecated

editableDatasetProperties

EditableDatasetProperties stores editable changes made to dataset properties. This separates changes made from ingestion pipelines and edits in the UI to avoid accidental overwrites of user-provided data by ingestion pipelines

FieldTypeRequiredDescriptionAnnotations
createdAuditStampAn AuditStamp corresponding to the creation of this resource/association/sub-resource. A value of...
lastModifiedAuditStampAn AuditStamp corresponding to the last modification of this resource/association/sub-resource. I...
deletedAuditStampAn AuditStamp corresponding to the deletion of this resource/association/sub-resource. Logically,...
descriptionstringDocumentation of the datasetSearchable (editedDescription)
namestringEditable display name of the DatasetSearchable (editedName)

datasetUpstreamLineage

Fine Grained upstream lineage for fields in a dataset

FieldTypeRequiredDescriptionAnnotations
fieldMappingsDatasetFieldMapping[]Upstream to downstream field level lineage mappings

upstreamLineage

Upstream lineage of a dataset

FieldTypeRequiredDescriptionAnnotations
upstreamsUpstream[]List of upstream dataset lineage information
fineGrainedLineagesFineGrainedLineage[]List of fine-grained lineage information, including field-level lineage→ DownstreamOf

institutionalMemory

Institutional memory of an entity. This is a way to link to relevant documentation and provide description of the documentation. Institutional or tribal knowledge is very important for users to leverage the entity.

FieldTypeRequiredDescriptionAnnotations
elementsInstitutionalMemoryMetadata[]List of records that represent institutional memory of an entity. Each record consists of a link,...

ownership

Ownership information of an entity.

FieldTypeRequiredDescriptionAnnotations
ownersOwner[]List of owners of the entity.
ownerTypesmapOwnership type to Owners map, populated via mutation hook.Searchable
lastModifiedAuditStampAudit stamp containing who last modified the record and when. A value of 0 in the time field indi...

status

The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc. This aspect is used to represent soft deletes conventionally.

FieldTypeRequiredDescriptionAnnotations
removedbooleanWhether the entity has been removed (soft-deleted).Searchable

schemaMetadata

SchemaMetadata to describe metadata related to store schema

FieldTypeRequiredDescriptionAnnotations
schemaNamestringSchema name e.g. PageViewEvent, identity.Profile, ams.account_management_tracking
platformstringStandardized platform urn where schema is defined. The data platform Urn (urn:li:platform:{platfo...
versionlongEvery change to SchemaMetadata in the resource results in a new version. Version is server assign...
createdAuditStampAn AuditStamp corresponding to the creation of this resource/association/sub-resource. A value of...
lastModifiedAuditStampAn AuditStamp corresponding to the last modification of this resource/association/sub-resource. I...
deletedAuditStampAn AuditStamp corresponding to the deletion of this resource/association/sub-resource. Logically,...
datasetstringDataset this schema metadata is associated with.
clusterstringThe cluster this schema metadata resides from
hashstringthe SHA1 hash of the schema content
platformSchemaunionThe native schema in the dataset's platform.
fieldsSchemaField[]Client provided a list of fields from document schema.
primaryKeysstring[]Client provided list of fields that define primary keys to access record. Field order defines hie...
foreignKeysSpecsmapMap captures all the references schema makes to external datasets. Map key is ForeignKeySpecName ...⚠️ Deprecated
foreignKeysForeignKeyConstraint[]List of foreign key constraints for the schema

editableSchemaMetadata

EditableSchemaMetadata stores editable changes made to schema metadata. This separates changes made from ingestion pipelines and edits in the UI to avoid accidental overwrites of user-provided data by ingestion pipelines.

FieldTypeRequiredDescriptionAnnotations
createdAuditStampAn AuditStamp corresponding to the creation of this resource/association/sub-resource. A value of...
lastModifiedAuditStampAn AuditStamp corresponding to the last modification of this resource/association/sub-resource. I...
deletedAuditStampAn AuditStamp corresponding to the deletion of this resource/association/sub-resource. Logically,...
editableSchemaFieldInfoEditableSchemaFieldInfo[]Client provided a list of fields from document schema.

globalTags

Tag aspect used for applying tags to an entity

FieldTypeRequiredDescriptionAnnotations
tagsTagAssociation[]Tags associated with a given entitySearchable, → TaggedWith

glossaryTerms

Related business terms information

FieldTypeRequiredDescriptionAnnotations
termsGlossaryTermAssociation[]The related business terms
auditStampAuditStampAudit stamp containing who reported the related business term

browsePaths

Shared aspect containing Browse Paths to be indexed for an entity.

FieldTypeRequiredDescriptionAnnotations
pathsstring[]A list of valid browse paths for the entity. Browse paths are expected to be forward slash-separ...Searchable

dataPlatformInstance

The specific instance of the data platform that this entity belongs to

FieldTypeRequiredDescriptionAnnotations
platformstringData PlatformSearchable
instancestringInstance of the data platform (e.g. db instance)Searchable (platformInstance)

viewProperties

Details about a View. e.g. Gets activated when subTypes is view

FieldTypeRequiredDescriptionAnnotations
materializedbooleanWhether the view is materializedSearchable
viewLogicstringThe view logic
formattedViewLogicstringThe formatted view logic. This is particularly used for SQL sources, where the SQL logic is forma...
viewLanguagestringThe view logic language / dialect

browsePathsV2

Shared aspect containing a Browse Path to be indexed for an entity.

FieldTypeRequiredDescriptionAnnotations
pathBrowsePathEntry[]A valid browse path for the entity. This field is provided by DataHub by default. This aspect is ...Searchable

subTypes

Sub Types. Use this aspect to specialize a generic Entity e.g. Making a Dataset also be a View or also be a LookerExplore

FieldTypeRequiredDescriptionAnnotations
typeNamesstring[]The names of the specific types.Searchable

domains

Links from an Asset to its Domains

FieldTypeRequiredDescriptionAnnotations
domainsstring[]The Domains attached to an AssetSearchable, → AssociatedWith

applications

Links from an Asset to its Applications

FieldTypeRequiredDescriptionAnnotations
applicationsstring[]The Applications attached to an AssetSearchable, → AssociatedWith

container

Link from an asset to its parent container

FieldTypeRequiredDescriptionAnnotations
containerstringThe parent container of an assetSearchable, → IsPartOf

deprecation

Deprecation status of an entity

FieldTypeRequiredDescriptionAnnotations
deprecatedbooleanWhether the entity is deprecated.Searchable
decommissionTimelongThe time user plan to decommission this entity.
notestringAdditional information about the entity deprecation plan, such as the wiki, doc, RB.
actorstringThe user URN which will be credited for modifying this deprecation content.
replacementstring

testResults

Information about a Test Result

FieldTypeRequiredDescriptionAnnotations
failingTestResult[]Results that are failingSearchable, → IsFailing
passingTestResult[]Results that are passingSearchable, → IsPassing

siblings

Siblings information of an entity.

FieldTypeRequiredDescriptionAnnotations
siblingsstring[]List of sibling entitiesSearchable, → SiblingOf
primarybooleanIf this is the leader entity of the set of siblings

embed

Information regarding rendering an embed for an asset.

FieldTypeRequiredDescriptionAnnotations
renderUrlstringAn embed URL to be rendered inside of an iframe.

incidentsSummary

Summary related incidents on an entity.

FieldTypeRequiredDescriptionAnnotations
resolvedIncidentsstring[]Resolved incidents for an asset Deprecated! Use the richer resolvedIncidentsDetails instead.⚠️ Deprecated
activeIncidentsstring[]Active incidents for an asset Deprecated! Use the richer activeIncidentsDetails instead.⚠️ Deprecated
resolvedIncidentDetailsIncidentSummaryDetails[]Summary details about the set of resolved incidentsSearchable, → ResolvedIncidents
activeIncidentDetailsIncidentSummaryDetails[]Summary details about the set of active incidentsSearchable, → ActiveIncidents

access

Aspect used for associating roles to a dataset or any asset

FieldTypeRequiredDescriptionAnnotations
rolesRoleAssociation[]List of Roles which needs to be associated

structuredProperties

Properties about an entity governed by StructuredPropertyDefinition

FieldTypeRequiredDescriptionAnnotations
propertiesStructuredPropertyValueAssignment[]Custom property bag.

forms

Forms that are assigned to this entity to be filled out

FieldTypeRequiredDescriptionAnnotations
incompleteFormsFormAssociation[]All incomplete forms assigned to the entity.Searchable
completedFormsFormAssociation[]All complete forms assigned to the entity.Searchable
verificationsFormVerificationAssociation[]Verifications that have been applied to the entity via completed forms.Searchable

partitionsSummary

Defines how the data is partitioned for Data Lake tables (e.g. Hive, S3, Iceberg, Delta, Hudi, etc).

FieldTypeRequiredDescriptionAnnotations
minPartitionPartitionSummaryThe minimum partition as ordered
maxPartitionPartitionSummaryThe maximum partition as ordered

versionProperties

Properties about a versioned asset i.e. dataset, ML Model, etc.

FieldTypeRequiredDescriptionAnnotations
versionSetstringThe linked Version Set entity that ties multiple versioned assets togetherSearchable, → VersionOf
versionVersionTagLabel for this versioned asset, is unique within a version setSearchable
aliasesVersionTag[]Associated aliases for this versioned assetSearchable
commentstringComment documenting what this version was created for, changes, or represents
sortIdstringSort identifier that determines where a version lives in the order of the Version Set. What this ...Searchable (versionSortId)
versioningSchemeVersioningSchemeWhat versioning scheme sortId belongs to. Defaults to a plain string that is lexicographically ...
sourceCreatedTimestampAuditStampTimestamp reflecting when this asset version was created in the source system.
metadataCreatedTimestampAuditStampTimestamp reflecting when the metadata for this version was created in DataHub
isLatestbooleanMarks whether this version is currently the latest. Set by a side effect and should not be modifi...Searchable

icebergCatalogInfo

Iceberg Catalog metadata associated with an Iceberg table/view

FieldTypeRequiredDescriptionAnnotations
metadataPointerstringWhen Datahub is the REST Catalog for an Iceberg Table, stores the current metadata pointer. If th...
viewboolean

logicalParent

Relates a physical asset to a logical model.

FieldTypeRequiredDescriptionAnnotations
parentEdgeSearchable, → PhysicalInstanceOf

datasetProfile (Timeseries)

Stats corresponding to datasets

FieldTypeRequiredDescriptionAnnotations
timestampMillislongThe event timestamp field as epoch at UTC in milli seconds.
eventGranularityTimeWindowSizeGranularity of the event if applicable
partitionSpecPartitionSpecThe optional partition specification.
messageIdstringThe optional messageId, if provided serves as a custom user-defined unique identifier for an aspe...
rowCountlongThe total number of rowsSearchable
columnCountlongThe total number of columns (or schema fields)Searchable
fieldProfilesDatasetFieldProfile[]Profiles for each column (or schema field)
sizeInByteslongStorage size in bytesSearchable

datasetUsageStatistics (Timeseries)

Stats corresponding to dataset's usage.

FieldTypeRequiredDescriptionAnnotations
timestampMillislongThe event timestamp field as epoch at UTC in milli seconds.
eventGranularityTimeWindowSizeGranularity of the event if applicable
partitionSpecPartitionSpecThe optional partition specification.
messageIdstringThe optional messageId, if provided serves as a custom user-defined unique identifier for an aspe...
uniqueUserCountintUnique user countSearchable
totalSqlQueriesintTotal SQL query countSearchable
topSqlQueriesstring[]Frequent SQL queries; mostly makes sense for datasets in SQL databases
userCountsDatasetUserUsageCounts[]Users within this bucket, with frequency counts
fieldCountsDatasetFieldUsageCounts[]Field-level usage stats

operation (Timeseries)

Operational info for an entity.

FieldTypeRequiredDescriptionAnnotations
timestampMillislongThe event timestamp field as epoch at UTC in milli seconds.
eventGranularityTimeWindowSizeGranularity of the event if applicable
partitionSpecPartitionSpecThe optional partition specification.
messageIdstringThe optional messageId, if provided serves as a custom user-defined unique identifier for an aspe...
actorstringActor who issued this operation.
operationTypeOperationTypeOperation type of change.
customOperationTypestringA custom type of operation. Required if operationType is CUSTOM.
numAffectedRowslongHow many rows were affected by this operation.
affectedDatasetsstring[]Which other datasets were affected by this operation.
sourceTypeOperationSourceTypeSource Type
customPropertiesmapCustom properties
lastUpdatedTimestamplongThe time at which the operation occurred. Would be better named 'operationTime'Searchable (lastOperationTime)
queriesstring[]Which queries were used in this operation.

datasetDeprecation (Deprecated)

Dataset deprecation status Deprecated! This aspect is deprecated in favor of the more-general-purpose 'Deprecation' aspect.

FieldTypeRequiredDescriptionAnnotations
deprecatedbooleanWhether the dataset is deprecated by owner.Searchable
decommissionTimelongThe time user plan to decommission this dataset.
notestringAdditional information about the dataset deprecation plan, such as the wiki, doc, RB.
actorstringThe corpuser URN which will be credited for modifying this deprecation content.

Common Types

These types are used across multiple aspects in this entity.

AuditStamp

Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage.

Fields:

  • time (long): When did the resource/association/sub-resource move into the specific lifecyc...
  • actor (string): The entity (e.g. a member URN) which will be credited for moving the resource...
  • impersonator (string?): The entity (e.g. a service URN) which performs the change on behalf of the Ac...
  • message (string?): Additional context around how DataHub was informed of the particular change. ...

Edge

A common structure to represent all edges to entities when used inside aspects as collections This ensures that all edges have common structure around audit-stamps and will support PATCH, time-travel automatically.

Fields:

  • sourceUrn (string?): Urn of the source of this relationship edge. If not specified, assumed to be ...
  • destinationUrn (string): Urn of the destination of this relationship edge.
  • created (AuditStamp?): Audit stamp containing who created this relationship edge and when
  • lastModified (AuditStamp?): Audit stamp containing who last modified this relationship edge and when
  • properties (map?): A generic properties bag that allows us to store specific information on this...

FormAssociation

Properties of an applied form.

Fields:

  • urn (string): Urn of the applied form
  • incompletePrompts (FormPromptAssociation[]): A list of prompts that are not yet complete for this form.
  • completedPrompts (FormPromptAssociation[]): A list of prompts that have been completed for this form.

IncidentSummaryDetails

Summary statistics about incidents on an entity.

Fields:

  • urn (string): The urn of the incident
  • type (string): The type of an incident
  • createdAt (long): The time at which the incident was raised in milliseconds since epoch.
  • resolvedAt (long?): The time at which the incident was marked as resolved in milliseconds since e...
  • priority (int?): The priority of the incident

PartitionSpec

A reference to a specific partition in a dataset.

Fields:

  • partition (string): A unique id / value for the partition for which statistics were collected, ge...
  • timePartition (TimeWindow?): Time window of the partition, if we are able to extract it from the partition...
  • type (PartitionType): Unused!

PartitionSummary

Defines how the data is partitioned

Fields:

  • partition (string): A unique id / value for the partition for which statistics were collected, ge...
  • createdTime (long?): The created time for a given partition.
  • lastModifiedTime (long?): The last modified / touched time for a given partition.

TestResult

Information about a Test Result

Fields:

  • test (string): The urn of the test
  • type (TestResultType): The type of the result
  • testDefinitionMd5 (string?): The md5 of the test definition that was used to compute this result. See Test...
  • lastComputed (AuditStamp?): The audit stamp of when the result was computed, including the actor who comp...

TimeStamp

A standard event timestamp

Fields:

  • time (long): When did the event occur
  • actor (string?): Optional: The actor urn involved in the event.

TimeWindowSize

Defines the size of a time window.

Fields:

  • unit (CalendarInterval): Interval unit such as minute/hour/day etc.
  • multiple (int): How many units. Defaults to 1.

VersionTag

A resource-defined string representing the resource state for the purpose of concurrency control

Fields:

  • versionTag (string?):
  • metadataAttribution (MetadataAttribution?):

Relationships

Self

These are the relationships to itself, stored in this entity's aspects

  • DownstreamOf (via upstreamLineage.upstreams.dataset)
  • DownstreamOf (via upstreamLineage.fineGrainedLineages)
  • ForeignKeyToDataset (via schemaMetadata.foreignKeys.foreignDataset)
  • SiblingOf (via siblings.siblings)
  • PhysicalInstanceOf (via logicalParent.parent)

Outgoing

These are the relationships stored in this entity's aspects

  • DownstreamOf

    • SchemaField via upstreamLineage.fineGrainedLineages
  • OwnedBy

    • Corpuser via ownership.owners.owner
    • CorpGroup via ownership.owners.owner
  • ownershipType

    • OwnershipType via ownership.owners.typeUrn
  • SchemaFieldTaggedWith

    • Tag via schemaMetadata.fields.globalTags
  • TaggedWith

    • Tag via schemaMetadata.fields.globalTags.tags
    • Tag via editableSchemaMetadata.editableSchemaFieldInfo.globalTags.tags
    • Tag via globalTags.tags
  • SchemaFieldWithGlossaryTerm

    • GlossaryTerm via schemaMetadata.fields.glossaryTerms
  • TermedWith

    • GlossaryTerm via schemaMetadata.fields.glossaryTerms.terms.urn
    • GlossaryTerm via editableSchemaMetadata.editableSchemaFieldInfo.glossaryTerms.terms.urn
    • GlossaryTerm via glossaryTerms.terms.urn
  • ForeignKeyTo

    • SchemaField via schemaMetadata.foreignKeys.foreignFields
  • EditableSchemaFieldTaggedWith

    • Tag via editableSchemaMetadata.editableSchemaFieldInfo.globalTags
  • EditableSchemaFieldWithGlossaryTerm

    • GlossaryTerm via editableSchemaMetadata.editableSchemaFieldInfo.glossaryTerms
  • AssociatedWith

    • Domain via domains.domains
    • Application via applications.applications
    • Role via access.roles.urn
  • IsPartOf

    • Container via container.container
  • IsFailing

    • Test via testResults.failing
  • IsPassing

    • Test via testResults.passing
  • ResolvedIncidents

    • Incident via incidentsSummary.resolvedIncidentDetails
  • ActiveIncidents

    • Incident via incidentsSummary.activeIncidentDetails
  • VersionOf

    • VersionSet via versionProperties.versionSet
  • PhysicalInstanceOf

    • SchemaField via logicalParent.parent

Global Metadata Model

Global Graph