Skip to main content
Version: Next

DataJob

Data jobs represent individual units of data processing work within a data pipeline or workflow. They are the tasks, steps, or operations that transform, move, or process data as part of a larger data flow. Examples include Airflow tasks, dbt models, Spark jobs, Databricks notebooks, and similar processing units in orchestration systems.

Identity

Data jobs are identified by two pieces of information:

  • The data flow (pipeline/workflow) that they belong to: this is represented as a URN pointing to the parent dataFlow entity. The data flow defines the orchestrator (e.g., airflow, spark, dbt), the flow ID (e.g., the DAG name or pipeline name), and the cluster where it runs.
  • The unique job identifier within that flow: this is a string that uniquely identifies the task within its parent flow (e.g., task name, step name, model name).

The URN structure for a data job is: urn:li:dataJob:(urn:li:dataFlow:(<orchestrator>,<flow_id>,<cluster>),<job_id>)

Examples

Airflow task:

urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_etl_dag,prod),transform_customer_data)

dbt model:

urn:li:dataJob:(urn:li:dataFlow:(dbt,analytics_project,prod),staging.stg_customers)

Spark job:

urn:li:dataJob:(urn:li:dataFlow:(spark,data_processing_pipeline,PROD),aggregate_sales_task)

Databricks notebook:

urn:li:dataJob:(urn:li:dataFlow:(databricks,etl_workflow,production),process_events_notebook)

Important Capabilities

Job Information (dataJobInfo)

The dataJobInfo aspect captures the core properties of a data job:

  • Name: Human-readable name of the job (searchable with autocomplete)
  • Description: Detailed description of what the job does
  • Type: The type of job (e.g., SQL, Python, Spark, etc.)
  • Flow URN: Reference to the parent data flow
  • Created/Modified timestamps: When the job was created or last modified in the source system
  • Environment: The fabric/environment where the job runs (PROD, DEV, QA, etc.)
  • Custom properties: Additional key-value properties specific to the source system
  • External references: Links to external documentation or definitions (e.g., GitHub links)

Input/Output Lineage (dataJobInputOutput)

The dataJobInputOutput aspect defines the data lineage relationships for the job:

  • Input datasets: Datasets consumed by the job during processing (via inputDatasetEdges)
  • Output datasets: Datasets produced by the job (via outputDatasetEdges)
  • Input data jobs: Other data jobs that this job depends on (via inputDatajobEdges)
  • Input dataset fields: Specific schema fields consumed from input datasets
  • Output dataset fields: Specific schema fields produced in output datasets
  • Fine-grained lineage: Column-level lineage mappings showing which upstream fields contribute to downstream fields

This aspect establishes the critical relationships that enable DataHub to build and visualize data lineage graphs across your entire data ecosystem.

Editable Properties (editableDataJobProperties)

The editableDataJobProperties aspect stores documentation edits made through the DataHub UI:

  • Description: User-edited documentation that complements or overrides the ingested description
  • Change audit stamps: Tracks who made edits and when

This separation ensures that manual edits in the UI are preserved and not overwritten by ingestion pipelines.

Ownership

Like other entities, data jobs support ownership through the ownership aspect. Owners can be users or groups with various ownership types (DATAOWNER, PRODUCER, DEVELOPER, etc.). This helps identify who is responsible for maintaining and troubleshooting the job.

Tags and Glossary Terms

Data jobs can be tagged and associated with glossary terms:

  • Tags (globalTags aspect): Used for categorization, classification, or operational purposes (e.g., PII, critical, deprecated)
  • Glossary terms (glossaryTerms aspect): Link jobs to business terminology and concepts from your glossary

Domains and Applications

Data jobs can be organized into:

  • Domains (domains aspect): Business domains or data domains for organizational structure
  • Applications (applications aspect): Associated with specific applications or systems

Structured Properties and Forms

Data jobs support:

  • Structured properties: Custom typed properties defined by your organization
  • Forms: Structured documentation forms for consistency

Code Examples

Creating a Data Job

The simplest way to create a data job is using the Python SDK v2:

Python SDK: Create a basic data job
# Inlined from /metadata-ingestion/examples/library/datajob_create_basic.py
# metadata-ingestion/examples/library/datajob_create_basic.py
from datahub.metadata.urns import DataFlowUrn, DatasetUrn
from datahub.sdk import DataHubClient, DataJob

client = DataHubClient.from_env()

datajob = DataJob(
name="transform_customer_data",
flow_urn=DataFlowUrn(
orchestrator="airflow",
flow_id="daily_etl_pipeline",
cluster="prod",
),
description="Transforms raw customer data into analytics-ready format",
inlets=[
DatasetUrn(platform="postgres", name="raw.customers", env="PROD"),
DatasetUrn(platform="postgres", name="raw.addresses", env="PROD"),
],
outlets=[
DatasetUrn(platform="snowflake", name="analytics.dim_customers", env="PROD"),
],
)

client.entities.upsert(datajob)
print(f"Created data job: {datajob.urn}")

Adding Tags, Terms, and Ownership

Common metadata can be added to data jobs to enhance discoverability and governance:

Python SDK: Add tags, terms, and ownership to a data job
# Inlined from /metadata-ingestion/examples/library/datajob_add_tags_terms_ownership.py
# metadata-ingestion/examples/library/datajob_add_tags_terms_ownership.py
from datahub.metadata.urns import (
CorpUserUrn,
DataFlowUrn,
DataJobUrn,
GlossaryTermUrn,
TagUrn,
)
from datahub.sdk import DataHubClient

client = DataHubClient.from_env()

datajob_urn = DataJobUrn(
job_id="transform_customer_data",
flow=DataFlowUrn(
orchestrator="airflow", flow_id="daily_etl_pipeline", cluster="prod"
),
)

datajob = client.entities.get(datajob_urn)

datajob.add_tag(TagUrn("Critical"))
datajob.add_tag(TagUrn("ETL"))

datajob.add_term(GlossaryTermUrn("CustomerData"))
datajob.add_term(GlossaryTermUrn("DataTransformation"))

datajob.add_owner(CorpUserUrn("data_engineering_team"))
datajob.add_owner(CorpUserUrn("john.doe"))

client.entities.update(datajob)

print(f"Added tags, terms, and ownership to {datajob_urn}")

Updating Job Properties

You can update job properties like descriptions using the low-level APIs:

Python SDK: Update data job description
# Inlined from /metadata-ingestion/examples/library/datajob_update_description.py
# metadata-ingestion/examples/library/datajob_update_description.py
from datahub.sdk import DataFlowUrn, DataHubClient, DataJobUrn

client = DataHubClient.from_env()

dataflow_urn = DataFlowUrn(
orchestrator="airflow", flow_id="daily_etl_pipeline", cluster="prod"
)
datajob_urn = DataJobUrn(flow=dataflow_urn, job_id="transform_customer_data")

datajob = client.entities.get(datajob_urn)
datajob.set_description(
"This job performs critical customer data transformation. "
"It joins raw customer records with address information and applies "
"data quality rules before loading into the analytics warehouse."
)

client.entities.update(datajob)

print(f"Updated description for {datajob_urn}")

Querying Data Job Information

Retrieve data job information via the REST API:

REST API: Query a data job
# Inlined from /metadata-ingestion/examples/library/datajob_query_rest.py
# metadata-ingestion/examples/library/datajob_query_rest.py
import json
from urllib.parse import quote

import requests

datajob_urn = "urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_etl_pipeline,prod),transform_customer_data)"

gms_server = "http://localhost:8080"
url = f"{gms_server}/entities/{quote(datajob_urn, safe='')}"

response = requests.get(url)

if response.status_code == 200:
data = response.json()
print(json.dumps(data, indent=2))

if "aspects" in data:
aspects = data["aspects"]

if "dataJobInfo" in aspects:
job_info = aspects["dataJobInfo"]["value"]
print(f"\nJob Name: {job_info.get('name')}")
print(f"Description: {job_info.get('description')}")
print(f"Type: {job_info.get('type')}")

if "dataJobInputOutput" in aspects:
lineage = aspects["dataJobInputOutput"]["value"]
print(f"\nInput Datasets: {len(lineage.get('inputDatasetEdges', []))}")
print(f"Output Datasets: {len(lineage.get('outputDatasetEdges', []))}")

if "ownership" in aspects:
ownership = aspects["ownership"]["value"]
print(f"\nOwners: {len(ownership.get('owners', []))}")
for owner in ownership.get("owners", []):
print(f" - {owner.get('owner')} ({owner.get('type')})")

if "globalTags" in aspects:
tags = aspects["globalTags"]["value"]
print("\nTags:")
for tag in tags.get("tags", []):
print(f" - {tag.get('tag')}")
else:
print(f"Failed to retrieve data job: {response.status_code}")
print(response.text)

Adding Lineage to Data Jobs

Data jobs are often used to define lineage relationships. See the existing lineage examples:

Python SDK: Add lineage using DataJobPatchBuilder
# Inlined from /metadata-ingestion/examples/library/datajob_add_lineage_patch.py
from datahub.emitter.mce_builder import (
make_data_job_urn,
make_dataset_urn,
make_schema_field_urn,
)
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.metadata.schema_classes import (
FineGrainedLineageClass as FineGrainedLineage,
FineGrainedLineageDownstreamTypeClass as FineGrainedLineageDownstreamType,
FineGrainedLineageUpstreamTypeClass as FineGrainedLineageUpstreamType,
)
from datahub.specific.datajob import DataJobPatchBuilder

# Create DataHub Client
datahub_client = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))

# Create DataJob URN
datajob_urn = make_data_job_urn(
orchestrator="airflow", flow_id="dag_abc", job_id="task_456"
)

# Create DataJob Patch to Add Lineage
patch_builder = DataJobPatchBuilder(datajob_urn)
patch_builder.add_input_dataset(
make_dataset_urn(platform="kafka", name="SampleKafkaDataset", env="PROD")
)
patch_builder.add_output_dataset(
make_dataset_urn(platform="hive", name="SampleHiveDataset", env="PROD")
)
patch_builder.add_input_datajob(
make_data_job_urn(orchestrator="airflow", flow_id="dag_abc", job_id="task_123")
)
patch_builder.add_input_dataset_field(
make_schema_field_urn(
parent_urn=make_dataset_urn(
platform="hive", name="fct_users_deleted", env="PROD"
),
field_path="user_id",
)
)
patch_builder.add_output_dataset_field(
make_schema_field_urn(
parent_urn=make_dataset_urn(
platform="hive", name="fct_users_created", env="PROD"
),
field_path="user_id",
)
)

# Update column-level lineage through the Data Job
lineage1 = FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[
make_schema_field_urn(make_dataset_urn("postgres", "raw_data.users"), "user_id")
],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[
make_schema_field_urn(
make_dataset_urn("postgres", "analytics.user_metrics"),
"user_id",
)
],
transformOperation="IDENTITY",
confidenceScore=1.0,
)
patch_builder.add_fine_grained_lineage(lineage1)
patch_builder.remove_fine_grained_lineage(lineage1)
# Replaces all existing fine-grained lineages
patch_builder.set_fine_grained_lineages([lineage1])

patch_mcps = patch_builder.build()

# Emit DataJob Patch
for patch_mcp in patch_mcps:
datahub_client.emit(patch_mcp)

Python SDK: Define fine-grained lineage through a data job
# Inlined from /metadata-ingestion/examples/library/lineage_emitter_datajob_finegrained.py
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
FineGrainedLineage,
FineGrainedLineageDownstreamType,
FineGrainedLineageUpstreamType,
)
from datahub.metadata.schema_classes import DataJobInputOutputClass


def datasetUrn(tbl):
return builder.make_dataset_urn("postgres", tbl)


def fldUrn(tbl, fld):
return builder.make_schema_field_urn(datasetUrn(tbl), fld)


# Lineage of fields output by a job
# bar.c1 <-- unknownFunc(bar2.c1, bar4.c1)
# bar.c2 <-- myfunc(bar3.c2)
# {bar.c3,bar.c4} <-- unknownFunc(bar2.c2, bar2.c3, bar3.c1)
# bar.c5 <-- unknownFunc(bar3)
# {bar.c6,bar.c7} <-- unknownFunc(bar4)
# bar2.c9 has no upstream i.e. its values are somehow created independently within this job.

# Note that the semantic of the "transformOperation" value is contextual.
# In above example, it is regarded as some kind of UDF; but it could also be an expression etc.

fineGrainedLineages = [
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c1"), fldUrn("bar4", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c1")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar3", "c2")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c2")],
confidenceScore=0.8,
transformOperation="myfunc",
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c2"), fldUrn("bar2", "c3"), fldUrn("bar3", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c3"), fldUrn("bar", "c4")],
confidenceScore=0.7,
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar3")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c5")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar4")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c6"), fldUrn("bar", "c7")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.NONE,
upstreams=[],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar2", "c9")],
),
]

# The lineage of output col bar.c9 is unknown. So there is no lineage for it above.
# Note that bar2 is an input as well as an output dataset, but some fields are inputs while other fields are outputs.

dataJobInputOutput = DataJobInputOutputClass(
inputDatasets=[datasetUrn("bar2"), datasetUrn("bar3"), datasetUrn("bar4")],
outputDatasets=[datasetUrn("bar"), datasetUrn("bar2")],
inputDatajobs=None,
inputDatasetFields=[
fldUrn("bar2", "c1"),
fldUrn("bar2", "c2"),
fldUrn("bar2", "c3"),
fldUrn("bar3", "c1"),
fldUrn("bar3", "c2"),
fldUrn("bar4", "c1"),
],
outputDatasetFields=[
fldUrn("bar", "c1"),
fldUrn("bar", "c2"),
fldUrn("bar", "c3"),
fldUrn("bar", "c4"),
fldUrn("bar", "c5"),
fldUrn("bar", "c6"),
fldUrn("bar", "c7"),
fldUrn("bar", "c9"),
fldUrn("bar2", "c9"),
],
fineGrainedLineages=fineGrainedLineages,
)

dataJobLineageMcp = MetadataChangeProposalWrapper(
entityUrn=builder.make_data_job_urn("spark", "Flow1", "Task1"),
aspect=dataJobInputOutput,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata!
emitter.emit_mcp(dataJobLineageMcp)

Integration Points

Relationship with DataFlow

Every data job belongs to exactly one dataFlow entity, which represents the parent pipeline or workflow. The data flow captures:

  • The orchestrator/platform (Airflow, Spark, dbt, etc.)
  • The flow/pipeline/DAG identifier
  • The cluster or environment where it executes

This hierarchical relationship allows DataHub to organize jobs within their workflows and understand the execution context.

Relationship with Datasets

Data jobs establish lineage by defining:

  • Consumes relationships with input datasets
  • Produces relationships with output datasets

These relationships are the foundation of DataHub's lineage graph. When a job processes data, it creates a connection between upstream sources and downstream outputs, enabling impact analysis and data discovery.

Relationship with DataProcessInstance

While dataJob represents the definition of a processing task, dataProcessInstance represents a specific execution or run of that job. Process instances capture:

  • Runtime information (start time, end time, duration)
  • Status (success, failure, running)
  • Input/output datasets for that specific run
  • Error messages and logs

This separation allows you to track both the static definition of a job and its dynamic runtime behavior.

GraphQL Resolvers

The DataHub GraphQL API provides rich query capabilities for data jobs:

  • DataJobType: Main type for querying data job information
  • DataJobRunsResolver: Resolves execution history and run information
  • DataFlowDataJobsRelationshipsMapper: Maps relationships between flows and jobs
  • UpdateLineageResolver: Handles lineage updates for jobs

Ingestion Sources

Data jobs are commonly ingested from:

  • Airflow: Tasks and DAGs with lineage extraction
  • dbt: Models as data jobs with SQL-based lineage
  • Spark: Job definitions with dataset dependencies
  • Databricks: Notebooks and workflows
  • Dagster: Ops and assets as processing units
  • Prefect: Tasks and flows
  • AWS Glue: ETL jobs
  • Azure Data Factory: Pipeline activities
  • Looker: LookML models and derived tables

These connectors automatically extract job definitions, lineage, and metadata from the source systems.

Notable Exceptions

DataHub Ingestion Jobs

DataHub's own ingestion pipelines are represented as data jobs with special aspects:

  • datahubIngestionRunSummary: Tracks ingestion run statistics, entities processed, warnings, and errors
  • datahubIngestionCheckpoint: Maintains state for incremental ingestion

These aspects are specific to DataHub's internal ingestion framework and are not used for general-purpose data jobs.

Job Status Deprecation

The status field in dataJobInfo is deprecated in favor of the dataProcessInstance model. Instead of storing job status on the job definition itself, create separate process instance entities for each execution with their own status information. This provides a cleaner separation between job definitions and runtime execution history.

Subtype Usage

The subTypes aspect allows you to classify jobs into categories:

  • SQL jobs
  • Python jobs
  • Notebook jobs
  • Container jobs
  • Custom job types

This helps with filtering and organizing jobs in the UI and API queries.

Technical Reference Guide

The sections above provide an overview of how to use this entity. The following sections provide detailed technical information about how metadata is stored and represented in DataHub.

Aspects are the individual pieces of metadata that can be attached to an entity. Each aspect contains specific information (like ownership, tags, or properties) and is stored as a separate record, allowing for flexible and incremental metadata updates.

Relationships show how this entity connects to other entities in the metadata graph. These connections are derived from the fields within each aspect and form the foundation of DataHub's knowledge graph.

Reading the Field Tables

Each aspect's field table includes an Annotations column that provides additional metadata about how fields are used:

  • ⚠️ Deprecated: This field is deprecated and may be removed in a future version. Check the description for the recommended alternative
  • Searchable: This field is indexed and can be searched in DataHub's search interface
  • Searchable (fieldname): When the field name in parentheses is shown, it indicates the field is indexed under a different name in the search index. For example, dashboardTool is indexed as tool
  • → RelationshipName: This field creates a relationship to another entity. The arrow indicates this field contains a reference (URN) to another entity, and the name indicates the type of relationship (e.g., → Contains, → OwnedBy)

Fields with complex types (like Edge, AuditStamp) link to their definitions in the Common Types section below.

Aspects

dataJobKey

Key for a Data Job

FieldTypeRequiredDescriptionAnnotations
flowstringStandardized data processing flow urn representing the flow for the jobSearchable (dataFlow), → IsPartOf
jobIdstringUnique Identifier of the data jobSearchable

dataJobInfo

Information about a Data processing job

FieldTypeRequiredDescriptionAnnotations
customPropertiesmapCustom property bag.Searchable
externalUrlstringURL where the reference existSearchable
namestringJob nameSearchable
descriptionstringJob descriptionSearchable
typeunionDatajob type *NOTE**: AzkabanJobType is deprecated. Please use strings instead.
flowUrnstringDataFlow urn that this job is part of
createdTimeStampA timestamp documenting when the asset was created in the source Data Platform (not on DataHub)Searchable
lastModifiedTimeStampA timestamp documenting when the asset was last modified in the source Data Platform (not on Data...Searchable
statusJobStatusStatus of the job - Deprecated for Data Process Instance model.⚠️ Deprecated
envFabricTypeEnvironment for this jobSearchable

dataJobInputOutput

Information about the inputs and outputs of a Data processing job

FieldTypeRequiredDescriptionAnnotations
inputDatasetsstring[]Input datasets consumed by the data job during processing Deprecated! Use inputDatasetEdges instead.⚠️ Deprecated, Searchable, → Consumes
inputDatasetEdgesEdge[]Input datasets consumed by the data job during processingSearchable, → Consumes
outputDatasetsstring[]Output datasets produced by the data job during processing Deprecated! Use outputDatasetEdges ins...⚠️ Deprecated, Searchable, → Produces
outputDatasetEdgesEdge[]Output datasets produced by the data job during processingSearchable, → Produces
inputDatajobsstring[]Input datajobs that this data job depends on Deprecated! Use inputDatajobEdges instead.⚠️ Deprecated, → DownstreamOf
inputDatajobEdgesEdge[]Input datajobs that this data job depends on→ DownstreamOf
inputDatasetFieldsstring[]Fields of the input datasets used by this jobSearchable, → Consumes
outputDatasetFieldsstring[]Fields of the output datasets this job writes toSearchable, → Produces
fineGrainedLineagesFineGrainedLineage[]Fine-grained column-level lineages Not currently supported in the UI Use UpstreamLineage aspect f...

editableDataJobProperties

Stores editable changes made to properties. This separates changes made from ingestion pipelines and edits in the UI to avoid accidental overwrites of user-provided data by ingestion pipelines

FieldTypeRequiredDescriptionAnnotations
createdAuditStampAn AuditStamp corresponding to the creation of this resource/association/sub-resource. A value of...
lastModifiedAuditStampAn AuditStamp corresponding to the last modification of this resource/association/sub-resource. I...
deletedAuditStampAn AuditStamp corresponding to the deletion of this resource/association/sub-resource. Logically,...
descriptionstringEdited documentation of the data jobSearchable (editedDescription)

ownership

Ownership information of an entity.

FieldTypeRequiredDescriptionAnnotations
ownersOwner[]List of owners of the entity.
ownerTypesmapOwnership type to Owners map, populated via mutation hook.Searchable
lastModifiedAuditStampAudit stamp containing who last modified the record and when. A value of 0 in the time field indi...

status

The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc. This aspect is used to represent soft deletes conventionally.

FieldTypeRequiredDescriptionAnnotations
removedbooleanWhether the entity has been removed (soft-deleted).Searchable

globalTags

Tag aspect used for applying tags to an entity

FieldTypeRequiredDescriptionAnnotations
tagsTagAssociation[]Tags associated with a given entitySearchable, → TaggedWith

browsePaths

Shared aspect containing Browse Paths to be indexed for an entity.

FieldTypeRequiredDescriptionAnnotations
pathsstring[]A list of valid browse paths for the entity. Browse paths are expected to be forward slash-separ...Searchable

glossaryTerms

Related business terms information

FieldTypeRequiredDescriptionAnnotations
termsGlossaryTermAssociation[]The related business terms
auditStampAuditStampAudit stamp containing who reported the related business term

institutionalMemory

Institutional memory of an entity. This is a way to link to relevant documentation and provide description of the documentation. Institutional or tribal knowledge is very important for users to leverage the entity.

FieldTypeRequiredDescriptionAnnotations
elementsInstitutionalMemoryMetadata[]List of records that represent institutional memory of an entity. Each record consists of a link,...

dataPlatformInstance

The specific instance of the data platform that this entity belongs to

FieldTypeRequiredDescriptionAnnotations
platformstringData PlatformSearchable
instancestringInstance of the data platform (e.g. db instance)Searchable (platformInstance)

browsePathsV2

Shared aspect containing a Browse Path to be indexed for an entity.

FieldTypeRequiredDescriptionAnnotations
pathBrowsePathEntry[]A valid browse path for the entity. This field is provided by DataHub by default. This aspect is ...Searchable

domains

Links from an Asset to its Domains

FieldTypeRequiredDescriptionAnnotations
domainsstring[]The Domains attached to an AssetSearchable, → AssociatedWith

applications

Links from an Asset to its Applications

FieldTypeRequiredDescriptionAnnotations
applicationsstring[]The Applications attached to an AssetSearchable, → AssociatedWith

deprecation

Deprecation status of an entity

FieldTypeRequiredDescriptionAnnotations
deprecatedbooleanWhether the entity is deprecated.Searchable
decommissionTimelongThe time user plan to decommission this entity.
notestringAdditional information about the entity deprecation plan, such as the wiki, doc, RB.
actorstringThe user URN which will be credited for modifying this deprecation content.
replacementstring

versionInfo

Information about a Data processing job

FieldTypeRequiredDescriptionAnnotations
customPropertiesmapCustom property bag.Searchable
externalUrlstringURL where the reference existSearchable
versionstringThe version which can indentify a job version like a commit hash or md5 hash
versionTypestringThe type of the version like git hash or md5 hash

container

Link from an asset to its parent container

FieldTypeRequiredDescriptionAnnotations
containerstringThe parent container of an assetSearchable, → IsPartOf

structuredProperties

Properties about an entity governed by StructuredPropertyDefinition

FieldTypeRequiredDescriptionAnnotations
propertiesStructuredPropertyValueAssignment[]Custom property bag.

forms

Forms that are assigned to this entity to be filled out

FieldTypeRequiredDescriptionAnnotations
incompleteFormsFormAssociation[]All incomplete forms assigned to the entity.Searchable
completedFormsFormAssociation[]All complete forms assigned to the entity.Searchable
verificationsFormVerificationAssociation[]Verifications that have been applied to the entity via completed forms.Searchable

subTypes

Sub Types. Use this aspect to specialize a generic Entity e.g. Making a Dataset also be a View or also be a LookerExplore

FieldTypeRequiredDescriptionAnnotations
typeNamesstring[]The names of the specific types.Searchable

incidentsSummary

Summary related incidents on an entity.

FieldTypeRequiredDescriptionAnnotations
resolvedIncidentsstring[]Resolved incidents for an asset Deprecated! Use the richer resolvedIncidentsDetails instead.⚠️ Deprecated
activeIncidentsstring[]Active incidents for an asset Deprecated! Use the richer activeIncidentsDetails instead.⚠️ Deprecated
resolvedIncidentDetailsIncidentSummaryDetails[]Summary details about the set of resolved incidentsSearchable, → ResolvedIncidents
activeIncidentDetailsIncidentSummaryDetails[]Summary details about the set of active incidentsSearchable, → ActiveIncidents

testResults

Information about a Test Result

FieldTypeRequiredDescriptionAnnotations
failingTestResult[]Results that are failingSearchable, → IsFailing
passingTestResult[]Results that are passingSearchable, → IsPassing

dataTransformLogic

Information about a Query against one or more data assets (e.g. Tables or Views).

FieldTypeRequiredDescriptionAnnotations
transformsDataTransform[]List of transformations applied

datahubIngestionRunSummary (Timeseries)

Summary of a datahub ingestion run for a given platform.

FieldTypeRequiredDescriptionAnnotations
timestampMillislongThe event timestamp field as epoch at UTC in milli seconds.
eventGranularityTimeWindowSizeGranularity of the event if applicable
partitionSpecPartitionSpecThe optional partition specification.
messageIdstringThe optional messageId, if provided serves as a custom user-defined unique identifier for an aspe...
pipelineNamestringThe name of the pipeline that ran ingestion, a stable unique user provided identifier. e.g. my_s...
platformInstanceIdstringThe id of the instance against which the ingestion pipeline ran. e.g.: Bigquery project ids, MySQ...
runIdstringThe runId for this pipeline instance.
runStatusJobStatusRun Status - Succeeded/Skipped/Failed etc.
numWorkUnitsCommittedlongThe number of workunits written to sink.
numWorkUnitsCreatedlongThe number of workunits that are produced.
numEventslongThe number of events produced (MCE + MCP).
numEntitieslongThe total number of entities produced (unique entity urns).
numAspectslongThe total number of aspects produced across all entities.
numSourceAPICallslongTotal number of source API calls.
totalLatencySourceAPICallslongTotal latency across all source API calls.
numSinkAPICallslongTotal number of sink API calls.
totalLatencySinkAPICallslongTotal latency across all sink API calls.
numWarningslongNumber of warnings generated.
numErrorslongNumber of errors generated.
numEntitiesSkippedlongNumber of entities skipped.
configstringThe non-sensitive key-value pairs of the yaml config used as json string.
custom_summarystringCustom value.
softwareVersionstringThe software version of this ingestion.
systemHostNamestringThe hostname the ingestion pipeline ran on.
operatingSystemNamestringThe os the ingestion pipeline ran on.
numProcessorsintThe number of processors on the host the ingestion pipeline ran on.
totalMemorylongThe total amount of memory on the host the ingestion pipeline ran on.
availableMemorylongThe available memory on the host the ingestion pipeline ran on.

datahubIngestionCheckpoint (Timeseries)

Checkpoint of a datahub ingestion run for a given job.

FieldTypeRequiredDescriptionAnnotations
timestampMillislongThe event timestamp field as epoch at UTC in milli seconds.
eventGranularityTimeWindowSizeGranularity of the event if applicable
partitionSpecPartitionSpecThe optional partition specification.
messageIdstringThe optional messageId, if provided serves as a custom user-defined unique identifier for an aspe...
pipelineNamestringThe name of the pipeline that ran ingestion, a stable unique user provided identifier. e.g. my_s...
platformInstanceIdstringThe id of the instance against which the ingestion pipeline ran. e.g.: Bigquery project ids, MySQ...
configstringJson-encoded string representation of the non-secret members of the config .
stateIngestionCheckpointStateOpaque blob of the state representation.
runIdstringThe run identifier of this job.

Common Types

These types are used across multiple aspects in this entity.

AuditStamp

Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage.

Fields:

  • time (long): When did the resource/association/sub-resource move into the specific lifecyc...
  • actor (string): The entity (e.g. a member URN) which will be credited for moving the resource...
  • impersonator (string?): The entity (e.g. a service URN) which performs the change on behalf of the Ac...
  • message (string?): Additional context around how DataHub was informed of the particular change. ...

Edge

A common structure to represent all edges to entities when used inside aspects as collections This ensures that all edges have common structure around audit-stamps and will support PATCH, time-travel automatically.

Fields:

  • sourceUrn (string?): Urn of the source of this relationship edge. If not specified, assumed to be ...
  • destinationUrn (string): Urn of the destination of this relationship edge.
  • created (AuditStamp?): Audit stamp containing who created this relationship edge and when
  • lastModified (AuditStamp?): Audit stamp containing who last modified this relationship edge and when
  • properties (map?): A generic properties bag that allows us to store specific information on this...

FormAssociation

Properties of an applied form.

Fields:

  • urn (string): Urn of the applied form
  • incompletePrompts (FormPromptAssociation[]): A list of prompts that are not yet complete for this form.
  • completedPrompts (FormPromptAssociation[]): A list of prompts that have been completed for this form.

IncidentSummaryDetails

Summary statistics about incidents on an entity.

Fields:

  • urn (string): The urn of the incident
  • type (string): The type of an incident
  • createdAt (long): The time at which the incident was raised in milliseconds since epoch.
  • resolvedAt (long?): The time at which the incident was marked as resolved in milliseconds since e...
  • priority (int?): The priority of the incident

PartitionSpec

A reference to a specific partition in a dataset.

Fields:

  • partition (string): A unique id / value for the partition for which statistics were collected, ge...
  • timePartition (TimeWindow?): Time window of the partition, if we are able to extract it from the partition...
  • type (PartitionType): Unused!

TestResult

Information about a Test Result

Fields:

  • test (string): The urn of the test
  • type (TestResultType): The type of the result
  • testDefinitionMd5 (string?): The md5 of the test definition that was used to compute this result. See Test...
  • lastComputed (AuditStamp?): The audit stamp of when the result was computed, including the actor who comp...

TimeStamp

A standard event timestamp

Fields:

  • time (long): When did the event occur
  • actor (string?): Optional: The actor urn involved in the event.

TimeWindowSize

Defines the size of a time window.

Fields:

  • unit (CalendarInterval): Interval unit such as minute/hour/day etc.
  • multiple (int): How many units. Defaults to 1.

Relationships

Self

These are the relationships to itself, stored in this entity's aspects

  • DownstreamOf (via dataJobInputOutput.inputDatajobs)
  • DownstreamOf (via dataJobInputOutput.inputDatajobEdges)

Outgoing

These are the relationships stored in this entity's aspects

  • IsPartOf

    • DataFlow via dataJobKey.flow
    • Container via container.container
  • Consumes

    • Dataset via dataJobInputOutput.inputDatasets
    • Dataset via dataJobInputOutput.inputDatasetEdges
    • SchemaField via dataJobInputOutput.inputDatasetFields
  • Produces

    • Dataset via dataJobInputOutput.outputDatasets
    • Dataset via dataJobInputOutput.outputDatasetEdges
    • SchemaField via dataJobInputOutput.outputDatasetFields
  • OwnedBy

    • Corpuser via ownership.owners.owner
    • CorpGroup via ownership.owners.owner
  • ownershipType

    • OwnershipType via ownership.owners.typeUrn
  • TaggedWith

    • Tag via globalTags.tags
  • TermedWith

    • GlossaryTerm via glossaryTerms.terms.urn
  • AssociatedWith

    • Domain via domains.domains
    • Application via applications.applications
  • ResolvedIncidents

    • Incident via incidentsSummary.resolvedIncidentDetails
  • ActiveIncidents

    • Incident via incidentsSummary.activeIncidentDetails
  • IsFailing

    • Test via testResults.failing
  • IsPassing

    • Test via testResults.passing

Global Metadata Model

Global Graph