DataProcess
DEPRECATED: This entity is deprecated and should not be used for new implementations.
Use dataFlow and dataJob instead.
The
dataProcessentity was an early attempt to model data processing tasks but has been superseded by the more robust and flexibledataFlowanddataJobentities which better represent the hierarchical nature of modern data pipelines.
Deprecation Notice
The dataProcess entity was deprecated to provide a clearer separation between:
- DataFlow: Represents the overall pipeline/workflow (e.g., an Airflow DAG, dbt project, Spark application)
- DataJob: Represents individual tasks within a pipeline (e.g., an Airflow task, dbt model, Spark job)
This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring.
Why was it deprecated?
The original dataProcess entity had several limitations:
- No hierarchical structure: It couldn't represent the relationship between a pipeline and its constituent tasks
- Limited orchestrator support: The flat structure didn't map well to DAG-based orchestration platforms like Airflow, Prefect, or Dagster
- Unclear semantics: It was ambiguous whether a dataProcess represented a whole pipeline or a single task
- Poor lineage modeling: Without task-level granularity, lineage relationships were less precise
The new dataFlow and dataJob model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures.
Identity (Historical Reference)
DataProcess entities were identified by three components:
- Name: The process name (typically an ETL job name)
- Orchestrator: The workflow management platform (e.g.,
airflow,azkaban) - Origin (Fabric): The environment where the process runs (PROD, DEV, etc.)
The URN structure was:
urn:li:dataProcess:(<name>,<orchestrator>,<origin>)
Example URNs
urn:li:dataProcess:(customer_etl_job,airflow,PROD)
urn:li:dataProcess:(sales_aggregation,azkaban,DEV)
Important Capabilities (Historical Reference)
DataProcessInfo Aspect
The dataProcessInfo aspect captured inputs and outputs of the process:
- Inputs: Array of dataset URNs consumed by the process
- Outputs: Array of dataset URNs produced by the process
This established basic lineage relationships through "Consumes" relationships with datasets.
Common Aspects
Like other entities, dataProcess supported:
- Ownership: Assigning owners to processes
- Status: Marking processes as removed
- Global Tags: Categorization and classification
- Institutional Memory: Links to documentation
Migration Guide
When to use DataFlow vs DataJob
Use DataFlow when representing:
- Airflow DAGs
- dbt projects
- Prefect flows
- Dagster pipelines
- Azkaban workflows
- Any container of related data processing tasks
Use DataJob when representing:
- Airflow tasks within a DAG
- dbt models within a project
- Prefect tasks within a flow
- Dagster ops/assets within a pipeline
- Individual processing steps
Use both together:
- Create a DataFlow for the pipeline
- Create DataJobs for each task within that pipeline
- Link DataJobs to their parent DataFlow
Conceptual Mapping
| DataProcess Concept | New Model Equivalent | Notes |
|---|---|---|
| Process with tasks | DataFlow + DataJobs | Split into two entities |
| Process name | DataFlow flowId | Becomes the parent identifier |
| Single-step process | DataFlow + 1 DataJob | Still requires both entities |
| Orchestrator | DataFlow orchestrator | Same concept, better modeling |
| Origin/Fabric | DataFlow cluster | Often matches environment |
| Inputs/Outputs | DataJob dataJobInputOutput | Moved to job level for precision |
Migration Steps
To migrate from dataProcess to dataFlow/dataJob:
Identify your process structure: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task
Create a DataFlow: This represents the overall pipeline/workflow
- Use the same orchestrator value
- Use the process name as the flow ID
- Use a cluster identifier (often matches the origin/fabric)
Create DataJob(s): Create one or more jobs within the flow
- For single-step processes: create one job named after the process
- For multi-step processes: create a job for each step
- Link each job to its parent DataFlow
Migrate lineage: Move input/output dataset relationships from the process level to the job level
Migrate metadata: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata)
Migration Examples
Example 1: Simple single-task process
Old dataProcess:
urn:li:dataProcess:(daily_report,airflow,PROD)
New structure:
DataFlow: urn:li:dataFlow:(airflow,daily_report,prod)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task)
Example 2: Multi-step ETL pipeline
Old dataProcess:
urn:li:dataProcess:(customer_pipeline,airflow,PROD)
New structure:
DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers)
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers)
Code Examples
Querying Existing DataProcess Entities
If you need to query existing dataProcess entities for migration purposes:
Python SDK: Query a dataProcess entity
# Inlined from /metadata-ingestion/examples/library/dataprocess_query_deprecated.py
"""
Example: Query an existing (deprecated) dataProcess entity for migration purposes.
This example shows how to read a deprecated dataProcess entity from DataHub
to understand its structure before migrating it to dataFlow and dataJob entities.
Note: This is only for reading existing data. Do NOT create new dataProcess entities.
Use dataFlow and dataJob instead for all new implementations.
"""
from datahub.emitter.rest_emitter import DatahubRestEmitter
# Create emitter to read from DataHub
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
# URN of the deprecated dataProcess entity to query
dataprocess_urn = "urn:li:dataProcess:(customer_etl_job,airflow,PROD)"
# Fetch the entity using the REST API
try:
entity = emitter._session.get(
f"{emitter._gms_server}/entities/{dataprocess_urn}"
).json()
print(f"Found dataProcess: {dataprocess_urn}")
print("\n=== Entity Aspects ===")
# Extract key information for migration
if "aspects" in entity:
aspects = entity["aspects"]
# Key aspect (identity)
if "dataProcessKey" in aspects:
key = aspects["dataProcessKey"]
print("\nIdentity:")
print(f" Name: {key.get('name')}")
print(f" Orchestrator: {key.get('orchestrator')}")
print(f" Origin (Fabric): {key.get('origin')}")
# Core process information
if "dataProcessInfo" in aspects:
info = aspects["dataProcessInfo"]
print("\nProcess Info:")
if "inputs" in info:
print(f" Input Datasets: {len(info['inputs'])}")
for inp in info["inputs"]:
print(f" - {inp}")
if "outputs" in info:
print(f" Output Datasets: {len(info['outputs'])}")
for out in info["outputs"]:
print(f" - {out}")
# Ownership information
if "ownership" in aspects:
ownership = aspects["ownership"]
print("\nOwnership:")
for owner in ownership.get("owners", []):
print(f" - {owner['owner']} (type: {owner.get('type', 'UNKNOWN')})")
# Tags
if "globalTags" in aspects:
tags = aspects["globalTags"]
print("\nTags:")
for tag in tags.get("tags", []):
print(f" - {tag['tag']}")
# Status
if "status" in aspects:
status = aspects["status"]
print(f"\nStatus: {status.get('removed', False)}")
print("\n=== Migration Recommendation ===")
print("Replace this dataProcess with:")
print(
f" DataFlow URN: urn:li:dataFlow:({key.get('orchestrator')},{key.get('name')},{key.get('origin', 'PROD').lower()})"
)
print(
f" DataJob URN: urn:li:dataJob:(urn:li:dataFlow:({key.get('orchestrator')},{key.get('name')},{key.get('origin', 'PROD').lower()}),main_task)"
)
print("\nSee dataprocess_migrate_to_flow_job.py for migration code examples.")
except Exception as e:
print(f"Error querying dataProcess: {e}")
print("\nThis is expected if the entity doesn't exist.")
print("DataProcess is deprecated - use dataFlow and dataJob instead.")
Creating Equivalent DataFlow and DataJob (Recommended)
Instead of using dataProcess, create the modern equivalent:
Python SDK: Create DataFlow and DataJob to replace dataProcess
# Inlined from /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py
"""
Example: Migrate from deprecated dataProcess to modern dataFlow and dataJob entities.
This example shows how to create the modern dataFlow and dataJob entities
that replace the deprecated dataProcess entity.
The dataProcess entity used a flat structure:
dataProcess -> inputs/outputs
The new model uses a hierarchical structure:
dataFlow (pipeline) -> dataJob (task) -> inputs/outputs
This provides better organization and more precise lineage tracking.
"""
from datahub.metadata.urns import DatasetUrn
from datahub.sdk import DataFlow, DataHubClient, DataJob
client = DataHubClient.from_env()
# Old dataProcess would have been:
# urn:li:dataProcess:(customer_etl_job,airflow,PROD)
# with inputs and outputs at the process level
# New approach: Create a DataFlow for the pipeline
dataflow = DataFlow(
platform="airflow", # Same as the old 'orchestrator' field
name="customer_etl_job", # Same as the old 'name' field
platform_instance="prod", # Based on old 'origin' field
description="ETL pipeline for customer data processing",
)
# Create DataJob(s) within the flow
# For a simple single-task process, create one job
# For complex multi-step processes, create multiple jobs
datajob = DataJob(
name="customer_etl_task", # Task name within the flow
flow=dataflow, # Link to parent flow
description="Main ETL task for customer data",
# Inputs and outputs now live at the job level for precise lineage
inlets=[
DatasetUrn(platform="mysql", name="raw_db.customers", env="PROD"),
DatasetUrn(platform="mysql", name="raw_db.orders", env="PROD"),
],
outlets=[
DatasetUrn(platform="postgres", name="analytics.customer_summary", env="PROD"),
],
)
# Upsert both entities
client.entities.upsert(dataflow)
client.entities.upsert(datajob)
print("Successfully migrated from dataProcess to dataFlow + dataJob!")
print(f"DataFlow URN: {dataflow.urn}")
print(f"DataJob URN: {datajob.urn}")
print("\nKey improvements over dataProcess:")
print("- Clear separation between pipeline (DataFlow) and task (DataJob)")
print("- Support for multi-step pipelines with multiple DataJobs")
print("- More precise lineage at the task level")
print("- Better integration with modern orchestration platforms")
Complete Migration Example
Python SDK: Full migration from dataProcess to dataFlow/dataJob
# Inlined from /metadata-ingestion/examples/library/dataprocess_full_migration.py
"""
Example: Complete migration from dataProcess to dataFlow/dataJob with metadata preservation.
This example demonstrates a full migration path that:
1. Reads an existing deprecated dataProcess entity
2. Extracts all its metadata (inputs, outputs, ownership, tags)
3. Creates equivalent dataFlow and dataJob entities
4. Preserves all metadata relationships
Use this as a template for migrating multiple dataProcess entities in bulk.
"""
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
GlobalTagsClass,
OwnerClass,
OwnershipClass,
OwnershipTypeClass,
TagAssociationClass,
)
from datahub.sdk import DataFlow, DataHubClient, DataJob
# Initialize clients
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
client = DataHubClient.from_env()
# Step 1: Define the dataProcess to migrate
old_dataprocess_urn = "urn:li:dataProcess:(sales_pipeline,airflow,PROD)"
print(f"Migrating: {old_dataprocess_urn}")
try:
# Step 2: Fetch the existing dataProcess entity
entity = rest_emitter._session.get(
f"{rest_emitter._gms_server}/entities/{old_dataprocess_urn}"
).json()
aspects = entity.get("aspects", {})
# Extract identity information
key = aspects.get("dataProcessKey", {})
name = key.get("name", "unknown_process")
orchestrator = key.get("orchestrator", "unknown")
origin = key.get("origin", "PROD")
# Extract process info
process_info = aspects.get("dataProcessInfo", {})
input_datasets = process_info.get("inputs", [])
output_datasets = process_info.get("outputs", [])
# Extract ownership
ownership_aspect = aspects.get("ownership", {})
owners = ownership_aspect.get("owners", [])
# Extract tags
tags_aspect = aspects.get("globalTags", {})
tags = tags_aspect.get("tags", [])
print("\n=== Extracted Metadata ===")
print(f"Name: {name}")
print(f"Orchestrator: {orchestrator}")
print(f"Environment: {origin}")
print(f"Inputs: {len(input_datasets)} datasets")
print(f"Outputs: {len(output_datasets)} datasets")
print(f"Owners: {len(owners)}")
print(f"Tags: {len(tags)}")
# Step 3: Create the new DataFlow
dataflow = DataFlow(
platform=orchestrator,
name=name,
platform_instance=origin.lower(),
description=f"Migrated from dataProcess {name}",
)
# Step 4: Create the DataJob(s)
# For simplicity, creating one job. In practice, you might split into multiple jobs.
datajob = DataJob(
name=f"{name}_main",
flow=dataflow,
description=f"Main task for {name}",
inlets=[inp for inp in input_datasets], # These should be dataset URNs
outlets=[out for out in output_datasets], # These should be dataset URNs
)
# Step 5: Upsert the entities
client.entities.upsert(dataflow)
client.entities.upsert(datajob)
print("\n=== Created New Entities ===")
print(f"DataFlow: {dataflow.urn}")
print(f"DataJob: {datajob.urn}")
# Step 6: Migrate ownership to DataFlow
if owners:
ownership_to_add = OwnershipClass(
owners=[
OwnerClass(
owner=owner.get("owner"),
type=getattr(OwnershipTypeClass, owner.get("type", "DATAOWNER")),
)
for owner in owners
]
)
rest_emitter.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=str(dataflow.urn),
aspect=ownership_to_add,
)
)
print(f"Migrated {len(owners)} owner(s) to DataFlow")
# Step 7: Migrate tags to DataFlow
if tags:
tags_to_add = GlobalTagsClass(
tags=[TagAssociationClass(tag=tag.get("tag")) for tag in tags]
)
rest_emitter.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=str(dataflow.urn),
aspect=tags_to_add,
)
)
print(f"Migrated {len(tags)} tag(s) to DataFlow")
print("\n=== Migration Complete ===")
print("Next steps:")
print("1. Verify the new entities in DataHub UI")
print("2. Update any downstream systems to reference the new URNs")
print("3. Consider soft-deleting the old dataProcess entity")
except Exception as e:
print(f"Error during migration: {e}")
print("\nCommon issues:")
print("- DataProcess entity doesn't exist (already migrated or never created)")
print("- Network connectivity to DataHub GMS")
print("- Permission issues writing to DataHub")
Integration Points
Historical Usage
The dataProcess entity was previously used by:
- Early ingestion connectors: Original Airflow, Azkaban connectors before they migrated to dataFlow/dataJob
- Custom integrations: User-built integrations that haven't been updated
- Legacy metadata: Historical data in existing DataHub instances
Modern Replacements
All modern DataHub connectors use dataFlow and dataJob:
- Airflow: DAGs → DataFlow, Tasks → DataJob
- dbt: Projects → DataFlow, Models → DataJob
- Prefect: Flows → DataFlow, Tasks → DataJob
- Dagster: Pipelines → DataFlow, Ops/Assets → DataJob
- Fivetran: Connectors → DataFlow, Sync operations → DataJob
- AWS Glue: Jobs → DataFlow, Steps → DataJob
- Azure Data Factory: Pipelines → DataFlow, Activities → DataJob
DataProcessInstance
Note that dataProcessInstance is NOT deprecated. It represents a specific execution/run of either:
- A dataJob (recommended)
- A legacy dataProcess (for backward compatibility)
DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information.
Notable Exceptions
Timeline for Removal
- Deprecated: Early 2021 (with introduction of dataFlow/dataJob)
- Status: Still exists in the entity registry for backward compatibility
- Current State: No active ingestion sources create dataProcess entities
- Removal: No specific timeline, maintained for existing data
Reading Existing Data
The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI.
No New Writes Recommended
While technically possible to create new dataProcess entities, it is strongly discouraged. All new integrations should use dataFlow and dataJob.
Upgrade Path
There is no automatic migration tool. Organizations with significant dataProcess data should:
- Use the Python SDK to query existing dataProcess entities
- Create equivalent dataFlow and dataJob entities
- Preserve URN mappings for lineage continuity
- Consider soft-deleting old dataProcess entities once migration is verified
GraphQL API
The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities.
Additional Resources
- DataFlow Entity Documentation
- DataJob Entity Documentation
- Lineage Documentation
- Airflow Integration Guide
Technical Reference Guide
The sections above provide an overview of how to use this entity. The following sections provide detailed technical information about how metadata is stored and represented in DataHub.
Aspects are the individual pieces of metadata that can be attached to an entity. Each aspect contains specific information (like ownership, tags, or properties) and is stored as a separate record, allowing for flexible and incremental metadata updates.
Relationships show how this entity connects to other entities in the metadata graph. These connections are derived from the fields within each aspect and form the foundation of DataHub's knowledge graph.
Reading the Field Tables
Each aspect's field table includes an Annotations column that provides additional metadata about how fields are used:
- ⚠️ Deprecated: This field is deprecated and may be removed in a future version. Check the description for the recommended alternative
- Searchable: This field is indexed and can be searched in DataHub's search interface
- Searchable (fieldname): When the field name in parentheses is shown, it indicates the field is indexed under a different name in the search index. For example,
dashboardToolis indexed astool - → RelationshipName: This field creates a relationship to another entity. The arrow indicates this field contains a reference (URN) to another entity, and the name indicates the type of relationship (e.g.,
→ Contains,→ OwnedBy)
Fields with complex types (like Edge, AuditStamp) link to their definitions in the Common Types section below.
Aspects
dataProcessKey
Key for a Data Process
- Fields
- Raw Schema
| Field | Type | Required | Description | Annotations |
|---|---|---|---|---|
| name | string | ✓ | Process name i.e. an ETL job name | Searchable |
| orchestrator | string | ✓ | Standardized Orchestrator where data process is defined. TODO: Migrate towards something that can... | Searchable |
| origin | FabricType | ✓ | Fabric type where dataset belongs to or where it was generated. | Searchable |
{
"type": "record",
"Aspect": {
"name": "dataProcessKey"
},
"name": "DataProcessKey",
"namespace": "com.linkedin.metadata.key",
"fields": [
{
"Searchable": {
"boostScore": 4.0,
"enableAutocomplete": true,
"fieldType": "WORD_GRAM"
},
"type": "string",
"name": "name",
"doc": "Process name i.e. an ETL job name"
},
{
"Searchable": {
"enableAutocomplete": true,
"fieldType": "TEXT_PARTIAL"
},
"type": "string",
"name": "orchestrator",
"doc": "Standardized Orchestrator where data process is defined.\nTODO: Migrate towards something that can be validated like DataPlatform urn"
},
{
"Searchable": {
"fieldType": "TEXT_PARTIAL",
"queryByDefault": false
},
"type": {
"type": "enum",
"symbolDocs": {
"CORP": "Designates corporation fabrics",
"DEV": "Designates development fabrics",
"EI": "Designates early-integration fabrics",
"NON_PROD": "Designates non-production fabrics",
"PRD": "Alternative Prod spelling",
"PRE": "Designates pre-production fabrics",
"PROD": "Designates production fabrics",
"QA": "Designates quality assurance fabrics",
"RVW": "Designates review fabrics",
"SANDBOX": "Designates sandbox fabrics",
"SBX": "Alternative spelling for sandbox",
"SIT": "System Integration Testing",
"STG": "Designates staging fabrics",
"TEST": "Designates testing fabrics",
"TST": "Alternative Test spelling",
"UAT": "Designates user acceptance testing fabrics"
},
"name": "FabricType",
"namespace": "com.linkedin.common",
"symbols": [
"DEV",
"TEST",
"QA",
"UAT",
"EI",
"PRE",
"STG",
"NON_PROD",
"PROD",
"CORP",
"RVW",
"PRD",
"TST",
"SIT",
"SBX",
"SANDBOX"
],
"doc": "Fabric group type"
},
"name": "origin",
"doc": "Fabric type where dataset belongs to or where it was generated."
}
],
"doc": "Key for a Data Process"
}
ownership
Ownership information of an entity.
- Fields
- Raw Schema
| Field | Type | Required | Description | Annotations |
|---|---|---|---|---|
| owners | Owner[] | ✓ | List of owners of the entity. | |
| ownerTypes | map | Ownership type to Owners map, populated via mutation hook. | Searchable | |
| lastModified | AuditStamp | ✓ | Audit stamp containing who last modified the record and when. A value of 0 in the time field indi... |
{
"type": "record",
"Aspect": {
"name": "ownership"
},
"name": "Ownership",
"namespace": "com.linkedin.common",
"fields": [
{
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Owner",
"namespace": "com.linkedin.common",
"fields": [
{
"Relationship": {
"entityTypes": [
"corpuser",
"corpGroup"
],
"name": "OwnedBy"
},
"Searchable": {
"addToFilters": true,
"fieldName": "owners",
"fieldType": "URN",
"filterNameOverride": "Owned By",
"hasValuesFieldName": "hasOwners",
"queryByDefault": false
},
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "owner",
"doc": "Owner URN, e.g. urn:li:corpuser:ldap, urn:li:corpGroup:group_name, and urn:li:multiProduct:mp_name\n(Caveat: only corpuser is currently supported in the frontend.)"
},
{
"deprecated": true,
"type": {
"type": "enum",
"symbolDocs": {
"BUSINESS_OWNER": "A person or group who is responsible for logical, or business related, aspects of the asset.",
"CONSUMER": "A person, group, or service that consumes the data\nDeprecated! Use TECHNICAL_OWNER or BUSINESS_OWNER instead.",
"CUSTOM": "Set when ownership type is unknown or a when new one is specified as an ownership type entity for which we have no\nenum value for. This is used for backwards compatibility",
"DATAOWNER": "A person or group that is owning the data\nDeprecated! Use TECHNICAL_OWNER instead.",
"DATA_STEWARD": "A steward, expert, or delegate responsible for the asset.",
"DELEGATE": "A person or a group that overseas the operation, e.g. a DBA or SRE.\nDeprecated! Use TECHNICAL_OWNER instead.",
"DEVELOPER": "A person or group that is in charge of developing the code\nDeprecated! Use TECHNICAL_OWNER instead.",
"NONE": "No specific type associated to the owner.",
"PRODUCER": "A person, group, or service that produces/generates the data\nDeprecated! Use TECHNICAL_OWNER instead.",
"STAKEHOLDER": "A person or a group that has direct business interest\nDeprecated! Use TECHNICAL_OWNER, BUSINESS_OWNER, or STEWARD instead.",
"TECHNICAL_OWNER": "person or group who is responsible for technical aspects of the asset."
},
"deprecatedSymbols": {
"CONSUMER": true,
"DATAOWNER": true,
"DELEGATE": true,
"DEVELOPER": true,
"PRODUCER": true,
"STAKEHOLDER": true
},
"name": "OwnershipType",
"namespace": "com.linkedin.common",
"symbols": [
"CUSTOM",
"TECHNICAL_OWNER",
"BUSINESS_OWNER",
"DATA_STEWARD",
"NONE",
"DEVELOPER",
"DATAOWNER",
"DELEGATE",
"PRODUCER",
"CONSUMER",
"STAKEHOLDER"
],
"doc": "Asset owner types"
},
"name": "type",
"doc": "The type of the ownership"
},
{
"Relationship": {
"entityTypes": [
"ownershipType"
],
"name": "ownershipType"
},
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "typeUrn",
"default": null,
"doc": "The type of the ownership\nUrn of type O"
},
{
"type": [
"null",
{
"type": "record",
"name": "OwnershipSource",
"namespace": "com.linkedin.common",
"fields": [
{
"type": {
"type": "enum",
"symbolDocs": {
"AUDIT": "Auditing system or audit logs",
"DATABASE": "Database, e.g. GRANTS table",
"FILE_SYSTEM": "File system, e.g. file/directory owner",
"ISSUE_TRACKING_SYSTEM": "Issue tracking system, e.g. Jira",
"MANUAL": "Manually provided by a user",
"OTHER": "Other sources",
"SERVICE": "Other ownership-like service, e.g. Nuage, ACL service etc",
"SOURCE_CONTROL": "SCM system, e.g. GIT, SVN"
},
"name": "OwnershipSourceType",
"namespace": "com.linkedin.common",
"symbols": [
"AUDIT",
"DATABASE",
"FILE_SYSTEM",
"ISSUE_TRACKING_SYSTEM",
"MANUAL",
"SERVICE",
"SOURCE_CONTROL",
"OTHER"
]
},
"name": "type",
"doc": "The type of the source"
},
{
"type": [
"null",
"string"
],
"name": "url",
"default": null,
"doc": "A reference URL for the source"
}
],
"doc": "Source/provider of the ownership information"
}
],
"name": "source",
"default": null,
"doc": "Source information for the ownership"
},
{
"Searchable": {
"/actor": {
"fieldName": "ownerAttributionActors",
"fieldType": "URN",
"queryByDefault": false
},
"/source": {
"fieldName": "ownerAttributionSources",
"fieldType": "URN",
"queryByDefault": false
},
"/time": {
"fieldName": "ownerAttributionDates",
"fieldType": "DATETIME",
"queryByDefault": false
}
},
"type": [
"null",
{
"type": "record",
"name": "MetadataAttribution",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When this metadata was updated."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) responsible for applying the assocated metadata. This can\neither be a user (in case of UI edits) or the datahub system for automation."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "source",
"default": null,
"doc": "The DataHub source responsible for applying the associated metadata. This will only be filled out\nwhen a DataHub source is responsible. This includes the specific metadata test urn, the automation urn."
},
{
"type": {
"type": "map",
"values": "string"
},
"name": "sourceDetail",
"default": {},
"doc": "The details associated with why this metadata was applied. For example, this could include\nthe actual regex rule, sql statement, ingestion pipeline ID, etc."
}
],
"doc": "Information about who, why, and how this metadata was applied"
}
],
"name": "attribution",
"default": null,
"doc": "Information about who, why, and how this metadata was applied"
}
],
"doc": "Ownership information"
}
},
"name": "owners",
"doc": "List of owners of the entity."
},
{
"Searchable": {
"/*": {
"fieldType": "MAP_ARRAY",
"queryByDefault": false
}
},
"type": [
{
"type": "map",
"values": {
"type": "array",
"items": "string"
}
},
"null"
],
"name": "ownerTypes",
"default": {},
"doc": "Ownership type to Owners map, populated via mutation hook."
},
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "lastModified",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "Audit stamp containing who last modified the record and when. A value of 0 in the time field indicates missing data."
}
],
"doc": "Ownership information of an entity."
}
dataProcessInfo
The inputs and outputs of this data process
- Fields
- Raw Schema
| Field | Type | Required | Description | Annotations |
|---|---|---|---|---|
| inputs | string[] | the inputs of the data process | Searchable, → Consumes | |
| outputs | string[] | the outputs of the data process | Searchable, → Consumes |
{
"type": "record",
"Aspect": {
"name": "dataProcessInfo"
},
"name": "DataProcessInfo",
"namespace": "com.linkedin.dataprocess",
"fields": [
{
"Relationship": {
"/*": {
"entityTypes": [
"dataset"
],
"isLineage": true,
"name": "Consumes"
}
},
"Searchable": {
"/*": {
"fieldName": "inputs",
"fieldType": "URN",
"numValuesFieldName": "numInputDatasets",
"queryByDefault": false
}
},
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"name": "inputs",
"default": null,
"doc": "the inputs of the data process"
},
{
"Relationship": {
"/*": {
"entityTypes": [
"dataset"
],
"isLineage": true,
"name": "Consumes"
}
},
"Searchable": {
"/*": {
"fieldName": "outputs",
"fieldType": "URN",
"numValuesFieldName": "numOutputDatasets",
"queryByDefault": false
}
},
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"name": "outputs",
"default": null,
"doc": "the outputs of the data process"
}
],
"doc": "The inputs and outputs of this data process"
}
status
The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc. This aspect is used to represent soft deletes conventionally.
- Fields
- Raw Schema
| Field | Type | Required | Description | Annotations |
|---|---|---|---|---|
| removed | boolean | ✓ | Whether the entity has been removed (soft-deleted). | Searchable |
{
"type": "record",
"Aspect": {
"name": "status"
},
"name": "Status",
"namespace": "com.linkedin.common",
"fields": [
{
"Searchable": {
"fieldType": "BOOLEAN"
},
"type": "boolean",
"name": "removed",
"default": false,
"doc": "Whether the entity has been removed (soft-deleted)."
}
],
"doc": "The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc.\nThis aspect is used to represent soft deletes conventionally."
}
testResults
Information about a Test Result
- Fields
- Raw Schema
| Field | Type | Required | Description | Annotations |
|---|---|---|---|---|
| failing | TestResult[] | ✓ | Results that are failing | Searchable, → IsFailing |
| passing | TestResult[] | ✓ | Results that are passing | Searchable, → IsPassing |
{
"type": "record",
"Aspect": {
"name": "testResults"
},
"name": "TestResults",
"namespace": "com.linkedin.test",
"fields": [
{
"Relationship": {
"/*/test": {
"entityTypes": [
"test"
],
"name": "IsFailing"
}
},
"Searchable": {
"/*/test": {
"fieldName": "failingTests",
"fieldType": "URN",
"hasValuesFieldName": "hasFailingTests",
"queryByDefault": false
}
},
"type": {
"type": "array",
"items": {
"type": "record",
"name": "TestResult",
"namespace": "com.linkedin.test",
"fields": [
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "test",
"doc": "The urn of the test"
},
{
"type": {
"type": "enum",
"symbolDocs": {
"FAILURE": " The Test Failed",
"SUCCESS": " The Test Succeeded"
},
"name": "TestResultType",
"namespace": "com.linkedin.test",
"symbols": [
"SUCCESS",
"FAILURE"
]
},
"name": "type",
"doc": "The type of the result"
},
{
"type": [
"null",
"string"
],
"name": "testDefinitionMd5",
"default": null,
"doc": "The md5 of the test definition that was used to compute this result.\nSee TestInfo.testDefinition.md5 for more information."
},
{
"type": [
"null",
{
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
}
],
"name": "lastComputed",
"default": null,
"doc": "The audit stamp of when the result was computed, including the actor who computed it."
}
],
"doc": "Information about a Test Result"
}
},
"name": "failing",
"doc": "Results that are failing"
},
{
"Relationship": {
"/*/test": {
"entityTypes": [
"test"
],
"name": "IsPassing"
}
},
"Searchable": {
"/*/test": {
"fieldName": "passingTests",
"fieldType": "URN",
"hasValuesFieldName": "hasPassingTests",
"queryByDefault": false
}
},
"type": {
"type": "array",
"items": "com.linkedin.test.TestResult"
},
"name": "passing",
"doc": "Results that are passing"
}
],
"doc": "Information about a Test Result"
}
subTypes
Sub Types. Use this aspect to specialize a generic Entity e.g. Making a Dataset also be a View or also be a LookerExplore
- Fields
- Raw Schema
| Field | Type | Required | Description | Annotations |
|---|---|---|---|---|
| typeNames | string[] | ✓ | The names of the specific types. | Searchable |
{
"type": "record",
"Aspect": {
"name": "subTypes"
},
"name": "SubTypes",
"namespace": "com.linkedin.common",
"fields": [
{
"Searchable": {
"/*": {
"addToFilters": true,
"fieldType": "KEYWORD",
"filterNameOverride": "Sub Type",
"queryByDefault": false
}
},
"type": {
"type": "array",
"items": "string"
},
"name": "typeNames",
"doc": "The names of the specific types."
}
],
"doc": "Sub Types. Use this aspect to specialize a generic Entity\ne.g. Making a Dataset also be a View or also be a LookerExplore"
}
Common Types
These types are used across multiple aspects in this entity.
AuditStamp
Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage.
Fields:
time(long): When did the resource/association/sub-resource move into the specific lifecyc...actor(string): The entity (e.g. a member URN) which will be credited for moving the resource...impersonator(string?): The entity (e.g. a service URN) which performs the change on behalf of the Ac...message(string?): Additional context around how DataHub was informed of the particular change. ...
TestResult
Information about a Test Result
Fields:
test(string): The urn of the testtype(TestResultType): The type of the resulttestDefinitionMd5(string?): The md5 of the test definition that was used to compute this result. See Test...lastComputed(AuditStamp?): The audit stamp of when the result was computed, including the actor who comp...
Relationships
Outgoing
These are the relationships stored in this entity's aspects
OwnedBy
- Corpuser via
ownership.owners.owner - CorpGroup via
ownership.owners.owner
- Corpuser via
ownershipType
- OwnershipType via
ownership.owners.typeUrn
- OwnershipType via
Consumes
- Dataset via
dataProcessInfo.inputs - Dataset via
dataProcessInfo.outputs
- Dataset via
IsFailing
- Test via
testResults.failing
- Test via
IsPassing
- Test via
testResults.passing
- Test via
Global Metadata Model
