DataHubGc

Overview

DataHub GC is a DataHub utility or metadata-focused integration. Learn more in the official DataHub GC documentation.

The DataHub integration for DataHub GC covers metadata entities and operational objects relevant to this connector. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source Concept	DataHub Concept	Notes
Platform/account/project scope	Platform Instance, Container	Organizes assets within the platform context.
Core technical asset (for example table/view/topic/file)	Dataset	Primary ingested technical asset.
Schema fields / columns	SchemaField	Included when schema extraction is supported.
Ownership and collaboration principals	CorpUser, CorpGroup	Emitted by modules that support ownership and identity metadata.
Dependencies and processing relationships	Lineage edges	Available when lineage extraction is supported and enabled.

Module `datahub-gc`

Important Capabilities

Capability metadata is not explicitly declared for this module. Refer to module documentation and configuration sections below.

Overview

The DataHub Garbage Collection (GC) source is a maintenance component responsible for cleaning up various types of metadata to maintain system performance and data quality. It performs multiple cleanup tasks, each focusing on different aspects of DataHub's metadata.

Prerequisites

Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.

Install the Plugin

pip install 'acryl-datahub[datahub-gc]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: datahub-gc
  config:
    dry_run: false
    cleanup_expired_tokens: true
    truncate_indices: true
    dataprocess_cleanup:
      retention_days: 10
      delete_empty_data_jobs: true
      delete_empty_data_flows: true
      hard_delete_entities: false
      keep_last_n: 5

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
cleanup_expired_tokens boolean	Whether to clean up expired tokens or not Default: True
dry_run boolean	Whether to perform a dry run or not. This is only supported for dataprocess cleanup and soft deleted entities cleanup. Default: False
truncate_index_older_than_days integer	Indices older than this number of days will be truncated Default: 30
truncate_indices boolean	Whether to truncate elasticsearch indices or not which can be safely truncated Default: True
truncation_sleep_between_seconds integer	Sleep between truncation monitoring. Default: 30
truncation_watch_until integer	Wait for truncation of indices until this number of documents are left Default: 10000
dataprocess_cleanup DataProcessCleanupConfig
dataprocess_cleanup.batch_size integer	The number of entities to get in a batch from API Default: 500
dataprocess_cleanup.delay One of number, null	Delay between each batch Default: 0.25
dataprocess_cleanup.delete_empty_data_flows boolean	Whether to delete Data Flows without runs Default: False
dataprocess_cleanup.delete_empty_data_jobs boolean	Whether to delete Data Jobs without runs Default: False
dataprocess_cleanup.enabled boolean	Whether to do data process cleanup. Default: True
dataprocess_cleanup.hard_delete_entities boolean	Whether to hard delete entities Default: False
dataprocess_cleanup.keep_last_n One of integer, null	Number of latest aspects to keep Default: 5
dataprocess_cleanup.max_workers integer	The number of workers to use for deletion Default: 10
dataprocess_cleanup.retention_days One of integer, null	Number of days to retain metadata in DataHub Default: 10
dataprocess_cleanup.aspects_to_clean array	List of aspect names to clean up Default: ['DataprocessInstance']
dataprocess_cleanup.aspects_to_clean.string string
execution_request_cleanup DatahubExecutionRequestCleanupConfig
execution_request_cleanup.batch_read_size integer	Number of records per read operation Default: 100
execution_request_cleanup.enabled boolean	Global switch for this cleanup task Default: True
execution_request_cleanup.keep_history_max_count integer	Maximum number of execution requests to keep, per ingestion source Default: 1000
execution_request_cleanup.keep_history_max_days integer	Maximum number of days to keep execution requests for, per ingestion source Default: 90
execution_request_cleanup.keep_history_min_count integer	Minimum number of execution requests to keep, per ingestion source Default: 10
execution_request_cleanup.limit_entities_delete One of integer, null	Max number of execution requests to hard delete. Default: 10000
execution_request_cleanup.max_read_errors integer	Maximum number of read errors before aborting Default: 10
execution_request_cleanup.runtime_limit_seconds integer	Maximum runtime in seconds for the cleanup task Default: 3600
soft_deleted_entities_cleanup SoftDeletedEntitiesCleanupConfig
soft_deleted_entities_cleanup.batch_size integer	The number of entities to get in a batch from GraphQL Default: 500
soft_deleted_entities_cleanup.delay One of number, null	Delay between each batch Default: 0.25
soft_deleted_entities_cleanup.enabled boolean	Whether to do soft deletion cleanup. Default: True
soft_deleted_entities_cleanup.futures_max_at_time integer	Max number of futures to have at a time. Default: 1000
soft_deleted_entities_cleanup.limit_entities_delete One of integer, null	Max number of entities to delete. Default: 25000
soft_deleted_entities_cleanup.max_workers integer	The number of workers to use for deletion Default: 10
soft_deleted_entities_cleanup.platform One of string, null	Platform to cleanup Default: None
soft_deleted_entities_cleanup.query One of string, null	Query to filter entities Default: None
soft_deleted_entities_cleanup.retention_days integer	Number of days to retain metadata in DataHub Default: 10
soft_deleted_entities_cleanup.runtime_limit_seconds integer	Runtime limit in seconds Default: 7200
soft_deleted_entities_cleanup.env One of string, null	Environment to cleanup Default: None
soft_deleted_entities_cleanup.entity_types One of array, null	List of entity types to cleanup Default: ['dataset', 'dashboard', 'chart', 'mlmodel', 'mlmo...
soft_deleted_entities_cleanup.entity_types.string string

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "DataProcessCleanupConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": true,
          "description": "Whether to do data process cleanup.",
          "title": "Enabled",
          "type": "boolean"
        },
        "retention_days": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 10,
          "description": "Number of days to retain metadata in DataHub",
          "title": "Retention Days"
        },
        "aspects_to_clean": {
          "default": [
            "DataprocessInstance"
          ],
          "description": "List of aspect names to clean up",
          "items": {
            "type": "string"
          },
          "title": "Aspects To Clean",
          "type": "array"
        },
        "keep_last_n": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5,
          "description": "Number of latest aspects to keep",
          "title": "Keep Last N"
        },
        "delete_empty_data_jobs": {
          "default": false,
          "description": "Whether to delete Data Jobs without runs",
          "title": "Delete Empty Data Jobs",
          "type": "boolean"
        },
        "delete_empty_data_flows": {
          "default": false,
          "description": "Whether to delete Data Flows without runs",
          "title": "Delete Empty Data Flows",
          "type": "boolean"
        },
        "hard_delete_entities": {
          "default": false,
          "description": "Whether to hard delete entities",
          "title": "Hard Delete Entities",
          "type": "boolean"
        },
        "batch_size": {
          "default": 500,
          "description": "The number of entities to get in a batch from API",
          "title": "Batch Size",
          "type": "integer"
        },
        "max_workers": {
          "default": 10,
          "description": "The number of workers to use for deletion",
          "title": "Max Workers",
          "type": "integer"
        },
        "delay": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": 0.25,
          "description": "Delay between each batch",
          "title": "Delay"
        }
      },
      "title": "DataProcessCleanupConfig",
      "type": "object"
    },
    "DatahubExecutionRequestCleanupConfig": {
      "additionalProperties": false,
      "properties": {
        "keep_history_min_count": {
          "default": 10,
          "description": "Minimum number of execution requests to keep, per ingestion source",
          "title": "Keep History Min Count",
          "type": "integer"
        },
        "keep_history_max_count": {
          "default": 1000,
          "description": "Maximum number of execution requests to keep, per ingestion source",
          "title": "Keep History Max Count",
          "type": "integer"
        },
        "keep_history_max_days": {
          "default": 90,
          "description": "Maximum number of days to keep execution requests for, per ingestion source",
          "title": "Keep History Max Days",
          "type": "integer"
        },
        "batch_read_size": {
          "default": 100,
          "description": "Number of records per read operation",
          "title": "Batch Read Size",
          "type": "integer"
        },
        "enabled": {
          "default": true,
          "description": "Global switch for this cleanup task",
          "title": "Enabled",
          "type": "boolean"
        },
        "runtime_limit_seconds": {
          "default": 3600,
          "description": "Maximum runtime in seconds for the cleanup task",
          "title": "Runtime Limit Seconds",
          "type": "integer"
        },
        "limit_entities_delete": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 10000,
          "description": "Max number of execution requests to hard delete.",
          "title": "Limit Entities Delete"
        },
        "max_read_errors": {
          "default": 10,
          "description": "Maximum number of read errors before aborting",
          "title": "Max Read Errors",
          "type": "integer"
        }
      },
      "title": "DatahubExecutionRequestCleanupConfig",
      "type": "object"
    },
    "SoftDeletedEntitiesCleanupConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": true,
          "description": "Whether to do soft deletion cleanup.",
          "title": "Enabled",
          "type": "boolean"
        },
        "retention_days": {
          "default": 10,
          "description": "Number of days to retain metadata in DataHub",
          "title": "Retention Days",
          "type": "integer"
        },
        "batch_size": {
          "default": 500,
          "description": "The number of entities to get in a batch from GraphQL",
          "title": "Batch Size",
          "type": "integer"
        },
        "delay": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": 0.25,
          "description": "Delay between each batch",
          "title": "Delay"
        },
        "max_workers": {
          "default": 10,
          "description": "The number of workers to use for deletion",
          "title": "Max Workers",
          "type": "integer"
        },
        "entity_types": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": [
            "dataset",
            "dashboard",
            "chart",
            "mlmodel",
            "mlmodelGroup",
            "mlfeatureTable",
            "mlfeature",
            "mlprimaryKey",
            "dataFlow",
            "dataJob",
            "glossaryTerm",
            "glossaryNode",
            "tag",
            "role",
            "corpuser",
            "corpGroup",
            "container",
            "domain",
            "dataProduct",
            "notebook",
            "businessAttribute",
            "schemaField",
            "query",
            "dataProcessInstance"
          ],
          "description": "List of entity types to cleanup",
          "title": "Entity Types"
        },
        "platform": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Platform to cleanup",
          "title": "Platform"
        },
        "env": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Environment to cleanup",
          "title": "Env"
        },
        "query": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Query to filter entities",
          "title": "Query"
        },
        "limit_entities_delete": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 25000,
          "description": "Max number of entities to delete.",
          "title": "Limit Entities Delete"
        },
        "futures_max_at_time": {
          "default": 1000,
          "description": "Max number of futures to have at a time.",
          "title": "Futures Max At Time",
          "type": "integer"
        },
        "runtime_limit_seconds": {
          "default": 7200,
          "description": "Runtime limit in seconds",
          "title": "Runtime Limit Seconds",
          "type": "integer"
        }
      },
      "title": "SoftDeletedEntitiesCleanupConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "dry_run": {
      "default": false,
      "description": "Whether to perform a dry run or not. This is only supported for dataprocess cleanup and soft deleted entities cleanup.",
      "title": "Dry Run",
      "type": "boolean"
    },
    "cleanup_expired_tokens": {
      "default": true,
      "description": "Whether to clean up expired tokens or not",
      "title": "Cleanup Expired Tokens",
      "type": "boolean"
    },
    "truncate_indices": {
      "default": true,
      "description": "Whether to truncate elasticsearch indices or not which can be safely truncated",
      "title": "Truncate Indices",
      "type": "boolean"
    },
    "truncate_index_older_than_days": {
      "default": 30,
      "description": "Indices older than this number of days will be truncated",
      "title": "Truncate Index Older Than Days",
      "type": "integer"
    },
    "truncation_watch_until": {
      "default": 10000,
      "description": "Wait for truncation of indices until this number of documents are left",
      "title": "Truncation Watch Until",
      "type": "integer"
    },
    "truncation_sleep_between_seconds": {
      "default": 30,
      "description": "Sleep between truncation monitoring.",
      "title": "Truncation Sleep Between Seconds",
      "type": "integer"
    },
    "dataprocess_cleanup": {
      "$ref": "#/$defs/DataProcessCleanupConfig",
      "description": "Configuration for data process cleanup"
    },
    "soft_deleted_entities_cleanup": {
      "$ref": "#/$defs/SoftDeletedEntitiesCleanupConfig",
      "description": "Configuration for soft deleted entities cleanup"
    },
    "execution_request_cleanup": {
      "$ref": "#/$defs/DatahubExecutionRequestCleanupConfig",
      "description": "Configuration for execution request cleanup"
    }
  },
  "title": "DataHubGcSourceConfig",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Index Cleanup

Manages Elasticsearch indices in DataHub, particularly focusing on time-series data.

Configuration

source:
  type: datahub-gc
  config:
    truncate_indices: true
    truncate_index_older_than_days: 30
    truncation_watch_until: 10000
    truncation_sleep_between_seconds: 30

Features

Truncates old Elasticsearch indices for the following timeseries aspects:
- DatasetOperations
- DatasetUsageStatistics
- ChartUsageStatistics
- DashboardUsageStatistics
- QueryUsageStatistics
- Timeseries Aspects
Monitors truncation progress
Implements safe deletion with monitoring thresholds
Supports gradual truncation with sleep intervals

Expired Token Cleanup

Manages access tokens in DataHub to maintain security and prevent token accumulation.

Configuration

source:
  type: datahub-gc
  config:
    cleanup_expired_tokens: true

Features

Automatically identifies and revokes expired access tokens
Processes tokens in batches for efficiency
Maintains system security by removing outdated credentials
Reports number of tokens revoked
Uses GraphQL API for token management

Data Process Cleanup

Manages the lifecycle of data processes, jobs, and their instances (DPIs) within DataHub.

Features

Cleans up Data Process Instances (DPIs) based on age and count
Can remove empty DataJobs and DataFlows
Supports both soft and hard deletion
Uses parallel processing for efficient cleanup
Maintains configurable retention policies

Configuration

source:
  type: datahub-gc
  config:
    dataprocess_cleanup:
      enabled: true
      retention_days: 10
      keep_last_n: 5
      delete_empty_data_jobs: false
      delete_empty_data_flows: false
      hard_delete_entities: false
      batch_size: 500
      max_workers: 10
      delay: 0.25

Limitations

Maximum 9000 DPIs per job for performance

Execution Request Cleanup

Manages DataHub execution request records to prevent accumulation of historical execution data.

Features

Maintains execution history per ingestion source
Preserves minimum number of recent requests
Removes old requests beyond retention period
Special handling for running/pending requests
Automatic cleanup of corrupted records

Configuration

source:
  type: datahub-gc
  config:
    execution_request_cleanup:
      enabled: true
      keep_history_min_count: 10
      keep_history_max_count: 1000
      keep_history_max_days: 30
      batch_read_size: 100
      runtime_limit_seconds: 3600
      max_read_errors: 10

Soft-Deleted Entities Cleanup

Manages the permanent removal of soft-deleted entities after a retention period.

Features

Permanently removes soft-deleted entities after retention period
Handles entity references cleanup
Special handling for query entities
Supports filtering by entity type, platform, or environment
Concurrent processing with safety limits

Configuration

source:
  type: datahub-gc
  config:
    soft_deleted_entities_cleanup:
      enabled: true
      retention_days: 10
      batch_size: 500
      max_workers: 10
      delay: 0.25
      entity_types: null # Optional list of entity types to clean
      platform: null # Optional platform filter
      env: null # Optional environment filter
      query: null # Optional custom query filter
      limit_entities_delete: 25000
      futures_max_at_time: 1000
      runtime_limit_seconds: 7200

Performance Considerations

Concurrent processing using thread pools
Configurable batch sizes for optimal performance
Rate limiting through configurable delays
Maximum limits on concurrent operations

Reporting

Each cleanup task maintains detailed reports including:

Number of entities processed
Number of entities removed
Errors encountered
Sample of affected entities
Runtime statistics
Task-specific metrics

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Code Coordinates

Class Name: datahub.ingestion.source.gc.datahub_gc.DataHubGcSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for DataHubGc, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

DataHubGc

Overview​

Concept Mapping​

Module datahub-gc​

Important Capabilities​

Overview​

Prerequisites​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Index Cleanup​

Configuration​

Features​

Expired Token Cleanup​

Configuration​

Features​

Data Process Cleanup​

Features​

Configuration​

Limitations​

Execution Request Cleanup​

Features​

Configuration​

Soft-Deleted Entities Cleanup​

Features​

Configuration​

Performance Considerations​

Reporting​

Limitations​

Troubleshooting​

Code Coordinates​

Overview

Concept Mapping

Module `datahub-gc`

Important Capabilities

Overview

Prerequisites

Install the Plugin

Starter Recipe

Config Details

Capabilities

Index Cleanup

Configuration

Features

Expired Token Cleanup

Configuration

Features

Data Process Cleanup

Features

Configuration

Limitations

Execution Request Cleanup

Features

Configuration

Soft-Deleted Entities Cleanup

Features

Configuration

Performance Considerations

Reporting

Limitations

Troubleshooting

Code Coordinates