Skip to main content

DataHubGc

Overview

DataHub GC is a DataHub utility or metadata-focused integration. Learn more in the official DataHub GC documentation.

The DataHub integration for DataHub GC covers metadata entities and operational objects relevant to this connector. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source ConceptDataHub ConceptNotes
Platform/account/project scopePlatform Instance, ContainerOrganizes assets within the platform context.
Core technical asset (for example table/view/topic/file)DatasetPrimary ingested technical asset.
Schema fields / columnsSchemaFieldIncluded when schema extraction is supported.
Ownership and collaboration principalsCorpUser, CorpGroupEmitted by modules that support ownership and identity metadata.
Dependencies and processing relationshipsLineage edgesAvailable when lineage extraction is supported and enabled.

Module datahub-gc

Testing

Important Capabilities

Capability metadata is not explicitly declared for this module. Refer to module documentation and configuration sections below.

Overview

The DataHub Garbage Collection (GC) source is a maintenance component responsible for cleaning up various types of metadata to maintain system performance and data quality. It performs multiple cleanup tasks, each focusing on different aspects of DataHub's metadata.

Prerequisites

Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.

Install the Plugin

pip install 'acryl-datahub[datahub-gc]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: datahub-gc
config:
dry_run: false
cleanup_expired_tokens: true
truncate_indices: true
dataprocess_cleanup:
retention_days: 10
delete_empty_data_jobs: true
delete_empty_data_flows: true
hard_delete_entities: false
keep_last_n: 5

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
cleanup_expired_tokens
boolean
Whether to clean up expired tokens or not
Default: True
dry_run
boolean
Whether to perform a dry run or not. This is only supported for dataprocess cleanup and soft deleted entities cleanup.
Default: False
truncate_index_older_than_days
integer
Indices older than this number of days will be truncated
Default: 30
truncate_indices
boolean
Whether to truncate elasticsearch indices or not which can be safely truncated
Default: True
truncation_sleep_between_seconds
integer
Sleep between truncation monitoring.
Default: 30
truncation_watch_until
integer
Wait for truncation of indices until this number of documents are left
Default: 10000
dataprocess_cleanup
DataProcessCleanupConfig
dataprocess_cleanup.batch_size
integer
The number of entities to get in a batch from API
Default: 500
dataprocess_cleanup.delay
One of number, null
Delay between each batch
Default: 0.25
dataprocess_cleanup.delete_empty_data_flows
boolean
Whether to delete Data Flows without runs
Default: False
dataprocess_cleanup.delete_empty_data_jobs
boolean
Whether to delete Data Jobs without runs
Default: False
dataprocess_cleanup.enabled
boolean
Whether to do data process cleanup.
Default: True
dataprocess_cleanup.hard_delete_entities
boolean
Whether to hard delete entities
Default: False
dataprocess_cleanup.keep_last_n
One of integer, null
Number of latest aspects to keep
Default: 5
dataprocess_cleanup.max_workers
integer
The number of workers to use for deletion
Default: 10
dataprocess_cleanup.retention_days
One of integer, null
Number of days to retain metadata in DataHub
Default: 10
dataprocess_cleanup.aspects_to_clean
array
List of aspect names to clean up
Default: ['DataprocessInstance']
dataprocess_cleanup.aspects_to_clean.string
string
execution_request_cleanup
DatahubExecutionRequestCleanupConfig
execution_request_cleanup.batch_read_size
integer
Number of records per read operation
Default: 100
execution_request_cleanup.enabled
boolean
Global switch for this cleanup task
Default: True
execution_request_cleanup.keep_history_max_count
integer
Maximum number of execution requests to keep, per ingestion source
Default: 1000
execution_request_cleanup.keep_history_max_days
integer
Maximum number of days to keep execution requests for, per ingestion source
Default: 90
execution_request_cleanup.keep_history_min_count
integer
Minimum number of execution requests to keep, per ingestion source
Default: 10
execution_request_cleanup.limit_entities_delete
One of integer, null
Max number of execution requests to hard delete.
Default: 10000
execution_request_cleanup.max_read_errors
integer
Maximum number of read errors before aborting
Default: 10
execution_request_cleanup.runtime_limit_seconds
integer
Maximum runtime in seconds for the cleanup task
Default: 3600
soft_deleted_entities_cleanup
SoftDeletedEntitiesCleanupConfig
soft_deleted_entities_cleanup.batch_size
integer
The number of entities to get in a batch from GraphQL
Default: 500
soft_deleted_entities_cleanup.delay
One of number, null
Delay between each batch
Default: 0.25
soft_deleted_entities_cleanup.enabled
boolean
Whether to do soft deletion cleanup.
Default: True
soft_deleted_entities_cleanup.futures_max_at_time
integer
Max number of futures to have at a time.
Default: 1000
soft_deleted_entities_cleanup.limit_entities_delete
One of integer, null
Max number of entities to delete.
Default: 25000
soft_deleted_entities_cleanup.max_workers
integer
The number of workers to use for deletion
Default: 10
soft_deleted_entities_cleanup.platform
One of string, null
Platform to cleanup
Default: None
soft_deleted_entities_cleanup.query
One of string, null
Query to filter entities
Default: None
soft_deleted_entities_cleanup.retention_days
integer
Number of days to retain metadata in DataHub
Default: 10
soft_deleted_entities_cleanup.runtime_limit_seconds
integer
Runtime limit in seconds
Default: 7200
soft_deleted_entities_cleanup.env
One of string, null
Environment to cleanup
Default: None
soft_deleted_entities_cleanup.entity_types
One of array, null
List of entity types to cleanup
Default: ['dataset', 'dashboard', 'chart', 'mlmodel', 'mlmo...
soft_deleted_entities_cleanup.entity_types.string
string

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Index Cleanup

Manages Elasticsearch indices in DataHub, particularly focusing on time-series data.

Configuration
source:
type: datahub-gc
config:
truncate_indices: true
truncate_index_older_than_days: 30
truncation_watch_until: 10000
truncation_sleep_between_seconds: 30
Features
  • Truncates old Elasticsearch indices for the following timeseries aspects:
    • DatasetOperations
    • DatasetUsageStatistics
    • ChartUsageStatistics
    • DashboardUsageStatistics
    • QueryUsageStatistics
    • Timeseries Aspects
  • Monitors truncation progress
  • Implements safe deletion with monitoring thresholds
  • Supports gradual truncation with sleep intervals

Expired Token Cleanup

Manages access tokens in DataHub to maintain security and prevent token accumulation.

Configuration
source:
type: datahub-gc
config:
cleanup_expired_tokens: true
Features
  • Automatically identifies and revokes expired access tokens
  • Processes tokens in batches for efficiency
  • Maintains system security by removing outdated credentials
  • Reports number of tokens revoked
  • Uses GraphQL API for token management

Data Process Cleanup

Manages the lifecycle of data processes, jobs, and their instances (DPIs) within DataHub.

Features
  • Cleans up Data Process Instances (DPIs) based on age and count
  • Can remove empty DataJobs and DataFlows
  • Supports both soft and hard deletion
  • Uses parallel processing for efficient cleanup
  • Maintains configurable retention policies
Configuration
source:
type: datahub-gc
config:
dataprocess_cleanup:
enabled: true
retention_days: 10
keep_last_n: 5
delete_empty_data_jobs: false
delete_empty_data_flows: false
hard_delete_entities: false
batch_size: 500
max_workers: 10
delay: 0.25
Limitations
  • Maximum 9000 DPIs per job for performance

Execution Request Cleanup

Manages DataHub execution request records to prevent accumulation of historical execution data.

Features
  • Maintains execution history per ingestion source
  • Preserves minimum number of recent requests
  • Removes old requests beyond retention period
  • Special handling for running/pending requests
  • Automatic cleanup of corrupted records
Configuration
source:
type: datahub-gc
config:
execution_request_cleanup:
enabled: true
keep_history_min_count: 10
keep_history_max_count: 1000
keep_history_max_days: 30
batch_read_size: 100
runtime_limit_seconds: 3600
max_read_errors: 10

Soft-Deleted Entities Cleanup

Manages the permanent removal of soft-deleted entities after a retention period.

Features
  • Permanently removes soft-deleted entities after retention period
  • Handles entity references cleanup
  • Special handling for query entities
  • Supports filtering by entity type, platform, or environment
  • Concurrent processing with safety limits
Configuration
source:
type: datahub-gc
config:
soft_deleted_entities_cleanup:
enabled: true
retention_days: 10
batch_size: 500
max_workers: 10
delay: 0.25
entity_types: null # Optional list of entity types to clean
platform: null # Optional platform filter
env: null # Optional environment filter
query: null # Optional custom query filter
limit_entities_delete: 25000
futures_max_at_time: 1000
runtime_limit_seconds: 7200
Performance Considerations
  • Concurrent processing using thread pools
  • Configurable batch sizes for optimal performance
  • Rate limiting through configurable delays
  • Maximum limits on concurrent operations

Reporting

Each cleanup task maintains detailed reports including:

  • Number of entities processed
  • Number of entities removed
  • Errors encountered
  • Sample of affected entities
  • Runtime statistics
  • Task-specific metrics

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Code Coordinates

  • Class Name: datahub.ingestion.source.gc.datahub_gc.DataHubGcSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for DataHubGc, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.