DataHubDocuments
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
This source extracts Document entities from DataHub and generates semantic embeddings.
It supports:
- Batch mode: Fetches documents via GraphQL
- Event-driven mode: Processes documents in real-time from Kafka MCL events (recommended)
- Incremental processing: Only reprocesses documents when content changes
- Smart defaults: Auto-configures connection, chunking, and embeddings from server
The minimal configuration requires just config: {} when using environment variables
(DATAHUB_GMS_URL and DATAHUB_GMS_TOKEN) and will automatically align with your server's
semantic search configuration.
Prerequisites: Before using this source, configure semantic search on your DataHub server. See the Semantic Search Configuration Guide for setup instructions.
CLI based Ingestion
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
min_text_length integer | Minimum text length in characters to process (shorter documents are skipped) Default: 50 |
partition_strategy string | Text partitioning strategy. Currently only 'markdown' is supported. This field is included in the document hash to trigger reprocessing if the strategy changes. Default: markdown |
skip_empty_text boolean | Skip documents with no text content Default: True |
chunking ChunkingConfig | Chunking strategy configuration. |
chunking.combine_text_under_n_chars integer | Combine chunks smaller than this size Default: 100 |
chunking.max_characters integer | Maximum characters per chunk Default: 500 |
chunking.overlap integer | Character overlap between chunks Default: 0 |
chunking.strategy Enum | One of: "basic", "by_title" Default: by_title |
datahub DataHubConnectionConfig | DataHub connection configuration. |
datahub.server string | DataHub GMS server URL Default: http://localhost:8080 |
datahub.token One of string, null | DataHub API token for authentication Default: None |
document_urns One of array, null | Specific document URNs to process (if None, process all matching platforms) Default: None |
document_urns.string string | |
embedding EmbeddingConfig | Embedding generation configuration. Default behavior: Fetches configuration from DataHub server automatically. Override behavior: Validates local config against server when explicitly set. |
embedding.allow_local_embedding_config boolean | BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search. Default: False |
embedding.api_key One of string, null | API key for Cohere (not needed for Bedrock with IAM roles) Default: None |
embedding.aws_region One of string, null | AWS region for Bedrock. If not set, loads from server. Default: None |
embedding.batch_size integer | Batch size for embedding API calls Default: 25 |
embedding.input_type One of string, null | Input type for Cohere embeddings Default: search_document |
embedding.model One of string, null | Model name. If not set, loads from server. Default: None |
embedding.model_embedding_key One of string, null | Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server. Default: None |
embedding.provider One of Enum, null | Embedding provider (bedrock uses AWS, cohere uses API key). If not set, loads from server. Default: None |
event_mode EventModeConfig | Event-driven mode configuration. |
event_mode.consumer_id One of string, null | Consumer ID for offset tracking (defaults to 'datahub-documents-{pipeline_name}') Default: None |
event_mode.enabled boolean | Enable event-driven mode (polls MCL events instead of GraphQL batch) Default: False |
event_mode.idle_timeout_seconds integer | Exit after this many seconds with no new events (incremental batch mode) Default: 30 |
event_mode.lookback_days One of integer, null | Number of days to look back for events on first run (None means start from latest) Default: None |
event_mode.poll_limit integer | Maximum number of events to fetch per poll Default: 100 |
event_mode.poll_timeout_seconds integer | Timeout for each poll request Default: 2 |
event_mode.reset_offsets boolean | Reset consumer offsets to start from beginning Default: False |
event_mode.topics array | Topics to consume for document changes Default: ['MetadataChangeLog_Versioned_v1'] |
event_mode.topics.string string | |
incremental IncrementalConfig | Incremental processing configuration. |
incremental.enabled boolean | Only process documents whose text content has changed (tracks content hash). Uses stateful ingestion when enabled. The state_file_path option is deprecated and ignored when stateful ingestion is enabled. Default: True |
incremental.force_reprocess boolean | Force reprocess all documents regardless of content hash Default: False |
incremental.state_file_path One of string, null | [DEPRECATED] Path to state file. This option is ignored when stateful ingestion is enabled. State is now managed through DataHub's stateful ingestion framework. Default: None |
platform_filter One of array, null | Filter documents by platforms. Default (None): Process all NATIVE documents (sourceType=NATIVE) regardless of platform. To include external documents from specific platforms, add them here (e.g., ['notion', 'confluence']). This will process NATIVE documents + EXTERNAL documents from the specified platforms. Use ['*'] or ['ALL'] to process all documents regardless of source type or platform. Default: None |
platform_filter.string string | |
stateful_ingestion DocumentChunkingStatefulIngestionConfig | Configuration for document chunking stateful ingestion. |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"ChunkingConfig": {
"additionalProperties": false,
"description": "Chunking strategy configuration.",
"properties": {
"strategy": {
"default": "by_title",
"description": "Chunking strategy to use",
"enum": [
"basic",
"by_title"
],
"title": "Strategy",
"type": "string"
},
"max_characters": {
"default": 500,
"description": "Maximum characters per chunk",
"title": "Max Characters",
"type": "integer"
},
"overlap": {
"default": 0,
"description": "Character overlap between chunks",
"title": "Overlap",
"type": "integer"
},
"combine_text_under_n_chars": {
"default": 100,
"description": "Combine chunks smaller than this size",
"title": "Combine Text Under N Chars",
"type": "integer"
}
},
"title": "ChunkingConfig",
"type": "object"
},
"DataHubConnectionConfig": {
"additionalProperties": false,
"description": "DataHub connection configuration.",
"properties": {
"server": {
"default": "http://localhost:8080",
"description": "DataHub GMS server URL",
"title": "Server",
"type": "string"
},
"token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "DataHub API token for authentication",
"title": "Token"
}
},
"title": "DataHubConnectionConfig",
"type": "object"
},
"DocumentChunkingStatefulIngestionConfig": {
"additionalProperties": false,
"description": "Configuration for document chunking stateful ingestion.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
}
},
"title": "DocumentChunkingStatefulIngestionConfig",
"type": "object"
},
"EmbeddingConfig": {
"additionalProperties": false,
"description": "Embedding generation configuration.\n\nDefault behavior: Fetches configuration from DataHub server automatically.\nOverride behavior: Validates local config against server when explicitly set.",
"properties": {
"provider": {
"anyOf": [
{
"enum": [
"bedrock",
"cohere"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Embedding provider (bedrock uses AWS, cohere uses API key). If not set, loads from server.",
"title": "Provider"
},
"model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Model name. If not set, loads from server.",
"title": "Model"
},
"model_embedding_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.",
"title": "Model Embedding Key"
},
"aws_region": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "AWS region for Bedrock. If not set, loads from server.",
"title": "Aws Region"
},
"api_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "API key for Cohere (not needed for Bedrock with IAM roles)",
"title": "Api Key"
},
"batch_size": {
"default": 25,
"description": "Batch size for embedding API calls",
"title": "Batch Size",
"type": "integer"
},
"input_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "search_document",
"description": "Input type for Cohere embeddings",
"title": "Input Type"
},
"allow_local_embedding_config": {
"default": false,
"description": "BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.",
"title": "Allow Local Embedding Config",
"type": "boolean"
}
},
"title": "EmbeddingConfig",
"type": "object"
},
"EventModeConfig": {
"additionalProperties": false,
"description": "Event-driven mode configuration.",
"properties": {
"enabled": {
"default": false,
"description": "Enable event-driven mode (polls MCL events instead of GraphQL batch)",
"title": "Enabled",
"type": "boolean"
},
"consumer_id": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Consumer ID for offset tracking (defaults to 'datahub-documents-{pipeline_name}')",
"title": "Consumer Id"
},
"topics": {
"default": [
"MetadataChangeLog_Versioned_v1"
],
"description": "Topics to consume for document changes",
"items": {
"type": "string"
},
"title": "Topics",
"type": "array"
},
"lookback_days": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of days to look back for events on first run (None means start from latest)",
"title": "Lookback Days"
},
"reset_offsets": {
"default": false,
"description": "Reset consumer offsets to start from beginning",
"title": "Reset Offsets",
"type": "boolean"
},
"idle_timeout_seconds": {
"default": 30,
"description": "Exit after this many seconds with no new events (incremental batch mode)",
"title": "Idle Timeout Seconds",
"type": "integer"
},
"poll_timeout_seconds": {
"default": 2,
"description": "Timeout for each poll request",
"title": "Poll Timeout Seconds",
"type": "integer"
},
"poll_limit": {
"default": 100,
"description": "Maximum number of events to fetch per poll",
"title": "Poll Limit",
"type": "integer"
}
},
"title": "EventModeConfig",
"type": "object"
},
"IncrementalConfig": {
"additionalProperties": false,
"description": "Incremental processing configuration.",
"properties": {
"enabled": {
"default": true,
"description": "Only process documents whose text content has changed (tracks content hash). Uses stateful ingestion when enabled. The state_file_path option is deprecated and ignored when stateful ingestion is enabled.",
"title": "Enabled",
"type": "boolean"
},
"state_file_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "[DEPRECATED] Path to state file. This option is ignored when stateful ingestion is enabled. State is now managed through DataHub's stateful ingestion framework.",
"title": "State File Path"
},
"force_reprocess": {
"default": false,
"description": "Force reprocess all documents regardless of content hash",
"title": "Force Reprocess",
"type": "boolean"
}
},
"title": "IncrementalConfig",
"type": "object"
}
},
"description": "Configuration for DataHub Documents Source.",
"properties": {
"stateful_ingestion": {
"$ref": "#/$defs/DocumentChunkingStatefulIngestionConfig",
"description": "Stateful ingestion configuration. Enabled by default to support incremental mode (document hash tracking) and event mode (offset tracking)."
},
"datahub": {
"$ref": "#/$defs/DataHubConnectionConfig",
"description": "DataHub connection configuration"
},
"platform_filter": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Filter documents by platforms. Default (None): Process all NATIVE documents (sourceType=NATIVE) regardless of platform. To include external documents from specific platforms, add them here (e.g., ['notion', 'confluence']). This will process NATIVE documents + EXTERNAL documents from the specified platforms. Use ['*'] or ['ALL'] to process all documents regardless of source type or platform.",
"title": "Platform Filter"
},
"document_urns": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Specific document URNs to process (if None, process all matching platforms)",
"title": "Document Urns"
},
"event_mode": {
"$ref": "#/$defs/EventModeConfig",
"description": "Event-driven mode configuration (polls Kafka MCL events)"
},
"incremental": {
"$ref": "#/$defs/IncrementalConfig",
"description": "Incremental processing configuration (skip unchanged documents)"
},
"chunking": {
"$ref": "#/$defs/ChunkingConfig",
"description": "Text chunking strategy configuration"
},
"embedding": {
"$ref": "#/$defs/EmbeddingConfig",
"description": "Embedding generation configuration (LiteLLM with Cohere/Bedrock)"
},
"partition_strategy": {
"const": "markdown",
"default": "markdown",
"description": "Text partitioning strategy. Currently only 'markdown' is supported. This field is included in the document hash to trigger reprocessing if the strategy changes.",
"title": "Partition Strategy",
"type": "string"
},
"skip_empty_text": {
"default": true,
"description": "Skip documents with no text content",
"title": "Skip Empty Text",
"type": "boolean"
},
"min_text_length": {
"default": 50,
"description": "Minimum text length in characters to process (shorter documents are skipped)",
"title": "Min Text Length",
"type": "integer"
}
},
"title": "DataHubDocumentsSourceConfig",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.datahub_documents.datahub_documents_source.DataHubDocumentsSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for DataHubDocuments, feel free to ping us on our Slack.
Is this page helpful?