Skip to main content

DataHubDocuments

Incubating

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesEnabled by default via stateful ingestion.

This source extracts Document entities from DataHub and generates semantic embeddings.

It supports:

  • Batch mode: Fetches documents via GraphQL
  • Event-driven mode: Processes documents in real-time from Kafka MCL events (recommended)
  • Incremental processing: Only reprocesses documents when content changes
  • Smart defaults: Auto-configures connection, chunking, and embeddings from server

The minimal configuration requires just config: {} when using environment variables (DATAHUB_GMS_URL and DATAHUB_GMS_TOKEN) and will automatically align with your server's semantic search configuration.

Prerequisites: Before using this source, configure semantic search on your DataHub server. See the Semantic Search Configuration Guide for setup instructions.

CLI based Ingestion

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
min_text_length
integer
Minimum text length in characters to process (shorter documents are skipped)
Default: 50
partition_strategy
string
Text partitioning strategy. Currently only 'markdown' is supported. This field is included in the document hash to trigger reprocessing if the strategy changes.
Default: markdown
skip_empty_text
boolean
Skip documents with no text content
Default: True
chunking
ChunkingConfig
Chunking strategy configuration.
chunking.combine_text_under_n_chars
integer
Combine chunks smaller than this size
Default: 100
chunking.max_characters
integer
Maximum characters per chunk
Default: 500
chunking.overlap
integer
Character overlap between chunks
Default: 0
chunking.strategy
Enum
One of: "basic", "by_title"
Default: by_title
datahub
DataHubConnectionConfig
DataHub connection configuration.
datahub.server
string
DataHub GMS server URL
Default: http://localhost:8080
datahub.token
One of string, null
DataHub API token for authentication
Default: None
document_urns
One of array, null
Specific document URNs to process (if None, process all matching platforms)
Default: None
document_urns.string
string
embedding
EmbeddingConfig
Embedding generation configuration.

Default behavior: Fetches configuration from DataHub server automatically.
Override behavior: Validates local config against server when explicitly set.
embedding.allow_local_embedding_config
boolean
BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.
Default: False
embedding.api_key
One of string, null
API key for Cohere (not needed for Bedrock with IAM roles)
Default: None
embedding.aws_region
One of string, null
AWS region for Bedrock. If not set, loads from server.
Default: None
embedding.batch_size
integer
Batch size for embedding API calls
Default: 25
embedding.input_type
One of string, null
Input type for Cohere embeddings
Default: search_document
embedding.model
One of string, null
Model name. If not set, loads from server.
Default: None
embedding.model_embedding_key
One of string, null
Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.
Default: None
embedding.provider
One of Enum, null
Embedding provider (bedrock uses AWS, cohere uses API key). If not set, loads from server.
Default: None
event_mode
EventModeConfig
Event-driven mode configuration.
event_mode.consumer_id
One of string, null
Consumer ID for offset tracking (defaults to 'datahub-documents-{pipeline_name}')
Default: None
event_mode.enabled
boolean
Enable event-driven mode (polls MCL events instead of GraphQL batch)
Default: False
event_mode.idle_timeout_seconds
integer
Exit after this many seconds with no new events (incremental batch mode)
Default: 30
event_mode.lookback_days
One of integer, null
Number of days to look back for events on first run (None means start from latest)
Default: None
event_mode.poll_limit
integer
Maximum number of events to fetch per poll
Default: 100
event_mode.poll_timeout_seconds
integer
Timeout for each poll request
Default: 2
event_mode.reset_offsets
boolean
Reset consumer offsets to start from beginning
Default: False
event_mode.topics
array
Topics to consume for document changes
Default: ['MetadataChangeLog_Versioned_v1']
event_mode.topics.string
string
incremental
IncrementalConfig
Incremental processing configuration.
incremental.enabled
boolean
Only process documents whose text content has changed (tracks content hash). Uses stateful ingestion when enabled. The state_file_path option is deprecated and ignored when stateful ingestion is enabled.
Default: True
incremental.force_reprocess
boolean
Force reprocess all documents regardless of content hash
Default: False
incremental.state_file_path
One of string, null
[DEPRECATED] Path to state file. This option is ignored when stateful ingestion is enabled. State is now managed through DataHub's stateful ingestion framework.
Default: None
platform_filter
One of array, null
Filter documents by platforms. Default (None): Process all NATIVE documents (sourceType=NATIVE) regardless of platform. To include external documents from specific platforms, add them here (e.g., ['notion', 'confluence']). This will process NATIVE documents + EXTERNAL documents from the specified platforms. Use ['*'] or ['ALL'] to process all documents regardless of source type or platform.
Default: None
platform_filter.string
string
stateful_ingestion
DocumentChunkingStatefulIngestionConfig
Configuration for document chunking stateful ingestion.
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False

Code Coordinates

  • Class Name: datahub.ingestion.source.datahub_documents.datahub_documents_source.DataHubDocumentsSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for DataHubDocuments, feel free to ping us on our Slack.