Confluence
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Detect Deleted Entities | ✅ | Enabled by default. |
| Platform Instance | ✅ | Enabled by default. |
| Test Connection | ✅ | Enabled by default. |
Overview
The Confluence source ingests pages and spaces from Confluence workspaces (Cloud or Data Center) as DataHub Document entities with optional semantic embeddings for semantic search.
Key Features
1. Content Extraction
- Page Content: Full text extraction from Confluence pages including all content types
- Space Discovery: Automatic discovery of all pages within specified spaces
- Hierarchical Structure: Maintains parent-child relationships between pages
- Metadata Extraction: Captures creation/modification timestamps, authors, labels, and custom properties
2. Hierarchical Relationships
- Parent-Child Links: Preserves Confluence page hierarchy in DataHub
- Recursive Discovery: Recursively discovers nested pages starting from root pages or entire spaces
- Space Organization: Maintains space context as custom properties
- Flexible Navigation: Browse documentation structure in DataHub UI
3. Embedding Generation
Optional semantic search support with sensible defaults:
- Supported providers: Cohere (API key), AWS Bedrock (IAM roles)
- Automatic chunking: Documents are automatically chunked for optimal embedding generation
- Automatic deduplication: Prevents duplicate chunk embeddings
See Semantic Search Configuration for detailed setup and advanced options.
4. Stateful Ingestion
Supports smart incremental updates via stateful ingestion:
- Content Change Detection: Only reprocesses documents when content or embeddings config changes
- Deletion Detection: Automatically removes stale entities from DataHub
- Flexible Discovery: Ingest entire spaces, specific pages, or page trees
- State Persistence: Maintains processing state between runs to skip unchanged documents
Prerequisites
1. Confluence API Access
For Confluence Cloud
Create an API token:
- Go to https://id.atlassian.com/manage-profile/security/api-tokens
- Click "Create API token"
- Give it a name (e.g., "DataHub Integration")
- Copy the token (you won't be able to see it again)
You'll need:
- Base URL: Your Confluence Cloud URL (e.g.,
https://your-domain.atlassian.net/wiki) - Username: Your Atlassian account email
- API Token: The token you just created
For Confluence Data Center / Server
Create a Personal Access Token:
- Go to your Confluence → Profile → Personal Access Tokens
- Click "Create token"
- Give it a name and set expiration
- Copy the token
You'll need:
- Base URL: Your Confluence server URL (e.g.,
https://confluence.company.com) - Personal Access Token: The token you created
Note: For Data Center, you can also use username/password, but Personal Access Tokens are recommended.
2. Required Permissions
The API credentials must have:
- Read access to all spaces and pages you want to ingest
- For Cloud: User must be added to spaces or have site-wide read access
- For Data Center: User must have "View" permissions on spaces
3. Embedding Provider (Optional)
If you want semantic search capabilities, configure an embedding provider in your DataHub instance.
Supported providers include Cohere (API key) and AWS Bedrock (IAM roles). The connector will use sensible defaults for chunking and embedding configuration.
See Semantic Search Configuration for detailed provider setup and configuration options.
Common Use Cases
1. Auto-Discover All Spaces (Default)
By default, the connector discovers and ingests all accessible spaces:
source:
type: confluence
config:
# Confluence Cloud
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
# No filtering - discovers all accessible spaces
# Optional: limit number of spaces for large instances
max_spaces: 100
2. Include Specific Spaces
Ingest only specific Confluence spaces:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
# Include only these spaces
spaces:
allow:
- "ENGINEERING"
- "PRODUCT"
- "DESIGN"
3. Exclude Personal and Archive Spaces
Ingest all spaces except specific ones:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
# Exclude personal spaces and archived content
spaces:
deny:
- "~john.doe"
- "~jane.smith"
- "ARCHIVE"
- "OLD_DOCS"
4. Specific Page Trees Only
Ingest specific pages and their descendants:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
# Start from specific pages
pages:
allow:
- "123456789" # API Documentation page tree
- "987654321" # User Guides page tree
recursive: true # Include all child pages
5. Combined Space and Page Filtering
Combine space and page filters for fine-grained control:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
# Include specific spaces
spaces:
allow:
- "ENGINEERING"
- "PRODUCT"
# Exclude personal spaces even if in allow list
deny:
- "~admin"
# Exclude specific pages (e.g., drafts, archived content)
pages:
deny:
- "999999" # Draft page
- "888888" # Archived page
6. Data Center / Server Setup
Connect to Confluence Data Center or Server:
source:
type: confluence
config:
# Data Center / Server
cloud: false
url: "https://confluence.company.com"
personal_access_token: "${CONFLUENCE_PAT}"
spaces:
allow:
- "WIKI"
- "DOCS"
7. Production Setup with Stateful Ingestion
Enterprise setup with incremental updates:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
spaces:
allow:
- "COMPANY"
- "PUBLIC"
# Enable stateful ingestion for incremental updates
stateful_ingestion:
enabled: true
Note: Embedding configuration is managed by your DataHub instance. See Semantic Search Configuration for setup.
8. Using URLs for Allow/Deny
You can specify spaces and pages using full URLs for both allow and deny lists:
source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"
# Use full URLs - connector extracts keys/IDs automatically
spaces:
allow:
- "https://your-domain.atlassian.net/wiki/spaces/ENG"
- "https://your-domain.atlassian.net/wiki/spaces/PRODUCT"
deny:
- "https://your-domain.atlassian.net/wiki/spaces/ARCHIVE"
- "~john.doe" # Can mix URLs and keys
pages:
allow:
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Getting+Started"
deny:
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/999999/Draft"
Filtering Content
The connector provides flexible filtering options through allow and deny lists for both spaces and pages.
Space Filtering
Control which Confluence spaces are ingested:
spaces.allow: Include only specific spaces (by default, all accessible spaces are discovered)
spaces:
allow:
- "ENGINEERING" # Space key
- "PRODUCT"
- "https://your-domain.atlassian.net/wiki/spaces/DESIGN" # Or full URL
spaces.deny: Exclude specific spaces (applied after spaces.allow)
spaces:
deny:
- "~john.doe" # Personal space
- "ARCHIVE" # Archived content
- "TEST" # Test space
Page Filtering
Control which pages are ingested:
pages.allow: Include only specific pages (triggers page-based mode, bypasses space discovery)
pages:
allow:
- "123456789" # Page ID
- "987654321"
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/111111/API+Docs" # Or full URL
recursive: true # Include child pages
pages.deny: Exclude specific pages (works in both space-based and page-based modes)
pages:
deny:
- "999999" # Draft page
- "888888" # Archived page
Filtering Rules
Precedence:
- Deny lists always take precedence over allow lists
- If a space/page is in both allow and deny lists, it will be excluded
Modes:
- Space-based mode (default): Discovers spaces, then ingests all pages within allowed spaces
- Page-based mode: When
page_allowis specified, bypasses space discovery and fetches specific page trees
Format Support:
- Space keys:
"ENGINEERING","~username"(for personal spaces) - Page IDs:
"123456789"(numeric string) - Full URLs: Both space URLs and page URLs are automatically parsed
Common Filtering Patterns
Exclude all personal spaces:
spaces:
deny:
- "~*" # Note: Use explicit user IDs, wildcard not supported
# Instead, list specific personal spaces:
- "~john.doe"
- "~jane.smith"
Ingest only documentation spaces:
spaces:
allow:
- "DOCS"
- "API_DOCS"
- "USER_GUIDES"
Focus on specific documentation trees:
pages:
allow:
- "123456" # API Documentation root page
- "789012" # User Guides root page
recursive: true
Exclude drafts and WIP pages:
pages:
deny:
- "999999" # Draft page ID
- "888888" # WIP page ID
How It Works
Processing Pipeline
- Discovery: Confluence API discovers spaces and pages
- Download: Downloads page content via Confluence REST API
- Extraction: Extracts text, metadata, and hierarchy from pages
- Chunking: Splits documents into semantic chunks (if embeddings enabled)
- Embedding: Generates vector embeddings for each chunk (if embeddings enabled)
- Emission: Emits Document entities with SemanticContent aspects to DataHub
URL Format Support
The connector supports multiple input formats for spaces and pages in allow/deny lists:
Space Identifiers:
- Space key:
"ENGINEERING","~username"(for personal spaces) - Full URL:
"https://your-domain.atlassian.net/wiki/spaces/ENGINEERING"
Page Identifiers:
- Page ID:
"123456789"(numeric string) - Full URL (Cloud):
"https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Page+Title" - Full URL (Data Center):
"https://confluence.company.com/pages/viewpage.action?pageId=123456"
The connector automatically extracts space keys and page IDs from URLs, so you can use either format interchangeably in space_allow, space_deny, page_allow, and page_deny lists.
Stateful Ingestion Details
The source uses content-based change detection:
- Calculates SHA-256 hash of document content + embedding configuration
- Compares hash with previous run to detect changes
- Only reprocesses documents when hash changes
- Tracks all emitted URNs to detect deletions
This means:
- First run: Processes all documents
- Subsequent runs: Only processes new/changed documents
- Deleted pages: Automatically soft-deleted from DataHub
Limitations and Considerations
Confluence API Limits
- Rate Limits: Confluence enforces rate limits (Cloud: varies by plan, Data Center: configurable)
- Content Types: Complex macros may not extract perfectly (e.g., embedded content, custom macros)
- Attachments: File attachments are not ingested (only page content)
Performance Considerations
- Large Spaces: First run may take significant time for large spaces (1000+ pages)
- Embedding Generation: Adds processing time proportional to content volume
- API Costs: Embedding providers may incur costs based on usage
Content Extraction
- Supported Content: Text, headings, lists, code blocks, tables, panels
- Limited Support: Some macros extract as text/links
- Not Supported: Attachments, complex custom macros, embedded Jira issues (content only)
Troubleshooting
Common Issues
"401 Unauthorized" or "Authentication failed" errors:
- Cloud: Verify
username(email) andapi_tokenare correct - Data Center: Verify
personal_access_tokenis valid and not expired - Check that
cloud: true/falsematches your Confluence type - Ensure the URL includes
/wikisuffix for Cloud (e.g.,https://domain.atlassian.net/wiki)
"403 Forbidden" or "Space not found" errors:
- Verify the user has read access to the specified spaces
- Check that space keys are correct (case-sensitive)
- For Cloud, ensure user is added to private spaces
- For Data Center, verify "View Space" permissions
Empty or missing content:
- Verify pages contain text (empty pages are skipped by default with
skip_empty_documents: true) - Check
min_text_lengthfilter setting (default: 50 characters) - Ensure
recursive: trueif expecting child pages - Check that pages are not restricted or have special permissions
Slow ingestion:
- Increase
processing.parallelism.num_processes(default: 2) - Consider filtering specific spaces instead of all spaces
- First run is always slower - subsequent runs use incremental updates
- Large spaces with 1000+ pages may take several minutes
Embedding generation failures:
- Verify provider API key is correct
- Check provider-specific rate limits (Cohere: 10k requests/min)
- Ensure embedding model name is valid for your provider
- For Bedrock: verify IAM permissions and model access is enabled in AWS Console
Stateful ingestion not working:
- Ensure
stateful_ingestion.enabled: truein config - Check DataHub connection (source needs to query previous state)
- Verify state file path is writable (if using file-based state)
- Look for state persistence logs in ingestion output
Missing hierarchy/parent relationships:
- Verify
hierarchy.enabled: true(default) - Check that parent pages are being ingested
- Ensure
recursive: trueto discover parent-child relationships - Parent pages must be accessible to the API credentials
Page IDs not working:
- For Cloud, use the numeric page ID from the URL (after
/pages/) - For Data Center, page IDs may differ - use the ID from the page URL or query param
?pageId= - Alternatively, use full page URLs instead of IDs in
page_alloworpage_deny
How to find space keys and page IDs:
- Space key: Visible in the space URL:
https://domain.atlassian.net/wiki/spaces/ENGINEERING→ key isENGINEERING - Page ID (Cloud): In the page URL after
/pages/:https://domain.atlassian.net/wiki/spaces/ENG/pages/123456/Title→ ID is123456 - Page ID (Data Center): In the URL query parameter:
https://confluence.company.com/pages/viewpage.action?pageId=123456→ ID is123456 - Personal space key: Format is
~username(e.g.,~john.doefor user john.doe)
Performance Tuning
Parallelism Settings
processing:
parallelism:
num_processes: 4 # Increase for faster processing (default: 2)
max_connections: 20 # Concurrent API connections (default: 10)
Guidelines:
- Small spaces (<100 pages):
num_processes: 2 - Medium spaces (100-500 pages):
num_processes: 4 - Large spaces (>500 pages):
num_processes: 8
Filtering
filtering:
min_text_length: 100 # Skip short pages (default: 50)
skip_empty_documents: true # Skip empty pages (default: true)
Space Selection
Instead of ingesting all spaces, select specific ones:
spaces:
allow:
- "ENGINEERING" # High-value documentation space
- "PRODUCT" # Product requirements space
deny:
- "~*" # Exclude personal spaces (list specific users)
- "ARCHIVE" # Exclude archived content
- "TEST" # Exclude test spaces
Related Documentation
- Confluence Cloud REST API
- Confluence Data Center REST API
- Semantic Search Configuration
- DataHub Document Ingestion
CLI based Ingestion
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
url ✅ string | Base URL of your Confluence instance. Examples: 'https://your-domain.atlassian.net/wiki' (Cloud) or 'https://confluence.your-company.com' (Data Center) |
api_token One of string(password), null | API token for Confluence Cloud authentication. Generate at: https://id.atlassian.com/manage-profile/security/api-tokens Default: None |
cloud boolean | Whether this is a Confluence Cloud instance (True) or Data Center/Server (False). Default: True |
max_pages_per_space integer | Maximum number of pages to ingest per space. Default: 1000 |
max_spaces integer | Maximum number of spaces to ingest when auto-discovering (applies when urls is not set). Default: 100 |
personal_access_token One of string(password), null | Personal Access Token for Confluence Data Center authentication. Generate from: User Profile > Settings > Personal Access Tokens Default: None |
platform_instance One of string, null | Optional human-readable identifier for this Confluence instance (e.g., 'mycompany-prod', 'team-a-confluence'). If not provided, automatically generated by hashing the base URL, which guarantees global uniqueness across all Confluence installations (both Cloud and Data Center). Use explicit values for more readable URNs, but auto-generated hashes are perfectly fine and require no manual configuration. Default: None |
recursive boolean | Whether to recursively fetch child pages (applies to page URLs only). Default: True |
username One of string, null | Username for Confluence Cloud authentication (required for Cloud). Default: None |
advanced AdvancedConfig | Advanced configuration options. |
advanced.continue_on_failure boolean | Default: True |
advanced.max_errors integer | Default: 10 |
advanced.output_format Enum | One of: "json", "xml" Default: json |
advanced.preserve_outputs boolean | Default: False |
advanced.raise_on_error boolean | Default: False |
advanced.work_dir string | Default: /tmp/unstructured_datahub |
advanced.cache CacheConfig | Cache configuration. |
advanced.cache.cache_dir string | Default: ~/.cache/unstructured_datahub |
advanced.cache.enabled boolean | Default: True |
advanced.cache.ttl integer | Cache TTL in seconds Default: 86400 |
advanced.retry RetryConfig | Retry configuration. |
advanced.retry.backoff_factor integer | Default: 2 |
advanced.retry.enabled boolean | Default: True |
advanced.retry.max_attempts integer | Default: 3 |
advanced.retry.retry_on_timeout boolean | Default: True |
chunking ChunkingConfig | Chunking strategy configuration. |
chunking.combine_text_under_n_chars integer | Combine chunks smaller than this size Default: 100 |
chunking.max_characters integer | Maximum characters per chunk Default: 500 |
chunking.overlap integer | Character overlap between chunks Default: 0 |
chunking.strategy Enum | One of: "basic", "by_title" Default: by_title |
document_mapping DocumentMappingConfig | Document entity mapping configuration. |
document_mapping.id_pattern string | Pattern for generating document IDs Default: {source_type}-{directory}-{basename} |
document_mapping.status Enum | One of: "PUBLISHED", "UNPUBLISHED" Default: PUBLISHED |
document_mapping.id_normalization IdNormalizationConfig | Document ID normalization rules. |
document_mapping.id_normalization.lowercase boolean | Convert to lowercase Default: True |
document_mapping.id_normalization.max_length integer | Maximum ID length Default: 200 |
document_mapping.id_normalization.remove_special_chars boolean | Remove special characters except _ and - Default: True |
document_mapping.id_normalization.replace_spaces_with string | Replace spaces with this character Default: - |
document_mapping.source SourceConfig | Document source configuration. |
document_mapping.source.include_external_id boolean | Include external ID in DocumentSource Default: True |
document_mapping.source.include_external_url boolean | Include external URL in DocumentSource Default: True |
document_mapping.source.type Enum | One of: "NATIVE", "EXTERNAL" Default: EXTERNAL |
document_mapping.title TitleExtractionConfig | Title extraction configuration. |
document_mapping.title.extract_from_content boolean | Try to extract title from document content Default: True |
document_mapping.title.fallback_to_filename boolean | Use filename as title if not found in content Default: True |
document_mapping.title.max_length integer | Maximum title length Default: 500 |
embedding EmbeddingConfig | Embedding generation configuration. Default behavior: Fetches configuration from DataHub server automatically. Override behavior: Validates local config against server when explicitly set. |
embedding.allow_local_embedding_config boolean | BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search. Default: False |
embedding.api_key One of string, null | API key for Cohere (not needed for Bedrock with IAM roles) Default: None |
embedding.aws_region One of string, null | AWS region for Bedrock. If not set, loads from server. Default: None |
embedding.batch_size integer | Batch size for embedding API calls Default: 25 |
embedding.input_type One of string, null | Input type for Cohere embeddings Default: search_document |
embedding.model One of string, null | Model name. If not set, loads from server. Default: None |
embedding.model_embedding_key One of string, null | Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server. Default: None |
embedding.provider One of Enum, null | Embedding provider (bedrock uses AWS, cohere/openai use API key). If not set, loads from server. Default: None |
filtering FilteringConfig | File filtering configuration. |
filtering.max_file_size One of integer, null | Maximum file size in bytes Default: None |
filtering.min_file_size One of integer, null | Minimum file size in bytes Default: None |
filtering.min_text_length integer | Minimum text length in characters Default: 50 |
filtering.modified_after One of string, null | Only files modified after this date (ISO format) Default: None |
filtering.modified_before One of string, null | Only files modified before this date (ISO format) Default: None |
filtering.skip_empty_documents boolean | Skip documents with no text content Default: True |
filtering.exclude_patterns array | Glob patterns to exclude |
filtering.exclude_patterns.string string | |
filtering.include_patterns array | Glob patterns to include |
filtering.include_patterns.string string | |
hierarchy HierarchyConfig | Hierarchy configuration. |
hierarchy.enabled boolean | Enable parent-child relationships Default: True |
hierarchy.parent_strategy Enum | One of: "folder", "none", "custom", "notion", "confluence" Default: folder |
hierarchy.custom_mapping One of CustomMappingConfig, null | Custom mapping configuration Default: None |
hierarchy.custom_mapping.rules array | Custom parent mapping rules |
hierarchy.custom_mapping.rules.CustomParentRule CustomParentRule | Custom parent mapping rule. |
hierarchy.custom_mapping.rules.CustomParentRule.parent_id ❓ string | Parent document ID for matching files |
hierarchy.custom_mapping.rules.CustomParentRule.pattern ❓ string | Glob pattern to match file paths |
hierarchy.folder_mapping FolderMappingConfig | Folder hierarchy mapping configuration. |
hierarchy.folder_mapping.create_parent_docs boolean | Create Document entities for folders Default: True |
hierarchy.folder_mapping.max_depth integer | Maximum hierarchy depth Default: 10 |
hierarchy.folder_mapping.parent_id_pattern string | Pattern for parent document IDs Default: {source_type}-{directory} |
hierarchy.folder_mapping.root_parent One of string, null | Optional root document URN Default: None |
pages PageFilterConfig | Configuration for filtering Confluence pages. |
pages.allow One of array, null | List of specific Confluence pages to include in ingestion. By default, all pages in discovered spaces are included. Specify page IDs or URLs to limit ingestion to specific pages and their children. Examples: - Page IDs: ['123456', '789012'] - Page URLs: ['https://domain.atlassian.net/wiki/spaces/ENG/pages/123456/API-Docs'] When specified, only these page trees will be ingested (if recursive=true). This allows focusing on specific documentation sections. Default: None |
pages.allow.string string | |
pages.deny One of array, null | List of specific Confluence pages to exclude from ingestion. Applies after allow filtering. Examples: - Exclude specific pages: ['123456', '789012'] - Page URLs: ['https://domain.atlassian.net/wiki/spaces/ENG/pages/999999/Draft'] Useful for excluding specific pages within otherwise included spaces. Default: None |
pages.deny.string string | |
processing ProcessingConfig | Processing configuration (partitioning only, no chunking). |
processing.parallelism ParallelismConfig | Parallelism configuration. |
processing.parallelism.disable_parallelism boolean | Disable all parallelism Default: False |
processing.parallelism.max_connections integer | Max concurrent connections for async operations Default: 10 |
processing.parallelism.num_processes integer | Number of worker processes Default: 2 |
processing.partition PartitionConfig | Unstructured partitioning configuration. |
processing.partition.additional_args object | Additional partition arguments |
processing.partition.api_key One of string, null | Unstructured API key Default: None |
processing.partition.partition_by_api boolean | Use Unstructured API for partitioning Default: False |
processing.partition.split_pdf_concurrency_level integer | Number of parallel requests for PDF pages Default: 5 |
processing.partition.split_pdf_page boolean | Enable page-level splitting for large PDFs Default: False |
processing.partition.strategy Enum | One of: "auto", "hi_res", "fast", "ocr_only" Default: auto |
processing.partition.ocr_languages array | Languages for OCR Default: ['eng'] |
processing.partition.ocr_languages.string string | |
spaces SpaceFilterConfig | Configuration for filtering Confluence spaces. |
spaces.allow One of array, null | List of Confluence spaces to include in ingestion. By default, all accessible spaces are discovered. Specify space keys or URLs to limit ingestion to specific spaces. Examples: - Space keys: ['ENGINEERING', 'PRODUCT', 'DESIGN'] - Space URLs: ['https://domain.atlassian.net/wiki/spaces/TEAM'] - Mixed: ['ENGINEERING', 'https://domain.atlassian.net/wiki/spaces/PRODUCT'] If specified, only these spaces will be ingested. Use deny to exclude specific spaces from discovery. Default: None |
spaces.allow.string string | |
spaces.deny One of array, null | List of Confluence spaces to exclude from ingestion. Applies after allow filtering. Examples: - Exclude personal spaces: ['~user1', '~user2'] - Exclude specific spaces: ['ARCHIVE', 'OLD_DOCS'] - Space URLs: ['https://domain.atlassian.net/wiki/spaces/TEST'] Useful for excluding personal spaces or archived content. Default: None |
spaces.deny.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Stateful Ingestion Config Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AdvancedConfig": {
"additionalProperties": false,
"description": "Advanced configuration options.",
"properties": {
"work_dir": {
"default": "/tmp/unstructured_datahub",
"title": "Work Dir",
"type": "string"
},
"preserve_outputs": {
"default": false,
"title": "Preserve Outputs",
"type": "boolean"
},
"output_format": {
"default": "json",
"enum": [
"json",
"xml"
],
"title": "Output Format",
"type": "string"
},
"raise_on_error": {
"default": false,
"title": "Raise On Error",
"type": "boolean"
},
"max_errors": {
"default": 10,
"title": "Max Errors",
"type": "integer"
},
"continue_on_failure": {
"default": true,
"title": "Continue On Failure",
"type": "boolean"
},
"retry": {
"$ref": "#/$defs/RetryConfig"
},
"cache": {
"$ref": "#/$defs/CacheConfig"
}
},
"title": "AdvancedConfig",
"type": "object"
},
"CacheConfig": {
"additionalProperties": false,
"description": "Cache configuration.",
"properties": {
"enabled": {
"default": true,
"title": "Enabled",
"type": "boolean"
},
"cache_dir": {
"default": "~/.cache/unstructured_datahub",
"title": "Cache Dir",
"type": "string"
},
"ttl": {
"default": 86400,
"description": "Cache TTL in seconds",
"title": "Ttl",
"type": "integer"
}
},
"title": "CacheConfig",
"type": "object"
},
"ChunkingConfig": {
"additionalProperties": false,
"description": "Chunking strategy configuration.",
"properties": {
"strategy": {
"default": "by_title",
"description": "Chunking strategy to use",
"enum": [
"basic",
"by_title"
],
"title": "Strategy",
"type": "string"
},
"max_characters": {
"default": 500,
"description": "Maximum characters per chunk",
"title": "Max Characters",
"type": "integer"
},
"overlap": {
"default": 0,
"description": "Character overlap between chunks",
"title": "Overlap",
"type": "integer"
},
"combine_text_under_n_chars": {
"default": 100,
"description": "Combine chunks smaller than this size",
"title": "Combine Text Under N Chars",
"type": "integer"
}
},
"title": "ChunkingConfig",
"type": "object"
},
"CustomMappingConfig": {
"additionalProperties": false,
"description": "Custom parent mapping configuration.",
"properties": {
"rules": {
"description": "Custom parent mapping rules",
"items": {
"$ref": "#/$defs/CustomParentRule"
},
"title": "Rules",
"type": "array"
}
},
"title": "CustomMappingConfig",
"type": "object"
},
"CustomParentRule": {
"additionalProperties": false,
"description": "Custom parent mapping rule.",
"properties": {
"pattern": {
"description": "Glob pattern to match file paths",
"title": "Pattern",
"type": "string"
},
"parent_id": {
"description": "Parent document ID for matching files",
"title": "Parent Id",
"type": "string"
}
},
"required": [
"pattern",
"parent_id"
],
"title": "CustomParentRule",
"type": "object"
},
"DocumentMappingConfig": {
"additionalProperties": false,
"description": "Document entity mapping configuration.",
"properties": {
"id_pattern": {
"default": "{source_type}-{directory}-{basename}",
"description": "Pattern for generating document IDs",
"title": "Id Pattern",
"type": "string"
},
"id_normalization": {
"$ref": "#/$defs/IdNormalizationConfig",
"description": "ID normalization rules"
},
"title": {
"$ref": "#/$defs/TitleExtractionConfig",
"description": "Title extraction configuration"
},
"source": {
"$ref": "#/$defs/SourceConfig",
"description": "Source configuration"
},
"status": {
"default": "PUBLISHED",
"description": "Default publication status",
"enum": [
"PUBLISHED",
"UNPUBLISHED"
],
"title": "Status",
"type": "string"
}
},
"title": "DocumentMappingConfig",
"type": "object"
},
"EmbeddingConfig": {
"additionalProperties": false,
"description": "Embedding generation configuration.\n\nDefault behavior: Fetches configuration from DataHub server automatically.\nOverride behavior: Validates local config against server when explicitly set.",
"properties": {
"provider": {
"anyOf": [
{
"enum": [
"bedrock",
"cohere",
"openai"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Embedding provider (bedrock uses AWS, cohere/openai use API key). If not set, loads from server.",
"title": "Provider"
},
"model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Model name. If not set, loads from server.",
"title": "Model"
},
"model_embedding_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.",
"title": "Model Embedding Key"
},
"aws_region": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "AWS region for Bedrock. If not set, loads from server.",
"title": "Aws Region"
},
"api_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "API key for Cohere (not needed for Bedrock with IAM roles)",
"title": "Api Key"
},
"batch_size": {
"default": 25,
"description": "Batch size for embedding API calls",
"title": "Batch Size",
"type": "integer"
},
"input_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "search_document",
"description": "Input type for Cohere embeddings",
"title": "Input Type"
},
"allow_local_embedding_config": {
"default": false,
"description": "BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.",
"title": "Allow Local Embedding Config",
"type": "boolean"
}
},
"title": "EmbeddingConfig",
"type": "object"
},
"FilteringConfig": {
"additionalProperties": false,
"description": "File filtering configuration.",
"properties": {
"include_patterns": {
"description": "Glob patterns to include",
"items": {
"type": "string"
},
"title": "Include Patterns",
"type": "array"
},
"exclude_patterns": {
"description": "Glob patterns to exclude",
"items": {
"type": "string"
},
"title": "Exclude Patterns",
"type": "array"
},
"min_file_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Minimum file size in bytes",
"title": "Min File Size"
},
"max_file_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Maximum file size in bytes",
"title": "Max File Size"
},
"modified_after": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Only files modified after this date (ISO format)",
"title": "Modified After"
},
"modified_before": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Only files modified before this date (ISO format)",
"title": "Modified Before"
},
"skip_empty_documents": {
"default": true,
"description": "Skip documents with no text content",
"title": "Skip Empty Documents",
"type": "boolean"
},
"min_text_length": {
"default": 50,
"description": "Minimum text length in characters",
"title": "Min Text Length",
"type": "integer"
}
},
"title": "FilteringConfig",
"type": "object"
},
"FolderMappingConfig": {
"additionalProperties": false,
"description": "Folder hierarchy mapping configuration.",
"properties": {
"create_parent_docs": {
"default": true,
"description": "Create Document entities for folders",
"title": "Create Parent Docs",
"type": "boolean"
},
"parent_id_pattern": {
"default": "{source_type}-{directory}",
"description": "Pattern for parent document IDs",
"title": "Parent Id Pattern",
"type": "string"
},
"max_depth": {
"default": 10,
"description": "Maximum hierarchy depth",
"maximum": 50,
"minimum": 1,
"title": "Max Depth",
"type": "integer"
},
"root_parent": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Optional root document URN",
"title": "Root Parent"
}
},
"title": "FolderMappingConfig",
"type": "object"
},
"HierarchyConfig": {
"additionalProperties": false,
"description": "Hierarchy configuration.",
"properties": {
"enabled": {
"default": true,
"description": "Enable parent-child relationships",
"title": "Enabled",
"type": "boolean"
},
"parent_strategy": {
"default": "folder",
"description": "Parent document creation strategy. 'notion' extracts parent from Notion API metadata. 'confluence' extracts parent from Confluence page ancestors.",
"enum": [
"folder",
"none",
"custom",
"notion",
"confluence"
],
"title": "Parent Strategy",
"type": "string"
},
"folder_mapping": {
"$ref": "#/$defs/FolderMappingConfig",
"description": "Folder mapping configuration"
},
"custom_mapping": {
"anyOf": [
{
"$ref": "#/$defs/CustomMappingConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Custom mapping configuration"
}
},
"title": "HierarchyConfig",
"type": "object"
},
"IdNormalizationConfig": {
"additionalProperties": false,
"description": "Document ID normalization rules.",
"properties": {
"lowercase": {
"default": true,
"description": "Convert to lowercase",
"title": "Lowercase",
"type": "boolean"
},
"replace_spaces_with": {
"default": "-",
"description": "Replace spaces with this character",
"title": "Replace Spaces With",
"type": "string"
},
"remove_special_chars": {
"default": true,
"description": "Remove special characters except _ and -",
"title": "Remove Special Chars",
"type": "boolean"
},
"max_length": {
"default": 200,
"description": "Maximum ID length",
"title": "Max Length",
"type": "integer"
}
},
"title": "IdNormalizationConfig",
"type": "object"
},
"PageFilterConfig": {
"additionalProperties": false,
"description": "Configuration for filtering Confluence pages.",
"properties": {
"allow": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of specific Confluence pages to include in ingestion. By default, all pages in discovered spaces are included. Specify page IDs or URLs to limit ingestion to specific pages and their children.\n\nExamples:\n - Page IDs: ['123456', '789012']\n - Page URLs: ['https://domain.atlassian.net/wiki/spaces/ENG/pages/123456/API-Docs']\n\nWhen specified, only these page trees will be ingested (if recursive=true). This allows focusing on specific documentation sections.",
"title": "Allow"
},
"deny": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of specific Confluence pages to exclude from ingestion. Applies after allow filtering.\n\nExamples:\n - Exclude specific pages: ['123456', '789012']\n - Page URLs: ['https://domain.atlassian.net/wiki/spaces/ENG/pages/999999/Draft']\n\nUseful for excluding specific pages within otherwise included spaces.",
"title": "Deny"
}
},
"title": "PageFilterConfig",
"type": "object"
},
"ParallelismConfig": {
"additionalProperties": false,
"description": "Parallelism configuration.",
"properties": {
"num_processes": {
"default": 2,
"description": "Number of worker processes",
"maximum": 32,
"minimum": 1,
"title": "Num Processes",
"type": "integer"
},
"disable_parallelism": {
"default": false,
"description": "Disable all parallelism",
"title": "Disable Parallelism",
"type": "boolean"
},
"max_connections": {
"default": 10,
"description": "Max concurrent connections for async operations",
"title": "Max Connections",
"type": "integer"
}
},
"title": "ParallelismConfig",
"type": "object"
},
"PartitionConfig": {
"additionalProperties": false,
"description": "Unstructured partitioning configuration.",
"properties": {
"strategy": {
"default": "auto",
"description": "Partitioning strategy",
"enum": [
"auto",
"hi_res",
"fast",
"ocr_only"
],
"title": "Strategy",
"type": "string"
},
"partition_by_api": {
"default": false,
"description": "Use Unstructured API for partitioning",
"title": "Partition By Api",
"type": "boolean"
},
"api_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Unstructured API key",
"title": "Api Key"
},
"split_pdf_page": {
"default": false,
"description": "Enable page-level splitting for large PDFs",
"title": "Split Pdf Page",
"type": "boolean"
},
"split_pdf_concurrency_level": {
"default": 5,
"description": "Number of parallel requests for PDF pages",
"title": "Split Pdf Concurrency Level",
"type": "integer"
},
"ocr_languages": {
"default": [
"eng"
],
"description": "Languages for OCR",
"items": {
"type": "string"
},
"title": "Ocr Languages",
"type": "array"
},
"additional_args": {
"additionalProperties": true,
"description": "Additional partition arguments",
"title": "Additional Args",
"type": "object"
}
},
"title": "PartitionConfig",
"type": "object"
},
"ProcessingConfig": {
"additionalProperties": false,
"description": "Processing configuration (partitioning only, no chunking).",
"properties": {
"partition": {
"$ref": "#/$defs/PartitionConfig",
"description": "Partition configuration"
},
"parallelism": {
"$ref": "#/$defs/ParallelismConfig",
"description": "Parallelism configuration"
}
},
"title": "ProcessingConfig",
"type": "object"
},
"RetryConfig": {
"additionalProperties": false,
"description": "Retry configuration.",
"properties": {
"enabled": {
"default": true,
"title": "Enabled",
"type": "boolean"
},
"max_attempts": {
"default": 3,
"title": "Max Attempts",
"type": "integer"
},
"backoff_factor": {
"default": 2,
"title": "Backoff Factor",
"type": "integer"
},
"retry_on_timeout": {
"default": true,
"title": "Retry On Timeout",
"type": "boolean"
}
},
"title": "RetryConfig",
"type": "object"
},
"SourceConfig": {
"additionalProperties": false,
"description": "Document source configuration.",
"properties": {
"type": {
"default": "EXTERNAL",
"description": "Source type (always EXTERNAL for ingested docs)",
"enum": [
"NATIVE",
"EXTERNAL"
],
"title": "Type",
"type": "string"
},
"include_external_url": {
"default": true,
"description": "Include external URL in DocumentSource",
"title": "Include External Url",
"type": "boolean"
},
"include_external_id": {
"default": true,
"description": "Include external ID in DocumentSource",
"title": "Include External Id",
"type": "boolean"
}
},
"title": "SourceConfig",
"type": "object"
},
"SpaceFilterConfig": {
"additionalProperties": false,
"description": "Configuration for filtering Confluence spaces.",
"properties": {
"allow": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of Confluence spaces to include in ingestion. By default, all accessible spaces are discovered. Specify space keys or URLs to limit ingestion to specific spaces.\n\nExamples:\n - Space keys: ['ENGINEERING', 'PRODUCT', 'DESIGN']\n - Space URLs: ['https://domain.atlassian.net/wiki/spaces/TEAM']\n - Mixed: ['ENGINEERING', 'https://domain.atlassian.net/wiki/spaces/PRODUCT']\n\nIf specified, only these spaces will be ingested. Use deny to exclude specific spaces from discovery.",
"title": "Allow"
},
"deny": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of Confluence spaces to exclude from ingestion. Applies after allow filtering.\n\nExamples:\n - Exclude personal spaces: ['~user1', '~user2']\n - Exclude specific spaces: ['ARCHIVE', 'OLD_DOCS']\n - Space URLs: ['https://domain.atlassian.net/wiki/spaces/TEST']\n\nUseful for excluding personal spaces or archived content.",
"title": "Deny"
}
},
"title": "SpaceFilterConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
},
"TitleExtractionConfig": {
"additionalProperties": false,
"description": "Title extraction configuration.",
"properties": {
"extract_from_content": {
"default": true,
"description": "Try to extract title from document content",
"title": "Extract From Content",
"type": "boolean"
},
"fallback_to_filename": {
"default": true,
"description": "Use filename as title if not found in content",
"title": "Fallback To Filename",
"type": "boolean"
},
"max_length": {
"default": 500,
"description": "Maximum title length",
"title": "Max Length",
"type": "integer"
}
},
"title": "TitleExtractionConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for Confluence source connector.",
"properties": {
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful Ingestion Config"
},
"url": {
"description": "Base URL of your Confluence instance. Examples: 'https://your-domain.atlassian.net/wiki' (Cloud) or 'https://confluence.your-company.com' (Data Center)",
"title": "Url",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Optional human-readable identifier for this Confluence instance (e.g., 'mycompany-prod', 'team-a-confluence'). If not provided, automatically generated by hashing the base URL, which guarantees global uniqueness across all Confluence installations (both Cloud and Data Center). Use explicit values for more readable URNs, but auto-generated hashes are perfectly fine and require no manual configuration.",
"title": "Platform Instance"
},
"cloud": {
"default": true,
"description": "Whether this is a Confluence Cloud instance (True) or Data Center/Server (False).",
"title": "Cloud",
"type": "boolean"
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Username for Confluence Cloud authentication (required for Cloud).",
"title": "Username"
},
"api_token": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "API token for Confluence Cloud authentication. Generate at: https://id.atlassian.com/manage-profile/security/api-tokens",
"title": "Api Token"
},
"personal_access_token": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "Personal Access Token for Confluence Data Center authentication. Generate from: User Profile > Settings > Personal Access Tokens",
"title": "Personal Access Token"
},
"spaces": {
"$ref": "#/$defs/SpaceFilterConfig"
},
"pages": {
"$ref": "#/$defs/PageFilterConfig"
},
"max_spaces": {
"default": 100,
"description": "Maximum number of spaces to ingest when auto-discovering (applies when urls is not set).",
"title": "Max Spaces",
"type": "integer"
},
"max_pages_per_space": {
"default": 1000,
"description": "Maximum number of pages to ingest per space.",
"title": "Max Pages Per Space",
"type": "integer"
},
"recursive": {
"default": true,
"description": "Whether to recursively fetch child pages (applies to page URLs only).",
"title": "Recursive",
"type": "boolean"
},
"processing": {
"$ref": "#/$defs/ProcessingConfig",
"description": "Document processing configuration (partitioning strategy, OCR, etc.)."
},
"document_mapping": {
"$ref": "#/$defs/DocumentMappingConfig",
"description": "Configuration for mapping Confluence pages to DataHub documents."
},
"hierarchy": {
"$ref": "#/$defs/HierarchyConfig",
"description": "Parent-child relationship configuration."
},
"filtering": {
"$ref": "#/$defs/FilteringConfig",
"description": "Filtering options for document content."
},
"chunking": {
"$ref": "#/$defs/ChunkingConfig",
"description": "Configuration for document chunking (required for embeddings)."
},
"embedding": {
"$ref": "#/$defs/EmbeddingConfig",
"description": "Configuration for generating vector embeddings for semantic search."
},
"advanced": {
"$ref": "#/$defs/AdvancedConfig",
"description": "Advanced ingestion options."
}
},
"required": [
"url"
],
"title": "ConfluenceSourceConfig",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.confluence.confluence_source.ConfluenceSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Confluence, feel free to ping us on our Slack.