Skip to main content

Confluence

Incubating

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesEnabled by default.
Platform InstanceEnabled by default.
Test ConnectionEnabled by default.

Overview

The Confluence source ingests pages and spaces from Confluence workspaces (Cloud or Data Center) as DataHub Document entities with optional semantic embeddings for semantic search.

Key Features

1. Content Extraction

  • Page Content: Full text extraction from Confluence pages including all content types
  • Space Discovery: Automatic discovery of all pages within specified spaces
  • Hierarchical Structure: Maintains parent-child relationships between pages
  • Metadata Extraction: Captures creation/modification timestamps, authors, labels, and custom properties

2. Hierarchical Relationships

  • Parent-Child Links: Preserves Confluence page hierarchy in DataHub
  • Recursive Discovery: Recursively discovers nested pages starting from root pages or entire spaces
  • Space Organization: Maintains space context as custom properties
  • Flexible Navigation: Browse documentation structure in DataHub UI

3. Embedding Generation

Optional semantic search support with sensible defaults:

  • Supported providers: Cohere (API key), AWS Bedrock (IAM roles)
  • Automatic chunking: Documents are automatically chunked for optimal embedding generation
  • Automatic deduplication: Prevents duplicate chunk embeddings

See Semantic Search Configuration for detailed setup and advanced options.

4. Stateful Ingestion

Supports smart incremental updates via stateful ingestion:

  • Content Change Detection: Only reprocesses documents when content or embeddings config changes
  • Deletion Detection: Automatically removes stale entities from DataHub
  • Flexible Discovery: Ingest entire spaces, specific pages, or page trees
  • State Persistence: Maintains processing state between runs to skip unchanged documents

Prerequisites

1. Confluence API Access

For Confluence Cloud

Create an API token:

  1. Go to https://id.atlassian.com/manage-profile/security/api-tokens
  2. Click "Create API token"
  3. Give it a name (e.g., "DataHub Integration")
  4. Copy the token (you won't be able to see it again)

You'll need:

  • Base URL: Your Confluence Cloud URL (e.g., https://your-domain.atlassian.net/wiki)
  • Username: Your Atlassian account email
  • API Token: The token you just created

For Confluence Data Center / Server

Create a Personal Access Token:

  1. Go to your Confluence → Profile → Personal Access Tokens
  2. Click "Create token"
  3. Give it a name and set expiration
  4. Copy the token

You'll need:

  • Base URL: Your Confluence server URL (e.g., https://confluence.company.com)
  • Personal Access Token: The token you created

Note: For Data Center, you can also use username/password, but Personal Access Tokens are recommended.

2. Required Permissions

The API credentials must have:

  • Read access to all spaces and pages you want to ingest
  • For Cloud: User must be added to spaces or have site-wide read access
  • For Data Center: User must have "View" permissions on spaces

3. Embedding Provider (Optional)

If you want semantic search capabilities, configure an embedding provider in your DataHub instance.

Supported providers include Cohere (API key) and AWS Bedrock (IAM roles). The connector will use sensible defaults for chunking and embedding configuration.

See Semantic Search Configuration for detailed provider setup and configuration options.

Common Use Cases

1. Auto-Discover All Spaces (Default)

By default, the connector discovers and ingests all accessible spaces:

source:
type: confluence
config:
# Confluence Cloud
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

# No filtering - discovers all accessible spaces
# Optional: limit number of spaces for large instances
max_spaces: 100

2. Include Specific Spaces

Ingest only specific Confluence spaces:

source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

# Include only these spaces
spaces:
allow:
- "ENGINEERING"
- "PRODUCT"
- "DESIGN"

3. Exclude Personal and Archive Spaces

Ingest all spaces except specific ones:

source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

# Exclude personal spaces and archived content
spaces:
deny:
- "~john.doe"
- "~jane.smith"
- "ARCHIVE"
- "OLD_DOCS"

4. Specific Page Trees Only

Ingest specific pages and their descendants:

source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

# Start from specific pages
pages:
allow:
- "123456789" # API Documentation page tree
- "987654321" # User Guides page tree
recursive: true # Include all child pages

5. Combined Space and Page Filtering

Combine space and page filters for fine-grained control:

source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

# Include specific spaces
spaces:
allow:
- "ENGINEERING"
- "PRODUCT"
# Exclude personal spaces even if in allow list
deny:
- "~admin"

# Exclude specific pages (e.g., drafts, archived content)
pages:
deny:
- "999999" # Draft page
- "888888" # Archived page

6. Data Center / Server Setup

Connect to Confluence Data Center or Server:

source:
type: confluence
config:
# Data Center / Server
cloud: false
url: "https://confluence.company.com"
personal_access_token: "${CONFLUENCE_PAT}"

spaces:
allow:
- "WIKI"
- "DOCS"

7. Production Setup with Stateful Ingestion

Enterprise setup with incremental updates:

source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

spaces:
allow:
- "COMPANY"
- "PUBLIC"

# Enable stateful ingestion for incremental updates
stateful_ingestion:
enabled: true

Note: Embedding configuration is managed by your DataHub instance. See Semantic Search Configuration for setup.

8. Using URLs for Allow/Deny

You can specify spaces and pages using full URLs for both allow and deny lists:

source:
type: confluence
config:
cloud: true
url: "https://your-domain.atlassian.net/wiki"
username: "user@company.com"
api_token: "${CONFLUENCE_API_TOKEN}"

# Use full URLs - connector extracts keys/IDs automatically
spaces:
allow:
- "https://your-domain.atlassian.net/wiki/spaces/ENG"
- "https://your-domain.atlassian.net/wiki/spaces/PRODUCT"
deny:
- "https://your-domain.atlassian.net/wiki/spaces/ARCHIVE"
- "~john.doe" # Can mix URLs and keys

pages:
allow:
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Getting+Started"
deny:
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/999999/Draft"

Filtering Content

The connector provides flexible filtering options through allow and deny lists for both spaces and pages.

Space Filtering

Control which Confluence spaces are ingested:

spaces.allow: Include only specific spaces (by default, all accessible spaces are discovered)

spaces:
allow:
- "ENGINEERING" # Space key
- "PRODUCT"
- "https://your-domain.atlassian.net/wiki/spaces/DESIGN" # Or full URL

spaces.deny: Exclude specific spaces (applied after spaces.allow)

spaces:
deny:
- "~john.doe" # Personal space
- "ARCHIVE" # Archived content
- "TEST" # Test space

Page Filtering

Control which pages are ingested:

pages.allow: Include only specific pages (triggers page-based mode, bypasses space discovery)

pages:
allow:
- "123456789" # Page ID
- "987654321"
- "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/111111/API+Docs" # Or full URL
recursive: true # Include child pages

pages.deny: Exclude specific pages (works in both space-based and page-based modes)

pages:
deny:
- "999999" # Draft page
- "888888" # Archived page

Filtering Rules

Precedence:

  • Deny lists always take precedence over allow lists
  • If a space/page is in both allow and deny lists, it will be excluded

Modes:

  • Space-based mode (default): Discovers spaces, then ingests all pages within allowed spaces
  • Page-based mode: When page_allow is specified, bypasses space discovery and fetches specific page trees

Format Support:

  • Space keys: "ENGINEERING", "~username" (for personal spaces)
  • Page IDs: "123456789" (numeric string)
  • Full URLs: Both space URLs and page URLs are automatically parsed

Common Filtering Patterns

Exclude all personal spaces:

spaces:
deny:
- "~*" # Note: Use explicit user IDs, wildcard not supported
# Instead, list specific personal spaces:
- "~john.doe"
- "~jane.smith"

Ingest only documentation spaces:

spaces:
allow:
- "DOCS"
- "API_DOCS"
- "USER_GUIDES"

Focus on specific documentation trees:

pages:
allow:
- "123456" # API Documentation root page
- "789012" # User Guides root page
recursive: true

Exclude drafts and WIP pages:

pages:
deny:
- "999999" # Draft page ID
- "888888" # WIP page ID

How It Works

Processing Pipeline

  1. Discovery: Confluence API discovers spaces and pages
  2. Download: Downloads page content via Confluence REST API
  3. Extraction: Extracts text, metadata, and hierarchy from pages
  4. Chunking: Splits documents into semantic chunks (if embeddings enabled)
  5. Embedding: Generates vector embeddings for each chunk (if embeddings enabled)
  6. Emission: Emits Document entities with SemanticContent aspects to DataHub

URL Format Support

The connector supports multiple input formats for spaces and pages in allow/deny lists:

Space Identifiers:

  • Space key: "ENGINEERING", "~username" (for personal spaces)
  • Full URL: "https://your-domain.atlassian.net/wiki/spaces/ENGINEERING"

Page Identifiers:

  • Page ID: "123456789" (numeric string)
  • Full URL (Cloud): "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Page+Title"
  • Full URL (Data Center): "https://confluence.company.com/pages/viewpage.action?pageId=123456"

The connector automatically extracts space keys and page IDs from URLs, so you can use either format interchangeably in space_allow, space_deny, page_allow, and page_deny lists.

Stateful Ingestion Details

The source uses content-based change detection:

  • Calculates SHA-256 hash of document content + embedding configuration
  • Compares hash with previous run to detect changes
  • Only reprocesses documents when hash changes
  • Tracks all emitted URNs to detect deletions

This means:

  • First run: Processes all documents
  • Subsequent runs: Only processes new/changed documents
  • Deleted pages: Automatically soft-deleted from DataHub

Limitations and Considerations

Confluence API Limits

  • Rate Limits: Confluence enforces rate limits (Cloud: varies by plan, Data Center: configurable)
  • Content Types: Complex macros may not extract perfectly (e.g., embedded content, custom macros)
  • Attachments: File attachments are not ingested (only page content)

Performance Considerations

  • Large Spaces: First run may take significant time for large spaces (1000+ pages)
  • Embedding Generation: Adds processing time proportional to content volume
  • API Costs: Embedding providers may incur costs based on usage

Content Extraction

  • Supported Content: Text, headings, lists, code blocks, tables, panels
  • Limited Support: Some macros extract as text/links
  • Not Supported: Attachments, complex custom macros, embedded Jira issues (content only)

Troubleshooting

Common Issues

"401 Unauthorized" or "Authentication failed" errors:

  • Cloud: Verify username (email) and api_token are correct
  • Data Center: Verify personal_access_token is valid and not expired
  • Check that cloud: true/false matches your Confluence type
  • Ensure the URL includes /wiki suffix for Cloud (e.g., https://domain.atlassian.net/wiki)

"403 Forbidden" or "Space not found" errors:

  • Verify the user has read access to the specified spaces
  • Check that space keys are correct (case-sensitive)
  • For Cloud, ensure user is added to private spaces
  • For Data Center, verify "View Space" permissions

Empty or missing content:

  • Verify pages contain text (empty pages are skipped by default with skip_empty_documents: true)
  • Check min_text_length filter setting (default: 50 characters)
  • Ensure recursive: true if expecting child pages
  • Check that pages are not restricted or have special permissions

Slow ingestion:

  • Increase processing.parallelism.num_processes (default: 2)
  • Consider filtering specific spaces instead of all spaces
  • First run is always slower - subsequent runs use incremental updates
  • Large spaces with 1000+ pages may take several minutes

Embedding generation failures:

  • Verify provider API key is correct
  • Check provider-specific rate limits (Cohere: 10k requests/min)
  • Ensure embedding model name is valid for your provider
  • For Bedrock: verify IAM permissions and model access is enabled in AWS Console

Stateful ingestion not working:

  • Ensure stateful_ingestion.enabled: true in config
  • Check DataHub connection (source needs to query previous state)
  • Verify state file path is writable (if using file-based state)
  • Look for state persistence logs in ingestion output

Missing hierarchy/parent relationships:

  • Verify hierarchy.enabled: true (default)
  • Check that parent pages are being ingested
  • Ensure recursive: true to discover parent-child relationships
  • Parent pages must be accessible to the API credentials

Page IDs not working:

  • For Cloud, use the numeric page ID from the URL (after /pages/)
  • For Data Center, page IDs may differ - use the ID from the page URL or query param ?pageId=
  • Alternatively, use full page URLs instead of IDs in page_allow or page_deny

How to find space keys and page IDs:

  • Space key: Visible in the space URL: https://domain.atlassian.net/wiki/spaces/ENGINEERING → key is ENGINEERING
  • Page ID (Cloud): In the page URL after /pages/: https://domain.atlassian.net/wiki/spaces/ENG/pages/123456/Title → ID is 123456
  • Page ID (Data Center): In the URL query parameter: https://confluence.company.com/pages/viewpage.action?pageId=123456 → ID is 123456
  • Personal space key: Format is ~username (e.g., ~john.doe for user john.doe)

Performance Tuning

Parallelism Settings

processing:
parallelism:
num_processes: 4 # Increase for faster processing (default: 2)
max_connections: 20 # Concurrent API connections (default: 10)

Guidelines:

  • Small spaces (<100 pages): num_processes: 2
  • Medium spaces (100-500 pages): num_processes: 4
  • Large spaces (>500 pages): num_processes: 8

Filtering

filtering:
min_text_length: 100 # Skip short pages (default: 50)
skip_empty_documents: true # Skip empty pages (default: true)

Space Selection

Instead of ingesting all spaces, select specific ones:

spaces:
allow:
- "ENGINEERING" # High-value documentation space
- "PRODUCT" # Product requirements space
deny:
- "~*" # Exclude personal spaces (list specific users)
- "ARCHIVE" # Exclude archived content
- "TEST" # Exclude test spaces

CLI based Ingestion

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
url 
string
Base URL of your Confluence instance. Examples: 'https://your-domain.atlassian.net/wiki' (Cloud) or 'https://confluence.your-company.com' (Data Center)
api_token
One of string(password), null
API token for Confluence Cloud authentication. Generate at: https://id.atlassian.com/manage-profile/security/api-tokens
Default: None
cloud
boolean
Whether this is a Confluence Cloud instance (True) or Data Center/Server (False).
Default: True
max_pages_per_space
integer
Maximum number of pages to ingest per space.
Default: 1000
max_spaces
integer
Maximum number of spaces to ingest when auto-discovering (applies when urls is not set).
Default: 100
personal_access_token
One of string(password), null
Personal Access Token for Confluence Data Center authentication. Generate from: User Profile > Settings > Personal Access Tokens
Default: None
platform_instance
One of string, null
Optional human-readable identifier for this Confluence instance (e.g., 'mycompany-prod', 'team-a-confluence'). If not provided, automatically generated by hashing the base URL, which guarantees global uniqueness across all Confluence installations (both Cloud and Data Center). Use explicit values for more readable URNs, but auto-generated hashes are perfectly fine and require no manual configuration.
Default: None
recursive
boolean
Whether to recursively fetch child pages (applies to page URLs only).
Default: True
username
One of string, null
Username for Confluence Cloud authentication (required for Cloud).
Default: None
advanced
AdvancedConfig
Advanced configuration options.
advanced.continue_on_failure
boolean
Default: True
advanced.max_errors
integer
Default: 10
advanced.output_format
Enum
One of: "json", "xml"
Default: json
advanced.preserve_outputs
boolean
Default: False
advanced.raise_on_error
boolean
Default: False
advanced.work_dir
string
Default: /tmp/unstructured_datahub
advanced.cache
CacheConfig
Cache configuration.
advanced.cache.cache_dir
string
Default: ~/.cache/unstructured_datahub
advanced.cache.enabled
boolean
Default: True
advanced.cache.ttl
integer
Cache TTL in seconds
Default: 86400
advanced.retry
RetryConfig
Retry configuration.
advanced.retry.backoff_factor
integer
Default: 2
advanced.retry.enabled
boolean
Default: True
advanced.retry.max_attempts
integer
Default: 3
advanced.retry.retry_on_timeout
boolean
Default: True
chunking
ChunkingConfig
Chunking strategy configuration.
chunking.combine_text_under_n_chars
integer
Combine chunks smaller than this size
Default: 100
chunking.max_characters
integer
Maximum characters per chunk
Default: 500
chunking.overlap
integer
Character overlap between chunks
Default: 0
chunking.strategy
Enum
One of: "basic", "by_title"
Default: by_title
document_mapping
DocumentMappingConfig
Document entity mapping configuration.
document_mapping.id_pattern
string
Pattern for generating document IDs
Default: {source_type}-{directory}-{basename}
document_mapping.status
Enum
One of: "PUBLISHED", "UNPUBLISHED"
Default: PUBLISHED
document_mapping.id_normalization
IdNormalizationConfig
Document ID normalization rules.
document_mapping.id_normalization.lowercase
boolean
Convert to lowercase
Default: True
document_mapping.id_normalization.max_length
integer
Maximum ID length
Default: 200
document_mapping.id_normalization.remove_special_chars
boolean
Remove special characters except _ and -
Default: True
document_mapping.id_normalization.replace_spaces_with
string
Replace spaces with this character
Default: -
document_mapping.source
SourceConfig
Document source configuration.
document_mapping.source.include_external_id
boolean
Include external ID in DocumentSource
Default: True
document_mapping.source.include_external_url
boolean
Include external URL in DocumentSource
Default: True
document_mapping.source.type
Enum
One of: "NATIVE", "EXTERNAL"
Default: EXTERNAL
document_mapping.title
TitleExtractionConfig
Title extraction configuration.
document_mapping.title.extract_from_content
boolean
Try to extract title from document content
Default: True
document_mapping.title.fallback_to_filename
boolean
Use filename as title if not found in content
Default: True
document_mapping.title.max_length
integer
Maximum title length
Default: 500
embedding
EmbeddingConfig
Embedding generation configuration.

Default behavior: Fetches configuration from DataHub server automatically.
Override behavior: Validates local config against server when explicitly set.
embedding.allow_local_embedding_config
boolean
BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.
Default: False
embedding.api_key
One of string, null
API key for Cohere (not needed for Bedrock with IAM roles)
Default: None
embedding.aws_region
One of string, null
AWS region for Bedrock. If not set, loads from server.
Default: None
embedding.batch_size
integer
Batch size for embedding API calls
Default: 25
embedding.input_type
One of string, null
Input type for Cohere embeddings
Default: search_document
embedding.model
One of string, null
Model name. If not set, loads from server.
Default: None
embedding.model_embedding_key
One of string, null
Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.
Default: None
embedding.provider
One of Enum, null
Embedding provider (bedrock uses AWS, cohere/openai use API key). If not set, loads from server.
Default: None
filtering
FilteringConfig
File filtering configuration.
filtering.max_file_size
One of integer, null
Maximum file size in bytes
Default: None
filtering.min_file_size
One of integer, null
Minimum file size in bytes
Default: None
filtering.min_text_length
integer
Minimum text length in characters
Default: 50
filtering.modified_after
One of string, null
Only files modified after this date (ISO format)
Default: None
filtering.modified_before
One of string, null
Only files modified before this date (ISO format)
Default: None
filtering.skip_empty_documents
boolean
Skip documents with no text content
Default: True
filtering.exclude_patterns
array
Glob patterns to exclude
filtering.exclude_patterns.string
string
filtering.include_patterns
array
Glob patterns to include
filtering.include_patterns.string
string
hierarchy
HierarchyConfig
Hierarchy configuration.
hierarchy.enabled
boolean
Enable parent-child relationships
Default: True
hierarchy.parent_strategy
Enum
One of: "folder", "none", "custom", "notion", "confluence"
Default: folder
hierarchy.custom_mapping
One of CustomMappingConfig, null
Custom mapping configuration
Default: None
hierarchy.custom_mapping.rules
array
Custom parent mapping rules
hierarchy.custom_mapping.rules.CustomParentRule
CustomParentRule
Custom parent mapping rule.
hierarchy.custom_mapping.rules.CustomParentRule.parent_id 
string
Parent document ID for matching files
hierarchy.custom_mapping.rules.CustomParentRule.pattern 
string
Glob pattern to match file paths
hierarchy.folder_mapping
FolderMappingConfig
Folder hierarchy mapping configuration.
hierarchy.folder_mapping.create_parent_docs
boolean
Create Document entities for folders
Default: True
hierarchy.folder_mapping.max_depth
integer
Maximum hierarchy depth
Default: 10
hierarchy.folder_mapping.parent_id_pattern
string
Pattern for parent document IDs
Default: {source_type}-{directory}
hierarchy.folder_mapping.root_parent
One of string, null
Optional root document URN
Default: None
pages
PageFilterConfig
Configuration for filtering Confluence pages.
pages.allow
One of array, null
List of specific Confluence pages to include in ingestion. By default, all pages in discovered spaces are included. Specify page IDs or URLs to limit ingestion to specific pages and their children.

Examples:
- Page IDs: ['123456', '789012']
- Page URLs: ['https://domain.atlassian.net/wiki/spaces/ENG/pages/123456/API-Docs']

When specified, only these page trees will be ingested (if recursive=true). This allows focusing on specific documentation sections.
Default: None
pages.allow.string
string
pages.deny
One of array, null
List of specific Confluence pages to exclude from ingestion. Applies after allow filtering.

Examples:
- Exclude specific pages: ['123456', '789012']
- Page URLs: ['https://domain.atlassian.net/wiki/spaces/ENG/pages/999999/Draft']

Useful for excluding specific pages within otherwise included spaces.
Default: None
pages.deny.string
string
processing
ProcessingConfig
Processing configuration (partitioning only, no chunking).
processing.parallelism
ParallelismConfig
Parallelism configuration.
processing.parallelism.disable_parallelism
boolean
Disable all parallelism
Default: False
processing.parallelism.max_connections
integer
Max concurrent connections for async operations
Default: 10
processing.parallelism.num_processes
integer
Number of worker processes
Default: 2
processing.partition
PartitionConfig
Unstructured partitioning configuration.
processing.partition.additional_args
object
Additional partition arguments
processing.partition.api_key
One of string, null
Unstructured API key
Default: None
processing.partition.partition_by_api
boolean
Use Unstructured API for partitioning
Default: False
processing.partition.split_pdf_concurrency_level
integer
Number of parallel requests for PDF pages
Default: 5
processing.partition.split_pdf_page
boolean
Enable page-level splitting for large PDFs
Default: False
processing.partition.strategy
Enum
One of: "auto", "hi_res", "fast", "ocr_only"
Default: auto
processing.partition.ocr_languages
array
Languages for OCR
Default: ['eng']
processing.partition.ocr_languages.string
string
spaces
SpaceFilterConfig
Configuration for filtering Confluence spaces.
spaces.allow
One of array, null
List of Confluence spaces to include in ingestion. By default, all accessible spaces are discovered. Specify space keys or URLs to limit ingestion to specific spaces.

Examples:
- Space keys: ['ENGINEERING', 'PRODUCT', 'DESIGN']
- Space URLs: ['https://domain.atlassian.net/wiki/spaces/TEAM']
- Mixed: ['ENGINEERING', 'https://domain.atlassian.net/wiki/spaces/PRODUCT']

If specified, only these spaces will be ingested. Use deny to exclude specific spaces from discovery.
Default: None
spaces.allow.string
string
spaces.deny
One of array, null
List of Confluence spaces to exclude from ingestion. Applies after allow filtering.

Examples:
- Exclude personal spaces: ['~user1', '~user2']
- Exclude specific spaces: ['ARCHIVE', 'OLD_DOCS']
- Space URLs: ['https://domain.atlassian.net/wiki/spaces/TEST']

Useful for excluding personal spaces or archived content.
Default: None
spaces.deny.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful Ingestion Config
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.confluence.confluence_source.ConfluenceSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Confluence, feel free to ping us on our Slack.