Notion
Notion Source
Ingest pages and databases from Notion workspaces as DataHub Document entities with optional semantic embeddings.
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Test Connection | ✅ | Enabled by default. |
Overview
The Notion source ingests pages and databases from Notion workspaces as DataHub Document entities with optional semantic embeddings for semantic search.
Key Features
1. Content Extraction
- Page Content: Full text extraction from Notion pages including all supported block types
- Database Rows: Ingests database entries as individual documents
- Hierarchical Structure: Maintains parent-child relationships between pages
- Metadata Extraction: Captures creation/modification timestamps, authors, and custom properties
2. Hierarchical Relationships
- Parent-Child Links: Preserves Notion's page hierarchy in DataHub
- Automatic Discovery: Recursively discovers nested pages starting from root pages
- Flexible Navigation: Browse documentation structure in DataHub UI
3. Embedding Generation
Optional semantic search support:
- Supported providers: Cohere (API key), AWS Bedrock (IAM roles)
- Chunking strategies: by_title, basic
- Configurable chunk size: Optimize for your embedding model (in characters)
- Automatic deduplication: Prevents duplicate chunk embeddings
4. Stateful Ingestion
Supports smart incremental updates via stateful ingestion:
- Content Change Detection: Only reprocesses documents when content or embeddings config changes
- Deletion Detection: Automatically removes stale entities from DataHub
- Recursive Discovery: Start from root pages/databases, automatically discovers and ingests child pages
- State Persistence: Maintains processing state between runs to skip unchanged documents
Prerequisites
1. Notion Integration
Create a Notion internal integration:
- Go to https://www.notion.so/my-integrations
- Click "+ New integration"
- Give it a name (e.g., "DataHub Integration")
- Select the workspace
- Copy the Internal Integration Token (starts with
secret_)
2. Share Pages with Integration
The integration can only access pages explicitly shared with it:
- Open the page or database in Notion
- Click "Share" in the top right
- Search for your integration name
- Click "Invite"
Important: For recursive ingestion, only share top-level pages. Child pages inherit access automatically.
3. Embedding Provider (Optional)
If you want semantic search capabilities, set up one of these providers:
Cohere
- Sign up at https://cohere.ai/
- Create an API key
- Supports:
embed-english-v3.0,embed-multilingual-v3.0
AWS Bedrock
- AWS account with Bedrock access
- Enable Cohere models in AWS Console → Bedrock → Model access
- IAM permissions for
bedrock:InvokeModel - Recommended region:
us-west-2
See Semantic Search Configuration for detailed embedding setup.
Common Use Cases
1. Workspace-wide Documentation Search
Ingest entire workspace documentation with semantic search:
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
# Start from workspace root page
page_ids:
- "workspace_root_page_id"
recursive: true
# Enable semantic embeddings
embedding:
provider: "cohere"
model: "embed-english-v3.0"
api_key: "${COHERE_API_KEY}"
2. Specific Database Ingestion
Ingest a specific Notion database (e.g., "Product Requirements"):
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
# Only this database
database_ids:
- "product_requirements_db_id"
recursive: false # Only database entries, not child pages
3. Multi-workspace Setup
Ingest from multiple workspaces (requires multiple integrations):
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
# Multiple root pages from different workspaces
page_ids:
- "workspace_1_page_id"
- "workspace_2_page_id"
recursive: true
4. Production Setup with AWS Bedrock
Enterprise setup using AWS Bedrock for embeddings:
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
page_ids:
- "company_wiki_root"
recursive: true
# Use AWS Bedrock (no API key needed, uses IAM roles)
embedding:
provider: "bedrock"
aws_region: "us-west-2"
model: "cohere.embed-english-v3"
# Enable stateful ingestion for incremental updates
stateful_ingestion:
enabled: true
How It Works
Processing Pipeline
- Discovery: Notion API discovers pages/databases
- Download: Unstructured.io downloads and converts content to structured format
- Extraction: Extracts text, metadata, and hierarchy from Notion pages
- Chunking: Splits documents into semantic chunks (if embeddings enabled)
- Embedding: Generates vector embeddings for each chunk (if embeddings enabled)
- Emission: Emits Document entities with SemanticContent aspects to DataHub
Stateful Ingestion Details
The source uses content-based change detection:
- Calculates SHA-256 hash of document content + embedding configuration
- Compares hash with previous run to detect changes
- Only reprocesses documents when hash changes
- Tracks all emitted URNs to detect deletions
This means:
- First run: Processes all documents
- Subsequent runs: Only processes new/changed documents
- Deleted pages: Automatically soft-deleted from DataHub
Limitations and Considerations
Notion API Limits
- Rate Limits: Notion enforces rate limits (3 requests/second for paid workspaces, 1/second for free)
- Access Scope: Integration only sees explicitly shared pages
- Content Types: Some Notion blocks may not extract perfectly (e.g., complex embeds, synced blocks)
Performance Considerations
- Large Workspaces: First run may take significant time for large workspaces
- Embedding Generation: Adds processing time proportional to content volume
- API Costs: Unstructured API and embedding providers may incur costs
Content Extraction
- Supported Blocks: Text, headings, lists, code blocks, tables, callouts, toggles, quotes
- Limited Support: Embeds, equations, files (extracted as links/references)
- Not Supported: Live charts, board/gallery/timeline views (database views)
Troubleshooting
Common Issues
"Integration not found" or "Unauthorized" errors:
- Verify the
api_keyis correct (should start withsecret_) - Ensure pages are shared with the integration
- Check that the integration has "Read content" capability
Empty or missing content:
- Verify pages contain text (empty pages are skipped by default with
skip_empty_documents: true) - Check
min_text_lengthfilter setting (default: 50 characters) - Ensure
recursive: trueif expecting child pages - Check that child pages are not explicitly restricted
Slow ingestion:
- Increase
processing.parallelism.num_processes(default: 2) - Consider using
partition_by_api: falsefor local processing (requires more memory) - Filter specific pages instead of entire workspace using
page_ids - First run is always slower - subsequent runs use incremental updates
Embedding generation failures:
- Verify provider API key is correct
- Check provider-specific rate limits (Cohere: 10k requests/min)
- Ensure embedding model name is valid for your provider
- For Bedrock: verify IAM permissions and model access is enabled in AWS Console
Stateful ingestion not working:
- Ensure
stateful_ingestion.enabled: truein config - Check DataHub connection (source needs to query previous state)
- Verify state file path is writable (if using file-based state)
- Look for state persistence logs in ingestion output
Missing hierarchy/parent relationships:
- Verify
hierarchy.enabled: true(default) - Check that parent pages are being ingested
- Ensure
recursive: trueto discover parent-child relationships - Parent pages must be accessible to the integration
Performance Tuning
Parallelism Settings
processing:
parallelism:
num_processes: 4 # Increase for faster processing (default: 2)
max_connections: 20 # Concurrent API connections (default: 10)
Guidelines:
- Small workspaces (<100 pages):
num_processes: 2 - Medium workspaces (100-1000 pages):
num_processes: 4 - Large workspaces (>1000 pages):
num_processes: 8
Filtering
filtering:
min_text_length: 100 # Skip short pages (default: 50)
skip_empty_documents: true # Skip empty pages (default: true)
Chunking Optimization
chunking:
strategy: "by_title" # Preserves document structure (recommended)
max_characters: 500 # Chunk size (default: 500)
combine_text_under_n_chars: 100 # Merge small chunks (default: 100)
Related Documentation
- Notion API Documentation
- Semantic Search Configuration
- Unstructured.io Documentation
- Cohere Embeddings API
- AWS Bedrock Embeddings
- DataHub Document Ingestion
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: notion
config:
# Notion API token from your integration
api_key: "${NOTION_API_KEY}"
# Ingest specific pages (get IDs from page URLs)
page_ids:
- "your-page-id-here"
# Or ingest all accessible content (leave page_ids and database_ids empty)
# page_ids: []
# database_ids: []
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
api_key ✅ string(password) | Notion internal integration token. Create one at https://www.notion.so/my-integrations |
recursive boolean | Recursively fetch child pages. When true, ingests all descendant pages of specified pages/databases. Default: True |
advanced AdvancedConfig | Advanced configuration options. |
advanced.continue_on_failure boolean | Default: True |
advanced.max_errors integer | Default: 10 |
advanced.output_format Enum | One of: "json", "xml" Default: json |
advanced.preserve_outputs boolean | Default: False |
advanced.raise_on_error boolean | Default: False |
advanced.work_dir string | Default: /tmp/unstructured_datahub |
advanced.cache CacheConfig | Cache configuration. |
advanced.cache.cache_dir string | Default: ~/.cache/unstructured_datahub |
advanced.cache.enabled boolean | Default: True |
advanced.cache.ttl integer | Cache TTL in seconds Default: 86400 |
advanced.retry RetryConfig | Retry configuration. |
advanced.retry.backoff_factor integer | Default: 2 |
advanced.retry.enabled boolean | Default: True |
advanced.retry.max_attempts integer | Default: 3 |
advanced.retry.retry_on_timeout boolean | Default: True |
chunking ChunkingConfig | Chunking strategy configuration. |
chunking.combine_text_under_n_chars integer | Combine chunks smaller than this size Default: 100 |
chunking.max_characters integer | Maximum characters per chunk Default: 500 |
chunking.overlap integer | Character overlap between chunks Default: 0 |
chunking.strategy Enum | One of: "basic", "by_title" Default: by_title |
database_ids array | List of Notion database IDs to ingest. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration. IDs can be found in database URLs: https://www.notion.so/{DATABASE_ID} |
database_ids.string string | |
datahub DataHubConnectionConfig | DataHub connection configuration. |
datahub.server string | DataHub GMS server URL Default: http://localhost:8080 |
datahub.token One of string, null | DataHub API token for authentication Default: None |
document_mapping DocumentMappingConfig | Document entity mapping configuration. |
document_mapping.id_pattern string | Pattern for generating document IDs Default: {source_type}-{directory}-{basename} |
document_mapping.status Enum | One of: "PUBLISHED", "UNPUBLISHED" Default: PUBLISHED |
document_mapping.id_normalization IdNormalizationConfig | Document ID normalization rules. |
document_mapping.id_normalization.lowercase boolean | Convert to lowercase Default: True |
document_mapping.id_normalization.max_length integer | Maximum ID length Default: 200 |
document_mapping.id_normalization.remove_special_chars boolean | Remove special characters except _ and - Default: True |
document_mapping.id_normalization.replace_spaces_with string | Replace spaces with this character Default: - |
document_mapping.source SourceConfig | Document source configuration. |
document_mapping.source.include_external_id boolean | Include external ID in DocumentSource Default: True |
document_mapping.source.include_external_url boolean | Include external URL in DocumentSource Default: True |
document_mapping.source.type Enum | One of: "NATIVE", "EXTERNAL" Default: EXTERNAL |
document_mapping.title TitleExtractionConfig | Title extraction configuration. |
document_mapping.title.extract_from_content boolean | Try to extract title from document content Default: True |
document_mapping.title.fallback_to_filename boolean | Use filename as title if not found in content Default: True |
document_mapping.title.max_length integer | Maximum title length Default: 500 |
embedding EmbeddingConfig | Embedding generation configuration. Default behavior: Fetches configuration from DataHub server automatically. Override behavior: Validates local config against server when explicitly set. |
embedding.allow_local_embedding_config boolean | BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search. Default: False |
embedding.api_key One of string, null | API key for Cohere (not needed for Bedrock with IAM roles) Default: None |
embedding.aws_region One of string, null | AWS region for Bedrock. If not set, loads from server. Default: None |
embedding.batch_size integer | Batch size for embedding API calls Default: 25 |
embedding.input_type One of string, null | Input type for Cohere embeddings Default: search_document |
embedding.model One of string, null | Model name. If not set, loads from server. Default: None |
embedding.model_embedding_key One of string, null | Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server. Default: None |
embedding.provider One of Enum, null | Embedding provider (bedrock uses AWS, cohere uses API key). If not set, loads from server. Default: None |
filtering FilteringConfig | File filtering configuration. |
filtering.max_file_size One of integer, null | Maximum file size in bytes Default: None |
filtering.min_file_size One of integer, null | Minimum file size in bytes Default: None |
filtering.min_text_length integer | Minimum text length in characters Default: 50 |
filtering.modified_after One of string, null | Only files modified after this date (ISO format) Default: None |
filtering.modified_before One of string, null | Only files modified before this date (ISO format) Default: None |
filtering.skip_empty_documents boolean | Skip documents with no text content Default: True |
filtering.exclude_patterns array | Glob patterns to exclude |
filtering.exclude_patterns.string string | |
filtering.include_patterns array | Glob patterns to include |
filtering.include_patterns.string string | |
hierarchy HierarchyConfig | Hierarchy configuration. |
hierarchy.enabled boolean | Enable parent-child relationships Default: True |
hierarchy.parent_strategy Enum | One of: "folder", "none", "custom", "notion" Default: folder |
hierarchy.custom_mapping One of CustomMappingConfig, null | Custom mapping configuration Default: None |
hierarchy.custom_mapping.rules array | Custom parent mapping rules |
hierarchy.custom_mapping.rules.CustomParentRule CustomParentRule | Custom parent mapping rule. |
hierarchy.custom_mapping.rules.CustomParentRule.parent_id ❓ string | Parent document ID for matching files |
hierarchy.custom_mapping.rules.CustomParentRule.pattern ❓ string | Glob pattern to match file paths |
hierarchy.folder_mapping FolderMappingConfig | Folder hierarchy mapping configuration. |
hierarchy.folder_mapping.create_parent_docs boolean | Create Document entities for folders Default: True |
hierarchy.folder_mapping.max_depth integer | Maximum hierarchy depth Default: 10 |
hierarchy.folder_mapping.parent_id_pattern string | Pattern for parent document IDs Default: {source_type}-{directory} |
hierarchy.folder_mapping.root_parent One of string, null | Optional root document URN Default: None |
page_ids array | List of Notion page IDs to ingest. IDs can be found in page URLs: https://www.notion.so/Page-Title-{PAGE_ID}. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration. |
page_ids.string string | |
processing ProcessingConfig | Processing configuration (partitioning only, no chunking). |
processing.parallelism ParallelismConfig | Parallelism configuration. |
processing.parallelism.disable_parallelism boolean | Disable all parallelism Default: False |
processing.parallelism.max_connections integer | Max concurrent connections for async operations Default: 10 |
processing.parallelism.num_processes integer | Number of worker processes Default: 2 |
processing.partition PartitionConfig | Unstructured partitioning configuration. |
processing.partition.additional_args object | Additional partition arguments |
processing.partition.api_key One of string, null | Unstructured API key Default: None |
processing.partition.partition_by_api boolean | Use Unstructured API for partitioning Default: False |
processing.partition.split_pdf_concurrency_level integer | Number of parallel requests for PDF pages Default: 5 |
processing.partition.split_pdf_page boolean | Enable page-level splitting for large PDFs Default: False |
processing.partition.strategy Enum | One of: "auto", "hi_res", "fast", "ocr_only" Default: auto |
processing.partition.ocr_languages array | Languages for OCR Default: ['eng'] |
processing.partition.ocr_languages.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Stateful Ingestion Config Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AdvancedConfig": {
"additionalProperties": false,
"description": "Advanced configuration options.",
"properties": {
"work_dir": {
"default": "/tmp/unstructured_datahub",
"title": "Work Dir",
"type": "string"
},
"preserve_outputs": {
"default": false,
"title": "Preserve Outputs",
"type": "boolean"
},
"output_format": {
"default": "json",
"enum": [
"json",
"xml"
],
"title": "Output Format",
"type": "string"
},
"raise_on_error": {
"default": false,
"title": "Raise On Error",
"type": "boolean"
},
"max_errors": {
"default": 10,
"title": "Max Errors",
"type": "integer"
},
"continue_on_failure": {
"default": true,
"title": "Continue On Failure",
"type": "boolean"
},
"retry": {
"$ref": "#/$defs/RetryConfig"
},
"cache": {
"$ref": "#/$defs/CacheConfig"
}
},
"title": "AdvancedConfig",
"type": "object"
},
"CacheConfig": {
"additionalProperties": false,
"description": "Cache configuration.",
"properties": {
"enabled": {
"default": true,
"title": "Enabled",
"type": "boolean"
},
"cache_dir": {
"default": "~/.cache/unstructured_datahub",
"title": "Cache Dir",
"type": "string"
},
"ttl": {
"default": 86400,
"description": "Cache TTL in seconds",
"title": "Ttl",
"type": "integer"
}
},
"title": "CacheConfig",
"type": "object"
},
"ChunkingConfig": {
"additionalProperties": false,
"description": "Chunking strategy configuration.",
"properties": {
"strategy": {
"default": "by_title",
"description": "Chunking strategy to use",
"enum": [
"basic",
"by_title"
],
"title": "Strategy",
"type": "string"
},
"max_characters": {
"default": 500,
"description": "Maximum characters per chunk",
"title": "Max Characters",
"type": "integer"
},
"overlap": {
"default": 0,
"description": "Character overlap between chunks",
"title": "Overlap",
"type": "integer"
},
"combine_text_under_n_chars": {
"default": 100,
"description": "Combine chunks smaller than this size",
"title": "Combine Text Under N Chars",
"type": "integer"
}
},
"title": "ChunkingConfig",
"type": "object"
},
"CustomMappingConfig": {
"additionalProperties": false,
"description": "Custom parent mapping configuration.",
"properties": {
"rules": {
"description": "Custom parent mapping rules",
"items": {
"$ref": "#/$defs/CustomParentRule"
},
"title": "Rules",
"type": "array"
}
},
"title": "CustomMappingConfig",
"type": "object"
},
"CustomParentRule": {
"additionalProperties": false,
"description": "Custom parent mapping rule.",
"properties": {
"pattern": {
"description": "Glob pattern to match file paths",
"title": "Pattern",
"type": "string"
},
"parent_id": {
"description": "Parent document ID for matching files",
"title": "Parent Id",
"type": "string"
}
},
"required": [
"pattern",
"parent_id"
],
"title": "CustomParentRule",
"type": "object"
},
"DataHubConnectionConfig": {
"additionalProperties": false,
"description": "DataHub connection configuration.",
"properties": {
"server": {
"default": "http://localhost:8080",
"description": "DataHub GMS server URL",
"title": "Server",
"type": "string"
},
"token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "DataHub API token for authentication",
"title": "Token"
}
},
"title": "DataHubConnectionConfig",
"type": "object"
},
"DocumentMappingConfig": {
"additionalProperties": false,
"description": "Document entity mapping configuration.",
"properties": {
"id_pattern": {
"default": "{source_type}-{directory}-{basename}",
"description": "Pattern for generating document IDs",
"title": "Id Pattern",
"type": "string"
},
"id_normalization": {
"$ref": "#/$defs/IdNormalizationConfig",
"description": "ID normalization rules"
},
"title": {
"$ref": "#/$defs/TitleExtractionConfig",
"description": "Title extraction configuration"
},
"source": {
"$ref": "#/$defs/SourceConfig",
"description": "Source configuration"
},
"status": {
"default": "PUBLISHED",
"description": "Default publication status",
"enum": [
"PUBLISHED",
"UNPUBLISHED"
],
"title": "Status",
"type": "string"
}
},
"title": "DocumentMappingConfig",
"type": "object"
},
"EmbeddingConfig": {
"additionalProperties": false,
"description": "Embedding generation configuration.\n\nDefault behavior: Fetches configuration from DataHub server automatically.\nOverride behavior: Validates local config against server when explicitly set.",
"properties": {
"provider": {
"anyOf": [
{
"enum": [
"bedrock",
"cohere"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Embedding provider (bedrock uses AWS, cohere uses API key). If not set, loads from server.",
"title": "Provider"
},
"model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Model name. If not set, loads from server.",
"title": "Model"
},
"model_embedding_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.",
"title": "Model Embedding Key"
},
"aws_region": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "AWS region for Bedrock. If not set, loads from server.",
"title": "Aws Region"
},
"api_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "API key for Cohere (not needed for Bedrock with IAM roles)",
"title": "Api Key"
},
"batch_size": {
"default": 25,
"description": "Batch size for embedding API calls",
"title": "Batch Size",
"type": "integer"
},
"input_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "search_document",
"description": "Input type for Cohere embeddings",
"title": "Input Type"
},
"allow_local_embedding_config": {
"default": false,
"description": "BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.",
"title": "Allow Local Embedding Config",
"type": "boolean"
}
},
"title": "EmbeddingConfig",
"type": "object"
},
"FilteringConfig": {
"additionalProperties": false,
"description": "File filtering configuration.",
"properties": {
"include_patterns": {
"description": "Glob patterns to include",
"items": {
"type": "string"
},
"title": "Include Patterns",
"type": "array"
},
"exclude_patterns": {
"description": "Glob patterns to exclude",
"items": {
"type": "string"
},
"title": "Exclude Patterns",
"type": "array"
},
"min_file_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Minimum file size in bytes",
"title": "Min File Size"
},
"max_file_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Maximum file size in bytes",
"title": "Max File Size"
},
"modified_after": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Only files modified after this date (ISO format)",
"title": "Modified After"
},
"modified_before": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Only files modified before this date (ISO format)",
"title": "Modified Before"
},
"skip_empty_documents": {
"default": true,
"description": "Skip documents with no text content",
"title": "Skip Empty Documents",
"type": "boolean"
},
"min_text_length": {
"default": 50,
"description": "Minimum text length in characters",
"title": "Min Text Length",
"type": "integer"
}
},
"title": "FilteringConfig",
"type": "object"
},
"FolderMappingConfig": {
"additionalProperties": false,
"description": "Folder hierarchy mapping configuration.",
"properties": {
"create_parent_docs": {
"default": true,
"description": "Create Document entities for folders",
"title": "Create Parent Docs",
"type": "boolean"
},
"parent_id_pattern": {
"default": "{source_type}-{directory}",
"description": "Pattern for parent document IDs",
"title": "Parent Id Pattern",
"type": "string"
},
"max_depth": {
"default": 10,
"description": "Maximum hierarchy depth",
"maximum": 50,
"minimum": 1,
"title": "Max Depth",
"type": "integer"
},
"root_parent": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Optional root document URN",
"title": "Root Parent"
}
},
"title": "FolderMappingConfig",
"type": "object"
},
"HierarchyConfig": {
"additionalProperties": false,
"description": "Hierarchy configuration.",
"properties": {
"enabled": {
"default": true,
"description": "Enable parent-child relationships",
"title": "Enabled",
"type": "boolean"
},
"parent_strategy": {
"default": "folder",
"description": "Parent document creation strategy. 'notion' extracts parent from Notion API metadata",
"enum": [
"folder",
"none",
"custom",
"notion"
],
"title": "Parent Strategy",
"type": "string"
},
"folder_mapping": {
"$ref": "#/$defs/FolderMappingConfig",
"description": "Folder mapping configuration"
},
"custom_mapping": {
"anyOf": [
{
"$ref": "#/$defs/CustomMappingConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Custom mapping configuration"
}
},
"title": "HierarchyConfig",
"type": "object"
},
"IdNormalizationConfig": {
"additionalProperties": false,
"description": "Document ID normalization rules.",
"properties": {
"lowercase": {
"default": true,
"description": "Convert to lowercase",
"title": "Lowercase",
"type": "boolean"
},
"replace_spaces_with": {
"default": "-",
"description": "Replace spaces with this character",
"title": "Replace Spaces With",
"type": "string"
},
"remove_special_chars": {
"default": true,
"description": "Remove special characters except _ and -",
"title": "Remove Special Chars",
"type": "boolean"
},
"max_length": {
"default": 200,
"description": "Maximum ID length",
"title": "Max Length",
"type": "integer"
}
},
"title": "IdNormalizationConfig",
"type": "object"
},
"ParallelismConfig": {
"additionalProperties": false,
"description": "Parallelism configuration.",
"properties": {
"num_processes": {
"default": 2,
"description": "Number of worker processes",
"maximum": 32,
"minimum": 1,
"title": "Num Processes",
"type": "integer"
},
"disable_parallelism": {
"default": false,
"description": "Disable all parallelism",
"title": "Disable Parallelism",
"type": "boolean"
},
"max_connections": {
"default": 10,
"description": "Max concurrent connections for async operations",
"title": "Max Connections",
"type": "integer"
}
},
"title": "ParallelismConfig",
"type": "object"
},
"PartitionConfig": {
"additionalProperties": false,
"description": "Unstructured partitioning configuration.",
"properties": {
"strategy": {
"default": "auto",
"description": "Partitioning strategy",
"enum": [
"auto",
"hi_res",
"fast",
"ocr_only"
],
"title": "Strategy",
"type": "string"
},
"partition_by_api": {
"default": false,
"description": "Use Unstructured API for partitioning",
"title": "Partition By Api",
"type": "boolean"
},
"api_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Unstructured API key",
"title": "Api Key"
},
"split_pdf_page": {
"default": false,
"description": "Enable page-level splitting for large PDFs",
"title": "Split Pdf Page",
"type": "boolean"
},
"split_pdf_concurrency_level": {
"default": 5,
"description": "Number of parallel requests for PDF pages",
"title": "Split Pdf Concurrency Level",
"type": "integer"
},
"ocr_languages": {
"default": [
"eng"
],
"description": "Languages for OCR",
"items": {
"type": "string"
},
"title": "Ocr Languages",
"type": "array"
},
"additional_args": {
"additionalProperties": true,
"description": "Additional partition arguments",
"title": "Additional Args",
"type": "object"
}
},
"title": "PartitionConfig",
"type": "object"
},
"ProcessingConfig": {
"additionalProperties": false,
"description": "Processing configuration (partitioning only, no chunking).",
"properties": {
"partition": {
"$ref": "#/$defs/PartitionConfig",
"description": "Partition configuration"
},
"parallelism": {
"$ref": "#/$defs/ParallelismConfig",
"description": "Parallelism configuration"
}
},
"title": "ProcessingConfig",
"type": "object"
},
"RetryConfig": {
"additionalProperties": false,
"description": "Retry configuration.",
"properties": {
"enabled": {
"default": true,
"title": "Enabled",
"type": "boolean"
},
"max_attempts": {
"default": 3,
"title": "Max Attempts",
"type": "integer"
},
"backoff_factor": {
"default": 2,
"title": "Backoff Factor",
"type": "integer"
},
"retry_on_timeout": {
"default": true,
"title": "Retry On Timeout",
"type": "boolean"
}
},
"title": "RetryConfig",
"type": "object"
},
"SourceConfig": {
"additionalProperties": false,
"description": "Document source configuration.",
"properties": {
"type": {
"default": "EXTERNAL",
"description": "Source type (always EXTERNAL for ingested docs)",
"enum": [
"NATIVE",
"EXTERNAL"
],
"title": "Type",
"type": "string"
},
"include_external_url": {
"default": true,
"description": "Include external URL in DocumentSource",
"title": "Include External Url",
"type": "boolean"
},
"include_external_id": {
"default": true,
"description": "Include external ID in DocumentSource",
"title": "Include External Id",
"type": "boolean"
}
},
"title": "SourceConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
},
"TitleExtractionConfig": {
"additionalProperties": false,
"description": "Title extraction configuration.",
"properties": {
"extract_from_content": {
"default": true,
"description": "Try to extract title from document content",
"title": "Extract From Content",
"type": "boolean"
},
"fallback_to_filename": {
"default": true,
"description": "Use filename as title if not found in content",
"title": "Fallback To Filename",
"type": "boolean"
},
"max_length": {
"default": 500,
"description": "Maximum title length",
"title": "Max Length",
"type": "integer"
}
},
"title": "TitleExtractionConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Notion ingestion configuration.\n\nThis source extracts documents from Notion pages and databases\nusing the Notion API and Unstructured.io text extraction.",
"properties": {
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful Ingestion Config"
},
"api_key": {
"description": "Notion internal integration token. Create one at https://www.notion.so/my-integrations",
"format": "password",
"title": "Api Key",
"type": "string",
"writeOnly": true
},
"page_ids": {
"description": "List of Notion page IDs to ingest. IDs can be found in page URLs: https://www.notion.so/Page-Title-{PAGE_ID}. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration.",
"items": {
"type": "string"
},
"title": "Page Ids",
"type": "array"
},
"database_ids": {
"description": "List of Notion database IDs to ingest. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration. IDs can be found in database URLs: https://www.notion.so/{DATABASE_ID}",
"items": {
"type": "string"
},
"title": "Database Ids",
"type": "array"
},
"recursive": {
"default": true,
"description": "Recursively fetch child pages. When true, ingests all descendant pages of specified pages/databases.",
"title": "Recursive",
"type": "boolean"
},
"processing": {
"$ref": "#/$defs/ProcessingConfig",
"description": "Text extraction and partitioning configuration"
},
"document_mapping": {
"$ref": "#/$defs/DocumentMappingConfig",
"description": "Document entity mapping configuration (ID generation, title extraction)"
},
"hierarchy": {
"$ref": "#/$defs/HierarchyConfig",
"description": "Parent-child relationship configuration"
},
"filtering": {
"$ref": "#/$defs/FilteringConfig",
"description": "Document filtering configuration"
},
"datahub": {
"$ref": "#/$defs/DataHubConnectionConfig",
"description": "DataHub connection configuration (for querying server-side embedding config)"
},
"chunking": {
"$ref": "#/$defs/ChunkingConfig",
"description": "Chunking strategy configuration (for embeddings)"
},
"embedding": {
"$ref": "#/$defs/EmbeddingConfig",
"description": "Embedding generation configuration (LiteLLM with Cohere/Bedrock)"
},
"advanced": {
"$ref": "#/$defs/AdvancedConfig",
"description": "Advanced configuration options (work directory, error handling)"
}
},
"required": [
"api_key"
],
"title": "NotionSourceConfig",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.notion.notion_source.NotionSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Notion, feel free to ping us on our Slack.