Notion

Overview

Notion is a documentation or collaboration platform. Learn more in the official Notion documentation.

The DataHub integration for Notion covers document/workspace entities and hierarchy context for knowledge assets. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.

Ingest pages and databases from Notion workspaces as DataHub Document entities with optional semantic embeddings.

Not Supported with Remote Executor

This source is not supported with the Remote Executor in DataHub Cloud. It must be run using a self-hosted ingestion setup.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source Concept	DataHub Concept	Notes
Platform/account/project scope	Platform Instance, Container	Organizes assets within the platform context.
Core technical asset (for example table/view/topic/file)	Dataset	Primary ingested technical asset.
Schema fields / columns	SchemaField	Included when schema extraction is supported.
Ownership and collaboration principals	CorpUser, CorpGroup	Emitted by modules that support ownership and identity metadata.
Dependencies and processing relationships	Lineage edges	Available when lineage extraction is supported and enabled.

Module `notion`

Important Capabilities

Capability	Status	Notes
Detect Deleted Entities	✅	Enabled by default via stateful ingestion.
Test Connection	✅	Enabled by default.

Overview

Not Supported with Remote Executor

This source is available as a private beta feature on DataHub Cloud. Note that running the connector using the Remote Executor is not yet supported.

The Notion source ingests pages and databases from Notion workspaces as DataHub Document entities with optional semantic embeddings for semantic search.

Key Features

1. Content Extraction

Page Content: Full text extraction from Notion pages including all supported block types
Database Rows: Ingests database entries as individual documents
Hierarchical Structure: Maintains parent-child relationships between pages
Metadata Extraction: Captures creation/modification timestamps, authors, and custom properties

2. Hierarchical Relationships

Parent-Child Links: Preserves Notion's page hierarchy in DataHub
Automatic Discovery: Recursively discovers nested pages starting from root pages
Flexible Navigation: Browse documentation structure in DataHub UI

3. Embedding Generation

Optional semantic search support:

Supported providers: Cohere (API key), AWS Bedrock (IAM roles)
Chunking strategies: by_title, basic
Configurable chunk size: Optimize for your embedding model (in characters)
Automatic deduplication: Prevents duplicate chunk embeddings

4. Stateful Ingestion

Supports smart incremental updates via stateful ingestion:

Content Change Detection: Only reprocesses documents when content or embeddings config changes
Deletion Detection: Automatically removes stale entities from DataHub
Recursive Discovery: Start from root pages/databases, automatically discovers and ingests child pages
State Persistence: Maintains processing state between runs to skip unchanged documents

Prerequisites

1. Notion Integration

Create a Notion internal integration:

Go to https://www.notion.so/my-integrations
Click "+ New integration"
Give it a name (e.g., "DataHub Integration")
Select the workspace
Copy the Internal Integration Token (starts with secret_)

The integration can only access pages explicitly shared with it:

Open the page or database in Notion
Click "Share" in the top right
Search for your integration name
Click "Invite"

Important: For recursive ingestion, only share top-level pages. Child pages inherit access automatically.

3. Embedding Provider (Optional)

If you want semantic search capabilities, set up one of these providers:

Cohere

Sign up at https://cohere.ai/
Create an API key
Supports: embed-english-v3.0, embed-multilingual-v3.0

AWS Bedrock

AWS account with Bedrock access
Enable Cohere models in AWS Console → Bedrock → Model access
IAM permissions for bedrock:InvokeModel
Recommended region: us-west-2

See Semantic Search Configuration for detailed embedding setup.

Install the Plugin

pip install 'acryl-datahub[notion]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: notion
  config:
    # Notion API token from your integration
    api_key: "${NOTION_API_KEY}"

    # Ingest specific pages (get IDs from page URLs)
    page_ids:
      - "your-page-id-here"

    # Or ingest all accessible content (leave page_ids and database_ids empty)
    # page_ids: []
    # database_ids: []

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
api_key ✅ string(password)	Notion internal integration token. Create one at https://www.notion.so/my-integrations
max_documents integer	Maximum number of documents to process per ingestion run. The job will stop and fail with an error once this limit is reached. Set to 0 or -1 to disable the limit. Default: 10000
recursive boolean	Recursively fetch child pages. When true, ingests all descendant pages of specified pages/databases. Default: True
advanced AdvancedConfig	Advanced configuration options.
advanced.continue_on_failure boolean	Default: True
advanced.max_errors integer	Default: 10
advanced.output_format Enum	One of: "json", "xml" Default: json
advanced.preserve_outputs boolean	Default: False
advanced.raise_on_error boolean	Default: False
advanced.work_dir string	Default: /tmp/unstructured_datahub
advanced.cache CacheConfig	Cache configuration.
advanced.cache.cache_dir string	Default: ~/.cache/unstructured_datahub
advanced.cache.enabled boolean	Default: True
advanced.cache.ttl integer	Cache TTL in seconds Default: 86400
advanced.retry RetryConfig	Retry configuration.
advanced.retry.backoff_factor integer	Default: 2
advanced.retry.enabled boolean	Default: True
advanced.retry.max_attempts integer	Default: 3
advanced.retry.retry_on_timeout boolean	Default: True
chunking ChunkingConfig	Chunking strategy configuration.
chunking.combine_text_under_n_chars integer	Combine chunks smaller than this size Default: 100
chunking.max_characters integer	Maximum characters per chunk Default: 500
chunking.overlap integer	Character overlap between chunks Default: 0
chunking.strategy Enum	One of: "basic", "by_title" Default: by_title
database_ids array	List of Notion database IDs to ingest. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration. IDs can be found in database URLs: https://www.notion.so/{DATABASE_ID}
database_ids.string string
datahub DataHubConnectionConfig	DataHub connection configuration. Defaults are loaded in this priority: 1. Explicitly configured values in the recipe 2. Environment variables (DATAHUB_GMS_URL, DATAHUB_GMS_TOKEN) 3. ~/.datahubenv file (created by `datahub init`) 4. Hardcoded defaults (http://localhost:8080, no token)
datahub.server string	DataHub GMS server URL (defaults to DATAHUB_GMS_URL env var, ~/.datahubenv, or localhost:8080)
datahub.token One of string(password), null	DataHub API token for authentication (defaults to DATAHUB_GMS_TOKEN env var or ~/.datahubenv)
document_mapping DocumentMappingConfig	Document entity mapping configuration.
document_mapping.id_pattern string	Pattern for generating document IDs Default: {source_type}-{directory}-{basename}
document_mapping.status Enum	One of: "PUBLISHED", "UNPUBLISHED" Default: PUBLISHED
document_mapping.id_normalization IdNormalizationConfig	Document ID normalization rules.
document_mapping.id_normalization.lowercase boolean	Convert to lowercase Default: True
document_mapping.id_normalization.max_length integer	Maximum ID length Default: 200
document_mapping.id_normalization.remove_special_chars boolean	Remove special characters except _ and - Default: True
document_mapping.id_normalization.replace_spaces_with string	Replace spaces with this character Default: -
document_mapping.source SourceConfig	Document source configuration.
document_mapping.source.include_external_id boolean	Include external ID in DocumentSource Default: True
document_mapping.source.include_external_url boolean	Include external URL in DocumentSource Default: True
document_mapping.source.type Enum	One of: "NATIVE", "EXTERNAL" Default: EXTERNAL
document_mapping.title TitleExtractionConfig	Title extraction configuration.
document_mapping.title.extract_from_content boolean	Try to extract title from document content Default: True
document_mapping.title.fallback_to_filename boolean	Use filename as title if not found in content Default: True
document_mapping.title.max_length integer	Maximum title length Default: 500
embedding EmbeddingConfig	Embedding generation configuration. Default behavior: Fetches configuration from DataHub server automatically. Override behavior: Validates local config against server when explicitly set.
embedding.allow_local_embedding_config boolean	BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search. Default: False
embedding.api_key One of string(password), null	API key for Cohere (not needed for Bedrock with IAM roles) Default: None
embedding.aws_region One of string, null	AWS region for Bedrock. If not set, loads from server. Default: None
embedding.batch_size integer	Batch size for embedding API calls Default: 25
embedding.documents_per_minute integer	Maximum number of documents to embed per minute when rate_limit is enabled. Default: 300
embedding.input_type One of string, null	Input type for Cohere embeddings Default: search_document
embedding.model One of string, null	Model name. If not set, loads from server. Default: None
embedding.model_embedding_key One of string, null	Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server. Default: None
embedding.provider One of Enum, null	Embedding provider (bedrock uses AWS, cohere/openai use API key). If not set, loads from server. Default: None
embedding.rate_limit boolean	Enable rate limiting for embedding API calls. Default: True
filtering FilteringConfig	File filtering configuration.
filtering.max_file_size One of integer, null	Maximum file size in bytes Default: None
filtering.min_file_size One of integer, null	Minimum file size in bytes Default: None
filtering.min_text_length integer	Minimum text length in characters Default: 50
filtering.modified_after One of string, null	Only files modified after this date (ISO format) Default: None
filtering.modified_before One of string, null	Only files modified before this date (ISO format) Default: None
filtering.skip_empty_documents boolean	Skip documents with no text content Default: True
filtering.exclude_patterns array	Glob patterns to exclude
filtering.exclude_patterns.string string
filtering.include_patterns array	Glob patterns to include
filtering.include_patterns.string string
hierarchy HierarchyConfig	Hierarchy configuration.
hierarchy.enabled boolean	Enable parent-child relationships Default: True
hierarchy.parent_strategy Enum	One of: "folder", "none", "custom", "notion", "confluence" Default: folder
hierarchy.custom_mapping One of CustomMappingConfig, null	Custom mapping configuration Default: None
hierarchy.custom_mapping.rules array	Custom parent mapping rules
hierarchy.custom_mapping.rules.CustomParentRule CustomParentRule	Custom parent mapping rule.
hierarchy.custom_mapping.rules.CustomParentRule.parent_id ❓ string	Parent document ID for matching files
hierarchy.custom_mapping.rules.CustomParentRule.pattern ❓ string	Glob pattern to match file paths
hierarchy.folder_mapping FolderMappingConfig	Folder hierarchy mapping configuration.
hierarchy.folder_mapping.create_parent_docs boolean	Create Document entities for folders Default: True
hierarchy.folder_mapping.max_depth integer	Maximum hierarchy depth Default: 10
hierarchy.folder_mapping.parent_id_pattern string	Pattern for parent document IDs Default: {source_type}-{directory}
hierarchy.folder_mapping.root_parent One of string, null	Optional root document URN Default: None
page_ids array	List of Notion page IDs to ingest. IDs can be found in page URLs: https://www.notion.so/Page-Title-{PAGE_ID}. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration.
page_ids.string string
processing ProcessingConfig	Processing configuration (partitioning only, no chunking).
processing.parallelism ParallelismConfig	Parallelism configuration.
processing.parallelism.disable_parallelism boolean	Disable all parallelism Default: False
processing.parallelism.max_connections integer	Max concurrent connections for async operations Default: 10
processing.parallelism.num_processes integer	Number of worker processes Default: 2
processing.partition PartitionConfig	Unstructured partitioning configuration.
processing.partition.additional_args object	Additional partition arguments
processing.partition.api_key One of string(password), null	Unstructured API key Default: None
processing.partition.partition_by_api boolean	Use Unstructured API for partitioning Default: False
processing.partition.split_pdf_concurrency_level integer	Number of parallel requests for PDF pages Default: 5
processing.partition.split_pdf_page boolean	Enable page-level splitting for large PDFs Default: False
processing.partition.strategy Enum	One of: "auto", "hi_res", "fast", "ocr_only" Default: auto
processing.partition.ocr_languages array	Languages for OCR Default: ['eng']
processing.partition.ocr_languages.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Stateful Ingestion Config Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AdvancedConfig": {
      "additionalProperties": false,
      "description": "Advanced configuration options.",
      "properties": {
        "work_dir": {
          "default": "/tmp/unstructured_datahub",
          "title": "Work Dir",
          "type": "string"
        },
        "preserve_outputs": {
          "default": false,
          "title": "Preserve Outputs",
          "type": "boolean"
        },
        "output_format": {
          "default": "json",
          "enum": [
            "json",
            "xml"
          ],
          "title": "Output Format",
          "type": "string"
        },
        "raise_on_error": {
          "default": false,
          "title": "Raise On Error",
          "type": "boolean"
        },
        "max_errors": {
          "default": 10,
          "title": "Max Errors",
          "type": "integer"
        },
        "continue_on_failure": {
          "default": true,
          "title": "Continue On Failure",
          "type": "boolean"
        },
        "retry": {
          "$ref": "#/$defs/RetryConfig"
        },
        "cache": {
          "$ref": "#/$defs/CacheConfig"
        }
      },
      "title": "AdvancedConfig",
      "type": "object"
    },
    "CacheConfig": {
      "additionalProperties": false,
      "description": "Cache configuration.",
      "properties": {
        "enabled": {
          "default": true,
          "title": "Enabled",
          "type": "boolean"
        },
        "cache_dir": {
          "default": "~/.cache/unstructured_datahub",
          "title": "Cache Dir",
          "type": "string"
        },
        "ttl": {
          "default": 86400,
          "description": "Cache TTL in seconds",
          "title": "Ttl",
          "type": "integer"
        }
      },
      "title": "CacheConfig",
      "type": "object"
    },
    "ChunkingConfig": {
      "additionalProperties": false,
      "description": "Chunking strategy configuration.",
      "properties": {
        "strategy": {
          "default": "by_title",
          "description": "Chunking strategy to use",
          "enum": [
            "basic",
            "by_title"
          ],
          "title": "Strategy",
          "type": "string"
        },
        "max_characters": {
          "default": 500,
          "description": "Maximum characters per chunk",
          "title": "Max Characters",
          "type": "integer"
        },
        "overlap": {
          "default": 0,
          "description": "Character overlap between chunks",
          "title": "Overlap",
          "type": "integer"
        },
        "combine_text_under_n_chars": {
          "default": 100,
          "description": "Combine chunks smaller than this size",
          "title": "Combine Text Under N Chars",
          "type": "integer"
        }
      },
      "title": "ChunkingConfig",
      "type": "object"
    },
    "CustomMappingConfig": {
      "additionalProperties": false,
      "description": "Custom parent mapping configuration.",
      "properties": {
        "rules": {
          "description": "Custom parent mapping rules",
          "items": {
            "$ref": "#/$defs/CustomParentRule"
          },
          "title": "Rules",
          "type": "array"
        }
      },
      "title": "CustomMappingConfig",
      "type": "object"
    },
    "CustomParentRule": {
      "additionalProperties": false,
      "description": "Custom parent mapping rule.",
      "properties": {
        "pattern": {
          "description": "Glob pattern to match file paths",
          "title": "Pattern",
          "type": "string"
        },
        "parent_id": {
          "description": "Parent document ID for matching files",
          "title": "Parent Id",
          "type": "string"
        }
      },
      "required": [
        "pattern",
        "parent_id"
      ],
      "title": "CustomParentRule",
      "type": "object"
    },
    "DataHubConnectionConfig": {
      "additionalProperties": false,
      "description": "DataHub connection configuration.\n\nDefaults are loaded in this priority:\n1. Explicitly configured values in the recipe\n2. Environment variables (DATAHUB_GMS_URL, DATAHUB_GMS_TOKEN)\n3. ~/.datahubenv file (created by `datahub init`)\n4. Hardcoded defaults (http://localhost:8080, no token)",
      "properties": {
        "server": {
          "description": "DataHub GMS server URL (defaults to DATAHUB_GMS_URL env var, ~/.datahubenv, or localhost:8080)",
          "title": "Server",
          "type": "string"
        },
        "token": {
          "anyOf": [
            {
              "format": "password",
              "type": "string",
              "writeOnly": true
            },
            {
              "type": "null"
            }
          ],
          "description": "DataHub API token for authentication (defaults to DATAHUB_GMS_TOKEN env var or ~/.datahubenv)",
          "title": "Token"
        }
      },
      "title": "DataHubConnectionConfig",
      "type": "object"
    },
    "DocumentMappingConfig": {
      "additionalProperties": false,
      "description": "Document entity mapping configuration.",
      "properties": {
        "id_pattern": {
          "default": "{source_type}-{directory}-{basename}",
          "description": "Pattern for generating document IDs",
          "title": "Id Pattern",
          "type": "string"
        },
        "id_normalization": {
          "$ref": "#/$defs/IdNormalizationConfig",
          "description": "ID normalization rules"
        },
        "title": {
          "$ref": "#/$defs/TitleExtractionConfig",
          "description": "Title extraction configuration"
        },
        "source": {
          "$ref": "#/$defs/SourceConfig",
          "description": "Source configuration"
        },
        "status": {
          "default": "PUBLISHED",
          "description": "Default publication status",
          "enum": [
            "PUBLISHED",
            "UNPUBLISHED"
          ],
          "title": "Status",
          "type": "string"
        }
      },
      "title": "DocumentMappingConfig",
      "type": "object"
    },
    "EmbeddingConfig": {
      "additionalProperties": false,
      "description": "Embedding generation configuration.\n\nDefault behavior: Fetches configuration from DataHub server automatically.\nOverride behavior: Validates local config against server when explicitly set.",
      "properties": {
        "provider": {
          "anyOf": [
            {
              "enum": [
                "bedrock",
                "cohere",
                "openai"
              ],
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Embedding provider (bedrock uses AWS, cohere/openai use API key). If not set, loads from server.",
          "title": "Provider"
        },
        "model": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Model name. If not set, loads from server.",
          "title": "Model"
        },
        "model_embedding_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.",
          "title": "Model Embedding Key"
        },
        "aws_region": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "AWS region for Bedrock. If not set, loads from server.",
          "title": "Aws Region"
        },
        "api_key": {
          "anyOf": [
            {
              "format": "password",
              "type": "string",
              "writeOnly": true
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "API key for Cohere (not needed for Bedrock with IAM roles)",
          "title": "Api Key"
        },
        "batch_size": {
          "default": 25,
          "description": "Batch size for embedding API calls",
          "title": "Batch Size",
          "type": "integer"
        },
        "input_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": "search_document",
          "description": "Input type for Cohere embeddings",
          "title": "Input Type"
        },
        "rate_limit": {
          "default": true,
          "description": "Enable rate limiting for embedding API calls.",
          "title": "Rate Limit",
          "type": "boolean"
        },
        "documents_per_minute": {
          "default": 300,
          "description": "Maximum number of documents to embed per minute when rate_limit is enabled.",
          "exclusiveMinimum": 0,
          "title": "Documents Per Minute",
          "type": "integer"
        },
        "allow_local_embedding_config": {
          "default": false,
          "description": "BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.",
          "title": "Allow Local Embedding Config",
          "type": "boolean"
        }
      },
      "title": "EmbeddingConfig",
      "type": "object"
    },
    "FilteringConfig": {
      "additionalProperties": false,
      "description": "File filtering configuration.",
      "properties": {
        "include_patterns": {
          "description": "Glob patterns to include",
          "items": {
            "type": "string"
          },
          "title": "Include Patterns",
          "type": "array"
        },
        "exclude_patterns": {
          "description": "Glob patterns to exclude",
          "items": {
            "type": "string"
          },
          "title": "Exclude Patterns",
          "type": "array"
        },
        "min_file_size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Minimum file size in bytes",
          "title": "Min File Size"
        },
        "max_file_size": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Maximum file size in bytes",
          "title": "Max File Size"
        },
        "modified_after": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Only files modified after this date (ISO format)",
          "title": "Modified After"
        },
        "modified_before": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Only files modified before this date (ISO format)",
          "title": "Modified Before"
        },
        "skip_empty_documents": {
          "default": true,
          "description": "Skip documents with no text content",
          "title": "Skip Empty Documents",
          "type": "boolean"
        },
        "min_text_length": {
          "default": 50,
          "description": "Minimum text length in characters",
          "title": "Min Text Length",
          "type": "integer"
        }
      },
      "title": "FilteringConfig",
      "type": "object"
    },
    "FolderMappingConfig": {
      "additionalProperties": false,
      "description": "Folder hierarchy mapping configuration.",
      "properties": {
        "create_parent_docs": {
          "default": true,
          "description": "Create Document entities for folders",
          "title": "Create Parent Docs",
          "type": "boolean"
        },
        "parent_id_pattern": {
          "default": "{source_type}-{directory}",
          "description": "Pattern for parent document IDs",
          "title": "Parent Id Pattern",
          "type": "string"
        },
        "max_depth": {
          "default": 10,
          "description": "Maximum hierarchy depth",
          "maximum": 50,
          "minimum": 1,
          "title": "Max Depth",
          "type": "integer"
        },
        "root_parent": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Optional root document URN",
          "title": "Root Parent"
        }
      },
      "title": "FolderMappingConfig",
      "type": "object"
    },
    "HierarchyConfig": {
      "additionalProperties": false,
      "description": "Hierarchy configuration.",
      "properties": {
        "enabled": {
          "default": true,
          "description": "Enable parent-child relationships",
          "title": "Enabled",
          "type": "boolean"
        },
        "parent_strategy": {
          "default": "folder",
          "description": "Parent document creation strategy. 'notion' extracts parent from Notion API metadata. 'confluence' extracts parent from Confluence page ancestors.",
          "enum": [
            "folder",
            "none",
            "custom",
            "notion",
            "confluence"
          ],
          "title": "Parent Strategy",
          "type": "string"
        },
        "folder_mapping": {
          "$ref": "#/$defs/FolderMappingConfig",
          "description": "Folder mapping configuration"
        },
        "custom_mapping": {
          "anyOf": [
            {
              "$ref": "#/$defs/CustomMappingConfig"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Custom mapping configuration"
        }
      },
      "title": "HierarchyConfig",
      "type": "object"
    },
    "IdNormalizationConfig": {
      "additionalProperties": false,
      "description": "Document ID normalization rules.",
      "properties": {
        "lowercase": {
          "default": true,
          "description": "Convert to lowercase",
          "title": "Lowercase",
          "type": "boolean"
        },
        "replace_spaces_with": {
          "default": "-",
          "description": "Replace spaces with this character",
          "title": "Replace Spaces With",
          "type": "string"
        },
        "remove_special_chars": {
          "default": true,
          "description": "Remove special characters except _ and -",
          "title": "Remove Special Chars",
          "type": "boolean"
        },
        "max_length": {
          "default": 200,
          "description": "Maximum ID length",
          "title": "Max Length",
          "type": "integer"
        }
      },
      "title": "IdNormalizationConfig",
      "type": "object"
    },
    "ParallelismConfig": {
      "additionalProperties": false,
      "description": "Parallelism configuration.",
      "properties": {
        "num_processes": {
          "default": 2,
          "description": "Number of worker processes",
          "maximum": 32,
          "minimum": 1,
          "title": "Num Processes",
          "type": "integer"
        },
        "disable_parallelism": {
          "default": false,
          "description": "Disable all parallelism",
          "title": "Disable Parallelism",
          "type": "boolean"
        },
        "max_connections": {
          "default": 10,
          "description": "Max concurrent connections for async operations",
          "title": "Max Connections",
          "type": "integer"
        }
      },
      "title": "ParallelismConfig",
      "type": "object"
    },
    "PartitionConfig": {
      "additionalProperties": false,
      "description": "Unstructured partitioning configuration.",
      "properties": {
        "strategy": {
          "default": "auto",
          "description": "Partitioning strategy",
          "enum": [
            "auto",
            "hi_res",
            "fast",
            "ocr_only"
          ],
          "title": "Strategy",
          "type": "string"
        },
        "partition_by_api": {
          "default": false,
          "description": "Use Unstructured API for partitioning",
          "title": "Partition By Api",
          "type": "boolean"
        },
        "api_key": {
          "anyOf": [
            {
              "format": "password",
              "type": "string",
              "writeOnly": true
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Unstructured API key",
          "title": "Api Key"
        },
        "split_pdf_page": {
          "default": false,
          "description": "Enable page-level splitting for large PDFs",
          "title": "Split Pdf Page",
          "type": "boolean"
        },
        "split_pdf_concurrency_level": {
          "default": 5,
          "description": "Number of parallel requests for PDF pages",
          "title": "Split Pdf Concurrency Level",
          "type": "integer"
        },
        "ocr_languages": {
          "default": [
            "eng"
          ],
          "description": "Languages for OCR",
          "items": {
            "type": "string"
          },
          "title": "Ocr Languages",
          "type": "array"
        },
        "additional_args": {
          "additionalProperties": true,
          "description": "Additional partition arguments",
          "title": "Additional Args",
          "type": "object"
        }
      },
      "title": "PartitionConfig",
      "type": "object"
    },
    "ProcessingConfig": {
      "additionalProperties": false,
      "description": "Processing configuration (partitioning only, no chunking).",
      "properties": {
        "partition": {
          "$ref": "#/$defs/PartitionConfig",
          "description": "Partition configuration"
        },
        "parallelism": {
          "$ref": "#/$defs/ParallelismConfig",
          "description": "Parallelism configuration"
        }
      },
      "title": "ProcessingConfig",
      "type": "object"
    },
    "RetryConfig": {
      "additionalProperties": false,
      "description": "Retry configuration.",
      "properties": {
        "enabled": {
          "default": true,
          "title": "Enabled",
          "type": "boolean"
        },
        "max_attempts": {
          "default": 3,
          "title": "Max Attempts",
          "type": "integer"
        },
        "backoff_factor": {
          "default": 2,
          "title": "Backoff Factor",
          "type": "integer"
        },
        "retry_on_timeout": {
          "default": true,
          "title": "Retry On Timeout",
          "type": "boolean"
        }
      },
      "title": "RetryConfig",
      "type": "object"
    },
    "SourceConfig": {
      "additionalProperties": false,
      "description": "Document source configuration.",
      "properties": {
        "type": {
          "default": "EXTERNAL",
          "description": "Source type (always EXTERNAL for ingested docs)",
          "enum": [
            "NATIVE",
            "EXTERNAL"
          ],
          "title": "Type",
          "type": "string"
        },
        "include_external_url": {
          "default": true,
          "description": "Include external URL in DocumentSource",
          "title": "Include External Url",
          "type": "boolean"
        },
        "include_external_id": {
          "default": true,
          "description": "Include external ID in DocumentSource",
          "title": "Include External Id",
          "type": "boolean"
        }
      },
      "title": "SourceConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    },
    "TitleExtractionConfig": {
      "additionalProperties": false,
      "description": "Title extraction configuration.",
      "properties": {
        "extract_from_content": {
          "default": true,
          "description": "Try to extract title from document content",
          "title": "Extract From Content",
          "type": "boolean"
        },
        "fallback_to_filename": {
          "default": true,
          "description": "Use filename as title if not found in content",
          "title": "Fallback To Filename",
          "type": "boolean"
        },
        "max_length": {
          "default": 500,
          "description": "Maximum title length",
          "title": "Max Length",
          "type": "integer"
        }
      },
      "title": "TitleExtractionConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "description": "Notion ingestion configuration.\n\nThis source extracts documents from Notion pages and databases\nusing the Notion API and Unstructured.io text extraction.",
  "properties": {
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Stateful Ingestion Config"
    },
    "api_key": {
      "description": "Notion internal integration token. Create one at https://www.notion.so/my-integrations",
      "format": "password",
      "title": "Api Key",
      "type": "string",
      "writeOnly": true
    },
    "page_ids": {
      "description": "List of Notion page IDs to ingest. IDs can be found in page URLs: https://www.notion.so/Page-Title-{PAGE_ID}. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration.",
      "items": {
        "type": "string"
      },
      "title": "Page Ids",
      "type": "array"
    },
    "database_ids": {
      "description": "List of Notion database IDs to ingest. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration. IDs can be found in database URLs: https://www.notion.so/{DATABASE_ID}",
      "items": {
        "type": "string"
      },
      "title": "Database Ids",
      "type": "array"
    },
    "recursive": {
      "default": true,
      "description": "Recursively fetch child pages. When true, ingests all descendant pages of specified pages/databases.",
      "title": "Recursive",
      "type": "boolean"
    },
    "processing": {
      "$ref": "#/$defs/ProcessingConfig",
      "description": "Text extraction and partitioning configuration"
    },
    "document_mapping": {
      "$ref": "#/$defs/DocumentMappingConfig",
      "description": "Document entity mapping configuration (ID generation, title extraction)"
    },
    "hierarchy": {
      "$ref": "#/$defs/HierarchyConfig",
      "description": "Parent-child relationship configuration"
    },
    "filtering": {
      "$ref": "#/$defs/FilteringConfig",
      "description": "Document filtering configuration"
    },
    "datahub": {
      "$ref": "#/$defs/DataHubConnectionConfig",
      "description": "DataHub connection configuration (for querying server-side embedding config)"
    },
    "chunking": {
      "$ref": "#/$defs/ChunkingConfig",
      "description": "Chunking strategy configuration (for embeddings)"
    },
    "embedding": {
      "$ref": "#/$defs/EmbeddingConfig",
      "description": "Embedding generation configuration (LiteLLM with Cohere/Bedrock)"
    },
    "max_documents": {
      "default": 10000,
      "description": "Maximum number of documents to process per ingestion run. The job will stop and fail with an error once this limit is reached. Set to 0 or -1 to disable the limit.",
      "minimum": -1,
      "title": "Max Documents",
      "type": "integer"
    },
    "advanced": {
      "$ref": "#/$defs/AdvancedConfig",
      "description": "Advanced configuration options (work directory, error handling)"
    }
  },
  "required": [
    "api_key"
  ],
  "title": "NotionSourceConfig",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Not Supported with Remote Executor

This source is not supported with the Remote Executor in DataHub Cloud. It must be run using a self-hosted ingestion setup.

Common Use Cases

1. Workspace-wide Documentation Search

Ingest entire workspace documentation with semantic search:

source:
  type: notion
  config:
    api_key: "${NOTION_API_KEY}"

    # Start from workspace root page
    page_ids:
      - "workspace_root_page_id"
    recursive: true

    # Enable semantic embeddings
    embedding:
      provider: "cohere"
      model: "embed-english-v3.0"
      api_key: "${COHERE_API_KEY}"

2. Specific Database Ingestion

Ingest a specific Notion database (e.g., "Product Requirements"):

source:
  type: notion
  config:
    api_key: "${NOTION_API_KEY}"

    # Only this database
    database_ids:
      - "product_requirements_db_id"
    recursive: false # Only database entries, not child pages

3. Multi-workspace Setup

Ingest from multiple workspaces (requires multiple integrations):

source:
  type: notion
  config:
    api_key: "${NOTION_API_KEY}"

    # Multiple root pages from different workspaces
    page_ids:
      - "workspace_1_page_id"
      - "workspace_2_page_id"
    recursive: true

4. Production Setup with AWS Bedrock

Enterprise setup using AWS Bedrock for embeddings:

source:
  type: notion
  config:
    api_key: "${NOTION_API_KEY}"

    page_ids:
      - "company_wiki_root"
    recursive: true

    # Use AWS Bedrock (no API key needed, uses IAM roles)
    embedding:
      provider: "bedrock"
      aws_region: "us-west-2"
      model: "cohere.embed-english-v3"

    # Enable stateful ingestion for incremental updates
    stateful_ingestion:
      enabled: true

How It Works

Processing Pipeline

Discovery: Notion API discovers pages/databases
Download: Unstructured.io downloads and converts content to structured format
Extraction: Extracts text, metadata, and hierarchy from Notion pages
Chunking: Splits documents into semantic chunks (if embeddings enabled)
Embedding: Generates vector embeddings for each chunk (if embeddings enabled)
Emission: Emits Document entities with SemanticContent aspects to DataHub

Stateful Ingestion Details

The source uses content-based change detection:

Calculates SHA-256 hash of document content + embedding configuration
Compares hash with previous run to detect changes
Only reprocesses documents when hash changes
Tracks all emitted URNs to detect deletions

This means:

First run: Processes all documents
Subsequent runs: Only processes new/changed documents
Deleted pages: Automatically soft-deleted from DataHub

Performance Tuning

Parallelism Settings

processing:
  parallelism:
    num_processes: 4 # Increase for faster processing (default: 2)
    max_connections: 20 # Concurrent API connections (default: 10)

Guidelines:

Small workspaces (<100 pages): num_processes: 2
Medium workspaces (100-1000 pages): num_processes: 4
Large workspaces (>1000 pages): num_processes: 8

Filtering

filtering:
  min_text_length: 100 # Skip short pages (default: 50)
  skip_empty_documents: true # Skip empty pages (default: true)

Chunking Optimization

chunking:
  strategy: "by_title" # Preserves document structure (recommended)
  max_characters: 500 # Chunk size (default: 500)
  combine_text_under_n_chars: 100 # Merge small chunks (default: 100)

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Notion API Limits

Rate Limits: Notion enforces rate limits (3 requests/second for paid workspaces, 1/second for free)
Access Scope: Integration only sees explicitly shared pages
Content Types: Some Notion blocks may not extract perfectly (e.g., complex embeds, synced blocks)

Performance Considerations

Large Workspaces: First run may take significant time for large workspaces
Embedding Generation: Adds processing time proportional to content volume
API Costs: Unstructured API and embedding providers may incur costs

Content Extraction

Supported Blocks: Text, headings, lists, code blocks, tables, callouts, toggles, quotes
Limited Support: Embeds, equations, files (extracted as links/references)
Not Supported: Live charts, board/gallery/timeline views (database views)

Troubleshooting

"Integration not found" or "Unauthorized" errors:

Verify the api_key is correct (should start with secret_)
Ensure pages are shared with the integration
Check that the integration has "Read content" capability

Empty or missing content:

Verify pages contain text (empty pages are skipped by default with skip_empty_documents: true)
Check min_text_length filter setting (default: 50 characters)
Ensure recursive: true if expecting child pages
Check that child pages are not explicitly restricted

Slow ingestion:

Increase processing.parallelism.num_processes (default: 2)
Consider using partition_by_api: false for local processing (requires more memory)
Filter specific pages instead of entire workspace using page_ids
First run is always slower - subsequent runs use incremental updates

Embedding generation failures:

Verify provider API key is correct
Check provider-specific rate limits (Cohere: 10k requests/min)
Ensure embedding model name is valid for your provider
For Bedrock: verify IAM permissions and model access is enabled in AWS Console

Stateful ingestion not working:

Ensure stateful_ingestion.enabled: true in config
Check DataHub connection (source needs to query previous state)
Verify state file path is writable (if using file-based state)
Look for state persistence logs in ingestion output

Missing hierarchy/parent relationships:

Verify hierarchy.enabled: true (default)
Check that parent pages are being ingested
Ensure recursive: true to discover parent-child relationships
Parent pages must be accessible to the integration

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Code Coordinates

Class Name: datahub.ingestion.source.notion.notion_source.NotionSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for Notion, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

Notion

Overview​

Concept Mapping​

Module notion​

Important Capabilities​

Overview​

Key Features​

1. Content Extraction​

2. Hierarchical Relationships​

3. Embedding Generation​

4. Stateful Ingestion​

Related Documentation​

Prerequisites​

1. Notion Integration​

2. Share Pages with Integration​

3. Embedding Provider (Optional)​

Cohere​

AWS Bedrock​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Common Use Cases​

1. Workspace-wide Documentation Search​

2. Specific Database Ingestion​

3. Multi-workspace Setup​

4. Production Setup with AWS Bedrock​

How It Works​

Processing Pipeline​

Stateful Ingestion Details​

Performance Tuning​

Parallelism Settings​

Filtering​

Chunking Optimization​

Limitations​

Notion API Limits​

Performance Considerations​

Content Extraction​

Troubleshooting​

Code Coordinates​

Overview

Concept Mapping

Module `notion`

Important Capabilities

Overview

Key Features

1. Content Extraction

2. Hierarchical Relationships

3. Embedding Generation

4. Stateful Ingestion

Related Documentation

Prerequisites

1. Notion Integration

2. Share Pages with Integration

3. Embedding Provider (Optional)

Cohere

AWS Bedrock

Install the Plugin

Starter Recipe

Config Details

Capabilities

Common Use Cases

1. Workspace-wide Documentation Search

2. Specific Database Ingestion

3. Multi-workspace Setup

4. Production Setup with AWS Bedrock

How It Works

Processing Pipeline

Stateful Ingestion Details

Performance Tuning

Parallelism Settings

Filtering

Chunking Optimization

Limitations

Notion API Limits

Performance Considerations

Content Extraction

Troubleshooting

Code Coordinates