Skip to main content

Notion

Notion Source

Ingest pages and databases from Notion workspaces as DataHub Document entities with optional semantic embeddings. Incubating

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesEnabled by default via stateful ingestion.
Test ConnectionEnabled by default.

Overview

The Notion source ingests pages and databases from Notion workspaces as DataHub Document entities with optional semantic embeddings for semantic search.

Key Features

1. Content Extraction

  • Page Content: Full text extraction from Notion pages including all supported block types
  • Database Rows: Ingests database entries as individual documents
  • Hierarchical Structure: Maintains parent-child relationships between pages
  • Metadata Extraction: Captures creation/modification timestamps, authors, and custom properties

2. Hierarchical Relationships

  • Parent-Child Links: Preserves Notion's page hierarchy in DataHub
  • Automatic Discovery: Recursively discovers nested pages starting from root pages
  • Flexible Navigation: Browse documentation structure in DataHub UI

3. Embedding Generation

Optional semantic search support:

  • Supported providers: Cohere (API key), AWS Bedrock (IAM roles)
  • Chunking strategies: by_title, basic
  • Configurable chunk size: Optimize for your embedding model (in characters)
  • Automatic deduplication: Prevents duplicate chunk embeddings

4. Stateful Ingestion

Supports smart incremental updates via stateful ingestion:

  • Content Change Detection: Only reprocesses documents when content or embeddings config changes
  • Deletion Detection: Automatically removes stale entities from DataHub
  • Recursive Discovery: Start from root pages/databases, automatically discovers and ingests child pages
  • State Persistence: Maintains processing state between runs to skip unchanged documents

Prerequisites

1. Notion Integration

Create a Notion internal integration:

  1. Go to https://www.notion.so/my-integrations
  2. Click "+ New integration"
  3. Give it a name (e.g., "DataHub Integration")
  4. Select the workspace
  5. Copy the Internal Integration Token (starts with secret_)

2. Share Pages with Integration

The integration can only access pages explicitly shared with it:

  1. Open the page or database in Notion
  2. Click "Share" in the top right
  3. Search for your integration name
  4. Click "Invite"

Important: For recursive ingestion, only share top-level pages. Child pages inherit access automatically.

3. Embedding Provider (Optional)

If you want semantic search capabilities, set up one of these providers:

Cohere

  • Sign up at https://cohere.ai/
  • Create an API key
  • Supports: embed-english-v3.0, embed-multilingual-v3.0

AWS Bedrock

  • AWS account with Bedrock access
  • Enable Cohere models in AWS Console → Bedrock → Model access
  • IAM permissions for bedrock:InvokeModel
  • Recommended region: us-west-2

See Semantic Search Configuration for detailed embedding setup.

Common Use Cases

Ingest entire workspace documentation with semantic search:

source:
type: notion
config:
api_key: "${NOTION_API_KEY}"

# Start from workspace root page
page_ids:
- "workspace_root_page_id"
recursive: true

# Enable semantic embeddings
embedding:
provider: "cohere"
model: "embed-english-v3.0"
api_key: "${COHERE_API_KEY}"

2. Specific Database Ingestion

Ingest a specific Notion database (e.g., "Product Requirements"):

source:
type: notion
config:
api_key: "${NOTION_API_KEY}"

# Only this database
database_ids:
- "product_requirements_db_id"
recursive: false # Only database entries, not child pages

3. Multi-workspace Setup

Ingest from multiple workspaces (requires multiple integrations):

source:
type: notion
config:
api_key: "${NOTION_API_KEY}"

# Multiple root pages from different workspaces
page_ids:
- "workspace_1_page_id"
- "workspace_2_page_id"
recursive: true

4. Production Setup with AWS Bedrock

Enterprise setup using AWS Bedrock for embeddings:

source:
type: notion
config:
api_key: "${NOTION_API_KEY}"

page_ids:
- "company_wiki_root"
recursive: true

# Use AWS Bedrock (no API key needed, uses IAM roles)
embedding:
provider: "bedrock"
aws_region: "us-west-2"
model: "cohere.embed-english-v3"

# Enable stateful ingestion for incremental updates
stateful_ingestion:
enabled: true

How It Works

Processing Pipeline

  1. Discovery: Notion API discovers pages/databases
  2. Download: Unstructured.io downloads and converts content to structured format
  3. Extraction: Extracts text, metadata, and hierarchy from Notion pages
  4. Chunking: Splits documents into semantic chunks (if embeddings enabled)
  5. Embedding: Generates vector embeddings for each chunk (if embeddings enabled)
  6. Emission: Emits Document entities with SemanticContent aspects to DataHub

Stateful Ingestion Details

The source uses content-based change detection:

  • Calculates SHA-256 hash of document content + embedding configuration
  • Compares hash with previous run to detect changes
  • Only reprocesses documents when hash changes
  • Tracks all emitted URNs to detect deletions

This means:

  • First run: Processes all documents
  • Subsequent runs: Only processes new/changed documents
  • Deleted pages: Automatically soft-deleted from DataHub

Limitations and Considerations

Notion API Limits

  • Rate Limits: Notion enforces rate limits (3 requests/second for paid workspaces, 1/second for free)
  • Access Scope: Integration only sees explicitly shared pages
  • Content Types: Some Notion blocks may not extract perfectly (e.g., complex embeds, synced blocks)

Performance Considerations

  • Large Workspaces: First run may take significant time for large workspaces
  • Embedding Generation: Adds processing time proportional to content volume
  • API Costs: Unstructured API and embedding providers may incur costs

Content Extraction

  • Supported Blocks: Text, headings, lists, code blocks, tables, callouts, toggles, quotes
  • Limited Support: Embeds, equations, files (extracted as links/references)
  • Not Supported: Live charts, board/gallery/timeline views (database views)

Troubleshooting

Common Issues

"Integration not found" or "Unauthorized" errors:

  • Verify the api_key is correct (should start with secret_)
  • Ensure pages are shared with the integration
  • Check that the integration has "Read content" capability

Empty or missing content:

  • Verify pages contain text (empty pages are skipped by default with skip_empty_documents: true)
  • Check min_text_length filter setting (default: 50 characters)
  • Ensure recursive: true if expecting child pages
  • Check that child pages are not explicitly restricted

Slow ingestion:

  • Increase processing.parallelism.num_processes (default: 2)
  • Consider using partition_by_api: false for local processing (requires more memory)
  • Filter specific pages instead of entire workspace using page_ids
  • First run is always slower - subsequent runs use incremental updates

Embedding generation failures:

  • Verify provider API key is correct
  • Check provider-specific rate limits (Cohere: 10k requests/min)
  • Ensure embedding model name is valid for your provider
  • For Bedrock: verify IAM permissions and model access is enabled in AWS Console

Stateful ingestion not working:

  • Ensure stateful_ingestion.enabled: true in config
  • Check DataHub connection (source needs to query previous state)
  • Verify state file path is writable (if using file-based state)
  • Look for state persistence logs in ingestion output

Missing hierarchy/parent relationships:

  • Verify hierarchy.enabled: true (default)
  • Check that parent pages are being ingested
  • Ensure recursive: true to discover parent-child relationships
  • Parent pages must be accessible to the integration

Performance Tuning

Parallelism Settings

processing:
parallelism:
num_processes: 4 # Increase for faster processing (default: 2)
max_connections: 20 # Concurrent API connections (default: 10)

Guidelines:

  • Small workspaces (<100 pages): num_processes: 2
  • Medium workspaces (100-1000 pages): num_processes: 4
  • Large workspaces (>1000 pages): num_processes: 8

Filtering

filtering:
min_text_length: 100 # Skip short pages (default: 50)
skip_empty_documents: true # Skip empty pages (default: true)

Chunking Optimization

chunking:
strategy: "by_title" # Preserves document structure (recommended)
max_characters: 500 # Chunk size (default: 500)
combine_text_under_n_chars: 100 # Merge small chunks (default: 100)

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: notion
config:
# Notion API token from your integration
api_key: "${NOTION_API_KEY}"

# Ingest specific pages (get IDs from page URLs)
page_ids:
- "your-page-id-here"

# Or ingest all accessible content (leave page_ids and database_ids empty)
# page_ids: []
# database_ids: []

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_key 
string(password)
Notion internal integration token. Create one at https://www.notion.so/my-integrations
recursive
boolean
Recursively fetch child pages. When true, ingests all descendant pages of specified pages/databases.
Default: True
advanced
AdvancedConfig
Advanced configuration options.
advanced.continue_on_failure
boolean
Default: True
advanced.max_errors
integer
Default: 10
advanced.output_format
Enum
One of: "json", "xml"
Default: json
advanced.preserve_outputs
boolean
Default: False
advanced.raise_on_error
boolean
Default: False
advanced.work_dir
string
Default: /tmp/unstructured_datahub
advanced.cache
CacheConfig
Cache configuration.
advanced.cache.cache_dir
string
Default: ~/.cache/unstructured_datahub
advanced.cache.enabled
boolean
Default: True
advanced.cache.ttl
integer
Cache TTL in seconds
Default: 86400
advanced.retry
RetryConfig
Retry configuration.
advanced.retry.backoff_factor
integer
Default: 2
advanced.retry.enabled
boolean
Default: True
advanced.retry.max_attempts
integer
Default: 3
advanced.retry.retry_on_timeout
boolean
Default: True
chunking
ChunkingConfig
Chunking strategy configuration.
chunking.combine_text_under_n_chars
integer
Combine chunks smaller than this size
Default: 100
chunking.max_characters
integer
Maximum characters per chunk
Default: 500
chunking.overlap
integer
Character overlap between chunks
Default: 0
chunking.strategy
Enum
One of: "basic", "by_title"
Default: by_title
database_ids
array
List of Notion database IDs to ingest. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration. IDs can be found in database URLs: https://www.notion.so/{DATABASE_ID}
database_ids.string
string
datahub
DataHubConnectionConfig
DataHub connection configuration.
datahub.server
string
DataHub GMS server URL
Default: http://localhost:8080
datahub.token
One of string, null
DataHub API token for authentication
Default: None
document_mapping
DocumentMappingConfig
Document entity mapping configuration.
document_mapping.id_pattern
string
Pattern for generating document IDs
Default: {source_type}-{directory}-{basename}
document_mapping.status
Enum
One of: "PUBLISHED", "UNPUBLISHED"
Default: PUBLISHED
document_mapping.id_normalization
IdNormalizationConfig
Document ID normalization rules.
document_mapping.id_normalization.lowercase
boolean
Convert to lowercase
Default: True
document_mapping.id_normalization.max_length
integer
Maximum ID length
Default: 200
document_mapping.id_normalization.remove_special_chars
boolean
Remove special characters except _ and -
Default: True
document_mapping.id_normalization.replace_spaces_with
string
Replace spaces with this character
Default: -
document_mapping.source
SourceConfig
Document source configuration.
document_mapping.source.include_external_id
boolean
Include external ID in DocumentSource
Default: True
document_mapping.source.include_external_url
boolean
Include external URL in DocumentSource
Default: True
document_mapping.source.type
Enum
One of: "NATIVE", "EXTERNAL"
Default: EXTERNAL
document_mapping.title
TitleExtractionConfig
Title extraction configuration.
document_mapping.title.extract_from_content
boolean
Try to extract title from document content
Default: True
document_mapping.title.fallback_to_filename
boolean
Use filename as title if not found in content
Default: True
document_mapping.title.max_length
integer
Maximum title length
Default: 500
embedding
EmbeddingConfig
Embedding generation configuration.

Default behavior: Fetches configuration from DataHub server automatically.
Override behavior: Validates local config against server when explicitly set.
embedding.allow_local_embedding_config
boolean
BREAK-GLASS: Allow local config without server validation. NOT RECOMMENDED - may break semantic search.
Default: False
embedding.api_key
One of string, null
API key for Cohere (not needed for Bedrock with IAM roles)
Default: None
embedding.aws_region
One of string, null
AWS region for Bedrock. If not set, loads from server.
Default: None
embedding.batch_size
integer
Batch size for embedding API calls
Default: 25
embedding.input_type
One of string, null
Input type for Cohere embeddings
Default: search_document
embedding.model
One of string, null
Model name. If not set, loads from server.
Default: None
embedding.model_embedding_key
One of string, null
Storage key for embeddings (e.g., 'cohere_embed_v3'). Required if overriding server config. If not set, loads from server.
Default: None
embedding.provider
One of Enum, null
Embedding provider (bedrock uses AWS, cohere uses API key). If not set, loads from server.
Default: None
filtering
FilteringConfig
File filtering configuration.
filtering.max_file_size
One of integer, null
Maximum file size in bytes
Default: None
filtering.min_file_size
One of integer, null
Minimum file size in bytes
Default: None
filtering.min_text_length
integer
Minimum text length in characters
Default: 50
filtering.modified_after
One of string, null
Only files modified after this date (ISO format)
Default: None
filtering.modified_before
One of string, null
Only files modified before this date (ISO format)
Default: None
filtering.skip_empty_documents
boolean
Skip documents with no text content
Default: True
filtering.exclude_patterns
array
Glob patterns to exclude
filtering.exclude_patterns.string
string
filtering.include_patterns
array
Glob patterns to include
filtering.include_patterns.string
string
hierarchy
HierarchyConfig
Hierarchy configuration.
hierarchy.enabled
boolean
Enable parent-child relationships
Default: True
hierarchy.parent_strategy
Enum
One of: "folder", "none", "custom", "notion"
Default: folder
hierarchy.custom_mapping
One of CustomMappingConfig, null
Custom mapping configuration
Default: None
hierarchy.custom_mapping.rules
array
Custom parent mapping rules
hierarchy.custom_mapping.rules.CustomParentRule
CustomParentRule
Custom parent mapping rule.
hierarchy.custom_mapping.rules.CustomParentRule.parent_id 
string
Parent document ID for matching files
hierarchy.custom_mapping.rules.CustomParentRule.pattern 
string
Glob pattern to match file paths
hierarchy.folder_mapping
FolderMappingConfig
Folder hierarchy mapping configuration.
hierarchy.folder_mapping.create_parent_docs
boolean
Create Document entities for folders
Default: True
hierarchy.folder_mapping.max_depth
integer
Maximum hierarchy depth
Default: 10
hierarchy.folder_mapping.parent_id_pattern
string
Pattern for parent document IDs
Default: {source_type}-{directory}
hierarchy.folder_mapping.root_parent
One of string, null
Optional root document URN
Default: None
page_ids
array
List of Notion page IDs to ingest. IDs can be found in page URLs: https://www.notion.so/Page-Title-{PAGE_ID}. If both page_ids and database_ids are empty, the source will automatically discover and ingest ALL pages and databases accessible to the integration.
page_ids.string
string
processing
ProcessingConfig
Processing configuration (partitioning only, no chunking).
processing.parallelism
ParallelismConfig
Parallelism configuration.
processing.parallelism.disable_parallelism
boolean
Disable all parallelism
Default: False
processing.parallelism.max_connections
integer
Max concurrent connections for async operations
Default: 10
processing.parallelism.num_processes
integer
Number of worker processes
Default: 2
processing.partition
PartitionConfig
Unstructured partitioning configuration.
processing.partition.additional_args
object
Additional partition arguments
processing.partition.api_key
One of string, null
Unstructured API key
Default: None
processing.partition.partition_by_api
boolean
Use Unstructured API for partitioning
Default: False
processing.partition.split_pdf_concurrency_level
integer
Number of parallel requests for PDF pages
Default: 5
processing.partition.split_pdf_page
boolean
Enable page-level splitting for large PDFs
Default: False
processing.partition.strategy
Enum
One of: "auto", "hi_res", "fast", "ocr_only"
Default: auto
processing.partition.ocr_languages
array
Languages for OCR
Default: ['eng']
processing.partition.ocr_languages.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful Ingestion Config
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.notion.notion_source.NotionSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Notion, feel free to ping us on our Slack.