Snowplow
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Descriptions | ✅ | Enabled by default from schema descriptions. |
| Detect Deleted Entities | ✅ | Enabled via stateful ingestion. |
| Domains | ✅ | Supported via configuration. |
| Platform Instance | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default for event and entity schemas. |
| Table-Level Lineage | ✅ | Optionally enabled via warehouse_lineage.enabled configuration (requires BDP). |
Ingests metadata from Snowplow.
Extracts:
- Organizations (as containers)
- Event schemas (as datasets)
- Entity schemas (as datasets)
- Event specifications (as datasets) - BDP only
- Tracking scenarios (as containers) - BDP only
- Warehouse lineage (optional) - requires warehouse connection
Supports:
- Snowplow BDP (Behavioral Data Platform) deployments
- Open-source Snowplow with Iglu registry
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
# Snowplow Comprehensive Recipe
# This recipe demonstrates ALL available configuration options with detailed comments
source:
type: snowplow
config:
# ============================================
# Connection Configuration
# ============================================
# BDP Console API Connection (for managed Snowplow)
# Required for: Event specifications, tracking scenarios, data products
# Optional: Can be omitted if using Iglu-only mode
bdp_connection:
# Organization UUID - found in BDP Console URL
# Example: https://console.snowplowanalytics.com/{org_id}/data-structures
organization_id: "<YOUR_ORG_UUID>"
# API credentials from BDP Console → Settings → API Credentials
# Use environment variables for security
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"
# Optional: BDP Console API base URL (default shown)
# Only change if using a custom/regional Snowplow deployment
console_api_url: "https://console.snowplowanalytics.com/api/msc/v1"
# Optional: Request timeout in seconds (default: 60)
timeout_seconds: 60
# Optional: Maximum retry attempts for failed requests (default: 3)
max_retries: 3
# Iglu Schema Registry Connection (for open-source Snowplow)
# Required for: Open-source Snowplow deployments without BDP Console API
# Note: Either bdp_connection OR iglu_connection is required (not both)
iglu_connection:
# Iglu server base URL
# Examples:
# - Public: http://iglucentral.com
# - Private: https://iglu.example.com
iglu_server_url: "https://iglu.example.com"
# Optional: API key for private Iglu registries (UUID format)
# Not required for public registries like Iglu Central
api_key: "${IGLU_API_KEY}"
# Optional: Request timeout in seconds (default: 30)
timeout_seconds: 30
# ============================================
# Filtering Configuration
# ============================================
# Filter schemas by vendor/name pattern
# Pattern format: "vendor/name" (e.g., "com.example/page_view")
# Requires permission: read:data-structures
schema_pattern:
allow:
- ".*" # Allow all schemas (default)
# Examples of allow patterns:
# - "com\\.example\\..*" # Allow all com.example schemas
# - "com\\.acme\\.events\\..*" # Allow com.acme.events schemas
# - "com\\.snowplowanalytics\\..*" # Allow Snowplow standard schemas
deny:
# - ".*\\.test$" # Deny schemas ending with .test
# - ".*_sandbox.*" # Deny sandbox schemas
# - "com\\.example\\.deprecated\\..*" # Deny deprecated schemas
# Filter event specifications by name
# Only applies when extract_event_specifications is enabled
# Requires permission: read:event-specs
event_spec_pattern:
allow:
- ".*" # Allow all event specifications (default)
deny: []
# Filter tracking scenarios by name
# Only applies when extract_tracking_scenarios is enabled
# Requires permission: read:tracking-scenarios
tracking_scenario_pattern:
allow:
- ".*" # Allow all tracking scenarios (default)
deny: []
# ============================================
# Feature Flags
# ============================================
# Extract event specifications (BDP only)
# Requires permission: read:event-specs
# Default: true
extract_event_specifications: true
# Extract tracking scenarios (BDP only)
# Requires permission: read:tracking-scenarios
# Default: true
extract_tracking_scenarios: true
# Include full JSON Schema definition in dataset properties
# Useful for downstream schema analysis
# Default: true
include_schema_definitions: true
# Include schemas marked as hidden in BDP Console
# Default: false
include_hidden_schemas: false
# ============================================
# Warehouse Lineage (BDP only - Advanced)
# ============================================
# Extract TABLE-LEVEL lineage from atomic.events to derived tables via Data Models API
# Creates lineage: atomic.events → derived tables (e.g., derived.sessions)
#
# ⚠️ IMPORTANT: Disabled by default
# Warehouse connectors (Snowflake, BigQuery, etc.) provide BETTER lineage:
# - Column-level lineage (not just table-level)
# - Transformation logic from actual SQL queries
# - Complete dependency graphs
#
# Only enable this if:
# - You want quick table-level lineage without setting up warehouse connector
# - You don't have access to warehouse query logs
# - You want to document Data Models API metadata specifically
warehouse_lineage:
# Enable warehouse lineage extraction (default: false)
# Disabled by default - prefer using warehouse connector for detailed lineage
enabled: false
# Optional: Default platform instance for warehouse URNs
# Example: "prod_snowflake", "prod_bigquery"
# Can be overridden per destination using destination_mappings
platform_instance: "prod_snowflake"
# Optional: Default environment for warehouse datasets (default: PROD)
env: "PROD"
# Optional: Per-destination mappings (overrides defaults for specific destinations)
destination_mappings:
# Example: Override platform instance for specific destination
# - destination_id: "12345678-1234-1234-1234-123456789012"
# platform_instance: "staging_snowflake"
# env: "DEV"
# Optional: Validate warehouse URNs exist in DataHub before creating lineage
# Requires DataHub Graph API access (default: true)
validate_urns: true
# ============================================
# Schema Extraction Options
# ============================================
# Schema types to extract
# Options: "event" and/or "entity"
# Default: ["event", "entity"]
schema_types_to_extract:
- "event" # Event schemas (self-describing events)
- "entity" # Entity schemas (contexts and entities)
# ============================================
# Platform Instance (Optional)
# ============================================
# Platform instance identifier for multi-environment deployments
# Groups schemas by environment (e.g., production, staging, dev)
# Uncomment to enable:
# platform_instance: "production"
# ============================================
# Environment (Optional)
# ============================================
# Environment tag (PROD, DEV, QA, etc.)
# Uncomment to enable:
# env: "PROD"
# ============================================
# Stateful Ingestion (Optional)
# ============================================
# Enable stateful ingestion for deletion detection
# Tracks which schemas have been seen and removes stale ones
# Requires permission: read:data-structures (to track existence)
stateful_ingestion:
enabled: false
remove_stale_metadata: true # Remove schemas that no longer exist
# ============================================
# Sink Configuration
# ============================================
sink:
type: datahub-rest
config:
# DataHub GMS server URL
server: "http://localhost:8080"
# Optional: Authentication token
# token: "${DATAHUB_TOKEN}"
# Optional: Timeout for REST requests (default: 30s)
# timeout_sec: 30
# Optional: Extra headers
# extra_headers:
# X-Custom-Header: "value"
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
deployed_since One of string, null | Only extract schemas deployed/updated since this timestamp (ISO 8601 format: 2025-12-15T00:00:00Z). Enables incremental ingestion by filtering based on deployment timestamps. Leave empty to fetch all schemas. Default: None |
enrichment_owner One of string, null | Default owner for enrichments (e.g., 'data-platform@company.com'). Applied as DATAOWNER to all enrichment DataJobs. Leave empty to skip enrichment ownership. Default: None |
extract_enrichments boolean | Extract enrichments as DataJob entities linked to pipelines (requires BDP connection) Default: True |
extract_event_specifications boolean | Extract event specifications (requires BDP connection) Default: True |
extract_pipelines boolean | Extract pipelines as DataFlow entities (requires BDP connection) Default: True |
extract_standard_schemas boolean | Extract Snowplow standard schemas from Iglu Central that are referenced by event specifications. Standard schemas (vendor: com.snowplowanalytics.*) are not in the Data Structures API but are publicly available. When enabled, creates dataset entities for standard schemas and completes lineage from event specs. Only fetches schemas that are actually referenced, not all standard schemas. Disable if you don't want to fetch from Iglu Central. Default: True |
extract_tracking_plans boolean | Extract tracking plans (requires BDP connection) Default: True |
iglu_central_url string | Iglu Central base URL for fetching Snowplow standard schemas Default: http://iglucentral.com |
include_hidden_schemas boolean | Include schemas marked as hidden in BDP Console Default: False |
include_version_in_urn boolean | Include version in dataset URN (legacy behavior). When False (recommended), version is stored in dataset properties instead. Set to True for backwards compatibility with existing metadata. Default: False |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
schema_page_size integer | Number of schemas to fetch per API page (default: 100). Adjust based on organization size and API performance. Default: 100 |
env string | The environment that all assets produced by this connector belong to Default: PROD |
bdp_connection One of SnowplowBDPConnectionConfig, null | BDP Console API connection (required for BDP mode) Default: None |
bdp_connection.api_key ❓ string(password) | API Key secret from BDP Console credentials |
bdp_connection.api_key_id ❓ string | API Key ID from BDP Console credentials |
bdp_connection.organization_id ❓ string | Organization UUID (found in BDP Console URL) |
bdp_connection.console_api_url string | BDP Console API base URL |
bdp_connection.max_retries integer | Maximum number of retry attempts for failed requests Default: 3 |
bdp_connection.timeout_seconds integer | Request timeout in seconds Default: 60 |
event_spec_pattern AllowDenyPattern | A class to store allow deny regexes |
event_spec_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
field_tagging FieldTaggingConfig | Configuration for auto-tagging schema fields. |
field_tagging.authorship_pattern string | Pattern for authorship tags. Use {author} placeholder. Default: added_by_{author} |
field_tagging.emit_tags_and_structured_properties boolean | Emit both tags and structured properties for fields. When True, both tags and structured properties are emitted. When False, only the method specified by use_structured_properties is used. Useful during migration from tags to structured properties. Default: False |
field_tagging.enabled boolean | Enable automatic field tagging Default: True |
field_tagging.event_type_pattern string | Pattern for event type tags. Use {name} placeholder. Default: snowplow_event_{name} |
field_tagging.pii_tags_only boolean | When emit_tags_and_structured_properties is true, only emit tags for PII/sensitive data classification. Version, authorship, and event type will only be in structured properties, not as tags. Useful when you want detailed structured properties but only highlight PII fields with tags. Default: False |
field_tagging.schema_version_pattern string | Pattern for schema version tags. Use {version} placeholder. Default: snowplow_schema_v{version} |
field_tagging.tag_authorship boolean | Tag fields with authorship (e.g., added_by_ryan_smith) Default: True |
field_tagging.tag_data_class boolean | Tag fields with data classification (e.g., PII, Sensitive) Default: True |
field_tagging.tag_event_type boolean | Tag fields with event type (e.g., snowplow_event_checkout) Default: True |
field_tagging.tag_schema_version boolean | Tag fields with schema version (e.g., snowplow_schema_v1-0-0) Default: True |
field_tagging.track_field_versions boolean | Track which version each field was added in. When enabled, compares schema versions to determine when fields were introduced. Tags fields with their introduction version and adds 'Added in version X' to descriptions. Disabled by default as it requires fetching all schema versions (slower ingestion). Default: False |
field_tagging.use_pii_enrichment boolean | Extract PII fields from PII Pseudonymization enrichment config Default: True |
field_tagging.use_structured_properties boolean | Use structured properties for field metadata instead of (or in addition to) tags. Structured properties provide strongly-typed metadata with better querying capabilities. When enabled, field authorship, version, timestamp, and classification are emitted as structured properties on the schemaField entity. Note: Requires structured property definitions to be registered in DataHub first. See snowplow_field_structured_properties.yaml in the connector directory. Default: True |
field_tagging.pii_field_patterns array | Field name patterns to classify as PII (fallback if enrichment not available) |
field_tagging.pii_field_patterns.string string | |
field_tagging.sensitive_field_patterns array | Field name patterns to classify as Sensitive |
field_tagging.sensitive_field_patterns.string string | |
iglu_connection One of IgluConnectionConfig, null | Iglu Schema Registry connection (required for Iglu mode, optional for BDP mode as fallback) Default: None |
iglu_connection.iglu_server_url ❓ string | Iglu server base URL (e.g., 'https://iglu.acme.com' or 'http://iglucentral.com') |
iglu_connection.api_key One of string(password), null | API key for private Iglu registry (UUID format) Default: None |
iglu_connection.timeout_seconds integer | Request timeout in seconds Default: 30 |
performance PerformanceConfig | Performance and scaling configuration. Controls parallel processing, caching, and limits for large-scale deployments. |
performance.enable_parallel_fetching boolean | Enable parallel fetching of schema deployments. Significantly speeds up ingestion when field version tracking is enabled. Disable for debugging or if API rate limits are strict. Default: True |
performance.max_concurrent_api_calls integer | Maximum concurrent API calls for deployment fetching. Increase for faster ingestion of large organizations with many schemas. Recommended: 5-20 depending on API rate limits. Default: 10 |
schema_pattern AllowDenyPattern | A class to store allow deny regexes |
schema_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
schema_types_to_extract array | Schema types to extract: 'event' and/or 'entity' |
schema_types_to_extract.string string | |
tracking_plan_pattern AllowDenyPattern | A class to store allow deny regexes |
tracking_plan_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
warehouse_lineage WarehouseLineageConfig | Configuration for extracting lineage to warehouse destinations. This feature creates table-level lineage from atomic.events to derived tables by querying the Snowplow BDP Data Models API. IMPORTANT: Disabled by default because warehouse connectors (Snowflake, BigQuery, etc.) provide more detailed lineage by parsing actual SQL queries, including: - Column-level lineage - Transformation logic - Complete dependency graphs Only enable this if: - You want quick table-level lineage without setting up warehouse connector - You don't have access to warehouse query logs - You want to document Data Models API metadata specifically |
warehouse_lineage.enabled boolean | Enable warehouse lineage extraction via data models API. Disabled by default - prefer using warehouse connector (Snowflake, BigQuery) for detailed lineage. Default: False |
warehouse_lineage.platform_instance One of string, null | Default platform instance prefix for warehouse URNs (e.g., 'prod_snowflake'). Applied globally unless overridden by destination_mappings. Default: None |
warehouse_lineage.validate_urns boolean | Validate that warehouse table URNs exist in DataHub before creating lineage. Requires DataHub Graph API access. Set to False to skip validation. Default: True |
warehouse_lineage.env string | Default environment for warehouse datasets Default: PROD |
warehouse_lineage.destination_mappings array | Per-destination platform instance mappings. Overrides global platform_instance for specific destinations. |
warehouse_lineage.destination_mappings.DestinationMapping DestinationMapping | Mapping configuration for a Snowplow warehouse destination. |
warehouse_lineage.destination_mappings.DestinationMapping.destination_id ❓ string | Snowplow destination UUID from data models |
warehouse_lineage.destination_mappings.DestinationMapping.platform_instance One of string, null | Platform instance to prepend to dataset name in URN (e.g., 'prod_snowflake') Default: None |
warehouse_lineage.destination_mappings.DestinationMapping.env string | Environment for warehouse datasets (e.g., PROD, DEV) Default: PROD |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Stateful ingestion configuration for deletion detection Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"DestinationMapping": {
"additionalProperties": false,
"description": "Mapping configuration for a Snowplow warehouse destination.",
"properties": {
"destination_id": {
"description": "Snowplow destination UUID from data models",
"title": "Destination Id",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Platform instance to prepend to dataset name in URN (e.g., 'prod_snowflake')",
"title": "Platform Instance"
},
"env": {
"default": "PROD",
"description": "Environment for warehouse datasets (e.g., PROD, DEV)",
"title": "Env",
"type": "string"
}
},
"required": [
"destination_id"
],
"title": "DestinationMapping",
"type": "object"
},
"FieldTaggingConfig": {
"additionalProperties": false,
"description": "Configuration for auto-tagging schema fields.",
"properties": {
"enabled": {
"default": true,
"description": "Enable automatic field tagging",
"title": "Enabled",
"type": "boolean"
},
"tag_schema_version": {
"default": true,
"description": "Tag fields with schema version (e.g., snowplow_schema_v1-0-0)",
"title": "Tag Schema Version",
"type": "boolean"
},
"tag_event_type": {
"default": true,
"description": "Tag fields with event type (e.g., snowplow_event_checkout)",
"title": "Tag Event Type",
"type": "boolean"
},
"tag_data_class": {
"default": true,
"description": "Tag fields with data classification (e.g., PII, Sensitive)",
"title": "Tag Data Class",
"type": "boolean"
},
"tag_authorship": {
"default": true,
"description": "Tag fields with authorship (e.g., added_by_ryan_smith)",
"title": "Tag Authorship",
"type": "boolean"
},
"track_field_versions": {
"default": false,
"description": "Track which version each field was added in. When enabled, compares schema versions to determine when fields were introduced. Tags fields with their introduction version and adds 'Added in version X' to descriptions. Disabled by default as it requires fetching all schema versions (slower ingestion).",
"title": "Track Field Versions",
"type": "boolean"
},
"use_structured_properties": {
"default": true,
"description": "Use structured properties for field metadata instead of (or in addition to) tags. Structured properties provide strongly-typed metadata with better querying capabilities. When enabled, field authorship, version, timestamp, and classification are emitted as structured properties on the schemaField entity. Note: Requires structured property definitions to be registered in DataHub first. See snowplow_field_structured_properties.yaml in the connector directory.",
"title": "Use Structured Properties",
"type": "boolean"
},
"emit_tags_and_structured_properties": {
"default": false,
"description": "Emit both tags and structured properties for fields. When True, both tags and structured properties are emitted. When False, only the method specified by use_structured_properties is used. Useful during migration from tags to structured properties.",
"title": "Emit Tags And Structured Properties",
"type": "boolean"
},
"pii_tags_only": {
"default": false,
"description": "When emit_tags_and_structured_properties is true, only emit tags for PII/sensitive data classification. Version, authorship, and event type will only be in structured properties, not as tags. Useful when you want detailed structured properties but only highlight PII fields with tags.",
"title": "Pii Tags Only",
"type": "boolean"
},
"schema_version_pattern": {
"default": "snowplow_schema_v{version}",
"description": "Pattern for schema version tags. Use {version} placeholder.",
"title": "Schema Version Pattern",
"type": "string"
},
"event_type_pattern": {
"default": "snowplow_event_{name}",
"description": "Pattern for event type tags. Use {name} placeholder.",
"title": "Event Type Pattern",
"type": "string"
},
"authorship_pattern": {
"default": "added_by_{author}",
"description": "Pattern for authorship tags. Use {author} placeholder.",
"title": "Authorship Pattern",
"type": "string"
},
"use_pii_enrichment": {
"default": true,
"description": "Extract PII fields from PII Pseudonymization enrichment config",
"title": "Use Pii Enrichment",
"type": "boolean"
},
"pii_field_patterns": {
"description": "Field name patterns to classify as PII (fallback if enrichment not available)",
"items": {
"type": "string"
},
"title": "Pii Field Patterns",
"type": "array"
},
"sensitive_field_patterns": {
"description": "Field name patterns to classify as Sensitive",
"items": {
"type": "string"
},
"title": "Sensitive Field Patterns",
"type": "array"
}
},
"title": "FieldTaggingConfig",
"type": "object"
},
"IgluConnectionConfig": {
"additionalProperties": false,
"description": "Connection configuration for Iglu Schema Registry.\n\nUse this for open-source Snowplow deployments or as fallback for BDP.",
"properties": {
"iglu_server_url": {
"description": "Iglu server base URL (e.g., 'https://iglu.acme.com' or 'http://iglucentral.com')",
"title": "Iglu Server Url",
"type": "string"
},
"api_key": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "API key for private Iglu registry (UUID format)",
"title": "Api Key"
},
"timeout_seconds": {
"default": 30,
"description": "Request timeout in seconds",
"title": "Timeout Seconds",
"type": "integer"
}
},
"required": [
"iglu_server_url"
],
"title": "IgluConnectionConfig",
"type": "object"
},
"PerformanceConfig": {
"additionalProperties": false,
"description": "Performance and scaling configuration.\n\nControls parallel processing, caching, and limits for large-scale deployments.",
"properties": {
"max_concurrent_api_calls": {
"default": 10,
"description": "Maximum concurrent API calls for deployment fetching. Increase for faster ingestion of large organizations with many schemas. Recommended: 5-20 depending on API rate limits.",
"title": "Max Concurrent Api Calls",
"type": "integer"
},
"enable_parallel_fetching": {
"default": true,
"description": "Enable parallel fetching of schema deployments. Significantly speeds up ingestion when field version tracking is enabled. Disable for debugging or if API rate limits are strict.",
"title": "Enable Parallel Fetching",
"type": "boolean"
}
},
"title": "PerformanceConfig",
"type": "object"
},
"SnowplowBDPConnectionConfig": {
"additionalProperties": false,
"description": "Connection configuration for Snowplow BDP (Behavioral Data Platform).\n\nUse this for managed Snowplow deployments with Console API access.",
"properties": {
"organization_id": {
"description": "Organization UUID (found in BDP Console URL)",
"title": "Organization Id",
"type": "string"
},
"api_key_id": {
"description": "API Key ID from BDP Console credentials",
"title": "Api Key Id",
"type": "string"
},
"api_key": {
"description": "API Key secret from BDP Console credentials",
"format": "password",
"title": "Api Key",
"type": "string",
"writeOnly": true
},
"console_api_url": {
"default": "https://console.snowplowanalytics.com/api/msc/v1",
"description": "BDP Console API base URL",
"title": "Console Api Url",
"type": "string"
},
"timeout_seconds": {
"default": 60,
"description": "Request timeout in seconds",
"title": "Timeout Seconds",
"type": "integer"
},
"max_retries": {
"default": 3,
"description": "Maximum number of retry attempts for failed requests",
"title": "Max Retries",
"type": "integer"
}
},
"required": [
"organization_id",
"api_key_id",
"api_key"
],
"title": "SnowplowBDPConnectionConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
},
"WarehouseLineageConfig": {
"additionalProperties": false,
"description": "Configuration for extracting lineage to warehouse destinations.\n\nThis feature creates table-level lineage from atomic.events to derived tables\nby querying the Snowplow BDP Data Models API.\n\nIMPORTANT: Disabled by default because warehouse connectors (Snowflake, BigQuery, etc.)\nprovide more detailed lineage by parsing actual SQL queries, including:\n- Column-level lineage\n- Transformation logic\n- Complete dependency graphs\n\nOnly enable this if:\n- You want quick table-level lineage without setting up warehouse connector\n- You don't have access to warehouse query logs\n- You want to document Data Models API metadata specifically",
"properties": {
"enabled": {
"default": false,
"description": "Enable warehouse lineage extraction via data models API. Disabled by default - prefer using warehouse connector (Snowflake, BigQuery) for detailed lineage.",
"title": "Enabled",
"type": "boolean"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Default platform instance prefix for warehouse URNs (e.g., 'prod_snowflake'). Applied globally unless overridden by destination_mappings.",
"title": "Platform Instance"
},
"env": {
"default": "PROD",
"description": "Default environment for warehouse datasets",
"title": "Env",
"type": "string"
},
"destination_mappings": {
"description": "Per-destination platform instance mappings. Overrides global platform_instance for specific destinations.",
"items": {
"$ref": "#/$defs/DestinationMapping"
},
"title": "Destination Mappings",
"type": "array"
},
"validate_urns": {
"default": true,
"description": "Validate that warehouse table URNs exist in DataHub before creating lineage. Requires DataHub Graph API access. Set to False to skip validation.",
"title": "Validate Urns",
"type": "boolean"
}
},
"title": "WarehouseLineageConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for Snowplow source.\n\nSupports two modes:\n1. BDP mode: Uses Snowplow BDP Console API (requires bdp_connection)\n2. Iglu mode: Uses Iglu Schema Registry only (requires iglu_connection)",
"properties": {
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful ingestion configuration for deletion detection"
},
"bdp_connection": {
"anyOf": [
{
"$ref": "#/$defs/SnowplowBDPConnectionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "BDP Console API connection (required for BDP mode)"
},
"iglu_connection": {
"anyOf": [
{
"$ref": "#/$defs/IgluConnectionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Iglu Schema Registry connection (required for Iglu mode, optional for BDP mode as fallback)"
},
"schema_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for schemas to filter (vendor/name format)"
},
"event_spec_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for event specifications to filter"
},
"tracking_plan_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tracking plans to filter"
},
"extract_event_specifications": {
"default": true,
"description": "Extract event specifications (requires BDP connection)",
"title": "Extract Event Specifications",
"type": "boolean"
},
"extract_tracking_plans": {
"default": true,
"description": "Extract tracking plans (requires BDP connection)",
"title": "Extract Tracking Plans",
"type": "boolean"
},
"extract_pipelines": {
"default": true,
"description": "Extract pipelines as DataFlow entities (requires BDP connection)",
"title": "Extract Pipelines",
"type": "boolean"
},
"extract_enrichments": {
"default": true,
"description": "Extract enrichments as DataJob entities linked to pipelines (requires BDP connection)",
"title": "Extract Enrichments",
"type": "boolean"
},
"enrichment_owner": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Default owner for enrichments (e.g., 'data-platform@company.com'). Applied as DATAOWNER to all enrichment DataJobs. Leave empty to skip enrichment ownership.",
"title": "Enrichment Owner"
},
"include_hidden_schemas": {
"default": false,
"description": "Include schemas marked as hidden in BDP Console",
"title": "Include Hidden Schemas",
"type": "boolean"
},
"include_version_in_urn": {
"default": false,
"description": "Include version in dataset URN (legacy behavior). When False (recommended), version is stored in dataset properties instead. Set to True for backwards compatibility with existing metadata.",
"title": "Include Version In Urn",
"type": "boolean"
},
"schema_types_to_extract": {
"description": "Schema types to extract: 'event' and/or 'entity'",
"items": {
"type": "string"
},
"title": "Schema Types To Extract",
"type": "array"
},
"deployed_since": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Only extract schemas deployed/updated since this timestamp (ISO 8601 format: 2025-12-15T00:00:00Z). Enables incremental ingestion by filtering based on deployment timestamps. Leave empty to fetch all schemas.",
"title": "Deployed Since"
},
"schema_page_size": {
"default": 100,
"description": "Number of schemas to fetch per API page (default: 100). Adjust based on organization size and API performance.",
"title": "Schema Page Size",
"type": "integer"
},
"extract_standard_schemas": {
"default": true,
"description": "Extract Snowplow standard schemas from Iglu Central that are referenced by event specifications. Standard schemas (vendor: com.snowplowanalytics.*) are not in the Data Structures API but are publicly available. When enabled, creates dataset entities for standard schemas and completes lineage from event specs. Only fetches schemas that are actually referenced, not all standard schemas. Disable if you don't want to fetch from Iglu Central.",
"title": "Extract Standard Schemas",
"type": "boolean"
},
"iglu_central_url": {
"default": "http://iglucentral.com",
"description": "Iglu Central base URL for fetching Snowplow standard schemas",
"title": "Iglu Central Url",
"type": "string"
},
"field_tagging": {
"$ref": "#/$defs/FieldTaggingConfig",
"description": "Field tagging configuration for auto-tagging schema fields"
},
"warehouse_lineage": {
"$ref": "#/$defs/WarehouseLineageConfig",
"description": "Warehouse lineage configuration for linking enrichment outputs to warehouse tables"
},
"performance": {
"$ref": "#/$defs/PerformanceConfig",
"description": "Performance and scaling configuration for large deployments"
}
},
"title": "SnowplowSourceConfig",
"type": "object"
}
Snowplow
Overview
The Snowplow source extracts metadata from Snowplow's behavioral data platform, including:
- Event schemas - Self-describing event definitions with properties and validation rules
- Entity schemas - Context and entity schemas attached to events
- Event specifications - Tracking requirements and specifications (BDP only)
- Tracking scenarios - Groupings of related events (BDP only)
- Organizations - Top-level containers for all schemas
Snowplow is an open-source behavioral data platform that collects, validates, and models event-level data. This connector supports both:
- Snowplow BDP (Behavioral Data Platform) - Managed Snowplow with Console API
- Open-source Snowplow - Self-hosted with Iglu schema registry
Supported Capabilities
| Capability | Status | Notes |
|---|---|---|
| Platform Instance | ✅ Supported | Group schemas by environment |
| Domains | ✅ Supported | Assign domains to schemas |
| Schema Metadata | ✅ Supported | Extract JSON Schema definitions |
| Descriptions | ✅ Supported | From schema descriptions |
| Lineage | ✅ Supported | Event schemas → Enrichments → Warehouse tables (BDP only) |
| Pipelines & Enrichments | ✅ Supported | Extract data pipelines and enrichment jobs (BDP only) |
| Deletion Detection | ✅ Supported | Via stateful ingestion |
Prerequisites
For Snowplow BDP (Managed)
- Snowplow BDP account with Console access
- Organization ID - Found in Console URL:
https://console.snowplowanalytics.com/{org-id}/... - API credentials - Generated from Console → Settings → API Credentials:
- API Key ID
- API Key Secret
For Open-Source Snowplow
- Iglu Schema Registry - URL of your Iglu server
- API Key (optional) - Required for private Iglu registries
Python Requirements
- Python 3.8 or newer
- DataHub CLI installed
Installation
# Install DataHub with Snowplow support
pip install 'acryl-datahub[snowplow]'
Required Permissions
Snowplow BDP API Permissions
The connector requires read-only access to the following BDP Console API endpoints:
Minimum Required Permissions
To extract basic schema metadata:
read:data-structures- Read access to data structures (event and entity schemas)read:organizations- Access to organization information
Permissions by Capability
| Capability | Required Permissions | Configuration |
|---|---|---|
| Schema Metadata | read:data-structures | Enabled by default |
| Event Specifications | read:event-specs | extract_event_specifications: true |
| Tracking Scenarios | read:tracking-scenarios | extract_tracking_scenarios: true |
| Tracking Plans | read:data-products | extract_tracking_plans: true |
Permission Testing
Test your API credentials and permissions:
# Get JWT token
curl -X POST \
-H "X-API-Key-ID: <API_KEY_ID>" \
-H "X-API-Key: <API_KEY>" \
https://console.snowplowanalytics.com/api/msc/v1/organizations/<ORG_ID>/credentials/v3/token
# List data structures
curl -H "Authorization: Bearer <JWT>" \
https://console.snowplowanalytics.com/api/msc/v1/organizations/<ORG_ID>/data-structures/v1
Iglu Registry Permissions
For open-source Snowplow with Iglu:
- Public registries: No authentication required (e.g., Iglu Central)
- Private registries: API key with read access to schemas
Configuration
See the recipe files for complete configuration examples:
- snowplow_recipe.yml - Comprehensive configuration with all options
- snowplow_bdp_basic.yml - Minimal BDP configuration
- snowplow_iglu.yml - Open-source Iglu configuration
- snowplow_with_filtering.yml - Schema filtering examples
Connection Options
BDP Console Connection
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
organization_id | string | ✅ | Organization UUID from Console URL | |
api_key_id | string | ✅ | API Key ID from Console credentials | |
api_key | string | ✅ | API Key secret | |
console_api_url | string | https://console.snowplowanalytics.com/api/msc/v1 | BDP Console API base URL | |
timeout_seconds | int | 60 | Request timeout in seconds | |
max_retries | int | 3 | Maximum retry attempts |
Iglu Connection
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
iglu_server_url | string | ✅ | Iglu server base URL | |
api_key | string | API key for private Iglu registry (UUID format) | ||
timeout_seconds | int | 30 | Request timeout in seconds |
Note: Iglu-only mode uses automatic schema discovery via the /api/schemas endpoint (requires Iglu Server 0.6+). All schemas in the registry will be automatically discovered.
Feature Options
| Option | Type | Default | Description | Required Permission |
|---|---|---|---|---|
extract_event_specifications | bool | true | Extract event specifications | read:event-specs |
extract_tracking_scenarios | bool | true | Extract tracking scenarios | read:tracking-scenarios |
extract_tracking_plans | bool | true | Extract tracking plans | read:data-products |
extract_pipelines | bool | true | Extract pipelines as DataFlow entities | read:pipelines |
extract_enrichments | bool | true | Extract enrichments as DataJob entities with lineage | read:enrichments |
enrichment_owner | string | None | Default owner email for enrichment DataJobs | N/A |
include_hidden_schemas | bool | false | Include schemas marked as hidden | N/A |
include_version_in_urn | bool | false | Include version in dataset URN (legacy behavior) | N/A |
extract_standard_schemas | bool | true | Extract Snowplow standard schemas from Iglu Central | N/A |
iglu_central_url | string | http://iglucentral.com | URL for fetching standard schemas | N/A |
Schema Extraction Options
| Option | Type | Default | Description |
|---|---|---|---|
schema_types_to_extract | list | ["event", "entity"] | Schema types to extract |
deployed_since | string | None | Only extract schemas deployed since this ISO 8601 timestamp |
schema_page_size | int | 100 | Number of schemas per API page |
Warehouse Lineage Options (Advanced)
⚠️ Note: Disabled by default. Prefer warehouse connectors (Snowflake, BigQuery) for column-level lineage.
| Option | Type | Default | Description | Required Permission |
|---|---|---|---|---|
warehouse_lineage.enabled | bool | false | Extract table-level lineage via Data Models API | read:data-products |
warehouse_lineage.platform_instance | string | None | Default platform instance for warehouse URNs | N/A |
warehouse_lineage.env | string | PROD | Default environment for warehouse datasets | N/A |
warehouse_lineage.validate_urns | bool | true | Validate warehouse URNs exist in DataHub | DataHub Graph API access |
warehouse_lineage.destination_mappings | list | [] | Per-destination platform instance overrides | N/A |
Field Tagging Options
| Option | Type | Default | Description |
|---|---|---|---|
field_tagging.enabled | bool | true | Enable automatic field tagging |
field_tagging.tag_schema_version | bool | true | Tag fields with schema version |
field_tagging.tag_event_type | bool | true | Tag fields with event type |
field_tagging.tag_data_class | bool | true | Tag fields with data classification (PII, Sensitive) |
field_tagging.tag_authorship | bool | true | Tag fields with authorship info |
field_tagging.track_field_versions | bool | false | Track which version each field was added in |
field_tagging.use_structured_properties | bool | true | Use structured properties instead of tags |
field_tagging.emit_tags_and_structured_properties | bool | false | Emit both tags and structured properties |
field_tagging.pii_tags_only | bool | false | Only emit tags for PII fields when using both |
field_tagging.use_pii_enrichment | bool | true | Extract PII fields from PII Pseudonymization enrichment |
Performance Options
| Option | Type | Default | Description |
|---|---|---|---|
performance.max_concurrent_api_calls | int | 10 | Maximum concurrent API calls for deployment fetching |
performance.enable_parallel_fetching | bool | true | Enable parallel fetching of schema deployments |
Filtering Options
| Option | Type | Default | Description |
|---|---|---|---|
schema_pattern | AllowDenyPattern | Allow all | Filter schemas by vendor/name pattern |
event_spec_pattern | AllowDenyPattern | Allow all | Filter event specifications by name |
tracking_scenario_pattern | AllowDenyPattern | Allow all | Filter tracking scenarios by name |
tracking_plan_pattern | AllowDenyPattern | Allow all | Filter tracking plans by name |
Stateful Ingestion
| Option | Type | Default | Description |
|---|---|---|---|
stateful_ingestion.enabled | bool | false | Enable stateful ingestion for deletion detection |
stateful_ingestion.remove_stale_metadata | bool | true | Remove schemas that no longer exist |
Quick Start
1. BDP Console (Managed Snowplow)
Create a recipe file snowplow_recipe.yml:
source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Run ingestion:
datahub ingest -c snowplow_recipe.yml
2. Open-Source Snowplow (Iglu-Only Mode)
For self-hosted Snowplow with Iglu registry (without BDP Console API):
source:
type: snowplow
config:
iglu_connection:
iglu_server_url: "https://iglu.example.com"
api_key: "${IGLU_API_KEY}" # Optional for private registries
schema_types_to_extract:
- "event"
- "entity"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Important notes for Iglu-only mode:
- ✅ Supported: Event and entity schemas with full JSON Schema definitions
- ✅ Supported: Automatic schema discovery via
/api/schemasendpoint (requires Iglu Server 0.6+) - ⚠️ Not supported: Event specifications (requires BDP API)
- ⚠️ Not supported: Tracking scenarios (requires BDP API)
- ⚠️ Not supported: Field tagging/PII detection (requires BDP deployment data)
For complete configuration options, see snowplow_iglu.yml.
3. With Warehouse Lineage (BDP Only - Advanced)
⚠️ Note: This feature is disabled by default and should only be enabled in specific scenarios (see below).
Extract table-level lineage from raw events to derived tables via BDP Data Models API:
source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"
# Enable warehouse lineage via Data Models API
warehouse_lineage:
enabled: true
platform_instance: "prod_snowflake" # Optional
env: "PROD" # Optional
validate_urns: true # Optional
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
What this creates:
- Table-level lineage:
atomic.events→derived.sessions(or other derived tables) - No direct warehouse credentials needed (uses BDP API)
Supported warehouses: Snowflake, BigQuery, Redshift, Databricks
When to Enable This Feature
✅ Enable warehouse lineage if:
- You want quick table-level lineage without configuring warehouse connector
- You don't have access to warehouse query logs
- You want to document Data Models API metadata specifically
❌ Don't enable if you're using warehouse connectors:
- Snowflake connector provides:
- Column-level lineage by parsing SQL queries
- Transformation logic from query history
- Complete dependency graphs
- BigQuery, Redshift, Databricks connectors similarly provide richer lineage
Best practice: Use warehouse connector for detailed lineage. Only enable this for quick documentation of Data Models metadata.
Requirements: Data Models must be configured in your BDP organization.
Schema Versioning
Snowplow uses SchemaVer (semantic versioning for schemas) with the format MODEL-REVISION-ADDITION:
- MODEL (first digit): Breaking changes - incompatible with previous versions
- REVISION (second digit): Non-breaking changes - additions that are backward compatible
- ADDITION (third digit): Adding optional fields without breaking changes
Example: 1-0-2
- Model: 1 (major version)
- Revision: 0 (no revisions)
- Addition: 2 (two optional field additions)
In DataHub, schemas are represented as:
- Dataset name:
{vendor}.{name}.{version}(e.g.,com.example.page_view.1-0-0) - Schema version: Tracked in dataset properties
Entity Mapping: Snowplow → DataHub
This section explains how Snowplow concepts are modeled as DataHub entities.
Entity Type Mapping
| Snowplow Concept | DataHub Entity | DataHub Subtype | Description |
|---|---|---|---|
| Organization | Container | DATABASE | Top-level container for all Snowplow metadata |
| Event Schema | Dataset | snowplow_event_schema | Self-describing event definition (JSON Schema) |
| Entity Schema | Dataset | snowplow_entity_schema | Context/entity schema attached to events |
| Event Specification | Dataset | snowplow_event_spec | Tracking requirement defining what to track |
| Tracking Scenario | Container | (custom) | Logical grouping of related event specifications |
| Tracking Plan | Container | tracking_plan | Business-level tracking plan grouping |
| Pipeline | DataFlow | - | Snowplow data pipeline (Collector → Warehouse) |
| Enrichment | DataJob | - | Data transformation job within a pipeline |
| Collector | DataJob | - | HTTP endpoint receiving tracking events |
| Atomic Events | Dataset | atomic_event | Raw enriched events table in warehouse |
| Parsed Events | Dataset | event | Parsed event data combining all schemas |
Pipeline Architecture in DataHub
Snowplow pipelines are modeled as DataFlow entities with DataJob children representing each processing stage:
Tracker SDKs (Web, Mobile, Server)
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Pipeline (DataFlow) │
│ urn:li:dataFlow:(snowplow,pipeline-id,PROD) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Collector │ ◄── Receives HTTP tracking events │
│ │ (DataJob) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐│
│ │ IP Lookup │ │ UA Parser │ │ PII Pseudonymization ││
│ │ (DataJob) │ │ (DataJob) │ │ (DataJob) ││
│ │ │ │ │ │ ││
│ │ user_ipaddress │ │ useragent │ │ user_id, email ││
│ │ → geo_*, ip_* │ │ → br_*, os_* │ │ → (hashed values) ││
│ └────────┬────────┘ └────────┬────────┘ └────────────┬────────────┘│
│ │ │ │ │
│ └────────────────────┼────────────────────────┘ │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Atomic Events (Dataset)│
│ Enriched event stream │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Warehouse Tables │
│ (Snowflake, BigQuery) │
└─────────────────────────┘
Lineage Relationships
The connector creates the following lineage relationships:
1. Schema → Event Specification Lineage
Event specifications reference the schemas they require:
┌──────────────────────────────┐
│ Event Schema │
│ (vendor.event_name.1-0-0) │────┐
└──────────────────────────────┘ │ ┌─────────────────────────┐
├────▶│ Event Specification │
┌──────────────────────────────┐ │ │ (Tracking Requirement) │
│ Entity Schema │────┘ └─────────────────────────┘
│ (vendor.context.1-0-0) │
└──────────────────────────────┘
2. Enrichment Column-Level Lineage
Enrichments transform specific fields. Example for IP Lookup:
┌─────────────────────────────────────────────────────────────────────┐
│ IP Lookup Enrichment │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Input Output │
│ ───── ────── │
│ ┌─────────────────┐ │
│ ┌──▶│ geo_country │ │
│ │ ├─────────────────┤ │
│ ┌─────────────────┐ │ │ geo_city │ │
│ │ user_ipaddress │────────┼──▶├─────────────────┤ │
│ └─────────────────┘ │ │ geo_region │ │
│ │ ├─────────────────┤ │
│ │ │ geo_latitude │ │
│ ├──▶├─────────────────┤ │
│ │ │ geo_longitude │ │
│ │ ├─────────────────┤ │
│ └──▶│ ip_isp │ │
│ ├─────────────────┤ │
│ │ ip_organization │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Supported enrichments with column-level lineage:
- IP Lookup:
user_ipaddress→geo_*,ip_*fields - UA Parser:
useragent→br_*,os_*fields - YAUAA:
useragent→ browser, OS, device fields - Referer Parser:
page_referrer→refr_*fields - Campaign Attribution:
page_urlquery→mkt_*fields - PII Pseudonymization: configured fields → same fields (hashed)
- Currency Conversion: currency fields → converted fields
- Event Fingerprint: event fields →
event_fingerprint - IAB Spiders/Robots:
useragent→iab_*classification fields
3. Warehouse Lineage (Optional)
When warehouse_lineage.enabled: true:
┌─────────────────────────┐ ┌─────────────────────────┐
│ Atomic Events │ Data Models API │ Derived Table │
│ (snowplow.atomic.events)│───────────────────▶│ (warehouse.schema.table)│
└─────────────────────────┘ └─────────────────────────┘
Container Hierarchy
Organization (Container: DATABASE)
│
├── Event Schema: com.example.page_view.1-0-0 (Dataset)
├── Event Schema: com.example.checkout.1-0-0 (Dataset)
├── Entity Schema: com.example.user_context.1-0-0 (Dataset)
├── Event Specification: "Page View Tracking" (Dataset)
│
├── Tracking Scenario: "Checkout Flow" (Container)
│ ├── Event Specification: "Add to Cart" (Dataset)
│ └── Event Specification: "Purchase Complete" (Dataset)
│
└── Tracking Plan: "Web Analytics" (Container)
├── Event Specification (linked)
└── Schema (linked)
URN Formats
| Entity Type | URN Format |
|---|---|
| Organization | urn:li:container:{guid} |
| Event/Entity Schema | urn:li:dataset:(urn:li:dataPlatform:snowplow,vendor.name,ENV) |
| Event Specification | urn:li:dataset:(urn:li:dataPlatform:snowplow,event_spec_id,ENV) |
| Pipeline | urn:li:dataFlow:(snowplow,pipeline-id,ENV) |
| Enrichment/DataJob | urn:li:dataJob:(urn:li:dataFlow:(...),job-id) |
| Tracking Scenario | urn:li:container:{guid} |
Custom Properties
Each entity type includes relevant custom properties:
Event/Entity Schemas:
vendor,name,version(SchemaVer format)schema_type(event/entity)json_schema(full JSON Schema definition)deployed_environments(PROD, DEV, etc.)
Event Specifications:
status(draft, active, deprecated)trigger_conditionsreferenced_schemas
Enrichments:
enrichment_typeinput_fields,output_fieldsconfigurationdetails
Troubleshooting
Authentication Errors
Error: Authentication failed: Invalid API credentials
Solution:
- Verify
api_key_idandapi_keyare correct - Check credentials are for the correct organization
- Ensure credentials haven't expired
- Generate new credentials in BDP Console if needed
Error: Authentication failed: Forbidden
Solution:
- Check
organization_idmatches your credentials - Verify API key has required permissions
- Contact Snowplow support if permissions are unclear
Permission Errors
Error: Permission denied for /data-structures
Solution:
- API key missing
read:data-structurespermission - Generate new credentials with correct permissions in BDP Console → Settings → API Credentials
Error: Permission denied for /event-specs
Solution:
- Set
extract_event_specifications: falsein config, or - Request
read:event-specspermission for your API key
Connection Errors
Error: Request timeout: https://console.snowplowanalytics.com
Solution:
- Check network connectivity to Snowplow Console
- Increase
timeout_secondsin configuration - Verify Console URL is correct
Error: Iglu connection failed
Solution:
- Verify
iglu_server_urlis correct and accessible - For private registries, check
api_keyis valid - Test connectivity:
curl https://iglu.example.com/api/schemas
No Schemas Found
Issue: Ingestion completes but no schemas extracted
Solutions:
Check filtering patterns:
schema_pattern:
allow: [".*"] # Allow all schemasCheck schema types:
schema_types_to_extract: ["event", "entity"]Include hidden schemas:
include_hidden_schemas: trueVerify schemas exist in BDP Console or Iglu registry
Rate Limiting
Error: HTTP 429: Rate limit exceeded
Solution:
- Connector implements automatic retry with exponential backoff
- Rate limits should be handled automatically
- If issues persist, contact Snowplow support to increase limits
Limitations
BDP-specific features:
- Event specifications only available via BDP Console API
- Tracking scenarios only available via BDP Console API
- Tracking plans only available via BDP Console API
- Open-source Iglu users won't have these features
Iglu Server requirements:
- Automatic schema discovery requires Iglu Server 0.6+ with
/api/schemasendpoint - Older Iglu implementations may not support the list schemas API
- Automatic schema discovery requires Iglu Server 0.6+ with
Field tagging in Iglu-only mode:
- PII/sensitive field detection requires BDP deployment metadata
- Not available when using Iglu-only mode
Advanced Configuration
Custom Platform Instance
Group schemas by environment:
source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"
platform_instance: "production"
env: "PROD"
Schema Filtering
Extract only specific vendor schemas:
source:
type: snowplow
config:
# ... connection config ...
schema_pattern:
allow:
- "com\\.example\\..*" # Allow com.example schemas
- "com\\.acme\\.events\\..*" # Allow com.acme.events schemas
deny:
- ".*\\.test$" # Deny test schemas
Stateful Ingestion
Enable deletion detection:
source:
type: snowplow
config:
# ... connection config ...
stateful_ingestion:
enabled: true
remove_stale_metadata: true
Testing the Connection
Use DataHub's built-in test-connection command:
datahub check source-connection snowplow \
--config snowplow_recipe.yml
This will:
- Test BDP Console API authentication
- Test Iglu registry connectivity (if configured)
- Verify required permissions
- Report capability availability
References
- Snowplow Documentation
- Snowplow BDP Console API
- Iglu Schema Registry
- SchemaVer Specification
- Snowplow GitHub
Support
For issues or questions:
- DataHub Slack: #troubleshoot
- GitHub Issues: datahub-project/datahub
- Snowplow Support: Snowplow Discourse
Code Coordinates
- Class Name:
datahub.ingestion.source.snowplow.snowplow.SnowplowSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Snowplow, feel free to ping us on our Slack.