Skip to main content

Snowplow

Incubating

Important Capabilities

CapabilityStatusNotes
DescriptionsEnabled by default from schema descriptions.
Detect Deleted EntitiesEnabled via stateful ingestion.
DomainsSupported via configuration.
Platform InstanceEnabled by default.
Schema MetadataEnabled by default for event and entity schemas.
Table-Level LineageOptionally enabled via warehouse_lineage.enabled configuration (requires BDP).

Ingests metadata from Snowplow.

Extracts:

  • Organizations (as containers)
  • Event schemas (as datasets)
  • Entity schemas (as datasets)
  • Event specifications (as datasets) - BDP only
  • Tracking scenarios (as containers) - BDP only
  • Warehouse lineage (optional) - requires warehouse connection

Supports:

  • Snowplow BDP (Behavioral Data Platform) deployments
  • Open-source Snowplow with Iglu registry

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

# Snowplow Comprehensive Recipe
# This recipe demonstrates ALL available configuration options with detailed comments

source:
type: snowplow
config:
# ============================================
# Connection Configuration
# ============================================

# BDP Console API Connection (for managed Snowplow)
# Required for: Event specifications, tracking scenarios, data products
# Optional: Can be omitted if using Iglu-only mode
bdp_connection:
# Organization UUID - found in BDP Console URL
# Example: https://console.snowplowanalytics.com/{org_id}/data-structures
organization_id: "<YOUR_ORG_UUID>"

# API credentials from BDP Console → Settings → API Credentials
# Use environment variables for security
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"

# Optional: BDP Console API base URL (default shown)
# Only change if using a custom/regional Snowplow deployment
console_api_url: "https://console.snowplowanalytics.com/api/msc/v1"

# Optional: Request timeout in seconds (default: 60)
timeout_seconds: 60

# Optional: Maximum retry attempts for failed requests (default: 3)
max_retries: 3

# Iglu Schema Registry Connection (for open-source Snowplow)
# Required for: Open-source Snowplow deployments without BDP Console API
# Note: Either bdp_connection OR iglu_connection is required (not both)
iglu_connection:
# Iglu server base URL
# Examples:
# - Public: http://iglucentral.com
# - Private: https://iglu.example.com
iglu_server_url: "https://iglu.example.com"

# Optional: API key for private Iglu registries (UUID format)
# Not required for public registries like Iglu Central
api_key: "${IGLU_API_KEY}"

# Optional: Request timeout in seconds (default: 30)
timeout_seconds: 30

# ============================================
# Filtering Configuration
# ============================================

# Filter schemas by vendor/name pattern
# Pattern format: "vendor/name" (e.g., "com.example/page_view")
# Requires permission: read:data-structures
schema_pattern:
allow:
- ".*" # Allow all schemas (default)
# Examples of allow patterns:
# - "com\\.example\\..*" # Allow all com.example schemas
# - "com\\.acme\\.events\\..*" # Allow com.acme.events schemas
# - "com\\.snowplowanalytics\\..*" # Allow Snowplow standard schemas
deny:
# - ".*\\.test$" # Deny schemas ending with .test
# - ".*_sandbox.*" # Deny sandbox schemas
# - "com\\.example\\.deprecated\\..*" # Deny deprecated schemas

# Filter event specifications by name
# Only applies when extract_event_specifications is enabled
# Requires permission: read:event-specs
event_spec_pattern:
allow:
- ".*" # Allow all event specifications (default)
deny: []

# Filter tracking scenarios by name
# Only applies when extract_tracking_scenarios is enabled
# Requires permission: read:tracking-scenarios
tracking_scenario_pattern:
allow:
- ".*" # Allow all tracking scenarios (default)
deny: []

# ============================================
# Feature Flags
# ============================================

# Extract event specifications (BDP only)
# Requires permission: read:event-specs
# Default: true
extract_event_specifications: true

# Extract tracking scenarios (BDP only)
# Requires permission: read:tracking-scenarios
# Default: true
extract_tracking_scenarios: true

# Include full JSON Schema definition in dataset properties
# Useful for downstream schema analysis
# Default: true
include_schema_definitions: true

# Include schemas marked as hidden in BDP Console
# Default: false
include_hidden_schemas: false

# ============================================
# Warehouse Lineage (BDP only - Advanced)
# ============================================
# Extract TABLE-LEVEL lineage from atomic.events to derived tables via Data Models API
# Creates lineage: atomic.events → derived tables (e.g., derived.sessions)
#
# ⚠️ IMPORTANT: Disabled by default
# Warehouse connectors (Snowflake, BigQuery, etc.) provide BETTER lineage:
# - Column-level lineage (not just table-level)
# - Transformation logic from actual SQL queries
# - Complete dependency graphs
#
# Only enable this if:
# - You want quick table-level lineage without setting up warehouse connector
# - You don't have access to warehouse query logs
# - You want to document Data Models API metadata specifically
warehouse_lineage:
# Enable warehouse lineage extraction (default: false)
# Disabled by default - prefer using warehouse connector for detailed lineage
enabled: false

# Optional: Default platform instance for warehouse URNs
# Example: "prod_snowflake", "prod_bigquery"
# Can be overridden per destination using destination_mappings
platform_instance: "prod_snowflake"

# Optional: Default environment for warehouse datasets (default: PROD)
env: "PROD"

# Optional: Per-destination mappings (overrides defaults for specific destinations)
destination_mappings:
# Example: Override platform instance for specific destination
# - destination_id: "12345678-1234-1234-1234-123456789012"
# platform_instance: "staging_snowflake"
# env: "DEV"

# Optional: Validate warehouse URNs exist in DataHub before creating lineage
# Requires DataHub Graph API access (default: true)
validate_urns: true

# ============================================
# Schema Extraction Options
# ============================================

# Schema types to extract
# Options: "event" and/or "entity"
# Default: ["event", "entity"]
schema_types_to_extract:
- "event" # Event schemas (self-describing events)
- "entity" # Entity schemas (contexts and entities)

# ============================================
# Platform Instance (Optional)
# ============================================

# Platform instance identifier for multi-environment deployments
# Groups schemas by environment (e.g., production, staging, dev)
# Uncomment to enable:
# platform_instance: "production"

# ============================================
# Environment (Optional)
# ============================================

# Environment tag (PROD, DEV, QA, etc.)
# Uncomment to enable:
# env: "PROD"

# ============================================
# Stateful Ingestion (Optional)
# ============================================

# Enable stateful ingestion for deletion detection
# Tracks which schemas have been seen and removes stale ones
# Requires permission: read:data-structures (to track existence)
stateful_ingestion:
enabled: false
remove_stale_metadata: true # Remove schemas that no longer exist

# ============================================
# Sink Configuration
# ============================================

sink:
type: datahub-rest
config:
# DataHub GMS server URL
server: "http://localhost:8080"

# Optional: Authentication token
# token: "${DATAHUB_TOKEN}"

# Optional: Timeout for REST requests (default: 30s)
# timeout_sec: 30

# Optional: Extra headers
# extra_headers:
# X-Custom-Header: "value"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
deployed_since
One of string, null
Only extract schemas deployed/updated since this timestamp (ISO 8601 format: 2025-12-15T00:00:00Z). Enables incremental ingestion by filtering based on deployment timestamps. Leave empty to fetch all schemas.
Default: None
enrichment_owner
One of string, null
Default owner for enrichments (e.g., 'data-platform@company.com'). Applied as DATAOWNER to all enrichment DataJobs. Leave empty to skip enrichment ownership.
Default: None
extract_enrichments
boolean
Extract enrichments as DataJob entities linked to pipelines (requires BDP connection)
Default: True
extract_event_specifications
boolean
Extract event specifications (requires BDP connection)
Default: True
extract_pipelines
boolean
Extract pipelines as DataFlow entities (requires BDP connection)
Default: True
extract_standard_schemas
boolean
Extract Snowplow standard schemas from Iglu Central that are referenced by event specifications. Standard schemas (vendor: com.snowplowanalytics.*) are not in the Data Structures API but are publicly available. When enabled, creates dataset entities for standard schemas and completes lineage from event specs. Only fetches schemas that are actually referenced, not all standard schemas. Disable if you don't want to fetch from Iglu Central.
Default: True
extract_tracking_plans
boolean
Extract tracking plans (requires BDP connection)
Default: True
iglu_central_url
string
Iglu Central base URL for fetching Snowplow standard schemas
include_hidden_schemas
boolean
Include schemas marked as hidden in BDP Console
Default: False
include_version_in_urn
boolean
Include version in dataset URN (legacy behavior). When False (recommended), version is stored in dataset properties instead. Set to True for backwards compatibility with existing metadata.
Default: False
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
schema_page_size
integer
Number of schemas to fetch per API page (default: 100). Adjust based on organization size and API performance.
Default: 100
env
string
The environment that all assets produced by this connector belong to
Default: PROD
bdp_connection
One of SnowplowBDPConnectionConfig, null
BDP Console API connection (required for BDP mode)
Default: None
bdp_connection.api_key 
string(password)
API Key secret from BDP Console credentials
bdp_connection.api_key_id 
string
API Key ID from BDP Console credentials
bdp_connection.organization_id 
string
Organization UUID (found in BDP Console URL)
bdp_connection.console_api_url
string
BDP Console API base URL
bdp_connection.max_retries
integer
Maximum number of retry attempts for failed requests
Default: 3
bdp_connection.timeout_seconds
integer
Request timeout in seconds
Default: 60
event_spec_pattern
AllowDenyPattern
A class to store allow deny regexes
event_spec_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
field_tagging
FieldTaggingConfig
Configuration for auto-tagging schema fields.
field_tagging.authorship_pattern
string
Pattern for authorship tags. Use {author} placeholder.
Default: added_by_{author}
field_tagging.emit_tags_and_structured_properties
boolean
Emit both tags and structured properties for fields. When True, both tags and structured properties are emitted. When False, only the method specified by use_structured_properties is used. Useful during migration from tags to structured properties.
Default: False
field_tagging.enabled
boolean
Enable automatic field tagging
Default: True
field_tagging.event_type_pattern
string
Pattern for event type tags. Use {name} placeholder.
Default: snowplow_event_{name}
field_tagging.pii_tags_only
boolean
When emit_tags_and_structured_properties is true, only emit tags for PII/sensitive data classification. Version, authorship, and event type will only be in structured properties, not as tags. Useful when you want detailed structured properties but only highlight PII fields with tags.
Default: False
field_tagging.schema_version_pattern
string
Pattern for schema version tags. Use {version} placeholder.
Default: snowplow_schema_v{version}
field_tagging.tag_authorship
boolean
Tag fields with authorship (e.g., added_by_ryan_smith)
Default: True
field_tagging.tag_data_class
boolean
Tag fields with data classification (e.g., PII, Sensitive)
Default: True
field_tagging.tag_event_type
boolean
Tag fields with event type (e.g., snowplow_event_checkout)
Default: True
field_tagging.tag_schema_version
boolean
Tag fields with schema version (e.g., snowplow_schema_v1-0-0)
Default: True
field_tagging.track_field_versions
boolean
Track which version each field was added in. When enabled, compares schema versions to determine when fields were introduced. Tags fields with their introduction version and adds 'Added in version X' to descriptions. Disabled by default as it requires fetching all schema versions (slower ingestion).
Default: False
field_tagging.use_pii_enrichment
boolean
Extract PII fields from PII Pseudonymization enrichment config
Default: True
field_tagging.use_structured_properties
boolean
Use structured properties for field metadata instead of (or in addition to) tags. Structured properties provide strongly-typed metadata with better querying capabilities. When enabled, field authorship, version, timestamp, and classification are emitted as structured properties on the schemaField entity. Note: Requires structured property definitions to be registered in DataHub first. See snowplow_field_structured_properties.yaml in the connector directory.
Default: True
field_tagging.pii_field_patterns
array
Field name patterns to classify as PII (fallback if enrichment not available)
field_tagging.pii_field_patterns.string
string
field_tagging.sensitive_field_patterns
array
Field name patterns to classify as Sensitive
field_tagging.sensitive_field_patterns.string
string
iglu_connection
One of IgluConnectionConfig, null
Iglu Schema Registry connection (required for Iglu mode, optional for BDP mode as fallback)
Default: None
iglu_connection.iglu_server_url 
string
Iglu server base URL (e.g., 'https://iglu.acme.com' or 'http://iglucentral.com')
iglu_connection.api_key
One of string(password), null
API key for private Iglu registry (UUID format)
Default: None
iglu_connection.timeout_seconds
integer
Request timeout in seconds
Default: 30
performance
PerformanceConfig
Performance and scaling configuration.

Controls parallel processing, caching, and limits for large-scale deployments.
performance.enable_parallel_fetching
boolean
Enable parallel fetching of schema deployments. Significantly speeds up ingestion when field version tracking is enabled. Disable for debugging or if API rate limits are strict.
Default: True
performance.max_concurrent_api_calls
integer
Maximum concurrent API calls for deployment fetching. Increase for faster ingestion of large organizations with many schemas. Recommended: 5-20 depending on API rate limits.
Default: 10
schema_pattern
AllowDenyPattern
A class to store allow deny regexes
schema_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
schema_types_to_extract
array
Schema types to extract: 'event' and/or 'entity'
schema_types_to_extract.string
string
tracking_plan_pattern
AllowDenyPattern
A class to store allow deny regexes
tracking_plan_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
warehouse_lineage
WarehouseLineageConfig
Configuration for extracting lineage to warehouse destinations.

This feature creates table-level lineage from atomic.events to derived tables
by querying the Snowplow BDP Data Models API.

IMPORTANT: Disabled by default because warehouse connectors (Snowflake, BigQuery, etc.)
provide more detailed lineage by parsing actual SQL queries, including:
- Column-level lineage
- Transformation logic
- Complete dependency graphs

Only enable this if:
- You want quick table-level lineage without setting up warehouse connector
- You don't have access to warehouse query logs
- You want to document Data Models API metadata specifically
warehouse_lineage.enabled
boolean
Enable warehouse lineage extraction via data models API. Disabled by default - prefer using warehouse connector (Snowflake, BigQuery) for detailed lineage.
Default: False
warehouse_lineage.platform_instance
One of string, null
Default platform instance prefix for warehouse URNs (e.g., 'prod_snowflake'). Applied globally unless overridden by destination_mappings.
Default: None
warehouse_lineage.validate_urns
boolean
Validate that warehouse table URNs exist in DataHub before creating lineage. Requires DataHub Graph API access. Set to False to skip validation.
Default: True
warehouse_lineage.env
string
Default environment for warehouse datasets
Default: PROD
warehouse_lineage.destination_mappings
array
Per-destination platform instance mappings. Overrides global platform_instance for specific destinations.
warehouse_lineage.destination_mappings.DestinationMapping
DestinationMapping
Mapping configuration for a Snowplow warehouse destination.
warehouse_lineage.destination_mappings.DestinationMapping.destination_id 
string
Snowplow destination UUID from data models
warehouse_lineage.destination_mappings.DestinationMapping.platform_instance
One of string, null
Platform instance to prepend to dataset name in URN (e.g., 'prod_snowflake')
Default: None
warehouse_lineage.destination_mappings.DestinationMapping.env
string
Environment for warehouse datasets (e.g., PROD, DEV)
Default: PROD
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration for deletion detection
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Snowplow

Overview

The Snowplow source extracts metadata from Snowplow's behavioral data platform, including:

  • Event schemas - Self-describing event definitions with properties and validation rules
  • Entity schemas - Context and entity schemas attached to events
  • Event specifications - Tracking requirements and specifications (BDP only)
  • Tracking scenarios - Groupings of related events (BDP only)
  • Organizations - Top-level containers for all schemas

Snowplow is an open-source behavioral data platform that collects, validates, and models event-level data. This connector supports both:

  • Snowplow BDP (Behavioral Data Platform) - Managed Snowplow with Console API
  • Open-source Snowplow - Self-hosted with Iglu schema registry

Supported Capabilities

CapabilityStatusNotes
Platform Instance✅ SupportedGroup schemas by environment
Domains✅ SupportedAssign domains to schemas
Schema Metadata✅ SupportedExtract JSON Schema definitions
Descriptions✅ SupportedFrom schema descriptions
Lineage✅ SupportedEvent schemas → Enrichments → Warehouse tables (BDP only)
Pipelines & Enrichments✅ SupportedExtract data pipelines and enrichment jobs (BDP only)
Deletion Detection✅ SupportedVia stateful ingestion

Prerequisites

For Snowplow BDP (Managed)

  1. Snowplow BDP account with Console access
  2. Organization ID - Found in Console URL: https://console.snowplowanalytics.com/{org-id}/...
  3. API credentials - Generated from Console → Settings → API Credentials:
    • API Key ID
    • API Key Secret

For Open-Source Snowplow

  1. Iglu Schema Registry - URL of your Iglu server
  2. API Key (optional) - Required for private Iglu registries

Python Requirements

  • Python 3.8 or newer
  • DataHub CLI installed

Installation

# Install DataHub with Snowplow support
pip install 'acryl-datahub[snowplow]'

Required Permissions

Snowplow BDP API Permissions

The connector requires read-only access to the following BDP Console API endpoints:

Minimum Required Permissions

To extract basic schema metadata:

  • read:data-structures - Read access to data structures (event and entity schemas)
  • read:organizations - Access to organization information

Permissions by Capability

CapabilityRequired PermissionsConfiguration
Schema Metadataread:data-structuresEnabled by default
Event Specificationsread:event-specsextract_event_specifications: true
Tracking Scenariosread:tracking-scenariosextract_tracking_scenarios: true
Tracking Plansread:data-productsextract_tracking_plans: true

Permission Testing

Test your API credentials and permissions:

# Get JWT token
curl -X POST \
-H "X-API-Key-ID: <API_KEY_ID>" \
-H "X-API-Key: <API_KEY>" \
https://console.snowplowanalytics.com/api/msc/v1/organizations/<ORG_ID>/credentials/v3/token

# List data structures
curl -H "Authorization: Bearer <JWT>" \
https://console.snowplowanalytics.com/api/msc/v1/organizations/<ORG_ID>/data-structures/v1

Iglu Registry Permissions

For open-source Snowplow with Iglu:

  • Public registries: No authentication required (e.g., Iglu Central)
  • Private registries: API key with read access to schemas

Configuration

See the recipe files for complete configuration examples:

Connection Options

BDP Console Connection

OptionTypeRequiredDefaultDescription
organization_idstringOrganization UUID from Console URL
api_key_idstringAPI Key ID from Console credentials
api_keystringAPI Key secret
console_api_urlstringhttps://console.snowplowanalytics.com/api/msc/v1BDP Console API base URL
timeout_secondsint60Request timeout in seconds
max_retriesint3Maximum retry attempts

Iglu Connection

OptionTypeRequiredDefaultDescription
iglu_server_urlstringIglu server base URL
api_keystringAPI key for private Iglu registry (UUID format)
timeout_secondsint30Request timeout in seconds

Note: Iglu-only mode uses automatic schema discovery via the /api/schemas endpoint (requires Iglu Server 0.6+). All schemas in the registry will be automatically discovered.

Feature Options

OptionTypeDefaultDescriptionRequired Permission
extract_event_specificationsbooltrueExtract event specificationsread:event-specs
extract_tracking_scenariosbooltrueExtract tracking scenariosread:tracking-scenarios
extract_tracking_plansbooltrueExtract tracking plansread:data-products
extract_pipelinesbooltrueExtract pipelines as DataFlow entitiesread:pipelines
extract_enrichmentsbooltrueExtract enrichments as DataJob entities with lineageread:enrichments
enrichment_ownerstringNoneDefault owner email for enrichment DataJobsN/A
include_hidden_schemasboolfalseInclude schemas marked as hiddenN/A
include_version_in_urnboolfalseInclude version in dataset URN (legacy behavior)N/A
extract_standard_schemasbooltrueExtract Snowplow standard schemas from Iglu CentralN/A
iglu_central_urlstringhttp://iglucentral.comURL for fetching standard schemasN/A

Schema Extraction Options

OptionTypeDefaultDescription
schema_types_to_extractlist["event", "entity"]Schema types to extract
deployed_sincestringNoneOnly extract schemas deployed since this ISO 8601 timestamp
schema_page_sizeint100Number of schemas per API page

Warehouse Lineage Options (Advanced)

⚠️ Note: Disabled by default. Prefer warehouse connectors (Snowflake, BigQuery) for column-level lineage.

OptionTypeDefaultDescriptionRequired Permission
warehouse_lineage.enabledboolfalseExtract table-level lineage via Data Models APIread:data-products
warehouse_lineage.platform_instancestringNoneDefault platform instance for warehouse URNsN/A
warehouse_lineage.envstringPRODDefault environment for warehouse datasetsN/A
warehouse_lineage.validate_urnsbooltrueValidate warehouse URNs exist in DataHubDataHub Graph API access
warehouse_lineage.destination_mappingslist[]Per-destination platform instance overridesN/A

Field Tagging Options

OptionTypeDefaultDescription
field_tagging.enabledbooltrueEnable automatic field tagging
field_tagging.tag_schema_versionbooltrueTag fields with schema version
field_tagging.tag_event_typebooltrueTag fields with event type
field_tagging.tag_data_classbooltrueTag fields with data classification (PII, Sensitive)
field_tagging.tag_authorshipbooltrueTag fields with authorship info
field_tagging.track_field_versionsboolfalseTrack which version each field was added in
field_tagging.use_structured_propertiesbooltrueUse structured properties instead of tags
field_tagging.emit_tags_and_structured_propertiesboolfalseEmit both tags and structured properties
field_tagging.pii_tags_onlyboolfalseOnly emit tags for PII fields when using both
field_tagging.use_pii_enrichmentbooltrueExtract PII fields from PII Pseudonymization enrichment

Performance Options

OptionTypeDefaultDescription
performance.max_concurrent_api_callsint10Maximum concurrent API calls for deployment fetching
performance.enable_parallel_fetchingbooltrueEnable parallel fetching of schema deployments

Filtering Options

OptionTypeDefaultDescription
schema_patternAllowDenyPatternAllow allFilter schemas by vendor/name pattern
event_spec_patternAllowDenyPatternAllow allFilter event specifications by name
tracking_scenario_patternAllowDenyPatternAllow allFilter tracking scenarios by name
tracking_plan_patternAllowDenyPatternAllow allFilter tracking plans by name

Stateful Ingestion

OptionTypeDefaultDescription
stateful_ingestion.enabledboolfalseEnable stateful ingestion for deletion detection
stateful_ingestion.remove_stale_metadatabooltrueRemove schemas that no longer exist

Quick Start

1. BDP Console (Managed Snowplow)

Create a recipe file snowplow_recipe.yml:

source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Run ingestion:

datahub ingest -c snowplow_recipe.yml

2. Open-Source Snowplow (Iglu-Only Mode)

For self-hosted Snowplow with Iglu registry (without BDP Console API):

source:
type: snowplow
config:
iglu_connection:
iglu_server_url: "https://iglu.example.com"
api_key: "${IGLU_API_KEY}" # Optional for private registries

schema_types_to_extract:
- "event"
- "entity"

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Important notes for Iglu-only mode:

  • Supported: Event and entity schemas with full JSON Schema definitions
  • Supported: Automatic schema discovery via /api/schemas endpoint (requires Iglu Server 0.6+)
  • ⚠️ Not supported: Event specifications (requires BDP API)
  • ⚠️ Not supported: Tracking scenarios (requires BDP API)
  • ⚠️ Not supported: Field tagging/PII detection (requires BDP deployment data)

For complete configuration options, see snowplow_iglu.yml.

3. With Warehouse Lineage (BDP Only - Advanced)

⚠️ Note: This feature is disabled by default and should only be enabled in specific scenarios (see below).

Extract table-level lineage from raw events to derived tables via BDP Data Models API:

source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"

# Enable warehouse lineage via Data Models API
warehouse_lineage:
enabled: true
platform_instance: "prod_snowflake" # Optional
env: "PROD" # Optional
validate_urns: true # Optional

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

What this creates:

  • Table-level lineage: atomic.eventsderived.sessions (or other derived tables)
  • No direct warehouse credentials needed (uses BDP API)

Supported warehouses: Snowflake, BigQuery, Redshift, Databricks

When to Enable This Feature

✅ Enable warehouse lineage if:

  • You want quick table-level lineage without configuring warehouse connector
  • You don't have access to warehouse query logs
  • You want to document Data Models API metadata specifically

❌ Don't enable if you're using warehouse connectors:

  • Snowflake connector provides:
    • Column-level lineage by parsing SQL queries
    • Transformation logic from query history
    • Complete dependency graphs
  • BigQuery, Redshift, Databricks connectors similarly provide richer lineage

Best practice: Use warehouse connector for detailed lineage. Only enable this for quick documentation of Data Models metadata.

Requirements: Data Models must be configured in your BDP organization.

Schema Versioning

Snowplow uses SchemaVer (semantic versioning for schemas) with the format MODEL-REVISION-ADDITION:

  • MODEL (first digit): Breaking changes - incompatible with previous versions
  • REVISION (second digit): Non-breaking changes - additions that are backward compatible
  • ADDITION (third digit): Adding optional fields without breaking changes

Example: 1-0-2

  • Model: 1 (major version)
  • Revision: 0 (no revisions)
  • Addition: 2 (two optional field additions)

In DataHub, schemas are represented as:

  • Dataset name: {vendor}.{name}.{version} (e.g., com.example.page_view.1-0-0)
  • Schema version: Tracked in dataset properties

Entity Mapping: Snowplow → DataHub

This section explains how Snowplow concepts are modeled as DataHub entities.

Entity Type Mapping

Snowplow ConceptDataHub EntityDataHub SubtypeDescription
OrganizationContainerDATABASETop-level container for all Snowplow metadata
Event SchemaDatasetsnowplow_event_schemaSelf-describing event definition (JSON Schema)
Entity SchemaDatasetsnowplow_entity_schemaContext/entity schema attached to events
Event SpecificationDatasetsnowplow_event_specTracking requirement defining what to track
Tracking ScenarioContainer(custom)Logical grouping of related event specifications
Tracking PlanContainertracking_planBusiness-level tracking plan grouping
PipelineDataFlow-Snowplow data pipeline (Collector → Warehouse)
EnrichmentDataJob-Data transformation job within a pipeline
CollectorDataJob-HTTP endpoint receiving tracking events
Atomic EventsDatasetatomic_eventRaw enriched events table in warehouse
Parsed EventsDataseteventParsed event data combining all schemas

Pipeline Architecture in DataHub

Snowplow pipelines are modeled as DataFlow entities with DataJob children representing each processing stage:

Tracker SDKs (Web, Mobile, Server)


┌─────────────────────────────────────────────────────────────────────────┐
│ Pipeline (DataFlow) │
│ urn:li:dataFlow:(snowplow,pipeline-id,PROD) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Collector │ ◄── Receives HTTP tracking events │
│ │ (DataJob) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐│
│ │ IP Lookup │ │ UA Parser │ │ PII Pseudonymization ││
│ │ (DataJob) │ │ (DataJob) │ │ (DataJob) ││
│ │ │ │ │ │ ││
│ │ user_ipaddress │ │ useragent │ │ user_id, email ││
│ │ → geo_*, ip_* │ │ → br_*, os_* │ │ → (hashed values) ││
│ └────────┬────────┘ └────────┬────────┘ └────────────┬────────────┘│
│ │ │ │ │
│ └────────────────────┼────────────────────────┘ │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────┐
│ Atomic Events (Dataset)│
│ Enriched event stream │
└────────────┬────────────┘


┌─────────────────────────┐
│ Warehouse Tables │
│ (Snowflake, BigQuery) │
└─────────────────────────┘

Lineage Relationships

The connector creates the following lineage relationships:

1. Schema → Event Specification Lineage

Event specifications reference the schemas they require:

┌──────────────────────────────┐
│ Event Schema │
│ (vendor.event_name.1-0-0) │────┐
└──────────────────────────────┘ │ ┌─────────────────────────┐
├────▶│ Event Specification │
┌──────────────────────────────┐ │ │ (Tracking Requirement) │
│ Entity Schema │────┘ └─────────────────────────┘
│ (vendor.context.1-0-0) │
└──────────────────────────────┘

2. Enrichment Column-Level Lineage

Enrichments transform specific fields. Example for IP Lookup:

┌─────────────────────────────────────────────────────────────────────┐
│ IP Lookup Enrichment │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Input Output │
│ ───── ────── │
│ ┌─────────────────┐ │
│ ┌──▶│ geo_country │ │
│ │ ├─────────────────┤ │
│ ┌─────────────────┐ │ │ geo_city │ │
│ │ user_ipaddress │────────┼──▶├─────────────────┤ │
│ └─────────────────┘ │ │ geo_region │ │
│ │ ├─────────────────┤ │
│ │ │ geo_latitude │ │
│ ├──▶├─────────────────┤ │
│ │ │ geo_longitude │ │
│ │ ├─────────────────┤ │
│ └──▶│ ip_isp │ │
│ ├─────────────────┤ │
│ │ ip_organization │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Supported enrichments with column-level lineage:

  • IP Lookup: user_ipaddressgeo_*, ip_* fields
  • UA Parser: useragentbr_*, os_* fields
  • YAUAA: useragent → browser, OS, device fields
  • Referer Parser: page_referrerrefr_* fields
  • Campaign Attribution: page_urlquerymkt_* fields
  • PII Pseudonymization: configured fields → same fields (hashed)
  • Currency Conversion: currency fields → converted fields
  • Event Fingerprint: event fields → event_fingerprint
  • IAB Spiders/Robots: useragentiab_* classification fields

3. Warehouse Lineage (Optional)

When warehouse_lineage.enabled: true:

┌─────────────────────────┐                    ┌─────────────────────────┐
│ Atomic Events │ Data Models API │ Derived Table │
│ (snowplow.atomic.events)│───────────────────▶│ (warehouse.schema.table)│
└─────────────────────────┘ └─────────────────────────┘

Container Hierarchy

Organization (Container: DATABASE)

├── Event Schema: com.example.page_view.1-0-0 (Dataset)
├── Event Schema: com.example.checkout.1-0-0 (Dataset)
├── Entity Schema: com.example.user_context.1-0-0 (Dataset)
├── Event Specification: "Page View Tracking" (Dataset)

├── Tracking Scenario: "Checkout Flow" (Container)
│ ├── Event Specification: "Add to Cart" (Dataset)
│ └── Event Specification: "Purchase Complete" (Dataset)

└── Tracking Plan: "Web Analytics" (Container)
├── Event Specification (linked)
└── Schema (linked)

URN Formats

Entity TypeURN Format
Organizationurn:li:container:{guid}
Event/Entity Schemaurn:li:dataset:(urn:li:dataPlatform:snowplow,vendor.name,ENV)
Event Specificationurn:li:dataset:(urn:li:dataPlatform:snowplow,event_spec_id,ENV)
Pipelineurn:li:dataFlow:(snowplow,pipeline-id,ENV)
Enrichment/DataJoburn:li:dataJob:(urn:li:dataFlow:(...),job-id)
Tracking Scenariourn:li:container:{guid}

Custom Properties

Each entity type includes relevant custom properties:

Event/Entity Schemas:

  • vendor, name, version (SchemaVer format)
  • schema_type (event/entity)
  • json_schema (full JSON Schema definition)
  • deployed_environments (PROD, DEV, etc.)

Event Specifications:

  • status (draft, active, deprecated)
  • trigger_conditions
  • referenced_schemas

Enrichments:

  • enrichment_type
  • input_fields, output_fields
  • configuration details

Troubleshooting

Authentication Errors

Error: Authentication failed: Invalid API credentials

Solution:

  1. Verify api_key_id and api_key are correct
  2. Check credentials are for the correct organization
  3. Ensure credentials haven't expired
  4. Generate new credentials in BDP Console if needed

Error: Authentication failed: Forbidden

Solution:

  • Check organization_id matches your credentials
  • Verify API key has required permissions
  • Contact Snowplow support if permissions are unclear

Permission Errors

Error: Permission denied for /data-structures

Solution:

  • API key missing read:data-structures permission
  • Generate new credentials with correct permissions in BDP Console → Settings → API Credentials

Error: Permission denied for /event-specs

Solution:

  • Set extract_event_specifications: false in config, or
  • Request read:event-specs permission for your API key

Connection Errors

Error: Request timeout: https://console.snowplowanalytics.com

Solution:

  • Check network connectivity to Snowplow Console
  • Increase timeout_seconds in configuration
  • Verify Console URL is correct

Error: Iglu connection failed

Solution:

  • Verify iglu_server_url is correct and accessible
  • For private registries, check api_key is valid
  • Test connectivity: curl https://iglu.example.com/api/schemas

No Schemas Found

Issue: Ingestion completes but no schemas extracted

Solutions:

  1. Check filtering patterns:

    schema_pattern:
    allow: [".*"] # Allow all schemas
  2. Check schema types:

    schema_types_to_extract: ["event", "entity"]
  3. Include hidden schemas:

    include_hidden_schemas: true
  4. Verify schemas exist in BDP Console or Iglu registry

Rate Limiting

Error: HTTP 429: Rate limit exceeded

Solution:

  • Connector implements automatic retry with exponential backoff
  • Rate limits should be handled automatically
  • If issues persist, contact Snowplow support to increase limits

Limitations

  1. BDP-specific features:

    • Event specifications only available via BDP Console API
    • Tracking scenarios only available via BDP Console API
    • Tracking plans only available via BDP Console API
    • Open-source Iglu users won't have these features
  2. Iglu Server requirements:

    • Automatic schema discovery requires Iglu Server 0.6+ with /api/schemas endpoint
    • Older Iglu implementations may not support the list schemas API
  3. Field tagging in Iglu-only mode:

    • PII/sensitive field detection requires BDP deployment metadata
    • Not available when using Iglu-only mode

Advanced Configuration

Custom Platform Instance

Group schemas by environment:

source:
type: snowplow
config:
bdp_connection:
organization_id: "<ORG_UUID>"
api_key_id: "${SNOWPLOW_API_KEY_ID}"
api_key: "${SNOWPLOW_API_KEY}"

platform_instance: "production"
env: "PROD"

Schema Filtering

Extract only specific vendor schemas:

source:
type: snowplow
config:
# ... connection config ...

schema_pattern:
allow:
- "com\\.example\\..*" # Allow com.example schemas
- "com\\.acme\\.events\\..*" # Allow com.acme.events schemas
deny:
- ".*\\.test$" # Deny test schemas

Stateful Ingestion

Enable deletion detection:

source:
type: snowplow
config:
# ... connection config ...

stateful_ingestion:
enabled: true
remove_stale_metadata: true

Testing the Connection

Use DataHub's built-in test-connection command:

datahub check source-connection snowplow \
--config snowplow_recipe.yml

This will:

  • Test BDP Console API authentication
  • Test Iglu registry connectivity (if configured)
  • Verify required permissions
  • Report capability availability

References

Support

For issues or questions:

Code Coordinates

  • Class Name: datahub.ingestion.source.snowplow.snowplow.SnowplowSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Snowplow, feel free to ping us on our Slack.