Skip to main content

Airbyte

Overview

Airbyte is an open-source data integration platform that syncs data from sources to destinations through configurable connections. It supports hundreds of pre-built connectors and lets you build custom ones.

This integration extracts metadata from Airbyte to give DataHub visibility into your data pipelines — including connections, sources, destinations, streams, and job execution history. It captures lineage between source and destination datasets at both the table and column level.

Concept Mapping

Here's a table for Concept Mapping between Airbyte and DataHub to provide a clear overview of how entities and concepts in Airbyte are mapped to corresponding entities in DataHub:

Source ConceptDataHub ConceptNotes
WorkspaceDataFlowTop-level container for Airbyte resources
ConnectionDataFlowRepresents an Airbyte connection between source and destination
SourceDatasetSource datasets are mapped to DataHub datasets
DestinationDatasetDestination datasets are mapped to DataHub datasets
StreamDataJobEach stream is represented as a DataJob within the Connection DataFlow
Connection JobDataProcessInstanceExecution information for a connection run
Source SchemaSchemaMetadataSchema information from source datasets
Column MappingFineGrainedLineageColumn-level lineage between source and destination

Module airbyte

Incubating

Important Capabilities

CapabilityStatusNotes
Column-level LineageEnabled by default.
Detect Deleted EntitiesEnabled by default when stateful ingestion is turned on.
Extract TagsRequires recipe configuration.
Platform InstanceEnabled by default.
Table-Level LineageEnabled by default.

Overview

This integration extracts metadata from Airbyte's API to capture information about your connections, sources, destinations, and the lineage between them.

Prerequisites

You'll need to have an Airbyte instance running with configured sources and destinations, and access to the Airbyte API.

Steps to Get the Required Information

  1. Determine Your Deployment Type:

    • Open Source (OSS): If you're running a self-hosted Airbyte instance
    • Cloud: If you're using Airbyte Cloud
  2. Authentication Credentials:

    • For Open Source (OSS):

      • The URL of your Airbyte instance (host and port)
      • OAuth2 client credentials (Airbyte 1.0+) - obtain via:
        • UI: Navigate to User > User settings > Applications to create an application and copy credentials
        • CLI: Run abctl local credentials (abctl v0.11.0+)
      • Username and password if basic authentication is enabled
      • API token if available
    • For Airbyte Cloud:

      • OAuth2 client ID and client secret (required)
      • OAuth2 refresh token (optional — omit to use client_credentials grant; provide to use refresh_token grant)
      • Your Airbyte Cloud workspace ID
  3. API Access:

    • For OSS users, ensure the API is accessible at /api/public/v1 path prefix
    • Verify connectivity by testing the health endpoint: http://localhost:8000/api/public/v1/health
    • Ensure you have proper network connectivity between your DataHub instance and the Airbyte API
  4. Permissions:

    • The authentication credentials should have permissions to:
      • Read workspace information
      • List and read sources, destinations, and connections
      • Access connection schemas and sync catalogs
      • View job execution history (if extracting job statuses)

Install the Plugin

pip install 'acryl-datahub[airbyte]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: airbyte
config:
# Deployment type - required
deployment_type: oss # Options: "oss" (self-hosted) or "cloud" (Airbyte Cloud)

# Connection details for OSS deployment
host_port: http://localhost:8000 # Airbyte API endpoint URL

# Authentication for OSS deployment
username: your_username # Username for basic auth
password: your_password # Password for basic auth
# api_key: your_api_key # Alternative: API token if available

# Authentication for Cloud deployment - uncomment if using Airbyte Cloud
#deployment_type: cloud
#oauth2_client_id: your_client_id # OAuth2 client ID for Airbyte Cloud
#oauth2_client_secret: your_client_secret # OAuth2 client secret
#oauth2_refresh_token: your_refresh_token # OAuth2 refresh token
#cloud_workspace_id: your_workspace_id # Airbyte Cloud workspace ID

# SSL configuration
verify_ssl: false # Whether to verify SSL certificates
#ssl_ca_cert: /path/to/cert.pem # Path to CA certificate file (optional)

# Data extraction options
extract_column_level_lineage: true # Extract column-level lineage information
include_statuses: true # Include connection job statuses
job_statuses_limit: 100 # Max number of job statuses to retrieve

# Lineage emission mode
incremental_lineage: true # Emit lineage as patch (incremental) rather than full replacement
# Set to false to re-state all lineage on each run

# Optional: Extract tags
extract_tags: false # Extract tags from Airbyte metadata

# Filtering options - uncomment to use
#workspace_pattern:
# allow:
# - ".*" # Pattern to filter workspaces

#connection_pattern:
# allow:
# - ".*" # Pattern to filter connections

#source_pattern:
# allow:
# - ".*MySQL.*" # Pattern to filter sources

#destination_pattern:
# allow:
# - ".*Postgres.*" # Pattern to filter destinations

# Platform instance configuration
platform_instance: airbyte-instance # Custom platform instance name

# Performance settings
request_timeout: 30 # Timeout for API requests in seconds
max_retries: 3 # Max retries for failed requests
retry_backoff_factor: 0.5 # Backoff factor for retries
page_size: 20 # Items per page in API requests

sink:
type: datahub-rest
config:
server: http://localhost:8080

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_key
One of string(password), null
API key or Personal Access Token for authentication (OSS deployment)
Default: None
cloud_api_url
string
Base URL for Airbyte Cloud API (defaults to production URL)
cloud_oauth_token_url
string
OAuth token URL for Airbyte Cloud (defaults to production URL)
cloud_workspace_id
One of string, null
Workspace ID for Airbyte Cloud (required for cloud deployment)
Default: None
deployment_type
Enum
One of: "oss", "cloud"
extra_headers
One of string, null
Additional HTTP headers to send with each request
Default: None
extract_column_level_lineage
boolean
Extract column-level lineage
Default: True
extract_tags
boolean
Extract tags from Airbyte metadata
Default: False
host_port
One of string, null
Airbyte API host and port (e.g., http://localhost:8000) - required for OSS deployment
Default: None
include_statuses
boolean
Whether to ingest run statuses
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
job_status_end_date
One of string, null
End date for job status retrieval (format: yyyy-mm-ddTHH:MM:SSZ). Default is current time.
Default: None
job_status_start_date
One of string, null
Start date for job status retrieval (format: yyyy-mm-ddTHH:MM:SSZ). Default is 7 days ago.
Default: None
job_statuses_limit
integer
Maximum number of job statuses to retrieve per connection
Default: 100
max_retries
integer
Maximum number of retries for failed API requests
Default: 3
oauth2_client_id
One of string, null
OAuth2 client ID for OSS (Airbyte 1.0+) and Cloud deployments
Default: None
oauth2_client_secret
One of string(password), null
OAuth2 client secret for OSS (Airbyte 1.0+) and Cloud deployments
Default: None
oauth2_refresh_token
One of string(password), null
OAuth2 refresh token (Cloud only). If provided, uses refresh_token grant; otherwise uses client_credentials
Default: None
page_size
integer
Number of items to fetch per page in API requests
Default: 20
password
One of string(password), null
Password for basic authentication (OSS deployment)
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
request_timeout
integer
Timeout for API requests in seconds
Default: 30
retry_backoff_factor
number
Backoff factor for retries (wait time is {factor} * (2 ^ retry_number))
Default: 0.5
source_type_mapping
map(str,string)
ssl_ca_cert
One of string, null
Path to CA certificate file (.pem) for SSL verification
Default: None
username
One of string, null
Username for basic authentication (OSS deployment)
Default: None
verify_ssl
boolean
Whether to verify SSL certificates
Default: True
env
string
The environment that all assets produced by this connector belong to
Default: PROD
connection_pattern
AllowDenyPattern
A class to store allow deny regexes
connection_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
destination_pattern
AllowDenyPattern
A class to store allow deny regexes
destination_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
destinations_to_platform_instance
map(str,PlatformDetail)
Configuration for mapping a specific Airbyte source/destination to DataHub URNs.
destinations_to_platform_instance.key.platform
One of string, null
Override the platform type detection (e.g., 'postgres', 'mysql')
Default: None
destinations_to_platform_instance.key.convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase. Recommended for case-insensitive platforms to ensure lineage compatibility. Note: For Snowflake destinations, this also lowercases column names in lineage to match DataHub's native Snowflake connector behavior. For other platforms (MSSQL, Postgres, BigQuery, etc.), only dataset names are lowercased, not column names.
Default: True
destinations_to_platform_instance.key.include_schema_in_urn
One of boolean, null
Include schema in the dataset URN when database is present. If None (default), automatically detects 2-tier vs 3-tier platforms by checking if schema equals database. Set to True to force 3-tier (database.schema.table), or False to force 2-tier (database.table).
Default: None
destinations_to_platform_instance.key.platform_instance
One of string, null
The instance of the platform that all assets belong to
Default: None
destinations_to_platform_instance.key.env
One of string, null
Environment to use for dataset URNs (e.g., PROD, DEV, STAGING)
Default: None
source_pattern
AllowDenyPattern
A class to store allow deny regexes
source_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
sources_to_platform_instance
map(str,PlatformDetail)
Configuration for mapping a specific Airbyte source/destination to DataHub URNs.
sources_to_platform_instance.key.platform
One of string, null
Override the platform type detection (e.g., 'postgres', 'mysql')
Default: None
sources_to_platform_instance.key.convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase. Recommended for case-insensitive platforms to ensure lineage compatibility. Note: For Snowflake destinations, this also lowercases column names in lineage to match DataHub's native Snowflake connector behavior. For other platforms (MSSQL, Postgres, BigQuery, etc.), only dataset names are lowercased, not column names.
Default: True
sources_to_platform_instance.key.include_schema_in_urn
One of boolean, null
Include schema in the dataset URN when database is present. If None (default), automatically detects 2-tier vs 3-tier platforms by checking if schema equals database. Set to True to force 3-tier (database.schema.table), or False to force 2-tier (database.table).
Default: None
sources_to_platform_instance.key.platform_instance
One of string, null
The instance of the platform that all assets belong to
Default: None
sources_to_platform_instance.key.env
One of string, null
Environment to use for dataset URNs (e.g., PROD, DEV, STAGING)
Default: None
workspace_pattern
AllowDenyPattern
A class to store allow deny regexes
workspace_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Lineage

Column-level lineage is extracted from Airbyte's sync catalog when field mapping information is available in the connection configuration. Table-level lineage is always captured between source and destination datasets.

Job History

Connection job execution history is ingested as DataProcessInstance entities, capturing run status, start time, and duration for each sync job.

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by Airbyte.

  • Schema information is only available for sources that expose a sync catalog. Sources without schema discovery will produce datasets without schema metadata.
  • Column-level lineage requires field mapping to be configured in the Airbyte connection.
  • Job history depth is limited by the Airbyte API's pagination and retention settings.
  • The Airbyte Public API only supports limit + offset pagination on list endpoints; cursor pagination is not exposed. Ingestion runs against an actively-mutating Airbyte instance may therefore skip or double-count entries inserted or deleted mid-scan. Schedule ingestion during quiet periods if exactness is required.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Authentication Errors

Verify that your OAuth2 client credentials are correct and have not expired. For OSS deployments, confirm the API is reachable at the /api/public/v1 path prefix.

Missing Schema Metadata

If datasets are ingested without schema information, confirm that the Airbyte source supports schema discovery and that the sync catalog is populated in the connection settings.

Code Coordinates

  • Class Name: datahub.ingestion.source.airbyte.source.AirbyteSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Airbyte, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.