Airbyte

Overview

Airbyte is an open-source data integration platform that syncs data from sources to destinations through configurable connections. It supports hundreds of pre-built connectors and lets you build custom ones.

This integration extracts metadata from Airbyte to give DataHub visibility into your data pipelines — including connections, sources, destinations, streams, and job execution history. It captures lineage between source and destination datasets at both the table and column level.

Concept Mapping

Here's a table for Concept Mapping between Airbyte and DataHub to provide a clear overview of how entities and concepts in Airbyte are mapped to corresponding entities in DataHub:

Source Concept	DataHub Concept	Notes
Workspace	`DataFlow`	Top-level container for Airbyte resources
Connection	`DataFlow`	Represents an Airbyte connection between source and destination
Source	`Dataset`	Source datasets are mapped to DataHub datasets
Destination	`Dataset`	Destination datasets are mapped to DataHub datasets
Stream	`DataJob`	Each stream is represented as a DataJob within the Connection DataFlow
Connection Job	`DataProcessInstance`	Execution information for a connection run
Source Schema	`SchemaMetadata`	Schema information from source datasets
Column Mapping	`FineGrainedLineage`	Column-level lineage between source and destination

Module `airbyte`

Important Capabilities

Capability	Status	Notes
Column-level Lineage	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled by default when stateful ingestion is turned on.
Extract Tags	✅	Requires recipe configuration.
Platform Instance	✅	Enabled by default.
Table-Level Lineage	✅	Enabled by default.

Overview

This integration extracts metadata from Airbyte's API to capture information about your connections, sources, destinations, and the lineage between them.

Prerequisites

You'll need to have an Airbyte instance running with configured sources and destinations, and access to the Airbyte API.

Steps to Get the Required Information

Determine Your Deployment Type:
- Open Source (OSS): If you're running a self-hosted Airbyte instance
- Cloud: If you're using Airbyte Cloud
Authentication Credentials:
- For Open Source (OSS):
  - The URL of your Airbyte instance (host and port)
  - OAuth2 client credentials (Airbyte 1.0+) - obtain via:
    - UI: Navigate to User > User settings > Applications to create an application and copy credentials
    - CLI: Run abctl local credentials (abctl v0.11.0+)
  - Username and password if basic authentication is enabled
  - API token if available
- For Airbyte Cloud:
  - OAuth2 client ID and client secret (required)
  - OAuth2 refresh token (optional — omit to use client_credentials grant; provide to use refresh_token grant)
  - Your Airbyte Cloud workspace ID
API Access:
- For OSS users, ensure the API is accessible at /api/public/v1 path prefix
- Verify connectivity by testing the health endpoint: http://localhost:8000/api/public/v1/health
- Ensure you have proper network connectivity between your DataHub instance and the Airbyte API
Permissions:
- The authentication credentials should have permissions to:
  - Read workspace information
  - List and read sources, destinations, and connections
  - Access connection schemas and sync catalogs
  - View job execution history (if extracting job statuses)

Install the Plugin

pip install 'acryl-datahub[airbyte]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: airbyte
  config:
    # Deployment type - required
    deployment_type: oss               # Options: "oss" (self-hosted) or "cloud" (Airbyte Cloud)

    # Connection details for OSS deployment
    host_port: http://localhost:8000   # Airbyte API endpoint URL

    # Authentication for OSS deployment
    username: your_username            # Username for basic auth
    password: your_password            # Password for basic auth
    # api_key: your_api_key            # Alternative: API token if available

    # Authentication for Cloud deployment - uncomment if using Airbyte Cloud
    #deployment_type: cloud
    #oauth2_client_id: your_client_id           # OAuth2 client ID for Airbyte Cloud
    #oauth2_client_secret: your_client_secret   # OAuth2 client secret
    #oauth2_refresh_token: your_refresh_token   # OAuth2 refresh token
    #cloud_workspace_id: your_workspace_id      # Airbyte Cloud workspace ID

    # SSL configuration
    verify_ssl: false                  # Whether to verify SSL certificates
    #ssl_ca_cert: /path/to/cert.pem    # Path to CA certificate file (optional)

    # Data extraction options
    extract_column_level_lineage: true # Extract column-level lineage information
    include_statuses: true             # Include connection job statuses
    job_statuses_limit: 100            # Max number of job statuses to retrieve
    
    # Lineage emission mode
    incremental_lineage: true          # Emit lineage as patch (incremental) rather than full replacement
                                       # Set to false to re-state all lineage on each run

    # Optional: Extract tags
    extract_tags: false                # Extract tags from Airbyte metadata

    # Filtering options - uncomment to use
    #workspace_pattern:
    #  allow:
    #    - ".*"                        # Pattern to filter workspaces

    #connection_pattern:
    #  allow:
    #    - ".*"                        # Pattern to filter connections

    #source_pattern:
    #  allow:
    #    - ".*MySQL.*"                 # Pattern to filter sources

    #destination_pattern:
    #  allow:
    #    - ".*Postgres.*"              # Pattern to filter destinations

    # Platform instance configuration
    platform_instance: airbyte-instance # Custom platform instance name

    # Performance settings
    request_timeout: 30                # Timeout for API requests in seconds
    max_retries: 3                     # Max retries for failed requests
    retry_backoff_factor: 0.5          # Backoff factor for retries
    page_size: 20                      # Items per page in API requests

sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
api_key One of string(password), null	API key or Personal Access Token for authentication (OSS deployment) Default: None
cloud_api_url string	Base URL for Airbyte Cloud API (defaults to production URL) Default: https://api.airbyte.com/v1
cloud_oauth_token_url string	OAuth token URL for Airbyte Cloud (defaults to production URL) Default: https://auth.airbyte.com/oauth/token
cloud_workspace_id One of string, null	Workspace ID for Airbyte Cloud (required for cloud deployment) Default: None
deployment_type Enum	One of: "oss", "cloud"
extra_headers One of string, null	Additional HTTP headers to send with each request Default: None
extract_column_level_lineage boolean	Extract column-level lineage Default: True
extract_tags boolean	Extract tags from Airbyte metadata Default: False
host_port One of string, null	Airbyte API host and port (e.g., http://localhost:8000) - required for OSS deployment Default: None
include_statuses boolean	Whether to ingest run statuses Default: True
incremental_lineage boolean	When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False
job_status_end_date One of string, null	End date for job status retrieval (format: yyyy-mm-ddTHH:MM:SSZ). Default is current time. Default: None
job_status_start_date One of string, null	Start date for job status retrieval (format: yyyy-mm-ddTHH:MM:SSZ). Default is 7 days ago. Default: None
job_statuses_limit integer	Maximum number of job statuses to retrieve per connection Default: 100
max_retries integer	Maximum number of retries for failed API requests Default: 3
oauth2_client_id One of string, null	OAuth2 client ID for OSS (Airbyte 1.0+) and Cloud deployments Default: None
oauth2_client_secret One of string(password), null	OAuth2 client secret for OSS (Airbyte 1.0+) and Cloud deployments Default: None
oauth2_refresh_token One of string(password), null	OAuth2 refresh token (Cloud only). If provided, uses refresh_token grant; otherwise uses client_credentials Default: None
page_size integer	Number of items to fetch per page in API requests Default: 20
password One of string(password), null	Password for basic authentication (OSS deployment) Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
request_timeout integer	Timeout for API requests in seconds Default: 30
retry_backoff_factor number	Backoff factor for retries (wait time is {factor} * (2 ^ retry_number)) Default: 0.5
source_type_mapping map(str,string)
ssl_ca_cert One of string, null	Path to CA certificate file (.pem) for SSL verification Default: None
username One of string, null	Username for basic authentication (OSS deployment) Default: None
verify_ssl boolean	Whether to verify SSL certificates Default: True
env string	The environment that all assets produced by this connector belong to Default: PROD
connection_pattern AllowDenyPattern	A class to store allow deny regexes
connection_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
destination_pattern AllowDenyPattern	A class to store allow deny regexes
destination_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
destinations_to_platform_instance map(str,PlatformDetail)	Configuration for mapping a specific Airbyte source/destination to DataHub URNs.
destinations_to_platform_instance.`key`.platform One of string, null	Override the platform type detection (e.g., 'postgres', 'mysql') Default: None
destinations_to_platform_instance.`key`.convert_urns_to_lowercase boolean	Whether to convert dataset urns to lowercase. Recommended for case-insensitive platforms to ensure lineage compatibility. Note: For Snowflake destinations, this also lowercases column names in lineage to match DataHub's native Snowflake connector behavior. For other platforms (MSSQL, Postgres, BigQuery, etc.), only dataset names are lowercased, not column names. Default: True
destinations_to_platform_instance.`key`.include_schema_in_urn One of boolean, null	Include schema in the dataset URN when database is present. If None (default), automatically detects 2-tier vs 3-tier platforms by checking if schema equals database. Set to True to force 3-tier (database.schema.table), or False to force 2-tier (database.table). Default: None
destinations_to_platform_instance.`key`.platform_instance One of string, null	The instance of the platform that all assets belong to Default: None
destinations_to_platform_instance.`key`.env One of string, null	Environment to use for dataset URNs (e.g., PROD, DEV, STAGING) Default: None
source_pattern AllowDenyPattern	A class to store allow deny regexes
source_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
sources_to_platform_instance map(str,PlatformDetail)	Configuration for mapping a specific Airbyte source/destination to DataHub URNs.
sources_to_platform_instance.`key`.platform One of string, null	Override the platform type detection (e.g., 'postgres', 'mysql') Default: None
sources_to_platform_instance.`key`.convert_urns_to_lowercase boolean	Whether to convert dataset urns to lowercase. Recommended for case-insensitive platforms to ensure lineage compatibility. Note: For Snowflake destinations, this also lowercases column names in lineage to match DataHub's native Snowflake connector behavior. For other platforms (MSSQL, Postgres, BigQuery, etc.), only dataset names are lowercased, not column names. Default: True
sources_to_platform_instance.`key`.include_schema_in_urn One of boolean, null	Include schema in the dataset URN when database is present. If None (default), automatically detects 2-tier vs 3-tier platforms by checking if schema equals database. Set to True to force 3-tier (database.schema.table), or False to force 2-tier (database.table). Default: None
sources_to_platform_instance.`key`.platform_instance One of string, null	The instance of the platform that all assets belong to Default: None
sources_to_platform_instance.`key`.env One of string, null	Environment to use for dataset URNs (e.g., PROD, DEV, STAGING) Default: None
workspace_pattern AllowDenyPattern	A class to store allow deny regexes
workspace_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AirbyteDeploymentType": {
      "enum": [
        "oss",
        "cloud"
      ],
      "title": "AirbyteDeploymentType",
      "type": "string"
    },
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "PlatformDetail": {
      "additionalProperties": false,
      "description": "Configuration for mapping a specific Airbyte source/destination to DataHub URNs.",
      "properties": {
        "platform": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Override the platform type detection (e.g., 'postgres', 'mysql')",
          "title": "Platform"
        },
        "platform_instance": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The instance of the platform that all assets belong to",
          "title": "Platform Instance"
        },
        "env": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Environment to use for dataset URNs (e.g., PROD, DEV, STAGING)",
          "title": "Env"
        },
        "include_schema_in_urn": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Include schema in the dataset URN when database is present. If None (default), automatically detects 2-tier vs 3-tier platforms by checking if schema equals database. Set to True to force 3-tier (database.schema.table), or False to force 2-tier (database.table).",
          "title": "Include Schema In Urn"
        },
        "convert_urns_to_lowercase": {
          "default": true,
          "description": "Whether to convert dataset urns to lowercase. Recommended for case-insensitive platforms to ensure lineage compatibility. Note: For Snowflake destinations, this also lowercases column names in lineage to match DataHub's native Snowflake connector behavior. For other platforms (MSSQL, Postgres, BigQuery, etc.), only dataset names are lowercased, not column names.",
          "title": "Convert Urns To Lowercase",
          "type": "boolean"
        }
      },
      "title": "PlatformDetail",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "description": "Airbyte source configuration for metadata ingestion",
  "properties": {
    "incremental_lineage": {
      "default": false,
      "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
      "title": "Incremental Lineage",
      "type": "boolean"
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "deployment_type": {
      "$ref": "#/$defs/AirbyteDeploymentType",
      "default": "oss",
      "description": "Type of Airbyte deployment ('oss' or 'cloud')"
    },
    "host_port": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Airbyte API host and port (e.g., http://localhost:8000) - required for OSS deployment",
      "title": "Host Port"
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Username for basic authentication (OSS deployment)",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Password for basic authentication (OSS deployment)",
      "title": "Password"
    },
    "api_key": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "API key or Personal Access Token for authentication (OSS deployment)",
      "title": "Api Key"
    },
    "oauth2_client_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "OAuth2 client ID for OSS (Airbyte 1.0+) and Cloud deployments",
      "title": "Oauth2 Client Id"
    },
    "oauth2_client_secret": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "OAuth2 client secret for OSS (Airbyte 1.0+) and Cloud deployments",
      "title": "Oauth2 Client Secret"
    },
    "oauth2_refresh_token": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "OAuth2 refresh token (Cloud only). If provided, uses refresh_token grant; otherwise uses client_credentials",
      "title": "Oauth2 Refresh Token"
    },
    "verify_ssl": {
      "default": true,
      "description": "Whether to verify SSL certificates",
      "title": "Verify Ssl",
      "type": "boolean"
    },
    "ssl_ca_cert": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Path to CA certificate file (.pem) for SSL verification",
      "title": "Ssl Ca Cert"
    },
    "extra_headers": {
      "anyOf": [
        {
          "additionalProperties": {
            "type": "string"
          },
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Additional HTTP headers to send with each request",
      "title": "Extra Headers"
    },
    "request_timeout": {
      "default": 30,
      "description": "Timeout for API requests in seconds",
      "title": "Request Timeout",
      "type": "integer"
    },
    "max_retries": {
      "default": 3,
      "description": "Maximum number of retries for failed API requests",
      "title": "Max Retries",
      "type": "integer"
    },
    "retry_backoff_factor": {
      "default": 0.5,
      "description": "Backoff factor for retries (wait time is {factor} * (2 ^ retry_number))",
      "title": "Retry Backoff Factor",
      "type": "number"
    },
    "page_size": {
      "default": 20,
      "description": "Number of items to fetch per page in API requests",
      "title": "Page Size",
      "type": "integer"
    },
    "cloud_workspace_id": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Workspace ID for Airbyte Cloud (required for cloud deployment)",
      "title": "Cloud Workspace Id"
    },
    "cloud_api_url": {
      "default": "https://api.airbyte.com/v1",
      "description": "Base URL for Airbyte Cloud API (defaults to production URL)",
      "title": "Cloud Api Url",
      "type": "string"
    },
    "cloud_oauth_token_url": {
      "default": "https://auth.airbyte.com/oauth/token",
      "description": "OAuth token URL for Airbyte Cloud (defaults to production URL)",
      "title": "Cloud Oauth Token Url",
      "type": "string"
    },
    "extract_column_level_lineage": {
      "default": true,
      "description": "Extract column-level lineage",
      "title": "Extract Column Level Lineage",
      "type": "boolean"
    },
    "workspace_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter workspaces. Use the pattern format as in other DataHub sources."
    },
    "connection_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter connections. Use the pattern format as in other DataHub sources."
    },
    "source_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter sources. Use the pattern format as in other DataHub sources."
    },
    "destination_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter destinations. Use the pattern format as in other DataHub sources."
    },
    "source_type_mapping": {
      "additionalProperties": {
        "type": "string"
      },
      "description": "Mapping from Airbyte sourceType/destinationType to DataHub platform names. Use this to normalize Airbyte's source types to DataHub platform names. Example: {'PostgreSQL': 'postgres', 'MySQL': 'mysql'}. If not specified, the sourceType/destinationType from Airbyte is sanitized and used directly.",
      "title": "Source Type Mapping",
      "type": "object"
    },
    "sources_to_platform_instance": {
      "additionalProperties": {
        "$ref": "#/$defs/PlatformDetail"
      },
      "description": "A mapping from Airbyte source ID to its platform/instance/env/database details. Use this to override platform details for specific sources. Example: {'11111111-1111-1111-1111-111111111111': {'platform': 'postgres', 'platform_instance': 'prod-postgres', 'env': 'PROD'}}",
      "title": "Sources To Platform Instance",
      "type": "object"
    },
    "destinations_to_platform_instance": {
      "additionalProperties": {
        "$ref": "#/$defs/PlatformDetail"
      },
      "description": "A mapping from Airbyte destination ID to its platform/instance/env/database details. Use this to override platform details for specific destinations.",
      "title": "Destinations To Platform Instance",
      "type": "object"
    },
    "include_statuses": {
      "default": true,
      "description": "Whether to ingest run statuses",
      "title": "Include Statuses",
      "type": "boolean"
    },
    "job_status_start_date": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Start date for job status retrieval (format: yyyy-mm-ddTHH:MM:SSZ). Default is 7 days ago.",
      "title": "Job Status Start Date"
    },
    "job_status_end_date": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "End date for job status retrieval (format: yyyy-mm-ddTHH:MM:SSZ). Default is current time.",
      "title": "Job Status End Date"
    },
    "job_statuses_limit": {
      "default": 100,
      "description": "Maximum number of job statuses to retrieve per connection",
      "title": "Job Statuses Limit",
      "type": "integer"
    },
    "extract_tags": {
      "default": false,
      "description": "Extract tags from Airbyte metadata",
      "title": "Extract Tags",
      "type": "boolean"
    }
  },
  "title": "AirbyteSourceConfig",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Lineage

Column-level lineage is extracted from Airbyte's sync catalog when field mapping information is available in the connection configuration. Table-level lineage is always captured between source and destination datasets.

Job History

Connection job execution history is ingested as DataProcessInstance entities, capturing run status, start time, and duration for each sync job.

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by Airbyte.

Schema information is only available for sources that expose a sync catalog. Sources without schema discovery will produce datasets without schema metadata.
Column-level lineage requires field mapping to be configured in the Airbyte connection.
Job history depth is limited by the Airbyte API's pagination and retention settings.
The Airbyte Public API only supports limit + offset pagination on list endpoints; cursor pagination is not exposed. Ingestion runs against an actively-mutating Airbyte instance may therefore skip or double-count entries inserted or deleted mid-scan. Schedule ingestion during quiet periods if exactness is required.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Authentication Errors

Verify that your OAuth2 client credentials are correct and have not expired. For OSS deployments, confirm the API is reachable at the /api/public/v1 path prefix.

Missing Schema Metadata

If datasets are ingested without schema information, confirm that the Airbyte source supports schema discovery and that the sync catalog is populated in the connection settings.

Code Coordinates

Class Name: datahub.ingestion.source.airbyte.source.AirbyteSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for Airbyte, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

Airbyte

Overview​

Concept Mapping​

Module airbyte​

Important Capabilities​

Overview​

Prerequisites​

Steps to Get the Required Information​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Lineage​

Job History​

Limitations​

Troubleshooting​

Authentication Errors​

Missing Schema Metadata​

Code Coordinates​

Overview

Concept Mapping

Module `airbyte`

Important Capabilities

Overview

Prerequisites

Steps to Get the Required Information

Install the Plugin

Starter Recipe

Config Details

Capabilities

Lineage

Job History

Limitations

Troubleshooting

Authentication Errors

Missing Schema Metadata

Code Coordinates