Skip to main content

dlt

Overview

dlt (data load tool) is an open-source Python ELT library for building data pipelines that load data from REST APIs, databases, and other sources into destinations like Postgres, BigQuery, Snowflake, and DuckDB.

The DataHub integration for dlt reads pipeline metadata from dlt's local state directory (~/.dlt/pipelines/) and emits DataFlow, DataJob, and lineage entities to DataHub. The connector also supports per-run history (DataProcessInstance) when the dlt package is installed and destination credentials are available, plus stateful deletion detection.

Concept Mapping

dltDataHub
Pipeline (pipeline_name)DataFlow
Resource / destination TableDataJob
Destination tableDataset (DataJob output)
User-configured upstreamDataset (DataJob input)
_dlt_loads rowDataProcessInstance

Destination tables are mapped to Dataset URNs that match the destination platform's own DataHub connector (Postgres, BigQuery, etc.), enabling lineage stitching when both connectors run.

Module dlt

Incubating

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesEnabled via stateful ingestion.
Extract OwnershipEmitted when dlt pipeline state contains owner information.
Platform InstanceEnabled by default.
Table-Level LineageEmits outlet lineage from dlt DataJobs to destination Dataset URNs. Configure destination_platform_map to match your destination connector's env/instance.

Overview

The dlt module ingests pipeline metadata from dlt (data load tool) into DataHub. It reads dlt's local state directory directly — no live connection to dlt or the destination is required for basic metadata extraction. If the dlt Python package is installed, the connector uses the SDK for richer metadata; otherwise it falls back to parsing the YAML state files directly.

What is ingested

  • DataFlow — one per dlt pipeline (pipeline_name)
  • DataJob — one per destination table (including auto-unnested child tables like orders__items)
  • Outlet lineage — DataJob → destination Dataset URNs (Postgres, BigQuery, Snowflake, etc.)
  • Inlet lineage — user-configured upstream Dataset URNs (dlt does not record source connection info)
  • Column-level lineage — for direct-copy pipelines with exactly one inlet and one outlet
  • DataProcessInstance — per-run history from _dlt_loads (opt-in)

How dlt stores metadata

dlt writes pipeline state to a local directory after each pipeline.run() call:

~/.dlt/pipelines/
<pipeline_name>/
schemas/
<schema_name>.schema.yaml # Table definitions with columns and types
state.json # Destination type, dataset name, pipeline state

Prerequisites

  • dlt pipeline(s) must have been run at least once (state files are created automatically)
  • The pipelines_dir must be accessible from where DataHub ingestion runs
  • For run history: the dlt package must be installed and destination credentials must be configured (see Capabilities → Run History below)

Where to find your pipelines_dir

Local / Quickstart

dlt's default location. Works out of the box:

pipelines_dir: "~/.dlt/pipelines"
CI/CD (GitHub Actions, Airflow, Jenkins)

dlt runs in one job and DataHub ingestion runs in another. Both must use the same path or shared storage:

pipelines_dir: "/data/dlt-pipelines"

Many dlt users already set a PIPELINES_DIR environment variable:

pipelines_dir: "${PIPELINES_DIR:-~/.dlt/pipelines}"
Kubernetes / Docker

dlt runs in one pod and DataHub in another. Mount the same PersistentVolumeClaim in both pods:

pipelines_dir: "/mnt/dlt-pipelines"

Required permissions

The connector reads local files only — no network permissions are needed for basic metadata extraction.

FeatureRequirement
Pipeline metadata (DataFlow, DataJob, lineage)Filesystem read access to pipelines_dir
Run history (_dlt_loads)dlt package installed + destination credentials in ~/.dlt/secrets.toml

Install the Plugin

pip install 'acryl-datahub[dlt]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

# DataHub dlt (data load tool) connector recipe.
# Reads pipeline metadata from dlt's local state directory and emits
# DataFlow, DataJob, and lineage metadata to DataHub.
# See: https://datahubproject.io/docs/generated/ingestion/sources/dlt

source:
type: dlt
config:
# Path to the dlt pipelines directory (~/.dlt/pipelines by default).
# Override for CI/CD or containerized environments — must point to the
# same directory that dlt writes to.
pipelines_dir: "~/.dlt/pipelines"

# Filter pipelines by name. Matched against the pipeline_name passed to dlt.pipeline().
pipeline_pattern:
allow:
- ".*"
# deny:
# - "^test_.*"

# Emit outlet lineage from DataJobs to destination Dataset URNs.
include_lineage: true

# Maps dlt destination type names to DataHub platform config for lineage stitching.
# Must exactly match the env/platform_instance used by your destination connector.
# The database field is required for SQL destinations (Postgres, Redshift, etc.)
# that use 3-part URNs (database.schema.table) — dlt only stores the schema name.
destination_platform_map:
postgres:
database: my_database
platform_instance: null
env: PROD

# bigquery:
# platform_instance: "my-gcp-project"
# env: PROD

# snowflake:
# platform_instance: "my-account"
# env: PROD

# Optional: manually specify upstream Dataset URNs.
# dlt does not store source connection info, so inlet lineage must be configured here.
# Use source_dataset_urns for pipeline-level inlets (e.g. REST API sources):
# source_dataset_urns:
# my_pipeline:
# - "urn:li:dataset:(urn:li:dataPlatform:salesforce,contacts,PROD)"
#
# Use source_table_dataset_urns for table-level inlets (e.g. sql_database sources):
# source_table_dataset_urns:
# my_pipeline:
# my_table:
# - "urn:li:dataset:(urn:li:dataPlatform:postgres,prod_db.public.my_table,PROD)"

# Query _dlt_loads and emit DataProcessInstance run history. Disabled by default.
# Requires dlt package + destination credentials in ~/.dlt/secrets.toml or env vars.
include_run_history: false

# Time window for run history (only used when include_run_history: true).
run_history_config:
start_time: "-7 days"
# end_time: "now"

# Remove DataFlow/DataJob entities when pipelines are deleted from pipelines_dir.
stateful_ingestion:
enabled: true
remove_stale_metadata: true

# Optional: distinguish multiple independent dlt installations in DataHub.
# platform_instance: "my-dlt-instance"

env: PROD

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
include_lineage
boolean
Whether to emit outlet lineage from dlt DataJobs to destination Dataset URNs. Constructs URNs using destination_platform_map. For lineage to stitch in DataHub, destination_platform_map env/instance must match the destination connector's configuration.
Default: True
include_run_history
boolean
Whether to query the destination's _dlt_loads table and emit DataProcessInstance run history. Requires the destination (e.g. Postgres, BigQuery) to be accessible and dlt credentials to be configured in ~/.dlt/secrets.toml. Disabled by default to avoid requiring destination access.
Default: False
pipelines_dir
string
Path to the dlt pipelines directory. dlt stores all pipeline state, schemas, and load packages here. Defaults to ~/.dlt/pipelines/ which is dlt's standard location. Override when pipelines are stored in a non-default location (e.g. /data/dlt-pipelines in a Docker environment).
Default: ~/.dlt/pipelines
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
destination_platform_map
map(str,DestinationPlatformConfig)
Per-destination URN construction config.

Maps a dlt destination type (e.g. "postgres", "bigquery") to a DataHub
platform instance and environment so that outlet Dataset URNs produced by
the dlt connector match those emitted by the destination's own DataHub
connector — enabling lineage stitching.

Example:
postgres:
platform_instance: null # local dev, no instance
env: "DEV"
bigquery:
platform_instance: "my-gcp-project"
env: "PROD"
destination_platform_map.key.database
One of string, null
Database name prefix for outlet Dataset URN construction. Required for SQL destinations where the DataHub connector uses a 3-part name (database.schema.table), e.g. Postgres uses 'chess.chess_data.players_games'. dlt only stores the schema (dataset_name), not the database name, so this must be supplied manually. Leave null for destinations like BigQuery where the project is captured in platform_instance instead.
Default: None
destination_platform_map.key.platform_instance
One of string, null
DataHub platform instance for this destination. Must exactly match the platform_instance used when ingesting the destination platform (e.g. your Snowflake or BigQuery connector). Leave null if no platform_instance was configured for that connector.
Default: None
destination_platform_map.key.env
string
DataHub environment for this destination (PROD, DEV, STAGING, etc.). Must match the env used by the destination platform's own connector. One of ['CORP', 'DEV', 'EI', 'NON_PROD', 'PRD', 'PRE', 'PROD', 'QA', 'RVW', 'SANDBOX', 'SBX', 'SIT', 'STG', 'TEST', 'TST', 'UAT'].
Default: PROD
pipeline_pattern
AllowDenyPattern
A class to store allow deny regexes
pipeline_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
run_history_config
DltRunHistoryConfig
Time window config for querying _dlt_loads run history.

Extends BaseTimeWindowConfig which provides start_time, end_time, and
bucket_duration — the DataHub standard for all time-windowed queries.
Use include_run_history on DltSourceConfig to enable/disable run history.
run_history_config.bucket_duration
Enum
One of: "DAY", "HOUR"
run_history_config.end_time
string(date-time)
Latest date of lineage/usage to consider. Default: Current time in UTC
run_history_config.start_time
string(date-time)
Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.
Default: None
source_dataset_urns
map(str,array)
source_dataset_urns.key.string
string
source_table_dataset_urns
map(str,map)
source_table_dataset_urns.key.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration. When enabled, automatically removes DataFlow/DataJob entities from DataHub if the corresponding dlt pipeline is deleted or no longer found in pipelines_dir.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

CapabilityStatusNotes
DataFlow / DataJob✅ AlwaysOne DataFlow per pipeline, one DataJob per destination table
Outlet lineage✅ Always (when include_lineage: true)Requires destination_platform_map to match your destination connector
Inlet lineage✅ User-configureddlt does not store source identity; configure via source_dataset_urns
Column-level lineage✅ PartialOnly for tables with exactly one inlet and one outlet (unambiguous 1:1 copy)
Run history (DataProcessInstance)⚙️ Opt-inRequires include_run_history: true + dlt installed + destination credentials
Deletion detection✅ Via stateful ingestionRemoves DataFlow/DataJob when pipeline is deleted from pipelines_dir
Ownership❌ Not supporteddlt state does not contain owner information

Modeling Notes — Why one DataJob per destination table?

A single @dlt.resource can produce more than one destination table when it returns nested data. dlt unnests JSON arrays into child tables using a double-underscore convention (for example orders plus orders__items). The connector emits one DataJob per destination table rather than one DataJob per @dlt.resource, so:

  • Each destination table has a clean 1:1 outlet lineage entry (DataJob → Dataset URN on the destination).
  • Column-level lineage stays at table granularity, which downstream lineage queries already expect.
  • Browsing the dlt pipeline in DataHub shows every loaded table, not just the parent resource.

The trade-off is that the "this resource produced these tables" abstraction is not directly visible. To preserve that link:

  • Each child-table DataJob carries a parent_table custom property pointing to its parent (for example, parent_table: orders on orders__items).
  • All tables produced by the same resource share the same resource_name custom property.

The two are siblings in DataHub's lineage graph (both come from the same source rows at load time, not parent → child), so no synthetic upstream/downstream lineage is added between them — that would misrepresent the actual data flow.

Lineage Stitching

For outlet lineage to connect to your destination's Dataset URNs, destination_platform_map must match the environment and platform instance used by your destination connector.

If your Postgres connector uses env: PROD and no platform_instance:

destination_platform_map:
postgres:
env: PROD
platform_instance: null
database: my_database # required for 3-part URN: database.schema.table

Why database is needed for SQL destinations: dlt stores the schema name (dataset_name) but not the database name. Postgres URNs in DataHub use a 3-part format (database.schema.table). Supply database to match those URNs.

Cloud warehouses (BigQuery, Snowflake) use the project or account as platform_instance instead:

destination_platform_map:
bigquery:
platform_instance: "my-gcp-project"
env: PROD
snowflake:
platform_instance: "my-account"
env: PROD

Inlet Lineage (Upstream Sources)

dlt does not record where data came from — only where it went. To enable upstream lineage, manually configure Dataset URNs.

For REST API pipelines (all tables share the same source):

source_dataset_urns:
my_pipeline:
- "urn:li:dataset:(urn:li:dataPlatform:salesforce,contacts,PROD)"

For sql_database pipelines (each table maps 1:1 to a source table):

source_table_dataset_urns:
my_pipeline:
my_table:
- "urn:li:dataset:(urn:li:dataPlatform:postgres,prod_db.public.my_table,PROD)"

Run History

Run history requires the dlt package to be installed in the DataHub ingestion environment and destination credentials to be accessible:

pip install "dlt[postgres]"   # or dlt[bigquery], dlt[snowflake], etc.

Credentials are read from ~/.dlt/secrets.toml (dlt's standard location) or environment variables:

export DESTINATION__POSTGRES__CREDENTIALS__HOST=localhost
export DESTINATION__POSTGRES__CREDENTIALS__DATABASE=my_db
export DESTINATION__POSTGRES__CREDENTIALS__USERNAME=dlt
export DESTINATION__POSTGRES__CREDENTIALS__PASSWORD=secret

The run_history_config time window is respected — configure start_time and end_time to limit which loads are ingested:

include_run_history: true
run_history_config:
start_time: "-7 days"
end_time: "now"

Limitations

Inlet lineage requires manual configuration

dlt's state files do not store source-system connection details. Inlet (upstream) Dataset URNs must be configured by the user via source_dataset_urns or source_table_dataset_urns.

Run history requires destination access

Querying _dlt_loads requires the dlt package and destination credentials. When dlt is not installed or credentials are missing, the connector still emits DataFlow / DataJob / outlet lineage but skips run history.

Ownership is not supported

dlt does not record pipeline owners.

Troubleshooting

No entities emitted

  • Check that pipelines_dir points to a directory containing subdirectories with schemas/ inside them
  • Run datahub ingest -c recipe.yml --test-source-connection to verify the path is readable

Lineage not stitching

  • Verify destination_platform_map env/instance/database match exactly what your destination connector uses
  • Check the destination Dataset URNs in DataHub and compare to what the dlt connector constructs
  • For Postgres: ensure database is set in destination_platform_map.postgres

Run history empty

  • Confirm include_run_history: true is set
  • Confirm the dlt package is installed: python -c "import dlt; print(dlt.__version__)"
  • Confirm destination credentials are in ~/.dlt/secrets.toml or environment variables
  • Check DataHub ingestion logs for warnings from the dlt connector

Nested child tables (e.g. orders__items)

dlt automatically unnests nested JSON into child tables using double-underscore naming. These appear as separate DataJobs with parent_table set in their custom properties. This is expected behavior.

Code Coordinates

  • Class Name: datahub.ingestion.source.dlt.dlt.DltSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for dlt, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.