dlt
Overview
dlt (data load tool) is an open-source Python ELT library for building data pipelines that load data from REST APIs, databases, and other sources into destinations like Postgres, BigQuery, Snowflake, and DuckDB.
The DataHub integration for dlt reads pipeline metadata from dlt's local state directory (~/.dlt/pipelines/) and emits DataFlow, DataJob, and lineage entities to DataHub. The connector also supports per-run history (DataProcessInstance) when the dlt package is installed and destination credentials are available, plus stateful deletion detection.
Concept Mapping
| dlt | DataHub |
|---|---|
Pipeline (pipeline_name) | DataFlow |
Resource / destination Table | DataJob |
| Destination table | Dataset (DataJob output) |
| User-configured upstream | Dataset (DataJob input) |
_dlt_loads row | DataProcessInstance |
Destination tables are mapped to Dataset URNs that match the destination platform's own DataHub connector (Postgres, BigQuery, etc.), enabling lineage stitching when both connectors run.
Module dlt
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Detect Deleted Entities | ✅ | Enabled via stateful ingestion. |
| Extract Ownership | ❌ | Emitted when dlt pipeline state contains owner information. |
| Platform Instance | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Emits outlet lineage from dlt DataJobs to destination Dataset URNs. Configure destination_platform_map to match your destination connector's env/instance. |
Overview
The dlt module ingests pipeline metadata from dlt (data load tool) into DataHub. It reads dlt's local state directory directly — no live connection to dlt or the destination is required for basic metadata extraction. If the dlt Python package is installed, the connector uses the SDK for richer metadata; otherwise it falls back to parsing the YAML state files directly.
What is ingested
- DataFlow — one per dlt pipeline (
pipeline_name) - DataJob — one per destination table (including auto-unnested child tables like
orders__items) - Outlet lineage — DataJob → destination Dataset URNs (Postgres, BigQuery, Snowflake, etc.)
- Inlet lineage — user-configured upstream Dataset URNs (dlt does not record source connection info)
- Column-level lineage — for direct-copy pipelines with exactly one inlet and one outlet
- DataProcessInstance — per-run history from
_dlt_loads(opt-in)
How dlt stores metadata
dlt writes pipeline state to a local directory after each pipeline.run() call:
~/.dlt/pipelines/
<pipeline_name>/
schemas/
<schema_name>.schema.yaml # Table definitions with columns and types
state.json # Destination type, dataset name, pipeline state
Prerequisites
- dlt pipeline(s) must have been run at least once (state files are created automatically)
- The
pipelines_dirmust be accessible from where DataHub ingestion runs - For run history: the dlt package must be installed and destination credentials must be configured (see Capabilities → Run History below)
Where to find your pipelines_dir
Local / Quickstart
dlt's default location. Works out of the box:
pipelines_dir: "~/.dlt/pipelines"
CI/CD (GitHub Actions, Airflow, Jenkins)
dlt runs in one job and DataHub ingestion runs in another. Both must use the same path or shared storage:
pipelines_dir: "/data/dlt-pipelines"
Many dlt users already set a PIPELINES_DIR environment variable:
pipelines_dir: "${PIPELINES_DIR:-~/.dlt/pipelines}"
Kubernetes / Docker
dlt runs in one pod and DataHub in another. Mount the same PersistentVolumeClaim in both pods:
pipelines_dir: "/mnt/dlt-pipelines"
Required permissions
The connector reads local files only — no network permissions are needed for basic metadata extraction.
| Feature | Requirement |
|---|---|
| Pipeline metadata (DataFlow, DataJob, lineage) | Filesystem read access to pipelines_dir |
Run history (_dlt_loads) | dlt package installed + destination credentials in ~/.dlt/secrets.toml |
Install the Plugin
pip install 'acryl-datahub[dlt]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
# DataHub dlt (data load tool) connector recipe.
# Reads pipeline metadata from dlt's local state directory and emits
# DataFlow, DataJob, and lineage metadata to DataHub.
# See: https://datahubproject.io/docs/generated/ingestion/sources/dlt
source:
type: dlt
config:
# Path to the dlt pipelines directory (~/.dlt/pipelines by default).
# Override for CI/CD or containerized environments — must point to the
# same directory that dlt writes to.
pipelines_dir: "~/.dlt/pipelines"
# Filter pipelines by name. Matched against the pipeline_name passed to dlt.pipeline().
pipeline_pattern:
allow:
- ".*"
# deny:
# - "^test_.*"
# Emit outlet lineage from DataJobs to destination Dataset URNs.
include_lineage: true
# Maps dlt destination type names to DataHub platform config for lineage stitching.
# Must exactly match the env/platform_instance used by your destination connector.
# The database field is required for SQL destinations (Postgres, Redshift, etc.)
# that use 3-part URNs (database.schema.table) — dlt only stores the schema name.
destination_platform_map:
postgres:
database: my_database
platform_instance: null
env: PROD
# bigquery:
# platform_instance: "my-gcp-project"
# env: PROD
# snowflake:
# platform_instance: "my-account"
# env: PROD
# Optional: manually specify upstream Dataset URNs.
# dlt does not store source connection info, so inlet lineage must be configured here.
# Use source_dataset_urns for pipeline-level inlets (e.g. REST API sources):
# source_dataset_urns:
# my_pipeline:
# - "urn:li:dataset:(urn:li:dataPlatform:salesforce,contacts,PROD)"
#
# Use source_table_dataset_urns for table-level inlets (e.g. sql_database sources):
# source_table_dataset_urns:
# my_pipeline:
# my_table:
# - "urn:li:dataset:(urn:li:dataPlatform:postgres,prod_db.public.my_table,PROD)"
# Query _dlt_loads and emit DataProcessInstance run history. Disabled by default.
# Requires dlt package + destination credentials in ~/.dlt/secrets.toml or env vars.
include_run_history: false
# Time window for run history (only used when include_run_history: true).
run_history_config:
start_time: "-7 days"
# end_time: "now"
# Remove DataFlow/DataJob entities when pipelines are deleted from pipelines_dir.
stateful_ingestion:
enabled: true
remove_stale_metadata: true
# Optional: distinguish multiple independent dlt installations in DataHub.
# platform_instance: "my-dlt-instance"
env: PROD
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
include_lineage boolean | Whether to emit outlet lineage from dlt DataJobs to destination Dataset URNs. Constructs URNs using destination_platform_map. For lineage to stitch in DataHub, destination_platform_map env/instance must match the destination connector's configuration. Default: True |
include_run_history boolean | Whether to query the destination's _dlt_loads table and emit DataProcessInstance run history. Requires the destination (e.g. Postgres, BigQuery) to be accessible and dlt credentials to be configured in ~/.dlt/secrets.toml. Disabled by default to avoid requiring destination access. Default: False |
pipelines_dir string | Path to the dlt pipelines directory. dlt stores all pipeline state, schemas, and load packages here. Defaults to ~/.dlt/pipelines/ which is dlt's standard location. Override when pipelines are stored in a non-default location (e.g. /data/dlt-pipelines in a Docker environment). Default: ~/.dlt/pipelines |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
env string | The environment that all assets produced by this connector belong to Default: PROD |
destination_platform_map map(str,DestinationPlatformConfig) | Per-destination URN construction config. Maps a dlt destination type (e.g. "postgres", "bigquery") to a DataHub platform instance and environment so that outlet Dataset URNs produced by the dlt connector match those emitted by the destination's own DataHub connector — enabling lineage stitching. Example: postgres: platform_instance: null # local dev, no instance env: "DEV" bigquery: platform_instance: "my-gcp-project" env: "PROD" |
destination_platform_map. key.databaseOne of string, null | Database name prefix for outlet Dataset URN construction. Required for SQL destinations where the DataHub connector uses a 3-part name (database.schema.table), e.g. Postgres uses 'chess.chess_data.players_games'. dlt only stores the schema (dataset_name), not the database name, so this must be supplied manually. Leave null for destinations like BigQuery where the project is captured in platform_instance instead. Default: None |
destination_platform_map. key.platform_instanceOne of string, null | DataHub platform instance for this destination. Must exactly match the platform_instance used when ingesting the destination platform (e.g. your Snowflake or BigQuery connector). Leave null if no platform_instance was configured for that connector. Default: None |
destination_platform_map. key.envstring | DataHub environment for this destination (PROD, DEV, STAGING, etc.). Must match the env used by the destination platform's own connector. One of ['CORP', 'DEV', 'EI', 'NON_PROD', 'PRD', 'PRE', 'PROD', 'QA', 'RVW', 'SANDBOX', 'SBX', 'SIT', 'STG', 'TEST', 'TST', 'UAT']. Default: PROD |
pipeline_pattern AllowDenyPattern | A class to store allow deny regexes |
pipeline_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
run_history_config DltRunHistoryConfig | Time window config for querying _dlt_loads run history. Extends BaseTimeWindowConfig which provides start_time, end_time, and bucket_duration — the DataHub standard for all time-windowed queries. Use include_run_history on DltSourceConfig to enable/disable run history. |
run_history_config.bucket_duration Enum | One of: "DAY", "HOUR" |
run_history_config.end_time string(date-time) | Latest date of lineage/usage to consider. Default: Current time in UTC |
run_history_config.start_time string(date-time) | Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. Default: None |
source_dataset_urns map(str,array) | |
source_dataset_urns. key.stringstring | |
source_table_dataset_urns map(str,map) | |
source_table_dataset_urns. key.stringstring | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Stateful ingestion configuration. When enabled, automatically removes DataFlow/DataJob entities from DataHub if the corresponding dlt pipeline is deleted or no longer found in pipelines_dir. Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"BucketDuration": {
"enum": [
"DAY",
"HOUR"
],
"title": "BucketDuration",
"type": "string"
},
"DestinationPlatformConfig": {
"additionalProperties": false,
"description": "Per-destination URN construction config.\n\nMaps a dlt destination type (e.g. \"postgres\", \"bigquery\") to a DataHub\nplatform instance and environment so that outlet Dataset URNs produced by\nthe dlt connector match those emitted by the destination's own DataHub\nconnector \u2014 enabling lineage stitching.\n\nExample:\n postgres:\n platform_instance: null # local dev, no instance\n env: \"DEV\"\n bigquery:\n platform_instance: \"my-gcp-project\"\n env: \"PROD\"",
"properties": {
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Database name prefix for outlet Dataset URN construction. Required for SQL destinations where the DataHub connector uses a 3-part name (database.schema.table), e.g. Postgres uses 'chess.chess_data.players_games'. dlt only stores the schema (dataset_name), not the database name, so this must be supplied manually. Leave null for destinations like BigQuery where the project is captured in platform_instance instead.",
"title": "Database"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "DataHub platform instance for this destination. Must exactly match the platform_instance used when ingesting the destination platform (e.g. your Snowflake or BigQuery connector). Leave null if no platform_instance was configured for that connector.",
"title": "Platform Instance"
},
"env": {
"default": "PROD",
"description": "DataHub environment for this destination (PROD, DEV, STAGING, etc.). Must match the env used by the destination platform's own connector. One of ['CORP', 'DEV', 'EI', 'NON_PROD', 'PRD', 'PRE', 'PROD', 'QA', 'RVW', 'SANDBOX', 'SBX', 'SIT', 'STG', 'TEST', 'TST', 'UAT'].",
"title": "Env",
"type": "string"
}
},
"title": "DestinationPlatformConfig",
"type": "object"
},
"DltRunHistoryConfig": {
"additionalProperties": false,
"description": "Time window config for querying _dlt_loads run history.\n\nExtends BaseTimeWindowConfig which provides start_time, end_time, and\nbucket_duration \u2014 the DataHub standard for all time-windowed queries.\nUse include_run_history on DltSourceConfig to enable/disable run history.",
"properties": {
"bucket_duration": {
"$ref": "#/$defs/BucketDuration",
"default": "DAY",
"description": "Size of the time window to aggregate usage stats."
},
"end_time": {
"description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
"format": "date-time",
"title": "End Time",
"type": "string"
},
"start_time": {
"default": null,
"description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
"format": "date-time",
"title": "Start Time",
"type": "string"
}
},
"title": "DltRunHistoryConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for the dlt DataHub source connector.\n\nReads pipeline metadata from dlt's local state directory\n(~/.dlt/pipelines/ by default) and emits DataFlow, DataJob, and lineage\nmetadata to DataHub.",
"properties": {
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful ingestion configuration. When enabled, automatically removes DataFlow/DataJob entities from DataHub if the corresponding dlt pipeline is deleted or no longer found in pipelines_dir."
},
"pipelines_dir": {
"default": "~/.dlt/pipelines",
"description": "Path to the dlt pipelines directory. dlt stores all pipeline state, schemas, and load packages here. Defaults to ~/.dlt/pipelines/ which is dlt's standard location. Override when pipelines are stored in a non-default location (e.g. /data/dlt-pipelines in a Docker environment).",
"title": "Pipelines Dir",
"type": "string"
},
"pipeline_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter pipelines by name. Matched against pipeline_name (the value passed to dlt.pipeline()). Example: allow ['^prod_.*'] to only ingest production pipelines."
},
"include_run_history": {
"default": false,
"description": "Whether to query the destination's _dlt_loads table and emit DataProcessInstance run history. Requires the destination (e.g. Postgres, BigQuery) to be accessible and dlt credentials to be configured in ~/.dlt/secrets.toml. Disabled by default to avoid requiring destination access.",
"title": "Include Run History",
"type": "boolean"
},
"run_history_config": {
"$ref": "#/$defs/DltRunHistoryConfig",
"description": "Time window for run history extraction. Uses standard DataHub BaseTimeWindowConfig \u2014 supports relative times like '-7 days' or absolute ISO timestamps."
},
"include_lineage": {
"default": true,
"description": "Whether to emit outlet lineage from dlt DataJobs to destination Dataset URNs. Constructs URNs using destination_platform_map. For lineage to stitch in DataHub, destination_platform_map env/instance must match the destination connector's configuration.",
"title": "Include Lineage",
"type": "boolean"
},
"destination_platform_map": {
"additionalProperties": {
"$ref": "#/$defs/DestinationPlatformConfig"
},
"description": "Maps dlt destination type names to DataHub platform configuration. Used to construct correct Dataset URNs for lineage. The destination type is the value passed to dlt.pipeline(destination='...'). Example:\n postgres:\n platform_instance: null\n env: DEV\n bigquery:\n platform_instance: my-gcp-project\n env: PROD",
"title": "Destination Platform Map",
"type": "object"
},
"source_dataset_urns": {
"additionalProperties": {
"items": {
"type": "string"
},
"type": "array"
},
"description": "Optional: manually specify inlet (upstream) Dataset URNs per pipeline. All listed URNs are applied as inlets to every DataJob in the pipeline. Use for REST API sources where every task shares the same upstream source. For SQL sources where each task reads from exactly one table, use source_table_dataset_urns instead. Key is the pipeline_name; value is a list of Dataset URN strings. Example:\n crm_sync:\n - 'urn:li:dataset:(urn:li:dataPlatform:postgres,prod.crm.customers,PROD)'",
"title": "Source Dataset Urns",
"type": "object"
},
"source_table_dataset_urns": {
"additionalProperties": {
"additionalProperties": {
"items": {
"type": "string"
},
"type": "array"
},
"type": "object"
},
"description": "Optional: manually specify inlet Dataset URNs per pipeline per table. Use for sql_database sources where each DataJob reads from exactly one source table and 1:1 lineage is desired. Outer key is pipeline_name; inner key is table_name; value is a list of URNs. Example:\n my_pipeline:\n my_table:\n - 'urn:li:dataset:(urn:li:dataPlatform:postgres,prod_db.public.my_table,PROD)'\n other_table:\n - 'urn:li:dataset:(urn:li:dataPlatform:postgres,prod_db.public.other_table,PROD)'",
"title": "Source Table Dataset Urns",
"type": "object"
}
},
"title": "DltSourceConfig",
"type": "object"
}
Capabilities
| Capability | Status | Notes |
|---|---|---|
| DataFlow / DataJob | ✅ Always | One DataFlow per pipeline, one DataJob per destination table |
| Outlet lineage | ✅ Always (when include_lineage: true) | Requires destination_platform_map to match your destination connector |
| Inlet lineage | ✅ User-configured | dlt does not store source identity; configure via source_dataset_urns |
| Column-level lineage | ✅ Partial | Only for tables with exactly one inlet and one outlet (unambiguous 1:1 copy) |
| Run history (DataProcessInstance) | ⚙️ Opt-in | Requires include_run_history: true + dlt installed + destination credentials |
| Deletion detection | ✅ Via stateful ingestion | Removes DataFlow/DataJob when pipeline is deleted from pipelines_dir |
| Ownership | ❌ Not supported | dlt state does not contain owner information |
Modeling Notes — Why one DataJob per destination table?
A single @dlt.resource can produce more than one destination table when it returns nested data. dlt unnests JSON arrays into child tables using a double-underscore convention (for example orders plus orders__items). The connector emits one DataJob per destination table rather than one DataJob per @dlt.resource, so:
- Each destination table has a clean 1:1 outlet lineage entry (DataJob → Dataset URN on the destination).
- Column-level lineage stays at table granularity, which downstream lineage queries already expect.
- Browsing the dlt pipeline in DataHub shows every loaded table, not just the parent resource.
The trade-off is that the "this resource produced these tables" abstraction is not directly visible. To preserve that link:
- Each child-table DataJob carries a
parent_tablecustom property pointing to its parent (for example,parent_table: ordersonorders__items). - All tables produced by the same resource share the same
resource_namecustom property.
The two are siblings in DataHub's lineage graph (both come from the same source rows at load time, not parent → child), so no synthetic upstream/downstream lineage is added between them — that would misrepresent the actual data flow.
Lineage Stitching
For outlet lineage to connect to your destination's Dataset URNs, destination_platform_map must match the environment and platform instance used by your destination connector.
If your Postgres connector uses env: PROD and no platform_instance:
destination_platform_map:
postgres:
env: PROD
platform_instance: null
database: my_database # required for 3-part URN: database.schema.table
Why database is needed for SQL destinations: dlt stores the schema name (dataset_name) but not the database name. Postgres URNs in DataHub use a 3-part format (database.schema.table). Supply database to match those URNs.
Cloud warehouses (BigQuery, Snowflake) use the project or account as platform_instance instead:
destination_platform_map:
bigquery:
platform_instance: "my-gcp-project"
env: PROD
snowflake:
platform_instance: "my-account"
env: PROD
Inlet Lineage (Upstream Sources)
dlt does not record where data came from — only where it went. To enable upstream lineage, manually configure Dataset URNs.
For REST API pipelines (all tables share the same source):
source_dataset_urns:
my_pipeline:
- "urn:li:dataset:(urn:li:dataPlatform:salesforce,contacts,PROD)"
For sql_database pipelines (each table maps 1:1 to a source table):
source_table_dataset_urns:
my_pipeline:
my_table:
- "urn:li:dataset:(urn:li:dataPlatform:postgres,prod_db.public.my_table,PROD)"
Run History
Run history requires the dlt package to be installed in the DataHub ingestion environment and destination credentials to be accessible:
pip install "dlt[postgres]" # or dlt[bigquery], dlt[snowflake], etc.
Credentials are read from ~/.dlt/secrets.toml (dlt's standard location) or environment variables:
export DESTINATION__POSTGRES__CREDENTIALS__HOST=localhost
export DESTINATION__POSTGRES__CREDENTIALS__DATABASE=my_db
export DESTINATION__POSTGRES__CREDENTIALS__USERNAME=dlt
export DESTINATION__POSTGRES__CREDENTIALS__PASSWORD=secret
The run_history_config time window is respected — configure start_time and end_time to limit which loads are ingested:
include_run_history: true
run_history_config:
start_time: "-7 days"
end_time: "now"
Limitations
Inlet lineage requires manual configuration
dlt's state files do not store source-system connection details. Inlet (upstream) Dataset URNs must be configured by the user via source_dataset_urns or source_table_dataset_urns.
Run history requires destination access
Querying _dlt_loads requires the dlt package and destination credentials. When dlt is not installed or credentials are missing, the connector still emits DataFlow / DataJob / outlet lineage but skips run history.
Ownership is not supported
dlt does not record pipeline owners.
Troubleshooting
No entities emitted
- Check that
pipelines_dirpoints to a directory containing subdirectories withschemas/inside them - Run
datahub ingest -c recipe.yml --test-source-connectionto verify the path is readable
Lineage not stitching
- Verify
destination_platform_mapenv/instance/database match exactly what your destination connector uses - Check the destination Dataset URNs in DataHub and compare to what the dlt connector constructs
- For Postgres: ensure
databaseis set indestination_platform_map.postgres
Run history empty
- Confirm
include_run_history: trueis set - Confirm the dlt package is installed:
python -c "import dlt; print(dlt.__version__)" - Confirm destination credentials are in
~/.dlt/secrets.tomlor environment variables - Check DataHub ingestion logs for warnings from the dlt connector
Nested child tables (e.g. orders__items)
dlt automatically unnests nested JSON into child tables using double-underscore naming. These appear as separate DataJobs with parent_table set in their custom properties. This is expected behavior.
Code Coordinates
- Class Name:
datahub.ingestion.source.dlt.dlt.DltSource - Browse on GitHub
If you've got any questions on configuring ingestion for dlt, feel free to ping us on our Slack.
This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.
Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.