datahub datapack

Experimental: This command is under active development. Command surface and behavior may change in future releases.

The datapack command manages curated collections of metadata (data packs) that can be loaded into DataHub for demos, testing, or bootstrapping new instances. Data packs contain pre-built MCPs (Metadata Change Proposals) with datasets, dashboards, lineage, ownership, glossary terms, and more.

Quick Start

# List available data packs
datahub datapack list

# Load the showcase-ecommerce pack (rich demo with 1000+ entities)
datahub datapack load showcase-ecommerce

# Remove it when done
datahub datapack unload showcase-ecommerce

Commands

list

List all available data packs from the registry.

datahub datapack list [--tag TAG] [--format table|json]

Options:

--tag - Filter packs by tag (e.g., demo, snowflake, lineage)
--format - Output format: table (default) or json

Example:

datahub datapack list --tag demo

# Name            Description                                Size     Trust     Tags
# bootstrap       Default DataHub bootstrap data...          ~100 KB  verified  demo, bootstrap
# showcase-ecommerce        Rich demo dataset with 1049 entities...    ~2.7 MB  verified  demo, rich, snowflake, ...

info

Show detailed information about a specific data pack.

datahub datapack info NAME

Example:

datahub datapack info showcase-ecommerce

# Name:            showcase-ecommerce
# Description:     Rich demo dataset with 1049 entities...
# URL:             https://raw.githubusercontent.com/datahub-project/static-assets/...
# Size:            ~2.7 MB
# Trust:           verified
# Tags:            demo, rich, snowflake, looker, powerbi, tableau, lineage, governance
# Reference time:  2025-07-08T16:15:42.552000+00:00
# Cached:          yes
# Loaded:          yes (run_id=datapack-showcase-ecommerce-..., at 2026-03-22T...)

load

Download and load a data pack into DataHub.

datahub datapack load NAME [OPTIONS]

Options:

--url URL - Load from an arbitrary URL instead of the registry. Supports http://, https://, and file:// schemes.
--dry-run - Preview what would be loaded without ingesting.
--no-cache - Force re-download even if the pack is cached.
--force - Override server version compatibility checks.
--as-of DATETIME - Set the target time for time-shifting (default: current time). Useful for making historical data appear fresh.
--no-time-shift - Load with original timestamps (skip time-shifting).
--trust-community - Allow loading community-contributed packs.
--trust-custom - Allow loading from unverified URLs.

What happens during load:

Registry lookup - Resolves the pack name to a URL
Download & cache - Downloads the MCP file (cached in ~/.datahub/datapack-cache/)
Schema downshift - Queries the server's entity registry to filter out unsupported aspects (prevents errors from Cloud-only features on OSS)
Referential integrity check - Warns about dangling URN references
Time-shifting - Rebases timestamps so the data appears fresh
Ingestion - Runs the MCP file through the standard ingestion pipeline
Load tracking - Records the run ID for clean unload

Examples:

# Load the showcase-ecommerce pack
datahub datapack load showcase-ecommerce

# Load from a local file
datahub datapack load my-data --url file:///path/to/data.json --trust-custom

# Load with timestamps anchored to a specific date
datahub datapack load showcase-ecommerce --as-of 2025-06-15

# Preview without loading
datahub datapack load showcase-ecommerce --dry-run

unload

Remove all entities that were loaded by a data pack.

datahub datapack unload NAME [--hard] [--dry-run]

Uses the ingestion rollback infrastructure to revert the load. Only works for packs loaded via datahub datapack load.

Options:

--hard - Hard-delete entities (irreversible). Default is soft-delete (reversible).
--dry-run - Show what would be deleted without deleting.

Example:

# Soft-delete (reversible)
datahub datapack unload showcase-ecommerce

# Hard-delete (irreversible)
datahub datapack unload showcase-ecommerce --hard

Built-in Data Packs

Pack	Description	Entities	Platforms
bootstrap	Lightweight bootstrap data with basic datasets, dashboards, users, and tags	~50	Kafka, Hive, HDFS
showcase-ecommerce	Rich e-commerce demo with lineage, governance, glossary, domains, and data products	~1,050	Snowflake, Looker, PowerBI, Tableau, dbt, Spark, PostgreSQL, S3

Trust Model

Data packs have three trust tiers:

Verified - Published by the DataHub project. Loads without prompting.
Community - Third-party contributed packs in the registry. Requires --trust-community.
Custom - Loaded via --url from an arbitrary source. Requires --trust-custom.

SHA256 checksums in the registry provide integrity verification for community packs. Verified packs from trusted origins (GitHub/datahub-project) skip checksum verification.

Ingestion Source

Data packs can also be loaded via standard ingestion recipes using the demo-data source type:

source:
  type: demo-data
  config:
    pack_name: "showcase-ecommerce"
    # OR: pack_url: "https://example.com/data.json"
    no_time_shift: false
    as_of: "2025-06-15T00:00:00Z"
    trust_community: false
    trust_custom: false
    no_cache: false

With no configuration, demo-data loads the bootstrap pack (backward compatible with existing recipes). Specify pack_name to load a different pack.

This is useful for scheduled or automated loading via datahub ingest.

Technical Details

Multi-File Index

Data packs can consist of multiple files, referenced by an index.json:

{
  "files": [
    { "path": "01-definitions.json", "wait_for_completion": true },
    { "path": "02-data.json" }
  ]
}

Files are loaded sequentially via OpenAPI async_batch ingestion. When wait_for_completion is set, that file is emitted with async_wait so the loader blocks on OpenAPI trace completion before starting the next file. Other files use async (same batch sink, no trace wait). This ensures ordering dependencies are respected (e.g., structured property definitions must be persisted before assignments in a later file can reference them).

Each entry in files can be a plain string (filename) or an object with path and optional wait_for_completion.

Schema Downshift

When loading a pack, the CLI queries the server's entity registry (/openapi/v1/registry/models/entity/specifications) to discover which (entityType, aspectName) pairs are supported. MCPs with unsupported aspects are automatically filtered out. This allows a single data pack to work across both DataHub OSS and Acryl Cloud, with Cloud-only aspects gracefully skipped on OSS.

Time-Shifting

Each pack has a reference_timestamp indicating when it was captured (set to the max timestamp in the data so the newest data appears as "just now" after shifting). During load, all temporal fields (timestamps in system metadata, timeseries aspects, audit stamps) are shifted by now - reference_timestamp so the data appears fresh. Use --as-of to anchor to a different time, or --no-time-shift to keep original timestamps.

Caching

Pack files: Cached in ~/.datahub/datapack-cache/ by URL hash
Registry: Cached in ~/.datahub/datapack-registry-cache.json with 1-hour TTL
Schema: Cached in ~/.datahub/openapi_schema_cache/ by server URL + commit hash

Use --no-cache on load to force re-download.

Is this page helpful?

datahub datapack

Quick Start​

Commands​

list​

info​

load​

unload​

Built-in Data Packs​

Trust Model​

Ingestion Source​

Technical Details​

Multi-File Index​

Schema Downshift​

Time-Shifting​

Caching​

Quick Start

Commands

list

info

load

unload

Built-in Data Packs

Trust Model

Ingestion Source

Technical Details

Multi-File Index

Schema Downshift

Time-Shifting

Caching