Skip to main content

Hex

Overview

Hex is a collaborative data workspace where teams build interactive notebooks combining SQL, Python, and visualizations.

The DataHub integration emits Hex Projects (Dashboards) and Components (Charts) along with workspace containers, ownership, tags from Collections/Status/Categories, usage statistics, and upstream lineage to the warehouses Hex queries. It also emits per-project run history, and per-project context documents for AI agent retrieval (opt-in via include_context_documents). Upstream lineage is produced directly from Hex's own APIs (SQL parsing by default; Hex's queriedTables API can be enabled on Hex Enterprise workspaces) — no warehouse ingestion dependency is required.

Concept Mapping

Hex ConceptDataHub ConceptNotes
"hex"Data Platform
WorkspaceContainerParent container for all projects and components in the workspace.
ProjectDashboardSubtype Project. Carries usage statistics, last refresh time from run history, and upstream lineage edges to warehouse datasets.
ComponentChartSubtype Component. Reusable shared cell group with its own visualization; linked to importing projects via DashboardInfo.charts.
CollectionTagEmitted as hex:collection:<name> when collections_as_tags is enabled.
StatusTagEmitted as hex:status:<name> when status_as_tag is enabled.
CategoryTagEmitted as hex:category:<name> when categories_as_tags is enabled.
Project DocDocumentOne per Project and per Component when include_context_documents is enabled. Hidden from global search; linked to the Dashboard/Chart for AI agents.

Other Hex concepts are not mapped to DataHub entities yet.

Module hex

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default.
Column-level LineageColumn-level lineage via SQL parsing when datahub-api is configured. The graph-backed SchemaResolver fetches table schemas from DataHub on demand to expand SELECT * and resolve column references. Graceful degradation to dataset-level when datahub-api is absent.
Dataset UsageSupported by default. Supported for types - Project.
DescriptionsSupported by default.
Detect Deleted EntitiesEnabled by default via stateful ingestion.
Extract OwnershipSupported by default.
Extract TagsStatus, categories, and collections emitted as tags.
Platform InstanceEnabled by default.
Table-Level LineageEnabled by default via queriedTables API (Hex Enterprise workspaces) or SQL parsing from cells (all Hex tiers). Applied to both projects and components. Unpublished entities always use SQL parsing. No warehouse ingestion dependency required.

Overview

The hex module ingests Hex Projects, Components, workspaces, and upstream lineage directly from the Hex REST API.

Prerequisites

Workspace Name

Open the workspace switcher dropdown in the top-left corner of the Hex app — the workspace name (and its slug) is shown next to each workspace entry. Use the slug value for workspace_name.

Authentication

The connector authenticates with a Hex Workspace token issued from Settings → API → Workspace tokens. Grant the token these read-only scopes:

  • Projects → Read access — list projects/components and read their detail and run history.
  • Cells → Read access — read SQL cells for lineage and context documents.
  • Read project queried tables — lineage from Hex's pre-resolved table list. Available on Hex Enterprise workspaces only; skip this scope on lower Hex tiers — the connector falls back to SQL parsing.
  • Data connections → Read access — map each Hex connection to its warehouse platform/database/schema.
  • Users → Read accessoptional, only needed to auto-discover the workspace (org) UUID used in external URLs. Skip this scope and set workspace_id in the recipe instead.

No write scopes are required — the connector never modifies state in Hex.

Personal Access Tokens (PATs) also work but ingest with the issuing user's permissions, so projects the user cannot see in Hex will be skipped. Workspace tokens are recommended for production ingestion. See the Hex API overview for the full list of token types.

Lineage URN Alignment

Upstream URNs are built from Hex's /v1/data-connections response — platform, database, and schema all come from there. Configure connection_platform_map (keyed by Hex dataConnectionId) in two cases:

  • the upstream warehouse was ingested under a platform_instance — set the matching platform_instance so the URNs collide with the warehouse-ingested ones,
  • a Hex connection's type is unrecognized (deleted, custom, or the token lacks scope on /v1/data-connections) — set platform explicitly so its cells aren't skipped.

See Connection Platform Resolution in the sections below for the full configuration shape.

Install the Plugin

pip install 'acryl-datahub[hex]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: hex
config:
# Hex workspace name — find it in the workspace switcher dropdown in the top-left corner of the Hex app
workspace_name: my-workspace
workspace_id: id # optional override for workspace ID (UUID); if not set, the source will call the Hex API to fetch it
token: "${HEX_TOKEN}"

# (Optional) platform_instance / env for the Hex side (Dashboards, Charts).
# platform_instance: prod_hex
# env: PROD

# (Optional) Feature toggles — all default to true. Uncomment to opt out.
# include_components: false
# include_lineage: false
# include_run_history: false
# set_ownership_from_email: false
# collections_as_tags: false
# status_as_tag: false
# categories_as_tags: false

# (Optional) Emit a DataHub Document per Project and per Component for
# AI agent retrieval. Off by default — opt in if you use AI agents and
# want context documents in your catalog.
# include_context_documents: true

# (Optional) Hex Enterprise workspaces only — use Hex's queriedTables API
# as the primary lineage source for published projects/components.
# Defaults to false (SQL-cell parsing for everything).
# use_queried_tables_lineage: true

# (Optional) Match the platform_instance under which the upstream warehouses
# were ingested. Required so Hex's lineage URNs collide with the
# warehouse-ingested ones. Keyed by Hex dataConnectionId (UUID).
# connection_platform_map:
# "8f3a1c2d-4b5e-6789-abcd-ef0123456789":
# platform: snowflake
# platform_instance: prod_snowflake
# default_database: ANALYTICS
# default_schema: PUBLIC
# "1a2b3c4d-5e6f-7890-abcd-1234567890ab":
# platform: bigquery
# default_database: my-gcp-project
# default_schema: analytics

# (Optional) Filter projects and components by title or category.
# project_title_pattern:
# allow:
# - "^Production .*"
# component_title_pattern:
# allow:
# - "^Shared .*"
# category_pattern:
# deny:
# - "^Sandbox$"

# (Optional) Cap projects per run — useful for staged rollouts.
# WARNING: with stateful_ingestion enabled, projects beyond the limit are
# soft-deleted on the next run.
# max_projects: 50

# Enable stale-entity removal (projects deleted in Hex are soft-deleted in DataHub).
stateful_ingestion:
enabled: true

# sink configs — see https://docs.datahub.com/docs/metadata-ingestion/sink_docs/datahub
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
token 
string(password)
Hex Workspace Token with the 'Read projects' scope. Create one at Settings → API → Workspace tokens. The 'Read projects' scope is required to access project cells for lineage; tokens without it can enumerate projects but not read their content. See https://learn.hex.tech/docs/api-integrations/api/overview for token types.
workspace_name 
string
Hex workspace name. Find it in the workspace switcher dropdown in the top-left corner of the Hex app.
base_url
string
Hex API base URL. For most Hex users, this will be https://app.hex.tech/api/v1. Single-tenant app users should replace this with the URL they use to access Hex.
categories_as_tags
boolean
Emit Hex Category as tags
Default: True
collections_as_tags
boolean
Emit Hex Collections as tags
Default: True
include_components
boolean
Include Hex Components in the ingestion
Default: True
include_context_documents
boolean
Emit a DataHub Document per Project and per Component containing SQL sources, visualisation metadata, and notebook documentation. Documents are hidden from global search and linked to the Dashboard/Chart for AI agent retrieval.
Default: False
include_lineage
boolean
Extract upstream lineage. Uses queriedTables API (Hex Enterprise workspaces) or falls back to parsing SQL from cells (all workspaces). No warehouse ingestion dependency required.
Default: True
include_run_history
boolean
Emit the most recent COMPLETED run as a DashboardInfo PATCH setting lastRefreshed.
Default: True
max_projects
One of integer, null
Maximum number of projects to process. Useful for testing or staged rollouts. Components discovered during project processing are not counted. Defaults to None (process all projects). WARNING: with stateful ingestion enabled, projects beyond this limit are soft-deleted on the next run.
Default: None
page_size
integer
Number of items to fetch per Hex API call.
Default: 100
patch_metadata
boolean
Emit metadata as patch events
Default: False
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
set_ownership_from_email
boolean
Set ownership identity from owner/creator email
Default: True
status_as_tag
boolean
Emit Hex Status as tags
Default: True
use_queried_tables_lineage
boolean
Use Hex's queriedTables API (Hex Enterprise workspaces only) as the primary lineage source for published projects and components. Unpublished entities always fall back to SQL-cell parsing since queriedTables is only populated for published runs. Set to False to force SQL-cell parsing for everything.
Default: False
workspace_id
One of string, null
Hex workspace (org) UUID, used to build external URLs to the Hex app (e.g. https://app.hex.tech/<workspace_id>/hex/<project_id>). If left unset, the connector calls /users/me to auto-discover it — which requires the token to have 'Users → Read access'. Set this explicitly to avoid granting that scope. Find the UUID in any Hex project URL.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
category_pattern
AllowDenyPattern
A class to store allow deny regexes
category_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
component_title_pattern
AllowDenyPattern
A class to store allow deny regexes
component_title_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
connection_platform_map
map(str,HexConnectionDetail)
Per-connection override for upstream lineage URN construction.
connection_platform_map.key.platform
One of string, null
DataHub platform name. Required only when Hex's connection type cannot be auto-resolved (deleted connections, permission gaps, custom types).
Default: None
connection_platform_map.key.default_database
One of string, null
Default outer-scope qualifier for unqualified table refs in SQL cells. For BigQuery this is the GCP project ID; for Snowflake/Postgres/Redshift/MSSQL the database; for Trino/Databricks/Presto the catalog. Leave empty for 2-part platforms (MySQL/MariaDB/Clickhouse) — set only default_schema there. Overrides the value auto-extracted from Hex's /v1/data-connections response.
Default: None
connection_platform_map.key.default_schema
One of string, null
Default inner-scope qualifier for unqualified table refs in SQL cells. For BigQuery this is the dataset; for Snowflake/Postgres/Redshift/MSSQL/Trino/Databricks/Presto/Athena the schema; for MySQL/MariaDB/Clickhouse the database name. Overrides the value auto-extracted from Hex's /v1/data-connections response.
Default: None
connection_platform_map.key.platform_instance
One of string, null
DataHub platform_instance the underlying warehouse was ingested under. Leave unset for warehouses ingested without one (e.g. typical BigQuery).
Default: None
project_title_pattern
AllowDenyPattern
A class to store allow deny regexes
project_title_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Configuration for stateful ingestion and stale metadata removal.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Upstream Lineage

Lineage is tiered, with both tiers opt-out via include_lineage: false:

  • Tier 1 — queriedTables (Hex Enterprise workspaces only, opt-in via use_queried_tables_lineage: true): Hex's own runtime-proven table list for published projects and components, served by /v1/projects/{id}/queriedTables. Unpublished entities always fall back to Tier 2 since queriedTables is only populated for published runs. A 403 (non-Enterprise Hex workspace) falls back to Tier 2 for everything and emits a warning.
  • Tier 2 — SQL parsing via sqlglot (all workspaces, default): each cell is parsed with its connection's dialect.

Both tiers resolve warehouse URNs via /v1/data-connections (platform + default database/schema), overridable per-connection via connection_platform_map. For projects that import components, native project SQL is separated from inlined component SQL via the export API so component lineage isn't attributed twice. Cells whose dataConnectionId cannot be resolved are skipped with a structured warning — see Missing Upstream Lineage for triage.

Connection Platform Resolution

Hex's /v1/data-connections endpoint returns a type field that the connector maps to a DataHub platform via CONNECTION_TYPE_TO_DATAHUB_PLATFORM. Default database/schema qualifiers come from the same response.

Configure connection_platform_map (keyed by Hex dataConnectionId UUID) when:

  1. The warehouse was ingested under a platform_instance — set the matching value so URNs collide.
  2. The connection is deleted, permission-gapped, or a custom type — set platform explicitly so its cells aren't skipped.

Example:

connection_platform_map:
"8f3a1c2d-4b5e-6789-abcd-ef0123456789":
platform: snowflake
platform_instance: prod_snowflake
default_database: ANALYTICS
default_schema: PUBLIC
"1a2b3c4d-5e6f-7890-abcd-1234567890ab":
platform: bigquery
default_database: my-gcp-project

Migration from query_fetcher

Earlier versions of this connector derived lineage by querying DataHub for prior Hex-emitted query metadata (query_fetcher.py). That path has been removed: lineage now comes from SQL parsing of cells by default, or from Hex's queriedTables API when use_queried_tables_lineage: true is set on a Hex Enterprise workspace.

The following config fields fed only the old path and are now removed — drop them from your recipe (the connector will emit a warning if they are still present):

  • lineage_start_time
  • lineage_end_time
  • datahub_page_size

Migration: Components are now Charts

Components were previously emitted as Dashboard entities (subtype Component); they are now Chart entities, linked from their Project's DashboardInfo.charts. This changes their URN entity type, so any saved views, glossary/tag/ownership assignments, and policies that targeted the old Dashboard-typed Component URNs are lost and must be manually reapplied to the new Chart URNs.

Legacy Dashboard-typed Components left over from the old version are soft-deleted by stale-entity removal when stateful_ingestion was enabled on the old run. Because every Component changes URN type, a component-heavy workspace can exceed the stale-removal fail-safe (fail_safe_threshold, default 75%); if that happens, raise the threshold or perform a one-time bulk cleanup via the DataHub UI or CLI.

Stale Entity Removal

Enable by configuring stateful_ingestion. Projects deleted in Hex are soft-deleted in DataHub on the next run.

max_projects caps projects per run. With stateful_ingestion enabled, projects beyond the limit are treated as stale and soft-deleted — only set it if that is the intended behavior.

Context Documents

Opt-in via include_context_documents: true. When enabled, the connector emits a DataHub Document per Project and per Component containing SQL sources, visualization metadata, and notebook documentation.

Run History

When include_run_history is enabled (default), the most recent scheduled run is emitted as an Operation aspect, and last_run_status / last_run_elapsed_seconds are written to the project's custom properties — ERRORED runs surface there so operators can see failures. Only COMPLETED runs additionally update DashboardInfo.lastRefreshed via a targeted PATCH, so projects with sustained failures keep their last known-good refresh time as a freshness signal.

Usage Statistics

Each Project and Component emits an all-time viewsCount and a rolling 7-day window with lastViewedAt. Hex counts app views only when the published app is accessed — unpublished drafts have no view counts, so usage statistics are only emitted for published Projects and Components.

Limitations

  1. queriedTables requires a Hex Enterprise workspace and opt-in. Defaults to SQL parsing; enable use_queried_tables_lineage on Hex Enterprise workspaces to use Hex's API as the primary source.
  2. Non-SQL query paths produce no lineage. SQL parsing cannot recover table references from hextoolkit Python cells, dynamic SQL built from variables, or parameterized table names — the resulting projects will be missing those upstreams.
  3. Context documents are not a complete mirror of the Hex notebook. Only a subset of cell types is captured, so the rendered document will not match the source notebook exactly.
  4. Upstream lineage may be missing or mismatched when Hex's /v1/data-connections metadata is incomplete or uses an unrecognized connectionDetails shape. Without default_database / default_schema, neither SQL parsing nor queriedTables can assemble fully-qualified URNs; without the right platform_instance, URNs won't align with the warehouse ingestion. Set the affected dataConnectionId under connection_platform_map with the correct platform_instance / default_database / default_schema, or report the new connection shape to the DataHub team so the parser can be updated.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first, then review ingestion logs for source-specific errors.

Missing Upstream Lineage

The source report lists every skipped cell with its dataConnectionId and a reason (missing_connection_id or unresolved_platform). For each unresolved connection, add an entry under connection_platform_map and re-run. Cells with no dataConnectionId are non-SQL cells or cells without a Hex connection assigned — these cannot be recovered.

Column Lineage Looks Sparse

When use_queried_tables_lineage is enabled on a Hex Enterprise workspace, the report exposes enterprise_cells_with_mismatch and enterprise_sample_mismatched_cells — SQL cells whose parsed table URN did not match the queriedTables result. Adjusting default_database / default_schema in connection_platform_map resolves most cases.

Code Coordinates

  • Class Name: datahub.ingestion.source.hex.hex.HexSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Hex, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.