Skip to main content

Cube

Overview

Cube is a headless semantic layer that defines metrics, dimensions, and joins once and exposes them to BI tools, data apps, and AI agents through SQL, REST, and GraphQL APIs. Its data model is organised into cubes (business entities such as orders or customers) and views (curated, query-ready datasets built on top of cubes).

This source ingests the Cube data model into DataHub as datasets: each cube and view becomes a dataset whose measures and dimensions are modelled as schema fields. It supports both Cube Core (self-hosted, via the /v1/meta REST endpoint) and Cube Cloud, where it merges /v1/meta (structural and presentation metadata) with the richer Metadata API (warehouse and column-level lineage). On Cube Cloud the connector can mint the metadata-scoped token automatically via the Control Plane API. DataHub captures descriptions, measure/dimension classification, view-to-cube lineage, and — where the deployment exposes it — column-level lineage down to the underlying warehouse tables. On Cube Cloud it also ingests saved reports as charts and workbooks as dashboards via the Platform API, extending lineage to the BI consumption layer. Stateful ingestion removes cubes and views that have been deleted from the model.

Concept Mapping

Source ConceptDataHub ConceptNotes
Deployment / data modelContainerSubtype Cube Deployment
CubeDatasetSubtype Cube
ViewDatasetSubtype View
MeasureSchema FieldTagged Measure; aggregation in native type
DimensionSchema FieldTagged Dimension; primary keys marked as key
format / drillMembers / cumulativeSchema Field jsonPropsMeasure presentation hints
joins / hierarchies / folders / preAggregationsDataset custom propertiesStructural model metadata
public / isVisibleIngestion filterHidden cubes/members skipped unless include_hidden
table_references / cube sqlLineageLineage to upstream warehouse tables
View member aliasMemberFine-Grained LineageColumn-level view-to-cube lineage
metaTags / Terms / Owners / DomainsMapped via meta_mapping / column_meta_mapping
Report (Cube Cloud)ChartInput lineage to queried cubes/views
Workbook (Cube Cloud)DashboardContains its reports' charts

Module cube

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default.
Column-level LineageEnabled by default, can be disabled via include_column_lineage.
DescriptionsEnabled by default.
Detect Deleted EntitiesEnabled via stateful ingestion.
DomainsEnabled via the domain config and meta_mapping.
Extract OwnershipEnabled via meta_mapping against Cube meta.
Extract TagsEnabled via meta_mapping/column_meta_mapping, plus Measure/Dimension/Temporal field tags.
Glossary TermsEnabled via meta_mapping/column_meta_mapping.
Platform InstanceEnabled by default.
Schema MetadataEnabled by default.
Table-Level LineageEnabled by default. Includes view->cube lineage and, where available, lineage to upstream warehouse tables.
Test ConnectionEnabled by default.

Overview

The cube module ingests the Cube semantic layer data model into DataHub. Every cube and view is emitted as a dataset, with its measures and dimensions modelled as schema fields, organised under a container that represents the Cube deployment. The module works against both Cube Core and Cube Cloud.

Prerequisites

Choose a deployment type

Set deployment_type to match your Cube installation:

  • CORE — a self-hosted Cube Core instance. Metadata is read from the /v1/meta REST endpoint.
  • CLOUD — a Cube Cloud deployment. When use_metadata_api is enabled, the connector reads from the Metadata API, which additionally exposes lineage to upstream warehouse tables. If the supplied token lacks the required scope, the connector automatically falls back to /v1/meta.

Obtain an API token

The connector authenticates with a token sent in the Authorization header.

  • Cube Core: generate a JWT signed with your deployment's CUBEJS_API_SECRET. See Security context.
  • Cube Cloud (/v1/meta): copy a token from the deployment's Playground → API tab, or sign one with the deployment's API secret.
  • Cube Cloud Metadata API: obtain a token via the Control Plane API. This token is required for warehouse lineage.

Configure the API URL

api_url is the base URL of the REST API, including the base path (defaults to /cubejs-api):

  • Cube Core: http://localhost:4000/cubejs-api
  • Cube Cloud: https://<deployment>.cubecloud.dev/cubejs-api

Warehouse lineage (optional)

To connect cubes to the warehouse tables they read from, set warehouse_platform (e.g. snowflake, bigquery, postgres) and, if your existing datasets use them, warehouse_platform_instance and warehouse_env. On Cube Cloud with the Metadata API enabled, the warehouse platform and database are auto-detected from the deployment's data sources. On Cube Core, set parse_sql_for_lineage to derive table lineage from each cube's SQL definition (requires warehouse_platform).

Note that cubes marked public: false are not returned by the /v1/meta endpoint, so views that reference them will still produce lineage edges to those cubes even though the cubes themselves are not ingested.

Install the Plugin

pip install 'acryl-datahub[cube]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: cube
config:
# Base URL of the Cube REST API, including the base path.
api_url: "https://your-deployment.cubecloud.dev/cubejs-api"
api_token: "${CUBE_API_TOKEN}"

# CORE (self-hosted) or CLOUD.
deployment_type: "CLOUD"

# Connect cubes to their upstream warehouse tables. Auto-detected on Cube
# Cloud via the Metadata API; set explicitly for Cube Core.
# warehouse_platform: "snowflake"
# warehouse_database: "ANALYTICS"

# Cube Cloud only: ingest reports as charts and workbooks as dashboards, and
# auto-mint a Metadata API token. cloud_api_key + deployment_id are required;
# environment_id is needed only for the Metadata API token.
# cloud_api_key: "${CUBE_CLOUD_API_KEY}"
# deployment_id: "12345"
# environment_id: "production"

stateful_ingestion:
enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_token 
string(password)
API token used to authenticate against Cube. For Cube Core this is a JWT signed with CUBEJS_API_SECRET; for the Cube Cloud Metadata API use a token obtained from the Control Plane API.
api_url 
string
Base URL of the Cube REST API, including the base path. For Cube Core this is typically http://localhost:4000/cubejs-api; for Cube Cloud it looks like https://<name>.cubecloud.dev/cubejs-api.
cloud_api_key
One of string(password), null
Cube Cloud Control Plane API key (Account → API keys). When set together with deployment_id and environment_id, the connector automatically mints a metadata-scoped JWT via the Control Plane tokens-for-meta-sync endpoint to access the Metadata API, instead of requiring a pre-generated token in api_token.
Default: None
cloud_api_url
One of string, null
Base URL of the Cube Cloud Control Plane API (e.g. https://<tenant>.cubecloud.dev). If unset, it is derived from the scheme and host of api_url. Only used when cloud_api_key is set.
Default: None
column_meta_mapping
map(str,object)
convert_lineage_urns_to_lowercase
boolean
Whether to lowercase upstream warehouse table and column names when building lineage URNs. Must match the convert_urns_to_lowercase setting of the warehouse connector (e.g. Snowflake ingests lowercased URNs by default) so that the lineage edges resolve.
Default: True
deployment_id
One of string, null
Cube Cloud deployment id, used to mint a Metadata API token via the Control Plane API.
Default: None
deployment_type
Enum
One of: "CORE", "CLOUD"
deployment_url
One of string, null
Base URL of the Cube deployment UI, used to build an external link on the deployment container. If unset, it is derived from api_url by stripping the API base path.
Default: None
emit_member_details
boolean
Whether to capture Cube member presentation hints (format, drill-down members, cumulative flag) as schema-field jsonProps, and structural metadata (joins, hierarchies, folders, pre-aggregations) as dataset custom properties.
Default: True
enable_meta_mapping
boolean
Whether to process meta_mapping and column_meta_mapping rules.
Default: True
environment_id
One of string, null
Cube Cloud environment id, used to mint a Metadata API token via the Control Plane API.
Default: None
include_column_lineage
boolean
Whether to emit column-level (fine-grained) lineage. Requires include_lineage to be enabled.
Default: True
include_cubes
boolean
Whether to ingest base cubes as datasets.
Default: True
include_hidden
boolean
Whether to ingest cubes, views, and members that Cube marks as hidden (public: false / isVisible: false). Hidden cubes are typically excluded from Cube's own API consumers; enable this to surface them in DataHub anyway.
Default: False
include_lineage
boolean
Whether to emit lineage. This includes view->cube lineage and, where available, lineage from cubes to their upstream warehouse tables.
Default: True
include_reports
boolean
Cube Cloud only. Whether to ingest saved reports as DataHub charts, with lineage to the cubes/views they query. Requires Platform API access (cloud_api_key + deployment_id).
Default: True
include_views
boolean
Whether to ingest views as datasets.
Default: True
include_workbooks
boolean
Cube Cloud only. Whether to ingest workbooks as DataHub dashboards containing their reports' charts. Requires Platform API access (cloud_api_key + deployment_id).
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
meta_mapping
map(str,object)
meta_sync_token_expires_in
integer
Expiry (in seconds) of the minted Metadata API token. Defaults to 24 hours.
Default: 86400
parse_sql_for_lineage
boolean
Cube Core only. When the /v1/meta?extended response includes a cube's SQL definition, parse it to derive upstream warehouse lineage. Requires warehouse_platform to be set. The Cloud Metadata API provides lineage directly, so this is ignored for Cube Cloud.
Default: True
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
request_timeout_sec
integer
Per-request timeout, in seconds.
Default: 30
security_context
object
Security context embedded in the minted Metadata API token. Controls which parts of the data model are visible, following Cube's multi-tenancy rules.
strip_user_ids_from_email
boolean
Whether to strip the email domain from owners derived via meta_mapping.
Default: False
tag_measures_and_dimensions
boolean
Whether to tag schema fields with Measure/Dimension (and Temporal for time dimensions) so the kinds of Cube members can be distinguished and filtered in DataHub.
Default: True
tag_prefix
string
Prefix added to tags created via meta_mapping.
Default:
use_metadata_api
boolean
Cube Cloud only. When enabled, the richer Metadata API (/v1/entities) is used to extract warehouse and column-level lineage, which is merged with the structural metadata from /v1/meta. When disabled, only the /v1/meta endpoint is used. Has no effect for Cube Core deployments.
Default: True
warehouse_database
One of string, null
Database name to prepend to upstream warehouse table references that do not already include one. If unset, it is taken from the Cube data source definition when available.
Default: None
warehouse_env
string
Environment of the upstream warehouse datasets referenced by lineage.
Default: PROD
warehouse_platform
One of string, null
DataHub platform name of the warehouse that backs the Cube data model (e.g. snowflake, bigquery, postgres). Used to build upstream lineage URNs. If unset, it is auto-detected from the Cube data source type when the Metadata API is available.
Default: None
warehouse_platform_instance
One of string, null
Platform instance of the upstream warehouse, used when building lineage URNs.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
cube_pattern
AllowDenyPattern
A class to store allow deny regexes
cube_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain
map(str,AllowDenyPattern)
A class to store allow deny regexes
domain.key.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
domain.key.allow.string
string
domain.key.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain.key.deny
array
List of regex patterns to exclude from ingestion.
Default: []
domain.key.deny.string
string
report_pattern
AllowDenyPattern
A class to store allow deny regexes
report_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
view_pattern
AllowDenyPattern
A class to store allow deny regexes
view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
workbook_pattern
AllowDenyPattern
A class to store allow deny regexes
workbook_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

The connector extracts the following metadata:

  • Cubes and views as datasets, grouped under a container representing the deployment. The container links back to the deployment UI (derived from api_url, or set deployment_url).
  • Schema — each measure and dimension becomes a schema field. Measures carry their aggregation type (e.g. count, sum) in the native data type; primary-key dimensions are flagged as part of the key. Fields are tagged Measure or Dimension — and Temporal for time dimensions (disable with tag_measures_and_dimensions: false).
  • Descriptions and properties — titles, descriptions, segment names, source file name, and any custom meta defined in the model.
  • Structural metadata — joins (with relationship), hierarchies (with levels), folders/nested folders (with members), and pre-aggregation names are captured as dataset custom properties (disable with emit_member_details: false).
  • Measure presentation hints — each measure's format, drill-down members, and cumulative flag are stored on the schema field as jsonProps.
  • Hidden members — cubes, views, and members marked public: false / isVisible: false are skipped by default; set include_hidden: true to ingest them.
  • Tags, glossary terms, owners, domains, and documentation links — derived from the meta defined on cubes/views via meta_mapping, and from member meta via column_meta_mapping (same syntax as the dbt connector). Domains can also be assigned by name pattern via the domain config.
  • Reports and workbooks (Cube Cloud only) — saved reports become DataHub charts with input lineage to the cubes/views they query, and workbooks become DataHub dashboards containing those charts. Owners and titles are carried across. Disable with include_reports: false / include_workbooks: false, and filter with report_pattern / workbook_pattern.

Lineage

Lineage is emitted when include_lineage is enabled (the default):

  • View to cube — views are linked to the cubes they are built on, including column-level lineage derived from each member's aliasMember.
  • Cube to warehouse — on Cube Cloud with the Metadata API, table and column references are read directly. On Cube Core, table-level lineage is parsed from each cube's SQL definition when parse_sql_for_lineage and warehouse_platform are set. Column-level lineage on Cube Core is best-effort: since /v1/meta does not expose per-member SQL, members are matched by name against the upstream table's columns as found in DataHub (so the warehouse must be ingested first, and members whose name differs from the underlying column — e.g. aggregate measures — are not linked).
  • Report and workbook to view — on Cube Cloud, charts (reports) carry input lineage to the cubes/views in their query, and dashboards (workbooks) contain those charts, extending the chain to warehouse → cube → view → chart → dashboard.

Disable column-level lineage with include_column_lineage: false.

Cube Cloud authentication and metadata merging

On Cube Cloud the connector reads both endpoints and merges them: /v1/meta supplies the structural and presentation metadata (joins, hierarchies, folders, formats, visibility), while the Metadata API (/v1/entities, /v1/data-sources) supplies warehouse and column-level lineage. This gives a Cloud ingestion the union of both.

The Metadata API requires a metadata-scoped JWT. You can either:

  • Provide a pre-generated token in api_token, or
  • Let the connector mint one automatically: set cloud_api_key (a Cube Cloud API key from Account → API keys) together with deployment_id and environment_id. The connector calls the Control Plane tokens-for-meta-sync endpoint to obtain a short-lived, metadata-only token. Override the Control Plane host with cloud_api_url if it differs from the api_url host, and embed a security_context to scope multi-tenant visibility.

If the Metadata API cannot be reached, the connector logs a warning and continues with /v1/meta only (structural metadata and view-to-cube lineage, but no warehouse lineage).

Reports and workbooks (Cube Cloud Platform API)

Reports and workbooks are read from the Cube Cloud Platform API, which is authenticated with a Cube Cloud API key as a Bearer token. Set cloud_api_key and deployment_id to enable this (environment_id is not required for reports/workbooks — it is only needed when minting a Metadata API token). When these are absent, or for Cube Core, report/workbook ingestion is skipped silently. A failed Platform API call logs a warning and does not abort the run.

Multi-tenancy and context variables

Cube context variables (COMPILE_CONTEXT, SECURITY_CONTEXT, FILTER_PARAMS, FILTER_GROUP, SQL_UTILS) are data-model authoring constructs, not metadata the APIs expose as structured fields — there is nothing separate to ingest. They affect the connector only indirectly:

  • COMPILE_CONTEXT (multi-tenancy). Cube compiles a different data model per security context. The connector ingests the single compiled model that matches the security context carried by its token: set security_context when minting a token via the Control Plane API, or rely on the claims baked into a directly-supplied api_token. To catalog multiple tenants, run one ingestion per tenant — but their cubes and views share names, so distinguish them with platform_instance / env (or cube_pattern / view_pattern) to avoid URN collisions.
  • FILTER_PARAMS / SQL_UTILS in cube SQL. The SQL returned by /v1/meta is already compiled (FILTER_PARAMS render to their defaults and COMPILE_CONTEXT is resolved), so Cube Core SQL lineage parsing operates on the resolved SQL and is wrapped defensively if a template still cannot be parsed. On Cube Cloud the Metadata API returns resolved table_references / column_references, so templating is irrelevant there.

Limitations

  • The /v1/meta endpoint does not return cubes or views marked public: false. On Cube Cloud the Metadata API may still return them (and the connector merges them in); on Cube Core such cubes are not ingested as datasets, though lineage edges to them are still emitted.
  • Warehouse lineage on Cube Cloud requires a metadata-scoped token for the Metadata API (supplied via api_token, or minted automatically with cloud_api_key + deployment_id + environment_id). Without it, the connector falls back to /v1/meta and only view-to-cube lineage is available.
  • The Control Plane audit-logs export and Orchestration API (pre-aggregation build jobs) are intentionally not used — they are operational/governance surfaces rather than data-catalog metadata, and the audit-logs export is an Enterprise-only CSV stream.
  • Column-level lineage on Cube Core relies on member names matching the warehouse column names (Cube's default convention) and on the upstream table's schema already being present in DataHub. Members backed by a renamed or computed expression (e.g. total_amount over amount, or any aggregate measure) are not column-linked, since Cube Core's /v1/meta does not expose the underlying member SQL. Cube Cloud's Metadata API provides exact references and has no such limitation.
  • Usage statistics and query profiling are not ingested. Cube does not expose query history through a pull API — it is only available via Query History export, which pushes logs to an external sink (e.g. S3). Ingesting that exported data would be a separate pipeline rather than a Metadata API feature.
  • Pre-aggregation definitions are not exposed by Cube Core's /v1/meta (it returns only measures, dimensions, segments, hierarchies, and folders); they are an internal caching concern. Where a payload does include them, their names are captured as custom properties.

Troubleshooting

"Required scope is missing" / Metadata API falls back to /v1/meta

The configured api_token is a regular REST/data token rather than a metadata-scoped token. Either set cloud_api_key + deployment_id + environment_id so the connector mints a metadata token via the Control Plane API, supply a pre-generated metadata token in api_token, or set use_metadata_api: false to silence the fallback warning.

No warehouse lineage appears

Confirm warehouse_platform is set (or auto-detected), and that the upstream datasets were ingested with the same warehouse_platform_instance and warehouse_env you configured here.

Warehouse lineage edges do not connect to existing datasets

When run against a DataHub instance (the usual case), the connector reconciles the casing of upstream warehouse table URNs and column names against what the warehouse connector actually ingested — it looks up the real schema in DataHub and snaps Cube's reported identifiers to it. This handles platforms that fold identifiers differently (Postgres/Redshift lower-case, Snowflake upper-case, BigQuery case-sensitive) without per-platform configuration.

When the upstream schema is not yet in DataHub (e.g. the warehouse has not been ingested, or a dry run with no server), there is nothing to reconcile against, so the connector falls back to its configured behaviour: it lowercases upstream warehouse table and column names by default. If the warehouse connector was configured with convert_urns_to_lowercase: false, set convert_lineage_urns_to_lowercase: false here so the fallback URNs match. Ingesting the warehouse first is the most reliable fix.

Code Coordinates

  • Class Name: datahub.ingestion.source.cube.cube.CubeSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Cube, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.