Dataplex
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Links BigQuery datasets to BigQuery dataset containers. Supports dual API extraction: Entries API (Universal Catalog) for system-managed resources, and Entities API (Lakes/Zones) for Dataplex-managed assets. Dataplex hierarchy (lakes, zones, assets) preserved as custom properties. |
| Detect Deleted Entities | ✅ | Enabled by default when stateful ingestion is configured. Tracks entities from both Entries API (Universal Catalog) and Entities API (Lakes/Zones). |
| Schema Metadata | ✅ | Extract schema information from Entries API (Universal Catalog) and Entities API (discovered tables/filesets). Schema extraction can be disabled via include_schema config for faster ingestion. |
| Table-Level Lineage | ✅ | Extract table-level lineage from Dataplex Lineage API. Supports configurable retry logic (lineage_max_retries, lineage_retry_backoff_multiplier) for handling transient errors. |
| Test Connection | ✅ | Verifies connectivity to Dataplex API, including both Entries API (Universal Catalog) and Entities API (Lakes/Zones) if enabled. |
Source to ingest metadata from Google Dataplex. Ingesting metadata from Google Dataplex requires using the dataplex module.
Prerequisites
Please refer to the Dataplex documentation for basic information on Google Dataplex.
Credentials to access GCP
Please read the section to understand how to set up application default credentials in the GCP docs.
Permissions
Grant the following permissions to the Service Account on every project where you would like to extract metadata from.
For Universal Catalog Entries API (default, include_entries: true):
Default GCP Role: roles/dataplex.catalogViewer
| Permission | Description |
|---|---|
dataplex.entryGroups.get | Retrieve specific entry group details |
dataplex.entryGroups.list | View all entry groups in a location |
dataplex.entries.get | Access entry metadata and details |
dataplex.entries.getData | View data aspects within entries |
dataplex.entries.list | Enumerate entries within groups |
For Lakes/Zones Entities API (optional, include_entities: true):
Default GCP Role: roles/dataplex.viewer
| Permission | Description |
|---|---|
dataplex.lakes.get | Allows a user to view details of a specific lake |
dataplex.lakes.list | Allows a user to view and list all lakes in a project |
dataplex.zones.get | Allows a user to view details of a specific zone |
dataplex.zones.list | Allows a user to view and list all zones in a lake |
dataplex.assets.get | Allows a user to view details of a specific asset |
dataplex.assets.list | Allows a user to view and list all assets in a zone |
dataplex.entities.get | Allows a user to view details of a specific entity |
dataplex.entities.list | Allows a user to view and list all entities in a zone |
For lineage extraction (optional, include_lineage: true):
Default GCP Role: roles/datalineage.viewer
| Permission | Description |
|---|---|
datalineage.links.get | Allows a user to view lineage links |
datalineage.links.search | Allows a user to search for lineage links |
Note: If using both APIs, grant both sets of permissions. Most users only need roles/dataplex.catalogViewer for Entries API access.
Create a service account and assign roles
Setup a ServiceAccount as per GCP docs and assign the previously mentioned roles to this service account.
Download a service account JSON keyfile.
Example credential file:
{
"type": "service_account",
"project_id": "project-id-1234567",
"private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
"client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com",
"client_id": "113545814931671546333",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
}To provide credentials to the source, you can either:
Set an environment variable:
$ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"or
Set credential config in your source based on the credential json file. For example:
credential:
project_id: "project-id-1234567"
private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com"
client_id: "123456678890"
Integration Details
The Dataplex connector extracts metadata from Google Dataplex using two complementary APIs:
Universal Catalog Entries API (Primary, default enabled): Extracts entries from system-managed entry groups for Google Cloud services. This is the recommended approach for discovering resources across your GCP organization. Supported services include:
- BigQuery: datasets, tables, models, routines, connections, and linked datasets
- Cloud SQL: instances
- AlloyDB: instances, databases, schemas, tables, and views
- Spanner: instances, databases, and tables
- Pub/Sub: topics and subscriptions
- Cloud Storage: buckets
- Bigtable: instances, clusters, and tables
- Vertex AI: models, datasets, and feature stores
- Dataform: repositories and workflows
- Dataproc Metastore: services and databases
Lakes/Zones Entities API (Optional, default disabled): Extracts entities from Dataplex lakes and zones. Use this if you are using the legacy Data Catalog and need lake/zone information not available in the Entries API. See API Selection Guide below for detailed guidance on when to use each API as using both APIs can cause loss of custom properties.
Datasets are ingested using their source platform URNs (BigQuery, GCS, etc.) to align with native source connectors.
Concept Mapping
This ingestion source maps the following Dataplex Concepts to DataHub Concepts:
| Dataplex Concept | DataHub Concept | Notes |
|---|---|---|
| Entry (Universal Catalog) | Dataset | Metadata from Universal Catalog for Google Cloud services (BigQuery, Cloud SQL, AlloyDB, Spanner, Pub/Sub, GCS, Bigtable, Vertex AI, Dataform, Dataproc Metastore). Ingested using source platform URNs (e.g., bigquery, gcs, spanner). Schema metadata is extracted when available. |
| Entity (Lakes/Zones) | Dataset | Discovered table or fileset from lakes/zones. Ingested using source platform URNs (e.g., bigquery, gcs). Schema metadata is extracted when available. |
| BigQuery Project/Dataset | Container | BigQuery projects and datasets are created as containers to align with the native BigQuery connector. Dataplex-discovered BigQuery tables are linked to these containers. |
| Lake/Zone/Asset | Custom Properties | Dataplex hierarchy information (lake, zone, asset, zone type) is preserved as custom properties on datasets for traceability without creating separate containers. |
API Selection Guide
When to use Entries API (default, include_entries: true):
- ✅ You want to discover all BigQuery tables, Pub/Sub topics, and other Google Cloud resources
- ✅ You need comprehensive metadata from Dataplex's centralized catalog
- ✅ You want system-managed discovery without manual lake/zone configuration
- ✅ Recommended for most users
When to use Entities API (include_entities: true):
- Use this if you need lake/zone information that isn't available in the Entries API
- Provides Dataplex organizational context (lake, zone, asset metadata)
- Can be used alongside Entries API, but see warning below
Important: To access system-managed entry groups like @bigquery that contain BigQuery tables, you must use multi-region locations (us, eu, asia) via the entries_location config parameter. Regional locations (us-central1, etc.) only contain placeholder entries.
⚠️ Using Both APIs Together - Important Behavior
When both include_entries and include_entities are enabled and they discover the same table (same URN), the metadata behaves as follows:
What gets preserved:
- ✅ Schema metadata (from Entries API - most authoritative)
- ✅ Entry-specific custom properties (
dataplex_entry_id,dataplex_entry_group,dataplex_fully_qualified_name, etc.)
What gets lost:
- ❌ Entity-specific custom properties (
dataplex_lake,dataplex_zone,dataplex_zone_type,data_path,system,format,asset)
Why this happens: DataHub replaces aspects at the aspect level. When the Entries API emits metadata for a dataset that was already processed by the Entities API, it completely replaces the datasetProperties aspect, which contains all custom properties.
Recommendation:
- For most users: Use Entries API only (default). It provides comprehensive metadata from Universal Catalog.
- For lake/zone context: Use Entities API only with
include_entries: falseif you specifically need Dataplex organizational metadata. - Using both: Only enable both APIs if you need Entries API for some tables and Entities API for others (non-overlapping datasets). For overlapping tables, entry metadata will take precedence and entity context will be lost.
Example showing the data loss:
# Entity metadata (first):
custom_properties:
dataplex_lake: "production-lake"
dataplex_zone: "raw-zone"
dataplex_zone_type: "RAW"
data_path: "gs://bucket/path"
# After Entry metadata (second) - lake/zone info is lost:
custom_properties:
dataplex_entry_id: "abc123"
dataplex_entry_group: "@bigquery"
dataplex_fully_qualified_name: "bigquery:project.dataset.table"
Filtering Configuration
The connector supports filtering at multiple levels with clear separation between Entries API and Entities API filters:
Entries API Filtering (only applies when include_entries=true):
entries.dataset_pattern: Filter which entry IDs to ingest from Universal Catalog- Supports regex patterns with allow/deny lists
- Applies to entries discovered from system-managed entry groups like
@bigquery
Entities API Filtering (only applies when include_entities=true):
entities.lake_pattern: Filter which lakes to processentities.zone_pattern: Filter which zones to processentities.dataset_pattern: Filter which entity IDs (tables/filesets) to ingest from lakes/zones- Supports regex patterns with allow/deny lists
These filters are nested under filter_config.entries and filter_config.entities to make it clear which API each filter applies to. This allows you to have different filtering rules for each API when both are enabled.
Example with filtering:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
entries_location: "us"
include_entries: true
include_entities: true
filter_config:
# Entries API filtering (Universal Catalog)
entries:
dataset_pattern:
allow:
- "bq_.*" # Allow BigQuery entries starting with bq_
- "pubsub_.*" # Allow Pub/Sub entries
deny:
- ".*_test" # Deny test entries
- ".*_temp" # Deny temporary entries
# Entities API filtering (Lakes/Zones)
entities:
lake_pattern:
allow:
- "production-.*" # Only production lakes
zone_pattern:
deny:
- ".*-sandbox" # Exclude sandbox zones
dataset_pattern:
allow:
- "table_.*" # Allow entities starting with table_
- "fileset_.*" # Allow filesets
deny:
- ".*_backup" # Exclude backups
Platform Alignment
The connector generates datasets that align with native source connectors:
BigQuery Entities:
- URN Format:
urn:li:dataset:(urn:li:dataPlatform:bigquery,{project}.{dataset}.{table},PROD) - Container: Linked to BigQuery dataset containers (same as BigQuery connector)
- Platform:
bigquery
GCS Entities:
- URN Format:
urn:li:dataset:(urn:li:dataPlatform:gcs,{bucket}/{path},PROD) - Container: No container (same as GCS connector)
- Platform:
gcs
This alignment ensures:
- Consistency: Dataplex-discovered entities appear alongside native BigQuery/GCS entities in the same container hierarchy
- No Duplication: If you run both Dataplex and BigQuery/GCS connectors, entities discovered by both will merge (same URN)
- Unified Navigation: Users see a single view of BigQuery datasets or GCS buckets, regardless of discovery method
Dataplex Context Preservation
Dataplex-specific metadata is preserved as custom properties on each dataset:
| Custom Property | Description | Example Value |
|---|---|---|
dataplex_ingested | Indicates this entity was discovered by Dataplex | "true" |
dataplex_lake | Dataplex lake ID | "my-data-lake" |
dataplex_zone | Dataplex zone ID | "raw-zone" |
dataplex_entity_id | Dataplex entity ID | "customer_table" |
dataplex_zone_type | Zone type (RAW or CURATED) | "RAW" |
data_path | GCS path for the entity | "gs://bucket/..." |
system | Storage system (BIGQUERY, CLOUD_STORAGE) | "BIGQUERY" |
format | Data format (PARQUET, AVRO, etc.) | "PARQUET" |
These properties allow you to:
- Identify which assets were discovered through Dataplex
- Understand the Dataplex organizational structure (lakes, zones)
- Filter or search for Dataplex-managed entities
- Trace entities back to their Dataplex catalog origin
Lineage
When include_lineage is enabled and proper permissions are granted, the connector extracts table-level lineage using the Dataplex Lineage API. Dataplex automatically tracks lineage from these Google Cloud systems:
Supported Systems:
- BigQuery: DDL (CREATE TABLE, CREATE TABLE AS SELECT, views, materialized views) and DML (SELECT, INSERT, MERGE, UPDATE, DELETE) operations
- Cloud Data Fusion: Pipeline executions
- Cloud Composer: Workflow orchestration
- Dataflow: Streaming and batch jobs
- Dataproc: Spark and Hadoop jobs
- Vertex AI: ML pipeline operations
Not Supported:
- Column-level lineage: The connector extracts only table-level lineage (column-level lineage is available in Dataplex but not exposed through this connector)
- Custom sources: Only Google Cloud systems with automatic lineage tracking are supported
- BigQuery Data Transfer Service: Recurring loads are not automatically tracked
Lineage Limitations:
- Lineage data is retained for 30 days in Dataplex
- Lineage may take up to 24 hours to appear after job completion
For more details, see Dataplex Lineage Documentation.
Metadata Extraction and Performance Configuration Options:
include_schema(default:true): Enable schema metadata extraction (columns, types, descriptions). Set tofalseto skip schema extraction for faster ingestion when only basic dataset metadata is needed. Disabling schema extraction can improve performance for large deploymentsinclude_lineage(default:true): Enable table-level lineage extraction. Lineage API calls automatically retry transient errors (timeouts, rate limits, service unavailable) with exponential backoffbatch_size(default:1000): Controls batching for metadata emission and lineage extraction. Lower values reduce memory usage but may increase processing time. Set tonullto disable batching. Recommended:1000for large deployments (>10k entities),nullfor small deployments (<1k entities)
Lineage Retry Configuration:
You can customize how the connector handles transient errors when extracting lineage:
lineage_max_retries(default:3, range:1-10): Maximum number of retry attempts for lineage API calls when encountering transient errors (timeouts, rate limits, service unavailable). Each attempt uses exponential backoff. Higher values increase resilience but may slow down ingestionlineage_retry_backoff_multiplier(default:1.0, range:0.1-10.0): Multiplier for exponential backoff between retry attempts (in seconds). Wait time formula:multiplier * (2 ^ attempt_number), capped between 2-10 seconds. Higher values reduce API load but increase ingestion time
Automatic Retry Behavior:
The connector automatically handles transient errors when extracting lineage:
Retried Errors (with exponential backoff):
- Timeouts: DeadlineExceeded errors from slow API responses
- Rate Limiting: HTTP 429 (TooManyRequests) errors
- Service Issues: HTTP 503 (ServiceUnavailable), HTTP 500 (InternalServerError)
Non-Retried Errors (logs warning and continues):
- Permission Denied: HTTP 403 - missing
datalineage.viewerrole - Not Found: HTTP 404 - entity or lineage data doesn't exist
- Invalid Argument: HTTP 400 - incorrect API parameters (e.g., wrong region format)
Common Configuration Issues:
- Regional restrictions: Lineage API requires multi-region location (e.g.,
us,eu) rather than specific regions (e.g.,us-central1). The connector automatically converts yourlocationconfig - Missing permissions: Ensure service account has
roles/datalineage.viewerrole on all projects - No lineage data: Some entities may not have lineage if they weren't created through supported systems (BigQuery DDL/DML, Cloud Data Fusion, etc.)
- Rate limiting: If you encounter persistent rate limiting, increase
lineage_retry_backoff_multiplierto add more delay between retries, or decreaselineage_max_retriesif you prefer faster failure
After exhausting retries, the connector logs a warning and continues processing other entities - you'll still get metadata (lakes, zones, assets, entities, schema) even if lineage extraction fails for some entities.
Example Configuration:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
# Location for lakes/zones/entities (if using include_entities)
location: "us-central1"
# Location for entries (Universal Catalog) - defaults to "us"
# Must be multi-region (us, eu, asia) for system entry groups like @bigquery
entries_location: "us" # Default value, can be omitted
# API selection
include_entries: true # Enable Universal Catalog entries (default: true)
include_entities: false # Enable lakes/zones entities (default: false)
# Metadata extraction settings
include_schema: true # Enable schema metadata extraction (default: true)
include_lineage: true # Enable lineage extraction with automatic retries
# Lineage retry settings (optional, defaults shown)
lineage_max_retries: 3 # Max retry attempts (range: 1-10)
lineage_retry_backoff_multiplier: 1.0 # Exponential backoff multiplier (range: 0.1-10.0)
Advanced Configuration for Large Deployments:
For deployments with thousands of entities, memory optimization is critical. The connector uses batched emission to keep memory bounded:
source:
type: dataplex
config:
project_ids:
- "my-gcp-project"
location: "us-central1"
entries_location: "us"
# API selection
include_entries: true
include_entities: false
# Memory optimization for large deployments
batch_size:
1000 # Batch size for metadata emission and lineage extraction
# Entries/entities are emitted in batches of 1000 to prevent memory issues
# Set to null to disable batching (only for small deployments <1k entities)
max_workers: 10 # Parallelize entity extraction across zones
How Batching Works:
- Entries and entities are collected during API streaming
- When a batch reaches
batch_sizeentries, it's immediately emitted to DataHub - The batch cache is cleared to free memory
- This keeps memory usage bounded regardless of dataset size
- For deployments with 50k+ entities, batching prevents out-of-memory errors
Lineage Limitations:
- Dataplex does not support column-level lineage extraction
- Lineage retention period: 30 days (Dataplex limitation)
- Cross-region lineage is not supported by Dataplex
- Lineage is only available for entities with active lineage tracking enabled For more details on lineage limitations, refer to GCP docs.
Python Dependencies
The connector requires the following Python packages, which are automatically installed with acryl-datahub[dataplex]:
google-cloud-dataplex>=1.0.0google-cloud-datacatalog-lineage==0.2.2(required for lineage extraction wheninclude_lineage: true)
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: dataplex
config:
# Required: GCP project ID(s) where Dataplex resources are located
project_ids:
- "my-gcp-project"
# Optional: GCP location for lakes/zones/entities (default: us-central1)
# Use regional locations like us-central1, europe-west1, etc.
location: "us-central1"
# Optional: GCP location for entries (Universal Catalog)
# Use multi-region locations (us, eu, asia) to access system entry groups like @bigquery
# If not specified, uses the same value as 'location'
entries_location: "us"
# Optional: Environment (default: PROD)
env: "PROD"
# Optional: GCP credentials (if not using Application Default Credentials)
# credential:
# project_id: "my-gcp-project"
# private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
# private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
# client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com"
# client_id: "123456678890"
# Optional: API Selection
# include_entries: true # Extract from Universal Catalog (default: true, recommended)
# include_entities: false # Extract from Lakes/Zones (default: false)
# include_lineage: true # Extract lineage (default: true)
# include_schema: true # Extract schema metadata (default: true)
# Optional: Filtering patterns
# filter_config:
# # Entries API filters (only applies when include_entries=true)
# entries:
# dataset_pattern:
# allow:
# - "bq_.*" # Allow BigQuery entries
# - "pubsub_.*" # Allow Pub/Sub entries
# deny:
# - ".*_test" # Deny test entries
# - ".*_temp" # Deny temporary entries
#
# # Entities API filters (only applies when include_entities=true)
# entities:
# lake_pattern:
# allow:
# - "retail-.*"
# - "finance-.*"
# deny:
# - ".*-test"
# zone_pattern:
# allow:
# - ".*"
# deny:
# - "deprecated-.*"
# dataset_pattern:
# allow:
# - "table_.*" # Allow tables
# - "fileset_.*" # Allow filesets
# deny:
# - ".*_backup" # Exclude backups
# Optional: Performance tuning
# max_workers: 10
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
batch_size One of integer, null | Batch size for metadata emission and lineage extraction. Entries and entities are emitted in batches to prevent memory issues in large deployments. Lower values reduce memory usage but may increase processing time. Set to None to disable batching (process all entities at once). Recommended: 1000 for large deployments (>10k entities), None for small deployments (<1k entities). Default: 1000. Default: 1000 |
dataplex_url string | Base URL for Dataplex console (for generating external links). |
enable_stateful_lineage_ingestion boolean | Enable stateful lineage ingestion. This will store lineage window timestamps after successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. NOTE: This only works with use_queries_v2=False (legacy extraction path). For queries v2, use enable_stateful_time_window instead. Default: True |
entries_location string | GCP location for Universal Catalog entries extraction. Must be a multi-region location (us, eu, asia) to access system-managed entry groups like @bigquery. Regional locations (us-central1, etc.) only contain placeholder entries and will miss BigQuery tables. Default: 'us' (recommended for most users). Default: us |
include_entities boolean | Whether to include Entity metadata from Lakes/Zones (discovered tables/filesets) as Datasets. This is optional and complements the Entries API data. WARNING: When both include_entries and include_entities are enabled and discover the same table, entries will completely replace entity metadata including custom properties (lake, zone, asset info will be lost). Recommended: Use only ONE API, or ensure APIs discover non-overlapping datasets. See documentation for details. Default: False |
include_entries boolean | Whether to extract Entries from Universal Catalog. This is the primary source of metadata and takes precedence when both sources are enabled. Default: True |
include_lineage boolean | Whether to extract lineage information using Dataplex Lineage API. Extracts table-level lineage relationships between entities. Lineage API calls automatically retry transient errors (timeouts, rate limits) with exponential backoff. Default: True |
include_schema boolean | Whether to extract and ingest schema metadata (columns, types, descriptions). Set to False to skip schema extraction for faster ingestion when only basic dataset metadata is needed. Disabling schema extraction can improve performance for large deployments. Default: True. Default: True |
lineage_max_retries integer | Maximum number of retry attempts for lineage API calls when encountering transient errors (timeouts, rate limits, service unavailable). Each attempt uses exponential backoff. Higher values increase resilience but may slow down ingestion. Default: 3. Default: 3 |
lineage_retry_backoff_multiplier number | Multiplier for exponential backoff between lineage API retry attempts (in seconds). Wait time formula: multiplier * (2 ^ attempt_number), capped between 2-10 seconds. Higher values reduce API load but increase ingestion time. Default: 1.0. Default: 1.0 |
location string | GCP location/region where Dataplex lakes, zones, and entities are located (e.g., us-central1, europe-west1). Only used for entities extraction (include_entities=True). Default: us-central1 |
max_workers integer | Number of worker threads to use to parallelize zone entity extraction. Set to 1 to disable parallelization. Default: 10 |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
env string | The environment that all assets produced by this connector belong to Default: PROD |
credential One of GCPCredential, null | GCP credential information. If not specified, uses Application Default Credentials. Default: None |
credential.client_email ❓ string | Client email |
credential.client_id ❓ string | Client Id |
credential.private_key ❓ string | Private key in a form of '-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n' |
credential.private_key_id ❓ string | Private key id |
credential.auth_provider_x509_cert_url string | Auth provider x509 certificate url |
credential.auth_uri string | Authentication uri |
credential.client_x509_cert_url One of string, null | If not set it will be default to https://www.googleapis.com/robot/v1/metadata/x509/client_email Default: None |
credential.project_id One of string, null | Project id to set the credentials Default: None |
credential.token_uri string | Token uri Default: https://oauth2.googleapis.com/token |
credential.type string | Authentication type Default: service_account |
filter_config DataplexFilterConfig | Filter configuration for Dataplex ingestion. |
filter_config.entities EntitiesFilterConfig | Filter configuration specific to Dataplex Entities API (Lakes/Zones). These filters only apply when include_entities=True. |
filter_config.entities.dataset_pattern AllowDenyPattern | A class to store allow deny regexes |
filter_config.entities.dataset_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
filter_config.entities.lake_pattern AllowDenyPattern | A class to store allow deny regexes |
filter_config.entities.lake_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
filter_config.entities.zone_pattern AllowDenyPattern | A class to store allow deny regexes |
filter_config.entities.zone_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
filter_config.entries EntriesFilterConfig | Filter configuration specific to Dataplex Entries API (Universal Catalog). These filters only apply when include_entries=True. |
filter_config.entries.dataset_pattern AllowDenyPattern | A class to store allow deny regexes |
filter_config.entries.dataset_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
project_ids array | List of Google Cloud Project IDs to ingest Dataplex resources from. If not specified, uses project_id or attempts to detect from credentials. |
project_ids.string string | |
stateful_ingestion One of StatefulIngestionConfig, null | Stateful Ingestion Config Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"DataplexFilterConfig": {
"additionalProperties": false,
"description": "Filter configuration for Dataplex ingestion.",
"properties": {
"entities": {
"$ref": "#/$defs/EntitiesFilterConfig",
"description": "Filters specific to Dataplex Entities API (lakes, zones, and entity datasets). Only applies when include_entities=True."
},
"entries": {
"$ref": "#/$defs/EntriesFilterConfig",
"description": "Filters specific to Dataplex Entries API (Universal Catalog). Only applies when include_entries=True."
}
},
"title": "DataplexFilterConfig",
"type": "object"
},
"EntitiesFilterConfig": {
"additionalProperties": false,
"description": "Filter configuration specific to Dataplex Entities API (Lakes/Zones).\n\nThese filters only apply when include_entities=True.",
"properties": {
"lake_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for lake names to filter in ingestion. Only applies when include_entities=True."
},
"zone_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for zone names to filter in ingestion. Only applies when include_entities=True."
},
"dataset_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for entity IDs (tables/filesets) to filter in ingestion. Only applies when include_entities=True."
}
},
"title": "EntitiesFilterConfig",
"type": "object"
},
"EntriesFilterConfig": {
"additionalProperties": false,
"description": "Filter configuration specific to Dataplex Entries API (Universal Catalog).\n\nThese filters only apply when include_entries=True.",
"properties": {
"dataset_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for entry IDs to filter in ingestion. Only applies when include_entries=True."
}
},
"title": "EntriesFilterConfig",
"type": "object"
},
"GCPCredential": {
"additionalProperties": false,
"properties": {
"project_id": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Project id to set the credentials",
"title": "Project Id"
},
"private_key_id": {
"description": "Private key id",
"title": "Private Key Id",
"type": "string"
},
"private_key": {
"description": "Private key in a form of '-----BEGIN PRIVATE KEY-----\\nprivate-key\\n-----END PRIVATE KEY-----\\n'",
"title": "Private Key",
"type": "string"
},
"client_email": {
"description": "Client email",
"title": "Client Email",
"type": "string"
},
"client_id": {
"description": "Client Id",
"title": "Client Id",
"type": "string"
},
"auth_uri": {
"default": "https://accounts.google.com/o/oauth2/auth",
"description": "Authentication uri",
"title": "Auth Uri",
"type": "string"
},
"token_uri": {
"default": "https://oauth2.googleapis.com/token",
"description": "Token uri",
"title": "Token Uri",
"type": "string"
},
"auth_provider_x509_cert_url": {
"default": "https://www.googleapis.com/oauth2/v1/certs",
"description": "Auth provider x509 certificate url",
"title": "Auth Provider X509 Cert Url",
"type": "string"
},
"type": {
"default": "service_account",
"description": "Authentication type",
"title": "Type",
"type": "string"
},
"client_x509_cert_url": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If not set it will be default to https://www.googleapis.com/robot/v1/metadata/x509/client_email",
"title": "Client X509 Cert Url"
}
},
"required": [
"private_key_id",
"private_key",
"client_email",
"client_id"
],
"title": "GCPCredential",
"type": "object"
},
"StatefulIngestionConfig": {
"additionalProperties": false,
"description": "Basic Stateful Ingestion Specific Configuration for any source.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
}
},
"title": "StatefulIngestionConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for Google Dataplex source.",
"properties": {
"enable_stateful_lineage_ingestion": {
"default": true,
"description": "Enable stateful lineage ingestion. This will store lineage window timestamps after successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. NOTE: This only works with use_queries_v2=False (legacy extraction path). For queries v2, use enable_stateful_time_window instead.",
"title": "Enable Stateful Lineage Ingestion",
"type": "boolean"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulIngestionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful Ingestion Config"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"credential": {
"anyOf": [
{
"$ref": "#/$defs/GCPCredential"
},
{
"type": "null"
}
],
"default": null,
"description": "GCP credential information. If not specified, uses Application Default Credentials."
},
"project_ids": {
"description": "List of Google Cloud Project IDs to ingest Dataplex resources from. If not specified, uses project_id or attempts to detect from credentials.",
"items": {
"type": "string"
},
"title": "Project Ids",
"type": "array"
},
"location": {
"default": "us-central1",
"description": "GCP location/region where Dataplex lakes, zones, and entities are located (e.g., us-central1, europe-west1). Only used for entities extraction (include_entities=True).",
"title": "Location",
"type": "string"
},
"entries_location": {
"default": "us",
"description": "GCP location for Universal Catalog entries extraction. Must be a multi-region location (us, eu, asia) to access system-managed entry groups like @bigquery. Regional locations (us-central1, etc.) only contain placeholder entries and will miss BigQuery tables. Default: 'us' (recommended for most users).",
"title": "Entries Location",
"type": "string"
},
"filter_config": {
"$ref": "#/$defs/DataplexFilterConfig",
"description": "Filters to control which Dataplex resources are ingested."
},
"include_entries": {
"default": true,
"description": "Whether to extract Entries from Universal Catalog. This is the primary source of metadata and takes precedence when both sources are enabled.",
"title": "Include Entries",
"type": "boolean"
},
"include_entities": {
"default": false,
"description": "Whether to include Entity metadata from Lakes/Zones (discovered tables/filesets) as Datasets. This is optional and complements the Entries API data. WARNING: When both include_entries and include_entities are enabled and discover the same table, entries will completely replace entity metadata including custom properties (lake, zone, asset info will be lost). Recommended: Use only ONE API, or ensure APIs discover non-overlapping datasets. See documentation for details.",
"title": "Include Entities",
"type": "boolean"
},
"include_schema": {
"default": true,
"description": "Whether to extract and ingest schema metadata (columns, types, descriptions). Set to False to skip schema extraction for faster ingestion when only basic dataset metadata is needed. Disabling schema extraction can improve performance for large deployments. Default: True.",
"title": "Include Schema",
"type": "boolean"
},
"include_lineage": {
"default": true,
"description": "Whether to extract lineage information using Dataplex Lineage API. Extracts table-level lineage relationships between entities. Lineage API calls automatically retry transient errors (timeouts, rate limits) with exponential backoff.",
"title": "Include Lineage",
"type": "boolean"
},
"lineage_max_retries": {
"default": 3,
"description": "Maximum number of retry attempts for lineage API calls when encountering transient errors (timeouts, rate limits, service unavailable). Each attempt uses exponential backoff. Higher values increase resilience but may slow down ingestion. Default: 3.",
"maximum": 10,
"minimum": 1,
"title": "Lineage Max Retries",
"type": "integer"
},
"lineage_retry_backoff_multiplier": {
"default": 1.0,
"description": "Multiplier for exponential backoff between lineage API retry attempts (in seconds). Wait time formula: multiplier * (2 ^ attempt_number), capped between 2-10 seconds. Higher values reduce API load but increase ingestion time. Default: 1.0.",
"maximum": 10.0,
"minimum": 0.1,
"title": "Lineage Retry Backoff Multiplier",
"type": "number"
},
"batch_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1000,
"description": "Batch size for metadata emission and lineage extraction. Entries and entities are emitted in batches to prevent memory issues in large deployments. Lower values reduce memory usage but may increase processing time. Set to None to disable batching (process all entities at once). Recommended: 1000 for large deployments (>10k entities), None for small deployments (<1k entities). Default: 1000.",
"title": "Batch Size"
},
"max_workers": {
"default": 10,
"description": "Number of worker threads to use to parallelize zone entity extraction. Set to 1 to disable parallelization.",
"title": "Max Workers",
"type": "integer"
},
"dataplex_url": {
"default": "https://console.cloud.google.com/dataplex",
"description": "Base URL for Dataplex console (for generating external links).",
"title": "Dataplex Url",
"type": "string"
}
},
"title": "DataplexConfig",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.dataplex.dataplex.DataplexSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Dataplex, feel free to ping us on our Slack.