Hive Metastore
There are 2 sources that provide integration with Hive Metastore
| Source Module | Documentation |
| Extracts metadata from Hive Metastore. Supports two connection methods selected via
Features:
|
| Extracts metadata from Hive Metastore. Supports two connection methods selected via
Features:
|
Module hive-metastore
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. Supported for types - Catalog, Schema. |
| Classification | ❌ | Not Supported. |
| Column-level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View. |
| Data Profiling | ❌ | Not Supported. |
| Descriptions | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Domains | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View. |
| Test Connection | ✅ | Enabled by default. |
Extracts metadata from Hive Metastore.
Supports two connection methods selected via connection_type:
- sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
- thrift: Connection to HMS Thrift API with Kerberos support
Features:
- Table and view metadata extraction
- Schema field types including complex types (struct, map, array)
- Storage lineage to S3, HDFS, Azure, GCS
- View lineage via SQL parsing
- Stateful ingestion for stale entity removal
Prerequisites
The Hive Metastore connector supports two connection modes:
- SQL Mode (Default): Connects directly to the Hive metastore database (MySQL, PostgreSQL, etc.)
- Thrift Mode: Connects to Hive Metastore via the Thrift API (port 9083), with Kerberos support
Choose your connection mode based on your environment:
| Feature | SQL Mode (default) | Thrift Mode |
|---|---|---|
| Use when | Direct database access available | Only HMS Thrift API accessible |
| Authentication | Database credentials | Kerberos/SASL or unauthenticated |
| Port | Database port (3306/5432) | Thrift port (9083) |
| Dependencies | Database drivers | pymetastore, thrift-sasl |
| Filtering | SQL WHERE clauses supported | Pattern-based filtering only |
Before configuring the DataHub connector, ensure you have:
Database Access: Direct read access to the Hive metastore database (typically MySQL or PostgreSQL).
Network Access: The machine running DataHub ingestion must be able to reach your metastore database on the configured port.
Database Driver: Install the appropriate Python database driver:
# For PostgreSQL metastore
pip install 'acryl-datahub[hive]' psycopg2-binary
# For MySQL metastore
pip install 'acryl-datahub[hive]' PyMySQLMetastore Schema Knowledge: Familiarity with your metastore database schema (typically
publicfor PostgreSQL, or the database name for MySQL).
Required Database Permissions
The database user account used by DataHub needs read-only access to the Hive metastore tables.
PostgreSQL Metastore
-- Create a dedicated read-only user for DataHub
CREATE USER datahub_user WITH PASSWORD 'secure_password';
-- Grant connection privileges
GRANT CONNECT ON DATABASE metastore TO datahub_user;
-- Grant schema usage
GRANT USAGE ON SCHEMA public TO datahub_user;
-- Grant SELECT on metastore tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;
-- Grant SELECT on future tables (for metastore upgrades)
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;
MySQL Metastore
-- Create a dedicated read-only user for DataHub
CREATE USER 'datahub_user'@'%' IDENTIFIED BY 'secure_password';
-- Grant SELECT privileges on metastore database
GRANT SELECT ON metastore.* TO 'datahub_user'@'%';
-- Apply changes
FLUSH PRIVILEGES;
Required Metastore Tables
DataHub queries the following metastore tables:
| Table | Purpose |
|---|---|
DBS | Database/schema information |
TBLS | Table metadata |
TABLE_PARAMS | Table properties (including view definitions) |
SDS | Storage descriptor (location, format) |
COLUMNS_V2 | Column metadata |
PARTITION_KEYS | Partition information |
SERDES | Serialization/deserialization information |
Recommendation: Grant SELECT on all metastore tables to ensure compatibility with different Hive versions and for future DataHub enhancements.
Authentication
PostgreSQL
Standard Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
SSL Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
sslrootcert: /path/to/ca-cert.pem
MySQL
Standard Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql" # Default if not specified
SSL Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql"
options:
connect_args:
ssl:
ca: /path/to/ca-cert.pem
cert: /path/to/client-cert.pem
key: /path/to/client-key.pem
Amazon RDS (PostgreSQL or MySQL)
For AWS RDS-hosted metastore databases:
source:
type: hive-metastore
config:
host_port: metastore.abc123.us-east-1.rds.amazonaws.com:5432
database: metastore
username: datahub_user
password: ${RDS_PASSWORD}
scheme: "postgresql+psycopg2" # or 'mysql+pymysql'
options:
connect_args:
sslmode: require # RDS requires SSL
Azure Database for PostgreSQL/MySQL
source:
type: hive-metastore
config:
host_port: metastore-server.postgres.database.azure.com:5432
database: metastore
username: datahub_user@metastore-server # Note: Azure requires @server-name suffix
password: ${AZURE_DB_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
Thrift Connection Mode
Use connection_type: thrift when you cannot access the metastore database directly but have access to the HMS Thrift API (typically port 9083). This is common in:
- Kerberized Hadoop clusters where database access is restricted
- Cloud-managed Hive services that only expose the Thrift API
- Environments with strict network segmentation
Thrift Mode Prerequisites
Before using Thrift mode, ensure:
- Network Access: The machine running DataHub ingestion can reach HMS on port 9083
- HMS Service Running: The Hive Metastore service is running and accepting Thrift connections
- For Kerberos: A valid Kerberos ticket is available (see Kerberos section below)
Verify connectivity:
# Test network connectivity to HMS
telnet hms.company.com 9083
# For Kerberos environments, verify ticket
klist
Thrift Mode Dependencies
# Install with Thrift support
pip install 'acryl-datahub[hive-metastore]'
# For Kerberos authentication, also install:
pip install thrift-sasl pyhive[hive-pure-sasl]
Thrift Configuration Options
| Option | Type | Default | Required | Description |
|---|---|---|---|---|
connection_type | string | sql | Yes (for Thrift) | Set to thrift to enable Thrift mode |
host_port | string | - | Yes | HMS host and port (e.g., hms.company.com:9083) |
use_kerberos | boolean | false | No | Enable Kerberos/SASL authentication |
kerberos_service_name | string | hive | No | Kerberos service principal name |
kerberos_hostname_override | string | - | No | Override hostname for Kerberos principal (for load balancers) |
timeout_seconds | int | 60 | No | Connection timeout in seconds |
max_retries | int | 3 | No | Maximum retry attempts for transient failures |
catalog_name | string | - | No | HMS 3.x catalog name (e.g., spark_catalog) |
include_catalog_name_in_ids | boolean | false | No | Include catalog in dataset URNs |
database_pattern | AllowDeny | - | No | Filter databases by regex pattern |
table_pattern | AllowDeny | - | No | Filter tables by regex pattern |
Note: SQL WHERE clause options (tables_where_clause_suffix, views_where_clause_suffix, schemas_where_clause_suffix) have been deprecated for security reasons (SQL injection risk) and are no longer supported. Use database_pattern and table_pattern instead.
Basic Thrift Configuration
source:
type: hive-metastore
config:
connection_type: thrift
host_port: hms.company.com:9083
Thrift with Kerberos Authentication
Ensure you have a valid Kerberos ticket (kinit -kt /path/to/keytab user@REALM) before running ingestion:
source:
type: hive-metastore
config:
connection_type: thrift
host_port: hms.company.com:9083
use_kerberos: true
kerberos_service_name: hive # Change if HMS uses different principal
# kerberos_hostname_override: hms-internal.company.com # If using load balancer
# catalog_name: spark_catalog # For HMS 3.x multi-catalog
database_pattern: # Pattern filtering (WHERE clauses NOT supported)
allow:
- "^prod_.*"
Thrift Mode Dependencies
pip install 'acryl-datahub[hive-metastore]' # Add thrift-sasl for Kerberos
Thrift Mode Limitations
- No Presto/Trino view lineage: View SQL parsing requires SQL mode
- No WHERE clause filtering: Use
database_pattern/table_patterninstead - Kerberos ticket required: Must have valid ticket before running (not embedded in config)
- HMS version compatibility: Tested with HMS 2.x and 3.x
Storage Lineage
The Hive Metastore connector supports the same storage lineage features as the Hive connector, with enhanced performance due to direct database access.
Quick Start
Enable storage lineage with minimal configuration:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
# Enable storage lineage
emit_storage_lineage: true
Configuration Options
Storage lineage is controlled by the same parameters as the Hive connector:
| Parameter | Type | Default | Description |
|---|---|---|---|
emit_storage_lineage | boolean | false | Master toggle to enable/disable storage lineage |
hive_storage_lineage_direction | string | "upstream" | Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage) |
include_column_lineage | boolean | true | Enable column-level lineage from storage paths to Hive columns |
storage_platform_instance | string | None | Platform instance for storage URNs (e.g., "prod-s3", "dev-hdfs") |
Supported Storage Platforms
All storage platforms supported by the Hive connector are also supported here:
- Amazon S3 (
s3://,s3a://,s3n://) - HDFS (
hdfs://) - Google Cloud Storage (
gs://) - Azure Blob Storage (
wasb://,wasbs://) - Azure Data Lake (
adl://,abfs://,abfss://) - Databricks File System (
dbfs://) - Local File System (
file://)
See the sections above for complete configuration details.
Presto and Trino View Support
A key advantage of the Hive Metastore connector is its ability to extract metadata from Presto and Trino views that are stored in the metastore.
How It Works
View Detection: The connector identifies views by checking the
TABLE_PARAMStable for Presto/Trino view definitions.View Parsing: Presto/Trino view JSON is parsed to extract:
- Original SQL text
- Referenced tables
- Column metadata and types
Lineage Extraction: SQL is parsed using
sqlglotto create table-to-view lineage.Storage Lineage Integration: If storage lineage is enabled, the connector also creates lineage from storage → tables → views.
Configuration
Presto/Trino view support is automatically enabled when ingesting from a metastore that contains Presto/Trino views. No additional configuration is required.
Example
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
# Enable storage lineage for complete lineage chain
emit_storage_lineage: true
This configuration will create complete lineage:
S3 Bucket → Hive Table → Presto View
Limitations
- Presto/Trino Version: The connector supports Presto 0.200+ and Trino view formats
- Complex SQL: Very complex SQL with non-standard syntax may have incomplete lineage
- Cross-Database References: Lineage is extracted for references within the same Hive metastore
Schema Filtering
For large metastore deployments with many databases, use filtering to limit ingestion scope:
Database Filtering
source:
type: hive-metastore
config:
# ... connection config ...
# Only ingest from specific databases
schema_pattern:
allow:
- "^production_.*" # All databases starting with production_
- "analytics" # Specific database
deny:
- ".*_test$" # Exclude test databases
Table Filtering with SQL
For filtering by database name, use pattern-based filtering:
source:
type: hive-metastore
config:
# ... connection config ...
# Filter to specific databases using regex patterns
database_pattern:
allow:
- "^production_db$"
- "^analytics_db$"
deny:
- "^test_.*"
- ".*_staging$"
Note: The deprecated *_where_clause_suffix options have been removed for security reasons. Use database_pattern and table_pattern for filtering.
Performance Considerations
Advantages Over HiveServer2 Connector
The Hive Metastore connector is significantly faster than the Hive connector because:
- Direct Database Access: No HiveServer2 overhead
- Batch Queries: Fetches all metadata in optimized SQL queries
- No Query Execution: Doesn't run Hive queries to extract metadata
- Parallel Processing: Can process multiple databases concurrently
Performance Comparison (approximate):
- 10 databases, 1000 tables: ~2 minutes (Metastore) vs ~15 minutes (HiveServer2)
- 100 databases, 10,000 tables: ~15 minutes (Metastore) vs ~2 hours (HiveServer2)
Optimization Tips
Database Connection Pooling: The connector uses SQLAlchemy's default connection pooling. For very large deployments, consider tuning pool size:
options:
pool_size: 10
max_overflow: 20Schema Filtering: Use
schema_patternto limit scope and reduce query time.Stateful Ingestion: Enable to only process changes:
stateful_ingestion:
enabled: true
remove_stale_metadata: trueDisable Column Lineage: If not needed:
emit_storage_lineage: true
include_column_lineage: false # Faster
Network Considerations
- Latency: Low latency to the metastore database is important
- Bandwidth: Minimal bandwidth required (only metadata, no data transfer)
- Connection Limits: Ensure metastore database can handle additional read connections
Platform Instances
When ingesting from multiple metastores (e.g., different clusters or environments), use platform_instance:
source:
type: hive-metastore
config:
host_port: prod-metastore-db.company.com:5432
database: metastore
platform_instance: "prod-hive"
Best Practice: Combine with storage_platform_instance:
source:
type: hive-metastore
config:
platform_instance: "prod-hive" # Hive tables
storage_platform_instance: "prod-hdfs" # Storage locations
emit_storage_lineage: true
Caveats and Limitations
Metastore Schema Compatibility
- Hive Versions: Tested with Hive 1.x, 2.x, and 3.x metastore schemas
- Schema Variations: Different Hive versions may have slightly different metastore schemas
- Custom Tables: If your organization has added custom metastore tables, they won't be processed
Database Support
- Supported: PostgreSQL, MySQL, MariaDB
- Not Supported: Oracle, MSSQL (may work but untested)
- Derby: Not recommended (embedded metastore, typically single-user)
View Lineage Parsing
- Simple SQL: Fully supported with accurate lineage
- Complex SQL: Best-effort parsing; some edge cases may have incomplete lineage
- Non-standard SQL: Presto/Trino-specific functions may not be fully parsed
Permissions Limitations
- Read-Only: The connector only needs SELECT permissions
- No Write Operations: Never requires INSERT, UPDATE, or DELETE
- Metastore Locks: Read operations don't acquire metastore locks
Storage Lineage Limitations
Same as the Hive connector:
- Only tables with defined storage locations have lineage
- Temporary tables are not supported
- Partition-level lineage is aggregated at table level
Known Issues
Large Column Lists: Tables with 500+ columns may be slow to process due to metastore query complexity.
View Definition Encoding: Some older Hive versions store view definitions in non-UTF-8 encoding, which may cause parsing issues.
Case Sensitivity:
- PostgreSQL metastore: Case-sensitive identifiers (use
"quoted"names in WHERE clauses) - MySQL metastore: Case-insensitive by default
- DataHub automatically lowercases URNs for consistency
- PostgreSQL metastore: Case-sensitive identifiers (use
Concurrent Metastore Writes: If the metastore is being actively modified during ingestion, some metadata may be inconsistent.
Troubleshooting
Connection Issues
Problem: Could not connect to metastore database
Solutions:
- Verify
host_port,database, andschemeare correct - Check network connectivity:
telnet <host> <port> - Verify firewall rules allow connections
- For PostgreSQL: Check
pg_hba.confallows connections from your IP - For MySQL: Check
bind-addressinmy.cnf
Authentication Failures
Problem: Authentication failed or Access denied
Solutions:
- Verify username and password are correct
- Check user has CONNECT/LOGIN privileges
- For Azure: Ensure username includes
@server-namesuffix - Review database logs for detailed error messages
Missing Tables
Problem: Not all tables appear in DataHub
Solutions:
- Verify database user has SELECT on all metastore tables
- Check if tables are filtered by
schema_patternor WHERE clauses - Query metastore directly to verify tables exist:
SELECT d.name as db_name, t.tbl_name as table_name, t.tbl_type
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
WHERE d.name = 'your_database';
Presto/Trino Views Not Appearing
Problem: Views defined in Presto/Trino don't show up
Solutions:
- Check view definitions exist in metastore:
SELECT d.name as db_name, t.tbl_name as view_name, tp.param_value
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
JOIN TABLE_PARAMS tp ON t.tbl_id = tp.tbl_id
WHERE t.tbl_type = 'VIRTUAL_VIEW'
AND tp.param_key = 'presto_view'
LIMIT 10; - Review ingestion logs for parsing errors
- Verify view JSON is valid
Storage Lineage Not Appearing
Problem: No storage lineage relationships visible
Solutions:
- Verify
emit_storage_lineage: trueis set - Check tables have storage locations in metastore:
SELECT d.name as db_name, t.tbl_name as table_name, s.location
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
JOIN SDS s ON t.sd_id = s.sd_id
WHERE s.location IS NOT NULL
LIMIT 10; - Review logs for "Failed to parse storage location" warnings
- See the "Storage Lineage" section above for troubleshooting tips
Slow Ingestion
Problem: Ingestion takes too long
Solutions:
- Use schema filtering to reduce scope
- Enable stateful ingestion to only process changes
- Check database query performance (may need indexes on metastore tables)
- Ensure low latency network connection to metastore database
- Consider disabling column lineage if not needed
Related Documentation
- Hive Metastore Configuration - Configuration examples
- Hive Connector - Alternative connector via HiveServer2
- SQLAlchemy Documentation - Underlying database connection library
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
# =============================================================================
# SQL Mode (Default) - Direct database connection
# =============================================================================
source:
type: hive-metastore
config:
# Hive metastore DB connection
host_port: localhost:5432
database: metastore
# specify the schema where metastore tables reside
schema_pattern:
allow:
- "^public"
# credentials
username: user # optional
password: pass # optional
#scheme: 'postgresql+psycopg2' # set this if metastore db is using postgres
#scheme: 'mysql+pymysql' # set this if metastore db is using mysql, default if unset
# Filter databases using pattern-based filtering
#database_pattern:
# allow:
# - "^db1$"
# deny:
# - "^test_.*"
# Storage Lineage Configuration (Optional)
# Enables lineage between Hive tables and their underlying storage locations
#emit_storage_lineage: false # Set to true to enable storage lineage
#hive_storage_lineage_direction: upstream # Direction: 'upstream' (storage -> Hive) or 'downstream' (Hive -> storage)
#include_column_lineage: true # Set to false to disable column-level lineage
#storage_platform_instance: "prod-hdfs" # Optional: platform instance for storage URNs
sink:
# sink configs
# =============================================================================
# Thrift Mode - HMS Thrift API connection (use when database access unavailable)
# =============================================================================
# Use this mode when:
# - You cannot access the metastore database directly
# - Only the HMS Thrift API (port 9083) is accessible
# - Your environment requires Kerberos authentication
#
# Prerequisites:
# - pip install 'acryl-datahub[hive-metastore]'
# - For Kerberos: pip install thrift-sasl pyhive[hive-pure-sasl]
# - For Kerberos: Run 'kinit' before ingestion to obtain ticket
#
# source:
# type: hive-metastore
# config:
# # =========================================================================
# # Connection Settings (Required)
# # =========================================================================
# connection_type: thrift # Enable Thrift mode (default is 'sql')
# host_port: hms.company.com:9083 # HMS Thrift API endpoint
#
# # =========================================================================
# # Authentication - Kerberos/SASL (Optional)
# # =========================================================================
# # Enable if HMS requires Kerberos authentication
# # Prerequisite: Run 'kinit -kt /path/to/keytab user@REALM' before ingestion
# use_kerberos: true
#
# # Kerberos service principal name (typically 'hive')
# # Check your HMS principal: klist -k /etc/hive/hive.keytab
# kerberos_service_name: hive
#
# # Override hostname for Kerberos principal (use with load balancers)
# # Set this if connecting via LB but Kerberos principal uses actual hostname
# # kerberos_hostname_override: hms-master.company.com
#
# # =========================================================================
# # Connection Tuning (Optional)
# # =========================================================================
# # timeout_seconds: 60 # Connection timeout (default: 60)
# # max_retries: 3 # Retry attempts for transient failures (default: 3)
#
# # =========================================================================
# # HMS 3.x Catalog Support (Optional)
# # =========================================================================
# # For HMS 3.x with multi-catalog support (e.g., Spark catalog)
# # catalog_name: spark_catalog
# # include_catalog_name_in_ids: true # Include catalog in dataset URNs
#
# # =========================================================================
# # Filtering (Pattern-based only - WHERE clauses NOT supported)
# # =========================================================================
# database_pattern:
# allow:
# - "^prod_.*" # Allow databases starting with 'prod_'
# - "^analytics$" # Allow exact match 'analytics'
# deny:
# - "^test_.*" # Deny databases starting with 'test_'
# - ".*_staging$" # Deny databases ending with '_staging'
#
# table_pattern:
# allow:
# - ".*" # Allow all tables by default
# deny:
# - "^tmp_.*" # Deny temporary tables
#
# # =========================================================================
# # Storage Lineage (Optional - works same as SQL mode)
# # =========================================================================
# emit_storage_lineage: true
# hive_storage_lineage_direction: upstream # or 'downstream'
# include_column_lineage: true
# # storage_platform_instance: "prod-hdfs"
#
# # =========================================================================
# # Platform Instance (Optional - for multi-cluster environments)
# # =========================================================================
# # platform_instance: "prod-hive"
#
# # =========================================================================
# # Stateful Ingestion (Optional - for incremental updates)
# # =========================================================================
# # stateful_ingestion:
# # enabled: true
# # remove_stale_metadata: true
#
# sink:
# type: datahub-rest
# config:
# server: http://localhost:8080
# =============================================================================
# Thrift Mode - Minimal Example (No Kerberos)
# =============================================================================
# source:
# type: hive-metastore
# config:
# connection_type: thrift
# host_port: hms.company.com:9083
# use_kerberos: false
#
# sink:
# type: datahub-rest
# config:
# server: http://localhost:8080
# =============================================================================
# Thrift Mode - Kerberos with Load Balancer
# =============================================================================
# source:
# type: hive-metastore
# config:
# connection_type: thrift
# host_port: hms-lb.company.com:9083 # Load balancer address
# use_kerberos: true
# kerberos_service_name: hive
# kerberos_hostname_override: hms-master.company.com # Actual HMS hostname
#
# sink:
# type: datahub-rest
# config:
# server: http://localhost:8080
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
catalog_name One of string, null | Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'. Default: None |
connection_type Enum | One of: "sql", "thrift" |
convert_urns_to_lowercase boolean | Whether to convert dataset urns to lowercase. Default: False |
database One of string, null | database (catalog) Default: None |
emit_storage_lineage boolean | Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False |
enable_properties_merge boolean | Merge properties with existing server data instead of overwriting. Default: True |
hive_storage_lineage_direction Enum | One of: "upstream", "downstream" |
host_port string | Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083). Default: localhost:3306 |
include_catalog_name_in_ids boolean | Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD) Default: False |
include_column_lineage boolean | When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True |
include_table_location_lineage boolean | If the source supports it, include table lineage to the underlying storage location. Default: True |
include_tables boolean | Whether tables should be ingested. Default: True |
include_view_column_lineage boolean | Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled. Default: True |
include_view_lineage boolean | Extract lineage from Hive views by parsing view definitions. Default: True |
include_views boolean | Whether views should be ingested. Default: True |
incremental_lineage boolean | When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False |
ingestion_job_id string | Default: |
kerberos_hostname_override One of string, null | Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'. Default: None |
kerberos_service_name string | Kerberos service name for the HMS principal. Only for connection_type='thrift'. Default: hive |
metastore_db_name One of string, null | Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering. Default: None |
mode Enum | One of: "hive", "presto", "presto-on-hive", "trino" |
options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. |
password One of string(password), null | password Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
schemas_where_clause_suffix string | DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default: |
simplify_nested_field_paths boolean | Simplify v2 field paths to v1. Falls back to v2 for Union/Array types. Default: False |
sqlalchemy_uri One of string, null | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None |
storage_platform_instance One of string, null | Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None |
tables_where_clause_suffix string | DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default: |
timeout_seconds integer | Connection timeout in seconds. Only for connection_type='thrift'. Default: 60 |
use_catalog_subtype boolean | Use 'Catalog' (True) or 'Database' (False) as container subtype. Default: True |
use_dataset_pascalcase_subtype boolean | Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype. Default: False |
use_file_backed_cache boolean | Whether to use a file backed cache for the view definitions. Default: True |
use_kerberos boolean | Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'. Default: False |
username One of string, null | username Default: None |
views_where_clause_suffix string | DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default: |
env string | The environment that all assets produced by this connector belong to Default: PROD |
database_pattern AllowDenyPattern | A class to store allow deny regexes |
database_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain map(str,AllowDenyPattern) | A class to store allow deny regexes |
domain. key.allowarray | List of regex patterns to include in ingestion Default: ['.*'] |
domain. key.allow.stringstring | |
domain. key.ignoreCaseOne of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain. key.denyarray | List of regex patterns to exclude from ingestion. Default: [] |
domain. key.deny.stringstring | |
profile_pattern AllowDenyPattern | A class to store allow deny regexes |
profile_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
schema_pattern AllowDenyPattern | A class to store allow deny regexes |
schema_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern AllowDenyPattern | A class to store allow deny regexes |
table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern AllowDenyPattern | A class to store allow deny regexes |
view_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification ClassificationConfig | |
classification.enabled boolean | Whether classification should be used to auto-detect glossary terms Default: False |
classification.info_type_to_term map(str,string) | |
classification.max_workers integer | Number of worker processes to use for classification. Set to 1 to disable. Default: 4 |
classification.sample_size integer | Number of sample values used for classification. Default: 100 |
classification.classifiers array | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}] |
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig | |
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string | The type of the classifier to use. For DataHub, use datahub |
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null | The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None |
classification.column_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.column_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification.table_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling GEProfilingConfig | |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit One of integer, null | Max number of documents to profile. By default, profiles all documents. Default: None |
profiling.max_number_of_fields_to_profile One of integer, null | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.offset One of integer, null | Offset in documents to profile. By default, uses no offset. Default: None |
profiling.partition_datetime One of string(date-time), null | If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None |
profiling.partition_profiling_enabled boolean | Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True |
profiling.profile_external_tables boolean | Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False |
profiling.profile_if_updated_since_days One of number, null | Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery. Default: None |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.profile_table_row_count_estimate_only boolean | Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False |
profiling.profile_table_row_limit One of integer, null | Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats. Default: 5000000 |
profiling.profile_table_size_limit One of integer, null | Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats. Default: 5 |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.sample_size integer | Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True. Default: 10000 |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.use_sampling boolean | Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.tags_to_ignore_sampling One of array, null | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling. Default: None |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Configuration for stateful ingestion and stale entity removal. Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"ClassificationConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether classification should be used to auto-detect glossary terms",
"title": "Enabled",
"type": "boolean"
},
"sample_size": {
"default": 100,
"description": "Number of sample values used for classification.",
"title": "Sample Size",
"type": "integer"
},
"max_workers": {
"default": 4,
"description": "Number of worker processes to use for classification. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"column_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
},
"info_type_to_term": {
"additionalProperties": {
"type": "string"
},
"default": {},
"description": "Optional mapping to provide glossary term identifier for info type",
"title": "Info Type To Term",
"type": "object"
},
"classifiers": {
"default": [
{
"type": "datahub",
"config": null
}
],
"description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
"items": {
"$ref": "#/$defs/DynamicTypedClassifierConfig"
},
"title": "Classifiers",
"type": "array"
}
},
"title": "ClassificationConfig",
"type": "object"
},
"DynamicTypedClassifierConfig": {
"additionalProperties": false,
"properties": {
"type": {
"description": "The type of the classifier to use. For DataHub, use `datahub`",
"title": "Type",
"type": "string"
},
"config": {
"anyOf": [
{},
{
"type": "null"
}
],
"default": null,
"description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
"title": "Config"
}
},
"required": [
"type"
],
"title": "DynamicTypedClassifierConfig",
"type": "object"
},
"GEProfilingConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether profiling should be done.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
},
"limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Max number of documents to profile. By default, profiles all documents.",
"title": "Limit"
},
"offset": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Offset in documents to profile. By default, uses no offset.",
"title": "Offset"
},
"profile_table_level_only": {
"default": false,
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"title": "Profile Table Level Only",
"type": "boolean"
},
"include_field_null_count": {
"default": true,
"description": "Whether to profile for the number of nulls for each column.",
"title": "Include Field Null Count",
"type": "boolean"
},
"include_field_distinct_count": {
"default": true,
"description": "Whether to profile for the number of distinct values for each column.",
"title": "Include Field Distinct Count",
"type": "boolean"
},
"include_field_min_value": {
"default": true,
"description": "Whether to profile for the min value of numeric columns.",
"title": "Include Field Min Value",
"type": "boolean"
},
"include_field_max_value": {
"default": true,
"description": "Whether to profile for the max value of numeric columns.",
"title": "Include Field Max Value",
"type": "boolean"
},
"include_field_mean_value": {
"default": true,
"description": "Whether to profile for the mean value of numeric columns.",
"title": "Include Field Mean Value",
"type": "boolean"
},
"include_field_median_value": {
"default": true,
"description": "Whether to profile for the median value of numeric columns.",
"title": "Include Field Median Value",
"type": "boolean"
},
"include_field_stddev_value": {
"default": true,
"description": "Whether to profile for the standard deviation of numeric columns.",
"title": "Include Field Stddev Value",
"type": "boolean"
},
"include_field_quantiles": {
"default": false,
"description": "Whether to profile for the quantiles of numeric columns.",
"title": "Include Field Quantiles",
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"default": false,
"description": "Whether to profile for distinct value frequencies.",
"title": "Include Field Distinct Value Frequencies",
"type": "boolean"
},
"include_field_histogram": {
"default": false,
"description": "Whether to profile for the histogram for numeric fields.",
"title": "Include Field Histogram",
"type": "boolean"
},
"include_field_sample_values": {
"default": true,
"description": "Whether to profile for the sample values for all columns.",
"title": "Include Field Sample Values",
"type": "boolean"
},
"max_workers": {
"default": 20,
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"report_dropped_profiles": {
"default": false,
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"title": "Report Dropped Profiles",
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"default": false,
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"title": "Turn Off Expensive Profiling Metrics",
"type": "boolean"
},
"field_sample_values_limit": {
"default": 20,
"description": "Upper limit for number of sample values to collect for all columns.",
"title": "Field Sample Values Limit",
"type": "integer"
},
"max_number_of_fields_to_profile": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"title": "Max Number Of Fields To Profile"
},
"profile_if_updated_since_days": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"title": "Profile If Updated Since Days"
},
"profile_table_size_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5,
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"title": "Profile Table Size Limit"
},
"profile_table_row_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5000000,
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"title": "Profile Table Row Limit"
},
"profile_table_row_count_estimate_only": {
"default": false,
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"title": "Profile Table Row Count Estimate Only",
"type": "boolean"
},
"query_combiner_enabled": {
"default": true,
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"title": "Query Combiner Enabled",
"type": "boolean"
},
"catch_exceptions": {
"default": true,
"description": "",
"title": "Catch Exceptions",
"type": "boolean"
},
"partition_profiling_enabled": {
"default": true,
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"title": "Partition Profiling Enabled",
"type": "boolean"
},
"partition_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"title": "Partition Datetime"
},
"use_sampling": {
"default": true,
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Use Sampling",
"type": "boolean"
},
"sample_size": {
"default": 10000,
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Sample Size",
"type": "integer"
},
"profile_external_tables": {
"default": false,
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"title": "Profile External Tables",
"type": "boolean"
},
"tags_to_ignore_sampling": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"title": "Tags To Ignore Sampling"
},
"profile_nested_fields": {
"default": false,
"description": "Whether to profile complex types like structs, arrays and maps. ",
"title": "Profile Nested Fields",
"type": "boolean"
}
},
"title": "GEProfilingConfig",
"type": "object"
},
"HiveMetastoreConfigMode": {
"description": "Mode for metadata extraction.",
"enum": [
"hive",
"presto",
"presto-on-hive",
"trino"
],
"title": "HiveMetastoreConfigMode",
"type": "string"
},
"HiveMetastoreConnectionType": {
"description": "Connection type for HiveMetastoreSource.",
"enum": [
"sql",
"thrift"
],
"title": "HiveMetastoreConnectionType",
"type": "string"
},
"LineageDirection": {
"description": "Direction of lineage relationship between storage and Hive",
"enum": [
"upstream",
"downstream"
],
"title": "LineageDirection",
"type": "string"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for Hive Metastore source.\n\nSupports two connection types:\n- sql: Direct database access (MySQL/PostgreSQL) to HMS backend\n- thrift: HMS Thrift API with Kerberos support",
"properties": {
"schema_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"view_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"classification": {
"$ref": "#/$defs/ClassificationConfig",
"default": {
"enabled": false,
"sample_size": 100,
"max_workers": 4,
"table_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"column_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"info_type_to_term": {},
"classifiers": [
{
"config": null,
"type": "datahub"
}
]
},
"description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
},
"incremental_lineage": {
"default": false,
"description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
"title": "Incremental Lineage",
"type": "boolean"
},
"convert_urns_to_lowercase": {
"default": false,
"description": "Whether to convert dataset urns to lowercase.",
"title": "Convert Urns To Lowercase",
"type": "boolean"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Configuration for stateful ingestion and stale entity removal."
},
"emit_storage_lineage": {
"default": false,
"description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
"title": "Emit Storage Lineage",
"type": "boolean"
},
"hive_storage_lineage_direction": {
"$ref": "#/$defs/LineageDirection",
"default": "upstream",
"description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
},
"include_column_lineage": {
"default": true,
"description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
"title": "Include Column Lineage",
"type": "boolean"
},
"storage_platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
"title": "Storage Platform Instance"
},
"options": {
"additionalProperties": true,
"description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
"title": "Options",
"type": "object"
},
"profile_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
},
"domain": {
"additionalProperties": {
"$ref": "#/$defs/AllowDenyPattern"
},
"default": {},
"description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
"title": "Domain",
"type": "object"
},
"include_views": {
"default": true,
"description": "Whether views should be ingested.",
"title": "Include Views",
"type": "boolean"
},
"include_tables": {
"default": true,
"description": "Whether tables should be ingested.",
"title": "Include Tables",
"type": "boolean"
},
"include_table_location_lineage": {
"default": true,
"description": "If the source supports it, include table lineage to the underlying storage location.",
"title": "Include Table Location Lineage",
"type": "boolean"
},
"include_view_lineage": {
"default": true,
"description": "Extract lineage from Hive views by parsing view definitions.",
"title": "Include View Lineage",
"type": "boolean"
},
"include_view_column_lineage": {
"default": true,
"description": "Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
"title": "Include View Column Lineage",
"type": "boolean"
},
"use_file_backed_cache": {
"default": true,
"description": "Whether to use a file backed cache for the view definitions.",
"title": "Use File Backed Cache",
"type": "boolean"
},
"profiling": {
"$ref": "#/$defs/GEProfilingConfig",
"default": {
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_date_of_month": null,
"profile_day_of_week": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
}
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "username",
"title": "Username"
},
"password": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "password",
"title": "Password"
},
"host_port": {
"default": "localhost:3306",
"description": "Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083).",
"title": "Host Port",
"type": "string"
},
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "database (catalog)",
"title": "Database"
},
"sqlalchemy_uri": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
"title": "Sqlalchemy Uri"
},
"connection_type": {
"$ref": "#/$defs/HiveMetastoreConnectionType",
"default": "sql",
"description": "Connection method: 'sql' for direct database access (MySQL/PostgreSQL), 'thrift' for HMS Thrift API with optional Kerberos support."
},
"views_where_clause_suffix": {
"default": "",
"description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
"title": "Views Where Clause Suffix",
"type": "string"
},
"tables_where_clause_suffix": {
"default": "",
"description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
"title": "Tables Where Clause Suffix",
"type": "string"
},
"schemas_where_clause_suffix": {
"default": "",
"description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
"title": "Schemas Where Clause Suffix",
"type": "string"
},
"metastore_db_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering.",
"title": "Metastore Db Name"
},
"use_kerberos": {
"default": false,
"description": "Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'.",
"title": "Use Kerberos",
"type": "boolean"
},
"kerberos_service_name": {
"default": "hive",
"description": "Kerberos service name for the HMS principal. Only for connection_type='thrift'.",
"title": "Kerberos Service Name",
"type": "string"
},
"kerberos_hostname_override": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'.",
"title": "Kerberos Hostname Override"
},
"timeout_seconds": {
"default": 60,
"description": "Connection timeout in seconds. Only for connection_type='thrift'.",
"title": "Timeout Seconds",
"type": "integer"
},
"catalog_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'.",
"title": "Catalog Name"
},
"database_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for databases to filter."
},
"mode": {
"$ref": "#/$defs/HiveMetastoreConfigMode",
"default": "hive",
"description": "Platform mode for metadata. Valid options: ['hive', 'presto', 'presto-on-hive', 'trino']"
},
"use_catalog_subtype": {
"default": true,
"description": "Use 'Catalog' (True) or 'Database' (False) as container subtype.",
"title": "Use Catalog Subtype",
"type": "boolean"
},
"use_dataset_pascalcase_subtype": {
"default": false,
"description": "Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype.",
"title": "Use Dataset Pascalcase Subtype",
"type": "boolean"
},
"include_catalog_name_in_ids": {
"default": false,
"description": "Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD)",
"title": "Include Catalog Name In Ids",
"type": "boolean"
},
"enable_properties_merge": {
"default": true,
"description": "Merge properties with existing server data instead of overwriting.",
"title": "Enable Properties Merge",
"type": "boolean"
},
"simplify_nested_field_paths": {
"default": false,
"description": "Simplify v2 field paths to v1. Falls back to v2 for Union/Array types.",
"title": "Simplify Nested Field Paths",
"type": "boolean"
},
"ingestion_job_id": {
"default": "",
"title": "Ingestion Job Id",
"type": "string"
}
},
"title": "HiveMetastore",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource - Browse on GitHub
Module presto-on-hive
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. Supported for types - Catalog, Schema. |
| Classification | ❌ | Not Supported. |
| Column-level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View. |
| Data Profiling | ❌ | Not Supported. |
| Descriptions | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Domains | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View. |
| Test Connection | ✅ | Enabled by default. |
Extracts metadata from Hive Metastore.
Supports two connection methods selected via connection_type:
- sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
- thrift: Connection to HMS Thrift API with Kerberos support
Features:
- Table and view metadata extraction
- Schema field types including complex types (struct, map, array)
- Storage lineage to S3, HDFS, Azure, GCS
- View lineage via SQL parsing
- Stateful ingestion for stale entity removal
CLI based Ingestion
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
catalog_name One of string, null | Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'. Default: None |
connection_type Enum | One of: "sql", "thrift" |
convert_urns_to_lowercase boolean | Whether to convert dataset urns to lowercase. Default: False |
database One of string, null | database (catalog) Default: None |
emit_storage_lineage boolean | Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False |
enable_properties_merge boolean | Merge properties with existing server data instead of overwriting. Default: True |
hive_storage_lineage_direction Enum | One of: "upstream", "downstream" |
host_port string | Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083). Default: localhost:3306 |
include_catalog_name_in_ids boolean | Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD) Default: False |
include_column_lineage boolean | When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True |
include_table_location_lineage boolean | If the source supports it, include table lineage to the underlying storage location. Default: True |
include_tables boolean | Whether tables should be ingested. Default: True |
include_view_column_lineage boolean | Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled. Default: True |
include_view_lineage boolean | Extract lineage from Hive views by parsing view definitions. Default: True |
include_views boolean | Whether views should be ingested. Default: True |
incremental_lineage boolean | When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False |
ingestion_job_id string | Default: |
kerberos_hostname_override One of string, null | Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'. Default: None |
kerberos_service_name string | Kerberos service name for the HMS principal. Only for connection_type='thrift'. Default: hive |
metastore_db_name One of string, null | Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering. Default: None |
mode Enum | One of: "hive", "presto", "presto-on-hive", "trino" |
options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. |
password One of string(password), null | password Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
schemas_where_clause_suffix string | DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default: |
simplify_nested_field_paths boolean | Simplify v2 field paths to v1. Falls back to v2 for Union/Array types. Default: False |
sqlalchemy_uri One of string, null | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None |
storage_platform_instance One of string, null | Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None |
tables_where_clause_suffix string | DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default: |
timeout_seconds integer | Connection timeout in seconds. Only for connection_type='thrift'. Default: 60 |
use_catalog_subtype boolean | Use 'Catalog' (True) or 'Database' (False) as container subtype. Default: True |
use_dataset_pascalcase_subtype boolean | Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype. Default: False |
use_file_backed_cache boolean | Whether to use a file backed cache for the view definitions. Default: True |
use_kerberos boolean | Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'. Default: False |
username One of string, null | username Default: None |
views_where_clause_suffix string | DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default: |
env string | The environment that all assets produced by this connector belong to Default: PROD |
database_pattern AllowDenyPattern | A class to store allow deny regexes |
database_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain map(str,AllowDenyPattern) | A class to store allow deny regexes |
domain. key.allowarray | List of regex patterns to include in ingestion Default: ['.*'] |
domain. key.allow.stringstring | |
domain. key.ignoreCaseOne of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain. key.denyarray | List of regex patterns to exclude from ingestion. Default: [] |
domain. key.deny.stringstring | |
profile_pattern AllowDenyPattern | A class to store allow deny regexes |
profile_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
schema_pattern AllowDenyPattern | A class to store allow deny regexes |
schema_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern AllowDenyPattern | A class to store allow deny regexes |
table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern AllowDenyPattern | A class to store allow deny regexes |
view_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification ClassificationConfig | |
classification.enabled boolean | Whether classification should be used to auto-detect glossary terms Default: False |
classification.info_type_to_term map(str,string) | |
classification.max_workers integer | Number of worker processes to use for classification. Set to 1 to disable. Default: 4 |
classification.sample_size integer | Number of sample values used for classification. Default: 100 |
classification.classifiers array | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}] |
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig | |
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string | The type of the classifier to use. For DataHub, use datahub |
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null | The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None |
classification.column_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.column_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification.table_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling GEProfilingConfig | |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit One of integer, null | Max number of documents to profile. By default, profiles all documents. Default: None |
profiling.max_number_of_fields_to_profile One of integer, null | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.offset One of integer, null | Offset in documents to profile. By default, uses no offset. Default: None |
profiling.partition_datetime One of string(date-time), null | If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None |
profiling.partition_profiling_enabled boolean | Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True |
profiling.profile_external_tables boolean | Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False |
profiling.profile_if_updated_since_days One of number, null | Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery. Default: None |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.profile_table_row_count_estimate_only boolean | Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False |
profiling.profile_table_row_limit One of integer, null | Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats. Default: 5000000 |
profiling.profile_table_size_limit One of integer, null | Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats. Default: 5 |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.sample_size integer | Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True. Default: 10000 |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.use_sampling boolean | Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.tags_to_ignore_sampling One of array, null | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling. Default: None |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Configuration for stateful ingestion and stale entity removal. Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"ClassificationConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether classification should be used to auto-detect glossary terms",
"title": "Enabled",
"type": "boolean"
},
"sample_size": {
"default": 100,
"description": "Number of sample values used for classification.",
"title": "Sample Size",
"type": "integer"
},
"max_workers": {
"default": 4,
"description": "Number of worker processes to use for classification. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"column_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
},
"info_type_to_term": {
"additionalProperties": {
"type": "string"
},
"default": {},
"description": "Optional mapping to provide glossary term identifier for info type",
"title": "Info Type To Term",
"type": "object"
},
"classifiers": {
"default": [
{
"type": "datahub",
"config": null
}
],
"description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
"items": {
"$ref": "#/$defs/DynamicTypedClassifierConfig"
},
"title": "Classifiers",
"type": "array"
}
},
"title": "ClassificationConfig",
"type": "object"
},
"DynamicTypedClassifierConfig": {
"additionalProperties": false,
"properties": {
"type": {
"description": "The type of the classifier to use. For DataHub, use `datahub`",
"title": "Type",
"type": "string"
},
"config": {
"anyOf": [
{},
{
"type": "null"
}
],
"default": null,
"description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
"title": "Config"
}
},
"required": [
"type"
],
"title": "DynamicTypedClassifierConfig",
"type": "object"
},
"GEProfilingConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether profiling should be done.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
},
"limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Max number of documents to profile. By default, profiles all documents.",
"title": "Limit"
},
"offset": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Offset in documents to profile. By default, uses no offset.",
"title": "Offset"
},
"profile_table_level_only": {
"default": false,
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"title": "Profile Table Level Only",
"type": "boolean"
},
"include_field_null_count": {
"default": true,
"description": "Whether to profile for the number of nulls for each column.",
"title": "Include Field Null Count",
"type": "boolean"
},
"include_field_distinct_count": {
"default": true,
"description": "Whether to profile for the number of distinct values for each column.",
"title": "Include Field Distinct Count",
"type": "boolean"
},
"include_field_min_value": {
"default": true,
"description": "Whether to profile for the min value of numeric columns.",
"title": "Include Field Min Value",
"type": "boolean"
},
"include_field_max_value": {
"default": true,
"description": "Whether to profile for the max value of numeric columns.",
"title": "Include Field Max Value",
"type": "boolean"
},
"include_field_mean_value": {
"default": true,
"description": "Whether to profile for the mean value of numeric columns.",
"title": "Include Field Mean Value",
"type": "boolean"
},
"include_field_median_value": {
"default": true,
"description": "Whether to profile for the median value of numeric columns.",
"title": "Include Field Median Value",
"type": "boolean"
},
"include_field_stddev_value": {
"default": true,
"description": "Whether to profile for the standard deviation of numeric columns.",
"title": "Include Field Stddev Value",
"type": "boolean"
},
"include_field_quantiles": {
"default": false,
"description": "Whether to profile for the quantiles of numeric columns.",
"title": "Include Field Quantiles",
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"default": false,
"description": "Whether to profile for distinct value frequencies.",
"title": "Include Field Distinct Value Frequencies",
"type": "boolean"
},
"include_field_histogram": {
"default": false,
"description": "Whether to profile for the histogram for numeric fields.",
"title": "Include Field Histogram",
"type": "boolean"
},
"include_field_sample_values": {
"default": true,
"description": "Whether to profile for the sample values for all columns.",
"title": "Include Field Sample Values",
"type": "boolean"
},
"max_workers": {
"default": 20,
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"report_dropped_profiles": {
"default": false,
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"title": "Report Dropped Profiles",
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"default": false,
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"title": "Turn Off Expensive Profiling Metrics",
"type": "boolean"
},
"field_sample_values_limit": {
"default": 20,
"description": "Upper limit for number of sample values to collect for all columns.",
"title": "Field Sample Values Limit",
"type": "integer"
},
"max_number_of_fields_to_profile": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"title": "Max Number Of Fields To Profile"
},
"profile_if_updated_since_days": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"title": "Profile If Updated Since Days"
},
"profile_table_size_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5,
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"title": "Profile Table Size Limit"
},
"profile_table_row_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5000000,
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"title": "Profile Table Row Limit"
},
"profile_table_row_count_estimate_only": {
"default": false,
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"title": "Profile Table Row Count Estimate Only",
"type": "boolean"
},
"query_combiner_enabled": {
"default": true,
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"title": "Query Combiner Enabled",
"type": "boolean"
},
"catch_exceptions": {
"default": true,
"description": "",
"title": "Catch Exceptions",
"type": "boolean"
},
"partition_profiling_enabled": {
"default": true,
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"title": "Partition Profiling Enabled",
"type": "boolean"
},
"partition_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"title": "Partition Datetime"
},
"use_sampling": {
"default": true,
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Use Sampling",
"type": "boolean"
},
"sample_size": {
"default": 10000,
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Sample Size",
"type": "integer"
},
"profile_external_tables": {
"default": false,
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"title": "Profile External Tables",
"type": "boolean"
},
"tags_to_ignore_sampling": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"title": "Tags To Ignore Sampling"
},
"profile_nested_fields": {
"default": false,
"description": "Whether to profile complex types like structs, arrays and maps. ",
"title": "Profile Nested Fields",
"type": "boolean"
}
},
"title": "GEProfilingConfig",
"type": "object"
},
"HiveMetastoreConfigMode": {
"description": "Mode for metadata extraction.",
"enum": [
"hive",
"presto",
"presto-on-hive",
"trino"
],
"title": "HiveMetastoreConfigMode",
"type": "string"
},
"HiveMetastoreConnectionType": {
"description": "Connection type for HiveMetastoreSource.",
"enum": [
"sql",
"thrift"
],
"title": "HiveMetastoreConnectionType",
"type": "string"
},
"LineageDirection": {
"description": "Direction of lineage relationship between storage and Hive",
"enum": [
"upstream",
"downstream"
],
"title": "LineageDirection",
"type": "string"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for Hive Metastore source.\n\nSupports two connection types:\n- sql: Direct database access (MySQL/PostgreSQL) to HMS backend\n- thrift: HMS Thrift API with Kerberos support",
"properties": {
"schema_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"view_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"classification": {
"$ref": "#/$defs/ClassificationConfig",
"default": {
"enabled": false,
"sample_size": 100,
"max_workers": 4,
"table_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"column_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"info_type_to_term": {},
"classifiers": [
{
"config": null,
"type": "datahub"
}
]
},
"description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
},
"incremental_lineage": {
"default": false,
"description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
"title": "Incremental Lineage",
"type": "boolean"
},
"convert_urns_to_lowercase": {
"default": false,
"description": "Whether to convert dataset urns to lowercase.",
"title": "Convert Urns To Lowercase",
"type": "boolean"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Configuration for stateful ingestion and stale entity removal."
},
"emit_storage_lineage": {
"default": false,
"description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
"title": "Emit Storage Lineage",
"type": "boolean"
},
"hive_storage_lineage_direction": {
"$ref": "#/$defs/LineageDirection",
"default": "upstream",
"description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
},
"include_column_lineage": {
"default": true,
"description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
"title": "Include Column Lineage",
"type": "boolean"
},
"storage_platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
"title": "Storage Platform Instance"
},
"options": {
"additionalProperties": true,
"description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
"title": "Options",
"type": "object"
},
"profile_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
},
"domain": {
"additionalProperties": {
"$ref": "#/$defs/AllowDenyPattern"
},
"default": {},
"description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
"title": "Domain",
"type": "object"
},
"include_views": {
"default": true,
"description": "Whether views should be ingested.",
"title": "Include Views",
"type": "boolean"
},
"include_tables": {
"default": true,
"description": "Whether tables should be ingested.",
"title": "Include Tables",
"type": "boolean"
},
"include_table_location_lineage": {
"default": true,
"description": "If the source supports it, include table lineage to the underlying storage location.",
"title": "Include Table Location Lineage",
"type": "boolean"
},
"include_view_lineage": {
"default": true,
"description": "Extract lineage from Hive views by parsing view definitions.",
"title": "Include View Lineage",
"type": "boolean"
},
"include_view_column_lineage": {
"default": true,
"description": "Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
"title": "Include View Column Lineage",
"type": "boolean"
},
"use_file_backed_cache": {
"default": true,
"description": "Whether to use a file backed cache for the view definitions.",
"title": "Use File Backed Cache",
"type": "boolean"
},
"profiling": {
"$ref": "#/$defs/GEProfilingConfig",
"default": {
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_date_of_month": null,
"profile_day_of_week": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
}
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "username",
"title": "Username"
},
"password": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "password",
"title": "Password"
},
"host_port": {
"default": "localhost:3306",
"description": "Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083).",
"title": "Host Port",
"type": "string"
},
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "database (catalog)",
"title": "Database"
},
"sqlalchemy_uri": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
"title": "Sqlalchemy Uri"
},
"connection_type": {
"$ref": "#/$defs/HiveMetastoreConnectionType",
"default": "sql",
"description": "Connection method: 'sql' for direct database access (MySQL/PostgreSQL), 'thrift' for HMS Thrift API with optional Kerberos support."
},
"views_where_clause_suffix": {
"default": "",
"description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
"title": "Views Where Clause Suffix",
"type": "string"
},
"tables_where_clause_suffix": {
"default": "",
"description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
"title": "Tables Where Clause Suffix",
"type": "string"
},
"schemas_where_clause_suffix": {
"default": "",
"description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
"title": "Schemas Where Clause Suffix",
"type": "string"
},
"metastore_db_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering.",
"title": "Metastore Db Name"
},
"use_kerberos": {
"default": false,
"description": "Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'.",
"title": "Use Kerberos",
"type": "boolean"
},
"kerberos_service_name": {
"default": "hive",
"description": "Kerberos service name for the HMS principal. Only for connection_type='thrift'.",
"title": "Kerberos Service Name",
"type": "string"
},
"kerberos_hostname_override": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'.",
"title": "Kerberos Hostname Override"
},
"timeout_seconds": {
"default": 60,
"description": "Connection timeout in seconds. Only for connection_type='thrift'.",
"title": "Timeout Seconds",
"type": "integer"
},
"catalog_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'.",
"title": "Catalog Name"
},
"database_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for databases to filter."
},
"mode": {
"$ref": "#/$defs/HiveMetastoreConfigMode",
"default": "hive",
"description": "Platform mode for metadata. Valid options: ['hive', 'presto', 'presto-on-hive', 'trino']"
},
"use_catalog_subtype": {
"default": true,
"description": "Use 'Catalog' (True) or 'Database' (False) as container subtype.",
"title": "Use Catalog Subtype",
"type": "boolean"
},
"use_dataset_pascalcase_subtype": {
"default": false,
"description": "Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype.",
"title": "Use Dataset Pascalcase Subtype",
"type": "boolean"
},
"include_catalog_name_in_ids": {
"default": false,
"description": "Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD)",
"title": "Include Catalog Name In Ids",
"type": "boolean"
},
"enable_properties_merge": {
"default": true,
"description": "Merge properties with existing server data instead of overwriting.",
"title": "Enable Properties Merge",
"type": "boolean"
},
"simplify_nested_field_paths": {
"default": false,
"description": "Simplify v2 field paths to v1. Falls back to v2 for Union/Array types.",
"title": "Simplify Nested Field Paths",
"type": "boolean"
},
"ingestion_job_id": {
"default": "",
"title": "Ingestion Job Id",
"type": "string"
}
},
"title": "HiveMetastore",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Hive Metastore, feel free to ping us on our Slack.