Skip to main content

Hive Metastore

There are 2 sources that provide integration with Hive Metastore

Source ModuleDocumentation

hive-metastore

Extracts metadata from Hive Metastore.

Supports two connection methods selected via connection_type:

  • sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
  • thrift: Connection to HMS Thrift API with Kerberos support

Features:

  • Table and view metadata extraction
  • Schema field types including complex types (struct, map, array)
  • Storage lineage to S3, HDFS, Azure, GCS
  • View lineage via SQL parsing
  • Stateful ingestion for stale entity removal Read more...

presto-on-hive

Extracts metadata from Hive Metastore.

Supports two connection methods selected via connection_type:

  • sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
  • thrift: Connection to HMS Thrift API with Kerberos support

Features:

  • Table and view metadata extraction
  • Schema field types including complex types (struct, map, array)
  • Storage lineage to S3, HDFS, Azure, GCS
  • View lineage via SQL parsing
  • Stateful ingestion for stale entity removal Read more...

Module hive-metastore

Certified

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default. Supported for types - Catalog, Schema.
ClassificationNot Supported.
Column-level LineageEnabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View.
Data ProfilingNot Supported.
DescriptionsEnabled by default.
Detect Deleted EntitiesEnabled by default via stateful ingestion.
DomainsEnabled by default.
Schema MetadataEnabled by default.
Table-Level LineageEnabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View.
Test ConnectionEnabled by default.

Extracts metadata from Hive Metastore.

Supports two connection methods selected via connection_type:

  • sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
  • thrift: Connection to HMS Thrift API with Kerberos support

Features:

  • Table and view metadata extraction
  • Schema field types including complex types (struct, map, array)
  • Storage lineage to S3, HDFS, Azure, GCS
  • View lineage via SQL parsing
  • Stateful ingestion for stale entity removal

Prerequisites

The Hive Metastore connector supports two connection modes:

  1. SQL Mode (Default): Connects directly to the Hive metastore database (MySQL, PostgreSQL, etc.)
  2. Thrift Mode: Connects to Hive Metastore via the Thrift API (port 9083), with Kerberos support

Choose your connection mode based on your environment:

FeatureSQL Mode (default)Thrift Mode
Use whenDirect database access availableOnly HMS Thrift API accessible
AuthenticationDatabase credentialsKerberos/SASL or unauthenticated
PortDatabase port (3306/5432)Thrift port (9083)
DependenciesDatabase driverspymetastore, thrift-sasl
FilteringSQL WHERE clauses supportedPattern-based filtering only

Before configuring the DataHub connector, ensure you have:

  1. Database Access: Direct read access to the Hive metastore database (typically MySQL or PostgreSQL).

  2. Network Access: The machine running DataHub ingestion must be able to reach your metastore database on the configured port.

  3. Database Driver: Install the appropriate Python database driver:

    # For PostgreSQL metastore
    pip install 'acryl-datahub[hive]' psycopg2-binary

    # For MySQL metastore
    pip install 'acryl-datahub[hive]' PyMySQL
  4. Metastore Schema Knowledge: Familiarity with your metastore database schema (typically public for PostgreSQL, or the database name for MySQL).

Required Database Permissions

The database user account used by DataHub needs read-only access to the Hive metastore tables.

PostgreSQL Metastore

-- Create a dedicated read-only user for DataHub
CREATE USER datahub_user WITH PASSWORD 'secure_password';

-- Grant connection privileges
GRANT CONNECT ON DATABASE metastore TO datahub_user;

-- Grant schema usage
GRANT USAGE ON SCHEMA public TO datahub_user;

-- Grant SELECT on metastore tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;

-- Grant SELECT on future tables (for metastore upgrades)
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;

MySQL Metastore

-- Create a dedicated read-only user for DataHub
CREATE USER 'datahub_user'@'%' IDENTIFIED BY 'secure_password';

-- Grant SELECT privileges on metastore database
GRANT SELECT ON metastore.* TO 'datahub_user'@'%';

-- Apply changes
FLUSH PRIVILEGES;

Required Metastore Tables

DataHub queries the following metastore tables:

TablePurpose
DBSDatabase/schema information
TBLSTable metadata
TABLE_PARAMSTable properties (including view definitions)
SDSStorage descriptor (location, format)
COLUMNS_V2Column metadata
PARTITION_KEYSPartition information
SERDESSerialization/deserialization information

Recommendation: Grant SELECT on all metastore tables to ensure compatibility with different Hive versions and for future DataHub enhancements.

Authentication

PostgreSQL

Standard Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"

SSL Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
sslrootcert: /path/to/ca-cert.pem

MySQL

Standard Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql" # Default if not specified

SSL Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql"
options:
connect_args:
ssl:
ca: /path/to/ca-cert.pem
cert: /path/to/client-cert.pem
key: /path/to/client-key.pem

Amazon RDS (PostgreSQL or MySQL)

For AWS RDS-hosted metastore databases:

source:
type: hive-metastore
config:
host_port: metastore.abc123.us-east-1.rds.amazonaws.com:5432
database: metastore
username: datahub_user
password: ${RDS_PASSWORD}
scheme: "postgresql+psycopg2" # or 'mysql+pymysql'
options:
connect_args:
sslmode: require # RDS requires SSL

Azure Database for PostgreSQL/MySQL

source:
type: hive-metastore
config:
host_port: metastore-server.postgres.database.azure.com:5432
database: metastore
username: datahub_user@metastore-server # Note: Azure requires @server-name suffix
password: ${AZURE_DB_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require

Thrift Connection Mode

Use connection_type: thrift when you cannot access the metastore database directly but have access to the HMS Thrift API (typically port 9083). This is common in:

  • Kerberized Hadoop clusters where database access is restricted
  • Cloud-managed Hive services that only expose the Thrift API
  • Environments with strict network segmentation

Thrift Mode Prerequisites

Before using Thrift mode, ensure:

  1. Network Access: The machine running DataHub ingestion can reach HMS on port 9083
  2. HMS Service Running: The Hive Metastore service is running and accepting Thrift connections
  3. For Kerberos: A valid Kerberos ticket is available (see Kerberos section below)

Verify connectivity:

# Test network connectivity to HMS
telnet hms.company.com 9083

# For Kerberos environments, verify ticket
klist

Thrift Mode Dependencies

# Install with Thrift support
pip install 'acryl-datahub[hive-metastore]'

# For Kerberos authentication, also install:
pip install thrift-sasl pyhive[hive-pure-sasl]

Thrift Configuration Options

OptionTypeDefaultRequiredDescription
connection_typestringsqlYes (for Thrift)Set to thrift to enable Thrift mode
host_portstring-YesHMS host and port (e.g., hms.company.com:9083)
use_kerberosbooleanfalseNoEnable Kerberos/SASL authentication
kerberos_service_namestringhiveNoKerberos service principal name
kerberos_hostname_overridestring-NoOverride hostname for Kerberos principal (for load balancers)
timeout_secondsint60NoConnection timeout in seconds
max_retriesint3NoMaximum retry attempts for transient failures
catalog_namestring-NoHMS 3.x catalog name (e.g., spark_catalog)
include_catalog_name_in_idsbooleanfalseNoInclude catalog in dataset URNs
database_patternAllowDeny-NoFilter databases by regex pattern
table_patternAllowDeny-NoFilter tables by regex pattern

Note: SQL WHERE clause options (tables_where_clause_suffix, views_where_clause_suffix, schemas_where_clause_suffix) have been deprecated for security reasons (SQL injection risk) and are no longer supported. Use database_pattern and table_pattern instead.

Basic Thrift Configuration

source:
type: hive-metastore
config:
connection_type: thrift
host_port: hms.company.com:9083

Thrift with Kerberos Authentication

Ensure you have a valid Kerberos ticket (kinit -kt /path/to/keytab user@REALM) before running ingestion:

source:
type: hive-metastore
config:
connection_type: thrift
host_port: hms.company.com:9083
use_kerberos: true
kerberos_service_name: hive # Change if HMS uses different principal
# kerberos_hostname_override: hms-internal.company.com # If using load balancer
# catalog_name: spark_catalog # For HMS 3.x multi-catalog
database_pattern: # Pattern filtering (WHERE clauses NOT supported)
allow:
- "^prod_.*"

Thrift Mode Dependencies

pip install 'acryl-datahub[hive-metastore]'  # Add thrift-sasl for Kerberos

Thrift Mode Limitations

  • No Presto/Trino view lineage: View SQL parsing requires SQL mode
  • No WHERE clause filtering: Use database_pattern/table_pattern instead
  • Kerberos ticket required: Must have valid ticket before running (not embedded in config)
  • HMS version compatibility: Tested with HMS 2.x and 3.x

Storage Lineage

The Hive Metastore connector supports the same storage lineage features as the Hive connector, with enhanced performance due to direct database access.

Quick Start

Enable storage lineage with minimal configuration:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"

# Enable storage lineage
emit_storage_lineage: true

Configuration Options

Storage lineage is controlled by the same parameters as the Hive connector:

ParameterTypeDefaultDescription
emit_storage_lineagebooleanfalseMaster toggle to enable/disable storage lineage
hive_storage_lineage_directionstring"upstream"Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage)
include_column_lineagebooleantrueEnable column-level lineage from storage paths to Hive columns
storage_platform_instancestringNonePlatform instance for storage URNs (e.g., "prod-s3", "dev-hdfs")

Supported Storage Platforms

All storage platforms supported by the Hive connector are also supported here:

  • Amazon S3 (s3://, s3a://, s3n://)
  • HDFS (hdfs://)
  • Google Cloud Storage (gs://)
  • Azure Blob Storage (wasb://, wasbs://)
  • Azure Data Lake (adl://, abfs://, abfss://)
  • Databricks File System (dbfs://)
  • Local File System (file://)

See the sections above for complete configuration details.

Presto and Trino View Support

A key advantage of the Hive Metastore connector is its ability to extract metadata from Presto and Trino views that are stored in the metastore.

How It Works

  1. View Detection: The connector identifies views by checking the TABLE_PARAMS table for Presto/Trino view definitions.

  2. View Parsing: Presto/Trino view JSON is parsed to extract:

    • Original SQL text
    • Referenced tables
    • Column metadata and types
  3. Lineage Extraction: SQL is parsed using sqlglot to create table-to-view lineage.

  4. Storage Lineage Integration: If storage lineage is enabled, the connector also creates lineage from storage → tables → views.

Configuration

Presto/Trino view support is automatically enabled when ingesting from a metastore that contains Presto/Trino views. No additional configuration is required.

Example

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"

# Enable storage lineage for complete lineage chain
emit_storage_lineage: true

This configuration will create complete lineage:

S3 Bucket → Hive Table → Presto View

Limitations

  • Presto/Trino Version: The connector supports Presto 0.200+ and Trino view formats
  • Complex SQL: Very complex SQL with non-standard syntax may have incomplete lineage
  • Cross-Database References: Lineage is extracted for references within the same Hive metastore

Schema Filtering

For large metastore deployments with many databases, use filtering to limit ingestion scope:

Database Filtering

source:
type: hive-metastore
config:
# ... connection config ...

# Only ingest from specific databases
schema_pattern:
allow:
- "^production_.*" # All databases starting with production_
- "analytics" # Specific database
deny:
- ".*_test$" # Exclude test databases

Table Filtering with SQL

For filtering by database name, use pattern-based filtering:

source:
type: hive-metastore
config:
# ... connection config ...

# Filter to specific databases using regex patterns
database_pattern:
allow:
- "^production_db$"
- "^analytics_db$"
deny:
- "^test_.*"
- ".*_staging$"

Note: The deprecated *_where_clause_suffix options have been removed for security reasons. Use database_pattern and table_pattern for filtering.

Performance Considerations

Advantages Over HiveServer2 Connector

The Hive Metastore connector is significantly faster than the Hive connector because:

  1. Direct Database Access: No HiveServer2 overhead
  2. Batch Queries: Fetches all metadata in optimized SQL queries
  3. No Query Execution: Doesn't run Hive queries to extract metadata
  4. Parallel Processing: Can process multiple databases concurrently

Performance Comparison (approximate):

  • 10 databases, 1000 tables: ~2 minutes (Metastore) vs ~15 minutes (HiveServer2)
  • 100 databases, 10,000 tables: ~15 minutes (Metastore) vs ~2 hours (HiveServer2)

Optimization Tips

  1. Database Connection Pooling: The connector uses SQLAlchemy's default connection pooling. For very large deployments, consider tuning pool size:

    options:
    pool_size: 10
    max_overflow: 20
  2. Schema Filtering: Use schema_pattern to limit scope and reduce query time.

  3. Stateful Ingestion: Enable to only process changes:

    stateful_ingestion:
    enabled: true
    remove_stale_metadata: true
  4. Disable Column Lineage: If not needed:

    emit_storage_lineage: true
    include_column_lineage: false # Faster

Network Considerations

  • Latency: Low latency to the metastore database is important
  • Bandwidth: Minimal bandwidth required (only metadata, no data transfer)
  • Connection Limits: Ensure metastore database can handle additional read connections

Platform Instances

When ingesting from multiple metastores (e.g., different clusters or environments), use platform_instance:

source:
type: hive-metastore
config:
host_port: prod-metastore-db.company.com:5432
database: metastore
platform_instance: "prod-hive"

Best Practice: Combine with storage_platform_instance:

source:
type: hive-metastore
config:
platform_instance: "prod-hive" # Hive tables
storage_platform_instance: "prod-hdfs" # Storage locations
emit_storage_lineage: true

Caveats and Limitations

Metastore Schema Compatibility

  • Hive Versions: Tested with Hive 1.x, 2.x, and 3.x metastore schemas
  • Schema Variations: Different Hive versions may have slightly different metastore schemas
  • Custom Tables: If your organization has added custom metastore tables, they won't be processed

Database Support

  • Supported: PostgreSQL, MySQL, MariaDB
  • Not Supported: Oracle, MSSQL (may work but untested)
  • Derby: Not recommended (embedded metastore, typically single-user)

View Lineage Parsing

  • Simple SQL: Fully supported with accurate lineage
  • Complex SQL: Best-effort parsing; some edge cases may have incomplete lineage
  • Non-standard SQL: Presto/Trino-specific functions may not be fully parsed

Permissions Limitations

  • Read-Only: The connector only needs SELECT permissions
  • No Write Operations: Never requires INSERT, UPDATE, or DELETE
  • Metastore Locks: Read operations don't acquire metastore locks

Storage Lineage Limitations

Same as the Hive connector:

  • Only tables with defined storage locations have lineage
  • Temporary tables are not supported
  • Partition-level lineage is aggregated at table level

Known Issues

  1. Large Column Lists: Tables with 500+ columns may be slow to process due to metastore query complexity.

  2. View Definition Encoding: Some older Hive versions store view definitions in non-UTF-8 encoding, which may cause parsing issues.

  3. Case Sensitivity:

    • PostgreSQL metastore: Case-sensitive identifiers (use "quoted" names in WHERE clauses)
    • MySQL metastore: Case-insensitive by default
    • DataHub automatically lowercases URNs for consistency
  4. Concurrent Metastore Writes: If the metastore is being actively modified during ingestion, some metadata may be inconsistent.

Troubleshooting

Connection Issues

Problem: Could not connect to metastore database

Solutions:

  • Verify host_port, database, and scheme are correct
  • Check network connectivity: telnet <host> <port>
  • Verify firewall rules allow connections
  • For PostgreSQL: Check pg_hba.conf allows connections from your IP
  • For MySQL: Check bind-address in my.cnf

Authentication Failures

Problem: Authentication failed or Access denied

Solutions:

  • Verify username and password are correct
  • Check user has CONNECT/LOGIN privileges
  • For Azure: Ensure username includes @server-name suffix
  • Review database logs for detailed error messages

Missing Tables

Problem: Not all tables appear in DataHub

Solutions:

  • Verify database user has SELECT on all metastore tables
  • Check if tables are filtered by schema_pattern or WHERE clauses
  • Query metastore directly to verify tables exist:
    SELECT d.name as db_name, t.tbl_name as table_name, t.tbl_type
    FROM TBLS t
    JOIN DBS d ON t.db_id = d.db_id
    WHERE d.name = 'your_database';

Presto/Trino Views Not Appearing

Problem: Views defined in Presto/Trino don't show up

Solutions:

  • Check view definitions exist in metastore:
    SELECT d.name as db_name, t.tbl_name as view_name, tp.param_value
    FROM TBLS t
    JOIN DBS d ON t.db_id = d.db_id
    JOIN TABLE_PARAMS tp ON t.tbl_id = tp.tbl_id
    WHERE t.tbl_type = 'VIRTUAL_VIEW'
    AND tp.param_key = 'presto_view'
    LIMIT 10;
  • Review ingestion logs for parsing errors
  • Verify view JSON is valid

Storage Lineage Not Appearing

Problem: No storage lineage relationships visible

Solutions:

  • Verify emit_storage_lineage: true is set
  • Check tables have storage locations in metastore:
    SELECT d.name as db_name, t.tbl_name as table_name, s.location
    FROM TBLS t
    JOIN DBS d ON t.db_id = d.db_id
    JOIN SDS s ON t.sd_id = s.sd_id
    WHERE s.location IS NOT NULL
    LIMIT 10;
  • Review logs for "Failed to parse storage location" warnings
  • See the "Storage Lineage" section above for troubleshooting tips

Slow Ingestion

Problem: Ingestion takes too long

Solutions:

  • Use schema filtering to reduce scope
  • Enable stateful ingestion to only process changes
  • Check database query performance (may need indexes on metastore tables)
  • Ensure low latency network connection to metastore database
  • Consider disabling column lineage if not needed

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

# =============================================================================
# SQL Mode (Default) - Direct database connection
# =============================================================================
source:
type: hive-metastore
config:
# Hive metastore DB connection
host_port: localhost:5432
database: metastore

# specify the schema where metastore tables reside
schema_pattern:
allow:
- "^public"

# credentials
username: user # optional
password: pass # optional

#scheme: 'postgresql+psycopg2' # set this if metastore db is using postgres
#scheme: 'mysql+pymysql' # set this if metastore db is using mysql, default if unset

# Filter databases using pattern-based filtering
#database_pattern:
# allow:
# - "^db1$"
# deny:
# - "^test_.*"

# Storage Lineage Configuration (Optional)
# Enables lineage between Hive tables and their underlying storage locations
#emit_storage_lineage: false # Set to true to enable storage lineage
#hive_storage_lineage_direction: upstream # Direction: 'upstream' (storage -> Hive) or 'downstream' (Hive -> storage)
#include_column_lineage: true # Set to false to disable column-level lineage
#storage_platform_instance: "prod-hdfs" # Optional: platform instance for storage URNs

sink:
# sink configs

# =============================================================================
# Thrift Mode - HMS Thrift API connection (use when database access unavailable)
# =============================================================================
# Use this mode when:
# - You cannot access the metastore database directly
# - Only the HMS Thrift API (port 9083) is accessible
# - Your environment requires Kerberos authentication
#
# Prerequisites:
# - pip install 'acryl-datahub[hive-metastore]'
# - For Kerberos: pip install thrift-sasl pyhive[hive-pure-sasl]
# - For Kerberos: Run 'kinit' before ingestion to obtain ticket
#
# source:
# type: hive-metastore
# config:
# # =========================================================================
# # Connection Settings (Required)
# # =========================================================================
# connection_type: thrift # Enable Thrift mode (default is 'sql')
# host_port: hms.company.com:9083 # HMS Thrift API endpoint
#
# # =========================================================================
# # Authentication - Kerberos/SASL (Optional)
# # =========================================================================
# # Enable if HMS requires Kerberos authentication
# # Prerequisite: Run 'kinit -kt /path/to/keytab user@REALM' before ingestion
# use_kerberos: true
#
# # Kerberos service principal name (typically 'hive')
# # Check your HMS principal: klist -k /etc/hive/hive.keytab
# kerberos_service_name: hive
#
# # Override hostname for Kerberos principal (use with load balancers)
# # Set this if connecting via LB but Kerberos principal uses actual hostname
# # kerberos_hostname_override: hms-master.company.com
#
# # =========================================================================
# # Connection Tuning (Optional)
# # =========================================================================
# # timeout_seconds: 60 # Connection timeout (default: 60)
# # max_retries: 3 # Retry attempts for transient failures (default: 3)
#
# # =========================================================================
# # HMS 3.x Catalog Support (Optional)
# # =========================================================================
# # For HMS 3.x with multi-catalog support (e.g., Spark catalog)
# # catalog_name: spark_catalog
# # include_catalog_name_in_ids: true # Include catalog in dataset URNs
#
# # =========================================================================
# # Filtering (Pattern-based only - WHERE clauses NOT supported)
# # =========================================================================
# database_pattern:
# allow:
# - "^prod_.*" # Allow databases starting with 'prod_'
# - "^analytics$" # Allow exact match 'analytics'
# deny:
# - "^test_.*" # Deny databases starting with 'test_'
# - ".*_staging$" # Deny databases ending with '_staging'
#
# table_pattern:
# allow:
# - ".*" # Allow all tables by default
# deny:
# - "^tmp_.*" # Deny temporary tables
#
# # =========================================================================
# # Storage Lineage (Optional - works same as SQL mode)
# # =========================================================================
# emit_storage_lineage: true
# hive_storage_lineage_direction: upstream # or 'downstream'
# include_column_lineage: true
# # storage_platform_instance: "prod-hdfs"
#
# # =========================================================================
# # Platform Instance (Optional - for multi-cluster environments)
# # =========================================================================
# # platform_instance: "prod-hive"
#
# # =========================================================================
# # Stateful Ingestion (Optional - for incremental updates)
# # =========================================================================
# # stateful_ingestion:
# # enabled: true
# # remove_stale_metadata: true
#
# sink:
# type: datahub-rest
# config:
# server: http://localhost:8080

# =============================================================================
# Thrift Mode - Minimal Example (No Kerberos)
# =============================================================================
# source:
# type: hive-metastore
# config:
# connection_type: thrift
# host_port: hms.company.com:9083
# use_kerberos: false
#
# sink:
# type: datahub-rest
# config:
# server: http://localhost:8080

# =============================================================================
# Thrift Mode - Kerberos with Load Balancer
# =============================================================================
# source:
# type: hive-metastore
# config:
# connection_type: thrift
# host_port: hms-lb.company.com:9083 # Load balancer address
# use_kerberos: true
# kerberos_service_name: hive
# kerberos_hostname_override: hms-master.company.com # Actual HMS hostname
#
# sink:
# type: datahub-rest
# config:
# server: http://localhost:8080

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
catalog_name
One of string, null
Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'.
Default: None
connection_type
Enum
One of: "sql", "thrift"
convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase.
Default: False
database
One of string, null
database (catalog)
Default: None
emit_storage_lineage
boolean
Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).
Default: False
enable_properties_merge
boolean
Merge properties with existing server data instead of overwriting.
Default: True
hive_storage_lineage_direction
Enum
One of: "upstream", "downstream"
host_port
string
Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083).
Default: localhost:3306
include_catalog_name_in_ids
boolean
Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD)
Default: False
include_column_lineage
boolean
When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.
Default: True
include_table_location_lineage
boolean
If the source supports it, include table lineage to the underlying storage location.
Default: True
include_tables
boolean
Whether tables should be ingested.
Default: True
include_view_column_lineage
boolean
Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled.
Default: True
include_view_lineage
boolean
Extract lineage from Hive views by parsing view definitions.
Default: True
include_views
boolean
Whether views should be ingested.
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
ingestion_job_id
string
Default:
kerberos_hostname_override
One of string, null
Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'.
Default: None
kerberos_service_name
string
Kerberos service name for the HMS principal. Only for connection_type='thrift'.
Default: hive
metastore_db_name
One of string, null
Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering.
Default: None
mode
Enum
One of: "hive", "presto", "presto-on-hive", "trino"
options
object
Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args.
password
One of string(password), null
password
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
schemas_where_clause_suffix
string
DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.
Default:
simplify_nested_field_paths
boolean
Simplify v2 field paths to v1. Falls back to v2 for Union/Array types.
Default: False
sqlalchemy_uri
One of string, null
URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.
Default: None
storage_platform_instance
One of string, null
Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.
Default: None
tables_where_clause_suffix
string
DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.
Default:
timeout_seconds
integer
Connection timeout in seconds. Only for connection_type='thrift'.
Default: 60
use_catalog_subtype
boolean
Use 'Catalog' (True) or 'Database' (False) as container subtype.
Default: True
use_dataset_pascalcase_subtype
boolean
Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype.
Default: False
use_file_backed_cache
boolean
Whether to use a file backed cache for the view definitions.
Default: True
use_kerberos
boolean
Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'.
Default: False
username
One of string, null
username
Default: None
views_where_clause_suffix
string
DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.
Default:
env
string
The environment that all assets produced by this connector belong to
Default: PROD
database_pattern
AllowDenyPattern
A class to store allow deny regexes
database_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain
map(str,AllowDenyPattern)
A class to store allow deny regexes
domain.key.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
domain.key.allow.string
string
domain.key.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain.key.deny
array
List of regex patterns to exclude from ingestion.
Default: []
domain.key.deny.string
string
profile_pattern
AllowDenyPattern
A class to store allow deny regexes
profile_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
schema_pattern
AllowDenyPattern
A class to store allow deny regexes
schema_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern
AllowDenyPattern
A class to store allow deny regexes
table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
view_pattern
AllowDenyPattern
A class to store allow deny regexes
view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification
ClassificationConfig
classification.enabled
boolean
Whether classification should be used to auto-detect glossary terms
Default: False
classification.info_type_to_term
map(str,string)
classification.max_workers
integer
Number of worker processes to use for classification. Set to 1 to disable.
Default: 4
classification.sample_size
integer
Number of sample values used for classification.
Default: 100
classification.classifiers
array
Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.
Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig
DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type 
string
The type of the classifier to use. For DataHub, use datahub
classification.classifiers.DynamicTypedClassifierConfig.config
One of object, null
The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.
Default: None
classification.column_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.column_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification.table_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profiling
GEProfilingConfig
profiling.catch_exceptions
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
One of integer, null
Max number of documents to profile. By default, profiles all documents.
Default: None
profiling.max_number_of_fields_to_profile
One of integer, null
A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.
Default: None
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
One of integer, null
Offset in documents to profile. By default, uses no offset.
Default: None
profiling.partition_datetime
One of string(date-time), null
If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.
Default: None
profiling.partition_profiling_enabled
boolean
Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.
Default: True
profiling.profile_external_tables
boolean
Whether to profile external tables. Only Snowflake and Redshift supports this.
Default: False
profiling.profile_if_updated_since_days
One of number, null
Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery.
Default: None
profiling.profile_nested_fields
boolean
Whether to profile complex types like structs, arrays and maps.
Default: False
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.profile_table_row_count_estimate_only
boolean
Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
profiling.profile_table_row_limit
One of integer, null
Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats.
Default: 5000000
profiling.profile_table_size_limit
One of integer, null
Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats.
Default: 5
profiling.query_combiner_enabled
boolean
This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.sample_size
integer
Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True.
Default: 10000
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.use_sampling
boolean
Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables.
Default: True
profiling.operation_config
OperationConfig
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
One of integer, null
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.operation_config.profile_day_of_week
One of integer, null
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.tags_to_ignore_sampling
One of array, null
Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling.
Default: None
profiling.tags_to_ignore_sampling.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Configuration for stateful ingestion and stale entity removal.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource
  • Browse on GitHub

Module presto-on-hive

Certified

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default. Supported for types - Catalog, Schema.
ClassificationNot Supported.
Column-level LineageEnabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View.
Data ProfilingNot Supported.
DescriptionsEnabled by default.
Detect Deleted EntitiesEnabled by default via stateful ingestion.
DomainsEnabled by default.
Schema MetadataEnabled by default.
Table-Level LineageEnabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View.
Test ConnectionEnabled by default.

Extracts metadata from Hive Metastore.

Supports two connection methods selected via connection_type:

  • sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
  • thrift: Connection to HMS Thrift API with Kerberos support

Features:

  • Table and view metadata extraction
  • Schema field types including complex types (struct, map, array)
  • Storage lineage to S3, HDFS, Azure, GCS
  • View lineage via SQL parsing
  • Stateful ingestion for stale entity removal

CLI based Ingestion

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
catalog_name
One of string, null
Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'.
Default: None
connection_type
Enum
One of: "sql", "thrift"
convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase.
Default: False
database
One of string, null
database (catalog)
Default: None
emit_storage_lineage
boolean
Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).
Default: False
enable_properties_merge
boolean
Merge properties with existing server data instead of overwriting.
Default: True
hive_storage_lineage_direction
Enum
One of: "upstream", "downstream"
host_port
string
Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083).
Default: localhost:3306
include_catalog_name_in_ids
boolean
Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD)
Default: False
include_column_lineage
boolean
When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.
Default: True
include_table_location_lineage
boolean
If the source supports it, include table lineage to the underlying storage location.
Default: True
include_tables
boolean
Whether tables should be ingested.
Default: True
include_view_column_lineage
boolean
Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled.
Default: True
include_view_lineage
boolean
Extract lineage from Hive views by parsing view definitions.
Default: True
include_views
boolean
Whether views should be ingested.
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
ingestion_job_id
string
Default:
kerberos_hostname_override
One of string, null
Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'.
Default: None
kerberos_service_name
string
Kerberos service name for the HMS principal. Only for connection_type='thrift'.
Default: hive
metastore_db_name
One of string, null
Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering.
Default: None
mode
Enum
One of: "hive", "presto", "presto-on-hive", "trino"
options
object
Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args.
password
One of string(password), null
password
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
schemas_where_clause_suffix
string
DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.
Default:
simplify_nested_field_paths
boolean
Simplify v2 field paths to v1. Falls back to v2 for Union/Array types.
Default: False
sqlalchemy_uri
One of string, null
URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.
Default: None
storage_platform_instance
One of string, null
Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.
Default: None
tables_where_clause_suffix
string
DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.
Default:
timeout_seconds
integer
Connection timeout in seconds. Only for connection_type='thrift'.
Default: 60
use_catalog_subtype
boolean
Use 'Catalog' (True) or 'Database' (False) as container subtype.
Default: True
use_dataset_pascalcase_subtype
boolean
Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype.
Default: False
use_file_backed_cache
boolean
Whether to use a file backed cache for the view definitions.
Default: True
use_kerberos
boolean
Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'.
Default: False
username
One of string, null
username
Default: None
views_where_clause_suffix
string
DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.
Default:
env
string
The environment that all assets produced by this connector belong to
Default: PROD
database_pattern
AllowDenyPattern
A class to store allow deny regexes
database_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain
map(str,AllowDenyPattern)
A class to store allow deny regexes
domain.key.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
domain.key.allow.string
string
domain.key.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain.key.deny
array
List of regex patterns to exclude from ingestion.
Default: []
domain.key.deny.string
string
profile_pattern
AllowDenyPattern
A class to store allow deny regexes
profile_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
schema_pattern
AllowDenyPattern
A class to store allow deny regexes
schema_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern
AllowDenyPattern
A class to store allow deny regexes
table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
view_pattern
AllowDenyPattern
A class to store allow deny regexes
view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification
ClassificationConfig
classification.enabled
boolean
Whether classification should be used to auto-detect glossary terms
Default: False
classification.info_type_to_term
map(str,string)
classification.max_workers
integer
Number of worker processes to use for classification. Set to 1 to disable.
Default: 4
classification.sample_size
integer
Number of sample values used for classification.
Default: 100
classification.classifiers
array
Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.
Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig
DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type 
string
The type of the classifier to use. For DataHub, use datahub
classification.classifiers.DynamicTypedClassifierConfig.config
One of object, null
The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.
Default: None
classification.column_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.column_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification.table_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profiling
GEProfilingConfig
profiling.catch_exceptions
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
One of integer, null
Max number of documents to profile. By default, profiles all documents.
Default: None
profiling.max_number_of_fields_to_profile
One of integer, null
A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.
Default: None
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
One of integer, null
Offset in documents to profile. By default, uses no offset.
Default: None
profiling.partition_datetime
One of string(date-time), null
If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.
Default: None
profiling.partition_profiling_enabled
boolean
Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.
Default: True
profiling.profile_external_tables
boolean
Whether to profile external tables. Only Snowflake and Redshift supports this.
Default: False
profiling.profile_if_updated_since_days
One of number, null
Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery.
Default: None
profiling.profile_nested_fields
boolean
Whether to profile complex types like structs, arrays and maps.
Default: False
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.profile_table_row_count_estimate_only
boolean
Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
profiling.profile_table_row_limit
One of integer, null
Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats.
Default: 5000000
profiling.profile_table_size_limit
One of integer, null
Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats.
Default: 5
profiling.query_combiner_enabled
boolean
This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.sample_size
integer
Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True.
Default: 10000
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.use_sampling
boolean
Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables.
Default: True
profiling.operation_config
OperationConfig
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
One of integer, null
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.operation_config.profile_day_of_week
One of integer, null
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.tags_to_ignore_sampling
One of array, null
Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling.
Default: None
profiling.tags_to_ignore_sampling.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Configuration for stateful ingestion and stale entity removal.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Hive Metastore, feel free to ping us on our Slack.