Hive Metastore

There are 2 sources that provide integration with Hive Metastore

Source Module Documentation

hive-metastore

Source that extracts metadata from Hive Metastore via SQL or Thrift connection.

Implementation notes:

Uses HiveDataFetcher abstraction (SQLAlchemyDataFetcher or ThriftDataFetcher based on connection_type)
Uses SqlParsingAggregator for extracting view lineage from view definitions
Supports Presto/Trino view parsing from TABLE_PARAMS
Implements stateful ingestion with StaleEntityRemovalHandler
Complex type handling for struct/map/array schema fields Read more...

presto-on-hive

Source that extracts metadata from Hive Metastore via SQL or Thrift connection.

Implementation notes:

Uses HiveDataFetcher abstraction (SQLAlchemyDataFetcher or ThriftDataFetcher based on connection_type)
Uses SqlParsingAggregator for extracting view lineage from view definitions
Supports Presto/Trino view parsing from TABLE_PARAMS
Implements stateful ingestion with StaleEntityRemovalHandler
Complex type handling for struct/map/array schema fields Read more...

Overview

Hive Metastore is a data platform used to store and query analytical or operational data. Learn more in the official Hive Metastore documentation.

The DataHub integration for Hive Metastore covers core metadata entities such as datasets/tables/views, schema fields, and containers. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source Concept	DataHub Concept	Notes
Platform/account/project scope	Platform Instance, Container	Organizes assets within the platform context.
Core technical asset (for example table/view/topic/file)	Dataset	Primary ingested technical asset.
Schema fields / columns	SchemaField	Included when schema extraction is supported.
Ownership and collaboration principals	CorpUser, CorpGroup	Emitted by modules that support ownership and identity metadata.
Dependencies and processing relationships	Lineage edges	Available when lineage extraction is supported and enabled.

Module `hive-metastore`

Important Capabilities

Capability	Status	Notes
Asset Containers	✅	Enabled by default. Supported for types - Catalog, Schema.
Classification	❌	Not Supported.
Column-level Lineage	✅	Enabled by default for views via `include_view_lineage`, and to storage via `include_column_lineage` when storage lineage is enabled. Supported for types - Table, View.
Data Profiling	❌	Not Supported.
Descriptions	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled by default via stateful ingestion.
Domains	✅	Enabled by default.
Schema Metadata	✅	Enabled by default.
Table-Level Lineage	✅	Enabled by default for views via `include_view_lineage`, and to upstream/downstream storage via `emit_storage_lineage`. Supported for types - Table, View.
Test Connection	✅	Enabled by default.

Overview

The hive-metastore module ingests metadata from Hive Metastore into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

Extracts metadata from Hive Metastore.
Supports two connection methods selected via connection_type:
- sql: Direct connection to HMS backend database (MySQL/PostgreSQL)
- thrift: Connection to HMS Thrift API with Kerberos support
Features:
- Table and view metadata extraction
- Schema field types including complex types (struct, map, array)
- Storage lineage to S3, HDFS, Azure, GCS
- View lineage via SQL parsing
- Stateful ingestion for stale entity removal

Hive Metastore Configuration - Configuration examples
Hive Connector - Alternative connector via HiveServer2
SQLAlchemy Documentation - Underlying database connection library

Prerequisites

The Hive Metastore connector supports two connection modes:

SQL Mode (Default): Connects directly to the Hive metastore database (MySQL, PostgreSQL, etc.)
Thrift Mode: Connects to Hive Metastore via the Thrift API (port 9083), with Kerberos support

Choose your connection mode based on your environment:

Feature	SQL Mode (default)	Thrift Mode
Use when	Direct database access available	Only HMS Thrift API accessible
Authentication	Database credentials	Kerberos/SASL or unauthenticated
Port	Database port (3306/5432)	Thrift port (9083)
Dependencies	Database drivers	`pymetastore`, `thrift-sasl`

Requirements:

Database Access: Direct read access to the Hive metastore database (MySQL or PostgreSQL)
Network Access: Access to metastore database on configured port

Database Driver: Install the appropriate Python driver:

# For PostgreSQL metastore
pip install 'acryl-datahub[hive]' psycopg2-binary

# For MySQL metastore
pip install 'acryl-datahub[hive]' PyMySQL

Metastore Schema: Typically public (PostgreSQL) or database name (MySQL)

Required Database Permissions

The database user account used by DataHub needs read-only access to the Hive metastore tables.

PostgreSQL Metastore

-- Create a dedicated read-only user for DataHub
CREATE USER datahub_user WITH PASSWORD 'secure_password';

-- Grant connection privileges
GRANT CONNECT ON DATABASE metastore TO datahub_user;

-- Grant schema usage
GRANT USAGE ON SCHEMA public TO datahub_user;

-- Grant SELECT on metastore tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;

-- Grant SELECT on future tables (for metastore upgrades)
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;

MySQL Metastore

-- Create a dedicated read-only user for DataHub
CREATE USER 'datahub_user'@'%' IDENTIFIED BY 'secure_password';

-- Grant SELECT privileges on metastore database
GRANT SELECT ON metastore.* TO 'datahub_user'@'%';

-- Apply changes
FLUSH PRIVILEGES;

Required Metastore Tables

DataHub queries the following metastore tables:

Table	Purpose
`DBS`	Database/schema information
`TBLS`	Table metadata
`TABLE_PARAMS`	Table properties (including view definitions)
`SDS`	Storage descriptor (location, format)
`COLUMNS_V2`	Column metadata
`PARTITION_KEYS`	Partition information
`SERDES`	Serialization/deserialization information

Recommendation: Grant SELECT on all metastore tables to ensure compatibility with different Hive versions and for future DataHub enhancements.

Authentication

PostgreSQL

Standard Connection:

source:
  type: hive-metastore
  config:
    host_port: metastore-db.company.com:5432
    database: metastore
    username: datahub_user
    password: ${METASTORE_PASSWORD}
    scheme: "postgresql+psycopg2"

SSL Connection:

source:
  type: hive-metastore
  config:
    host_port: metastore-db.company.com:5432
    database: metastore
    username: datahub_user
    password: ${METASTORE_PASSWORD}
    scheme: "postgresql+psycopg2"
    options:
      connect_args:
        sslmode: require
        sslrootcert: /path/to/ca-cert.pem

MySQL

Standard Connection:

source:
  type: hive-metastore
  config:
    host_port: metastore-db.company.com:3306
    database: metastore
    username: datahub_user
    password: ${METASTORE_PASSWORD}
    scheme: "mysql+pymysql" # Default if not specified

SSL Connection:

source:
  type: hive-metastore
  config:
    host_port: metastore-db.company.com:3306
    database: metastore
    username: datahub_user
    password: ${METASTORE_PASSWORD}
    scheme: "mysql+pymysql"
    options:
      connect_args:
        ssl:
          ca: /path/to/ca-cert.pem
          cert: /path/to/client-cert.pem
          key: /path/to/client-key.pem

Amazon RDS (PostgreSQL or MySQL)

For AWS RDS-hosted metastore databases:

source:
  type: hive-metastore
  config:
    host_port: metastore.abc123.us-east-1.rds.amazonaws.com:5432
    database: metastore
    username: datahub_user
    password: ${RDS_PASSWORD}
    scheme: "postgresql+psycopg2" # or 'mysql+pymysql'
    options:
      connect_args:
        sslmode: require # RDS requires SSL

Azure Database for PostgreSQL/MySQL

source:
  type: hive-metastore
  config:
    host_port: metastore-server.postgres.database.azure.com:5432
    database: metastore
    username: datahub_user@metastore-server # Note: Azure requires @server-name suffix
    password: ${AZURE_DB_PASSWORD}
    scheme: "postgresql+psycopg2"
    options:
      connect_args:
        sslmode: require

Install the Plugin

pip install 'acryl-datahub[hive-metastore]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

# =============================================================================
# SQL Mode (Default) - Direct database connection
# =============================================================================
source:
  type: hive-metastore
  config:
    # Hive metastore DB connection
    host_port: localhost:5432
    database: metastore

    # specify the schema where metastore tables reside
    schema_pattern:
      allow:
        - "^public"

    # credentials
    username: user # optional
    password: pass # optional

    #scheme: 'postgresql+psycopg2' # set this if metastore db is using postgres
    #scheme: 'mysql+pymysql' # set this if metastore db is using mysql, default if unset

    # Filter databases using pattern-based filtering
    #database_pattern:
    #  allow:
    #    - "^db1$"
    #  deny:
    #    - "^test_.*"

    # Storage Lineage Configuration (Optional)
    # Enables lineage between Hive tables and their underlying storage locations
    #emit_storage_lineage: false  # Set to true to enable storage lineage
    #hive_storage_lineage_direction: upstream  # Direction: 'upstream' (storage -> Hive) or 'downstream' (Hive -> storage)
    #include_column_lineage: true  # Set to false to disable column-level lineage
    #storage_platform_instance: "prod-hdfs"  # Optional: platform instance for storage URNs

sink:
  # sink configs

# =============================================================================
# Thrift Mode - HMS Thrift API connection (use when database access unavailable)
# =============================================================================
# Use this mode when:
# - You cannot access the metastore database directly
# - Only the HMS Thrift API (port 9083) is accessible
# - Your environment requires Kerberos authentication
#
# Prerequisites:
# - pip install 'acryl-datahub[hive-metastore]'
# - For Kerberos: pip install thrift-sasl pyhive[hive-pure-sasl]
# - For Kerberos: Run 'kinit' before ingestion to obtain ticket
#
# source:
#   type: hive-metastore
#   config:
#     # =========================================================================
#     # Connection Settings (Required)
#     # =========================================================================
#     connection_type: thrift  # Enable Thrift mode (default is 'sql')
#     host_port: hms.company.com:9083  # HMS Thrift API endpoint
#
#     # =========================================================================
#     # Authentication - Kerberos/SASL (Optional)
#     # =========================================================================
#     # Enable if HMS requires Kerberos authentication
#     # Prerequisite: Run 'kinit -kt /path/to/keytab user@REALM' before ingestion
#     use_kerberos: true
#
#     # Kerberos service principal name (typically 'hive')
#     # Check your HMS principal: klist -k /etc/hive/hive.keytab
#     kerberos_service_name: hive
#
#     # Override hostname for Kerberos principal (use with load balancers)
#     # Set this if connecting via LB but Kerberos principal uses actual hostname
#     # kerberos_hostname_override: hms-master.company.com
#
#     # =========================================================================
#     # Connection Tuning (Optional)
#     # =========================================================================
#     # timeout_seconds: 60     # Connection timeout (default: 60)
#     # max_retries: 3          # Retry attempts for transient failures (default: 3)
#
#     # =========================================================================
#     # HMS 3.x Catalog Support (Optional)
#     # =========================================================================
#     # For HMS 3.x with multi-catalog support (e.g., Spark catalog)
#     # catalog_name: spark_catalog
#     # include_catalog_name_in_ids: true  # Include catalog in dataset URNs
#
#     # =========================================================================
#     # Filtering (Pattern-based only - WHERE clauses NOT supported)
#     # =========================================================================
#     database_pattern:
#       allow:
#         - "^prod_.*"      # Allow databases starting with 'prod_'
#         - "^analytics$"   # Allow exact match 'analytics'
#       deny:
#         - "^test_.*"      # Deny databases starting with 'test_'
#         - ".*_staging$"   # Deny databases ending with '_staging'
#
#     table_pattern:
#       allow:
#         - ".*"            # Allow all tables by default
#       deny:
#         - "^tmp_.*"       # Deny temporary tables
#
#     # =========================================================================
#     # Storage Lineage (Optional - works same as SQL mode)
#     # =========================================================================
#     emit_storage_lineage: true
#     hive_storage_lineage_direction: upstream  # or 'downstream'
#     include_column_lineage: true
#     # storage_platform_instance: "prod-hdfs"
#
#     # =========================================================================
#     # Platform Instance (Optional - for multi-cluster environments)
#     # =========================================================================
#     # platform_instance: "prod-hive"
#
#     # =========================================================================
#     # Stateful Ingestion (Optional - for incremental updates)
#     # =========================================================================
#     # stateful_ingestion:
#     #   enabled: true
#     #   remove_stale_metadata: true
#
# sink:
#   type: datahub-rest
#   config:
#     server: http://localhost:8080

# =============================================================================
# Thrift Mode - Minimal Example (No Kerberos)
# =============================================================================
# source:
#   type: hive-metastore
#   config:
#     connection_type: thrift
#     host_port: hms.company.com:9083
#     use_kerberos: false
#
# sink:
#   type: datahub-rest
#   config:
#     server: http://localhost:8080

# =============================================================================
# Thrift Mode - Kerberos with Load Balancer
# =============================================================================
# source:
#   type: hive-metastore
#   config:
#     connection_type: thrift
#     host_port: hms-lb.company.com:9083           # Load balancer address
#     use_kerberos: true
#     kerberos_service_name: hive
#     kerberos_hostname_override: hms-master.company.com  # Actual HMS hostname
#
# sink:
#   type: datahub-rest
#   config:
#     server: http://localhost:8080

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
catalog_name One of string, null	Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'. Default: None
connection_type Enum	One of: "sql", "thrift"
convert_urns_to_lowercase boolean	Whether to convert dataset urns to lowercase. Default: False
database One of string, null	database (catalog) Default: None
emit_storage_lineage boolean	Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False
enable_properties_merge boolean	Merge properties with existing server data instead of overwriting. Default: True
hive_storage_lineage_direction Enum	One of: "upstream", "downstream"
host_port string	Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083). Default: localhost:3306
include_catalog_name_in_ids boolean	Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD) Default: False
include_column_lineage boolean	When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True
include_table_location_lineage boolean	If the source supports it, include table lineage to the underlying storage location. Default: True
include_tables boolean	Whether tables should be ingested. Default: True
include_view_column_lineage boolean	Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled. Default: True
include_view_lineage boolean	Extract lineage from Hive views by parsing view definitions. Default: True
include_views boolean	Whether views should be ingested. Default: True
incremental_lineage boolean	When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False
ingestion_job_id string	Default:
kerberos_hostname_override One of string, null	Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'. Default: None
kerberos_qop string	Kerberos Quality of Protection (QOP) for SASL authentication. Options: 'auth' (authentication only), 'auth-int' (authentication + integrity), 'auth-conf' (authentication + confidentiality/encryption). Must match the server's hadoop.rpc.protection setting. Only for connection_type='thrift'. Default: auth
kerberos_service_name string	Kerberos service name for the HMS principal. Only for connection_type='thrift'. Default: hive
metastore_db_name One of string, null	Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering. Default: None
mode Enum	One of: "hive", "presto", "presto-on-hive", "trino"
options object	Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under `connect_args`.
password One of string(password), null	password Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
schemas_where_clause_suffix string	DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default:
simplify_nested_field_paths boolean	Simplify v2 field paths to v1. Falls back to v2 for Union/Array types. Default: False
sqlalchemy_uri One of string, null	URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None
storage_platform_instance One of string, null	Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None
tables_where_clause_suffix string	DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default:
timeout_seconds integer	Connection timeout in seconds. Only for connection_type='thrift'. Default: 60
use_catalog_subtype boolean	Use 'Catalog' (True) or 'Database' (False) as container subtype. Default: True
use_dataset_pascalcase_subtype boolean	Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype. Default: False
use_file_backed_cache boolean	Whether to use a file backed cache for the view definitions. Default: True
use_kerberos boolean	Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'. Default: False
username One of string, null	username Default: None
views_where_clause_suffix string	DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default:
env string	The environment that all assets produced by this connector belong to Default: PROD
database_pattern AllowDenyPattern	A class to store allow deny regexes
database_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
domain map(str,AllowDenyPattern)	A class to store allow deny regexes
domain.`key`.allow array	List of regex patterns to include in ingestion Default: ['.*']
domain.`key`.allow.string string
domain.`key`.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
domain.`key`.deny array	List of regex patterns to exclude from ingestion. Default: []
domain.`key`.deny.string string
profile_pattern AllowDenyPattern	A class to store allow deny regexes
profile_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
schema_pattern AllowDenyPattern	A class to store allow deny regexes
schema_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
table_pattern AllowDenyPattern	A class to store allow deny regexes
table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
view_pattern AllowDenyPattern	A class to store allow deny regexes
view_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification ClassificationConfig
classification.enabled boolean	Whether classification should be used to auto-detect glossary terms Default: False
classification.info_type_to_term map(str,string)
classification.max_workers integer	Number of worker processes to use for classification. Set to 1 to disable. Default: 4
classification.sample_size integer	Number of sample values used for classification. Default: 100
classification.classifiers array	Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string	The type of the classifier to use. For DataHub, use `datahub`
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null	The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None
classification.column_pattern AllowDenyPattern	A class to store allow deny regexes
classification.column_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification.table_pattern AllowDenyPattern	A class to store allow deny regexes
classification.table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profiling GEProfilingConfig
profiling.catch_exceptions boolean	Default: True
profiling.enabled boolean	Whether profiling should be done. Default: False
profiling.field_sample_values_limit integer	Upper limit for number of sample values to collect for all columns. Default: 20
profiling.include_field_distinct_count boolean	Whether to profile for the number of distinct values for each column. Default: True
profiling.include_field_distinct_value_frequencies boolean	Whether to profile for distinct value frequencies. Default: False
profiling.include_field_histogram boolean	Whether to profile for the histogram for numeric fields. Default: False
profiling.include_field_max_value boolean	Whether to profile for the max value of numeric columns. Default: True
profiling.include_field_mean_value boolean	Whether to profile for the mean value of numeric columns. Default: True
profiling.include_field_median_value boolean	Whether to profile for the median value of numeric columns. Default: True
profiling.include_field_min_value boolean	Whether to profile for the min value of numeric columns. Default: True
profiling.include_field_null_count boolean	Whether to profile for the number of nulls for each column. Default: True
profiling.include_field_quantiles boolean	Whether to profile for the quantiles of numeric columns. Default: False
profiling.include_field_sample_values boolean	Whether to profile for the sample values for all columns. Default: True
profiling.include_field_stddev_value boolean	Whether to profile for the standard deviation of numeric columns. Default: True
profiling.limit One of integer, null	Max number of documents to profile. By default, profiles all documents. Default: None
profiling.max_number_of_fields_to_profile One of integer, null	A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None
profiling.max_workers integer	Number of worker threads to use for profiling. Set to 1 to disable. Default: 20
profiling.method Enum	One of: "ge", "sqlalchemy" Default: ge
profiling.offset One of integer, null	Offset in documents to profile. By default, uses no offset. Default: None
profiling.partition_datetime One of string(date-time), null	If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None
profiling.partition_profiling_enabled boolean	Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True
profiling.profile_external_tables boolean	Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False
profiling.profile_if_updated_since_days One of number, null	Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. Default: None
profiling.profile_nested_fields boolean	Whether to profile complex types like structs, arrays and maps. Default: False
profiling.profile_table_level_only boolean	Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False
profiling.profile_table_row_count_estimate_only boolean	Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False
profiling.profile_table_row_limit One of integer, null	Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats. Default: 5000000
profiling.profile_table_size_limit One of integer, null	Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats. Default: 5
profiling.query_combiner_enabled boolean	This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True
profiling.report_dropped_profiles boolean	Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes. Default: False
profiling.sample_size integer	Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True. Default: 10000
profiling.turn_off_expensive_profiling_metrics boolean	Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False
profiling.use_sampling boolean	Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True
profiling.operation_config OperationConfig
profiling.operation_config.lower_freq_profile_enabled boolean	Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False
profiling.operation_config.profile_date_of_month One of integer, null	Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.operation_config.profile_day_of_week One of integer, null	Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.tags_to_ignore_sampling One of array, null	Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`. Default: None
profiling.tags_to_ignore_sampling.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Configuration for stateful ingestion and stale entity removal. Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "ClassificationConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether classification should be used to auto-detect glossary terms",
          "title": "Enabled",
          "type": "boolean"
        },
        "sample_size": {
          "default": 100,
          "description": "Number of sample values used for classification.",
          "title": "Sample Size",
          "type": "integer"
        },
        "max_workers": {
          "default": 4,
          "description": "Number of worker processes to use for classification. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "table_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
        },
        "column_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
        },
        "info_type_to_term": {
          "additionalProperties": {
            "type": "string"
          },
          "default": {},
          "description": "Optional mapping to provide glossary term identifier for info type",
          "title": "Info Type To Term",
          "type": "object"
        },
        "classifiers": {
          "default": [
            {
              "type": "datahub",
              "config": null
            }
          ],
          "description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
          "items": {
            "$ref": "#/$defs/DynamicTypedClassifierConfig"
          },
          "title": "Classifiers",
          "type": "array"
        }
      },
      "title": "ClassificationConfig",
      "type": "object"
    },
    "DynamicTypedClassifierConfig": {
      "additionalProperties": false,
      "properties": {
        "type": {
          "description": "The type of the classifier to use. For DataHub,  use `datahub`",
          "title": "Type",
          "type": "string"
        },
        "config": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
          "title": "Config"
        }
      },
      "required": [
        "type"
      ],
      "title": "DynamicTypedClassifierConfig",
      "type": "object"
    },
    "GEProfilingConfig": {
      "additionalProperties": false,
      "properties": {
        "method": {
          "default": "ge",
          "description": "Profiling method to use. Options: `ge` (Great Expectations) or `sqlalchemy` (custom SQLAlchemy-based profiler). The SQLAlchemy profiler has no GE dependency and provides the same functionality.",
          "enum": [
            "ge",
            "sqlalchemy"
          ],
          "title": "Method",
          "type": "string"
        },
        "enabled": {
          "default": false,
          "description": "Whether profiling should be done.",
          "title": "Enabled",
          "type": "boolean"
        },
        "operation_config": {
          "$ref": "#/$defs/OperationConfig",
          "description": "Experimental feature. To specify operation configs."
        },
        "limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Max number of documents to profile. By default, profiles all documents.",
          "title": "Limit"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Offset in documents to profile. By default, uses no offset.",
          "title": "Offset"
        },
        "profile_table_level_only": {
          "default": false,
          "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
          "title": "Profile Table Level Only",
          "type": "boolean"
        },
        "include_field_null_count": {
          "default": true,
          "description": "Whether to profile for the number of nulls for each column.",
          "title": "Include Field Null Count",
          "type": "boolean"
        },
        "include_field_distinct_count": {
          "default": true,
          "description": "Whether to profile for the number of distinct values for each column.",
          "title": "Include Field Distinct Count",
          "type": "boolean"
        },
        "include_field_min_value": {
          "default": true,
          "description": "Whether to profile for the min value of numeric columns.",
          "title": "Include Field Min Value",
          "type": "boolean"
        },
        "include_field_max_value": {
          "default": true,
          "description": "Whether to profile for the max value of numeric columns.",
          "title": "Include Field Max Value",
          "type": "boolean"
        },
        "include_field_mean_value": {
          "default": true,
          "description": "Whether to profile for the mean value of numeric columns.",
          "title": "Include Field Mean Value",
          "type": "boolean"
        },
        "include_field_median_value": {
          "default": true,
          "description": "Whether to profile for the median value of numeric columns.",
          "title": "Include Field Median Value",
          "type": "boolean"
        },
        "include_field_stddev_value": {
          "default": true,
          "description": "Whether to profile for the standard deviation of numeric columns.",
          "title": "Include Field Stddev Value",
          "type": "boolean"
        },
        "include_field_quantiles": {
          "default": false,
          "description": "Whether to profile for the quantiles of numeric columns.",
          "title": "Include Field Quantiles",
          "type": "boolean"
        },
        "include_field_distinct_value_frequencies": {
          "default": false,
          "description": "Whether to profile for distinct value frequencies.",
          "title": "Include Field Distinct Value Frequencies",
          "type": "boolean"
        },
        "include_field_histogram": {
          "default": false,
          "description": "Whether to profile for the histogram for numeric fields.",
          "title": "Include Field Histogram",
          "type": "boolean"
        },
        "include_field_sample_values": {
          "default": true,
          "description": "Whether to profile for the sample values for all columns.",
          "title": "Include Field Sample Values",
          "type": "boolean"
        },
        "max_workers": {
          "default": 20,
          "description": "Number of worker threads to use for profiling. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "report_dropped_profiles": {
          "default": false,
          "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
          "title": "Report Dropped Profiles",
          "type": "boolean"
        },
        "turn_off_expensive_profiling_metrics": {
          "default": false,
          "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
          "title": "Turn Off Expensive Profiling Metrics",
          "type": "boolean"
        },
        "field_sample_values_limit": {
          "default": 20,
          "description": "Upper limit for number of sample values to collect for all columns.",
          "title": "Field Sample Values Limit",
          "type": "integer"
        },
        "max_number_of_fields_to_profile": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
          "title": "Max Number Of Fields To Profile"
        },
        "profile_if_updated_since_days": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery"
            ]
          },
          "title": "Profile If Updated Since Days"
        },
        "profile_table_size_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5,
          "description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "unity-catalog",
              "oracle"
            ]
          },
          "title": "Profile Table Size Limit"
        },
        "profile_table_row_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5000000,
          "description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "oracle"
            ]
          },
          "title": "Profile Table Row Limit"
        },
        "profile_table_row_count_estimate_only": {
          "default": false,
          "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
          "schema_extra": {
            "supported_sources": [
              "postgres",
              "mysql"
            ]
          },
          "title": "Profile Table Row Count Estimate Only",
          "type": "boolean"
        },
        "query_combiner_enabled": {
          "default": true,
          "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
          "title": "Query Combiner Enabled",
          "type": "boolean"
        },
        "catch_exceptions": {
          "default": true,
          "description": "",
          "title": "Catch Exceptions",
          "type": "boolean"
        },
        "partition_profiling_enabled": {
          "default": true,
          "description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
          "schema_extra": {
            "supported_sources": [
              "athena",
              "bigquery"
            ]
          },
          "title": "Partition Profiling Enabled",
          "type": "boolean"
        },
        "partition_datetime": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
          "schema_extra": {
            "supported_sources": [
              "bigquery"
            ]
          },
          "title": "Partition Datetime"
        },
        "use_sampling": {
          "default": true,
          "description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Use Sampling",
          "type": "boolean"
        },
        "sample_size": {
          "default": 10000,
          "description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Sample Size",
          "type": "integer"
        },
        "profile_external_tables": {
          "default": false,
          "description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
          "schema_extra": {
            "supported_sources": [
              "redshift",
              "snowflake"
            ]
          },
          "title": "Profile External Tables",
          "type": "boolean"
        },
        "tags_to_ignore_sampling": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
          "title": "Tags To Ignore Sampling"
        },
        "profile_nested_fields": {
          "default": false,
          "description": "Whether to profile complex types like structs, arrays and maps. ",
          "title": "Profile Nested Fields",
          "type": "boolean"
        }
      },
      "title": "GEProfilingConfig",
      "type": "object"
    },
    "HiveMetastoreConfigMode": {
      "description": "Mode for metadata extraction.",
      "enum": [
        "hive",
        "presto",
        "presto-on-hive",
        "trino"
      ],
      "title": "HiveMetastoreConfigMode",
      "type": "string"
    },
    "HiveMetastoreConnectionType": {
      "description": "Connection type for HiveMetastoreSource.",
      "enum": [
        "sql",
        "thrift"
      ],
      "title": "HiveMetastoreConnectionType",
      "type": "string"
    },
    "LineageDirection": {
      "description": "Direction of lineage relationship between storage and Hive",
      "enum": [
        "upstream",
        "downstream"
      ],
      "title": "LineageDirection",
      "type": "string"
    },
    "OperationConfig": {
      "additionalProperties": false,
      "properties": {
        "lower_freq_profile_enabled": {
          "default": false,
          "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
          "title": "Lower Freq Profile Enabled",
          "type": "boolean"
        },
        "profile_day_of_week": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Day Of Week"
        },
        "profile_date_of_month": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Date Of Month"
        }
      },
      "title": "OperationConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "description": "Configuration for Hive Metastore source.\n\nSupports two connection types:\n- sql: Direct database access (MySQL/PostgreSQL) to HMS backend\n- thrift: HMS Thrift API with Kerberos support",
  "properties": {
    "schema_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
    },
    "table_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "view_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "classification": {
      "$ref": "#/$defs/ClassificationConfig",
      "default": {
        "enabled": false,
        "sample_size": 100,
        "max_workers": 4,
        "table_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "column_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "info_type_to_term": {},
        "classifiers": [
          {
            "config": null,
            "type": "datahub"
          }
        ]
      },
      "description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
    },
    "incremental_lineage": {
      "default": false,
      "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
      "title": "Incremental Lineage",
      "type": "boolean"
    },
    "convert_urns_to_lowercase": {
      "default": false,
      "description": "Whether to convert dataset urns to lowercase.",
      "title": "Convert Urns To Lowercase",
      "type": "boolean"
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Configuration for stateful ingestion and stale entity removal."
    },
    "emit_storage_lineage": {
      "default": false,
      "description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
      "title": "Emit Storage Lineage",
      "type": "boolean"
    },
    "hive_storage_lineage_direction": {
      "$ref": "#/$defs/LineageDirection",
      "default": "upstream",
      "description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
    },
    "include_column_lineage": {
      "default": true,
      "description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
      "title": "Include Column Lineage",
      "type": "boolean"
    },
    "storage_platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
      "title": "Storage Platform Instance"
    },
    "options": {
      "additionalProperties": true,
      "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
      "title": "Options",
      "type": "object"
    },
    "profile_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
    },
    "domain": {
      "additionalProperties": {
        "$ref": "#/$defs/AllowDenyPattern"
      },
      "default": {},
      "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
      "title": "Domain",
      "type": "object"
    },
    "include_views": {
      "default": true,
      "description": "Whether views should be ingested.",
      "title": "Include Views",
      "type": "boolean"
    },
    "include_tables": {
      "default": true,
      "description": "Whether tables should be ingested.",
      "title": "Include Tables",
      "type": "boolean"
    },
    "include_table_location_lineage": {
      "default": true,
      "description": "If the source supports it, include table lineage to the underlying storage location.",
      "title": "Include Table Location Lineage",
      "type": "boolean"
    },
    "include_view_lineage": {
      "default": true,
      "description": "Extract lineage from Hive views by parsing view definitions.",
      "title": "Include View Lineage",
      "type": "boolean"
    },
    "include_view_column_lineage": {
      "default": true,
      "description": "Populates column-level lineage for  view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
      "title": "Include View Column Lineage",
      "type": "boolean"
    },
    "use_file_backed_cache": {
      "default": true,
      "description": "Whether to use a file backed cache for the view definitions.",
      "title": "Use File Backed Cache",
      "type": "boolean"
    },
    "profiling": {
      "$ref": "#/$defs/GEProfilingConfig",
      "default": {
        "method": "ge",
        "enabled": false,
        "operation_config": {
          "lower_freq_profile_enabled": false,
          "profile_date_of_month": null,
          "profile_day_of_week": null
        },
        "limit": null,
        "offset": null,
        "profile_table_level_only": false,
        "include_field_null_count": true,
        "include_field_distinct_count": true,
        "include_field_min_value": true,
        "include_field_max_value": true,
        "include_field_mean_value": true,
        "include_field_median_value": true,
        "include_field_stddev_value": true,
        "include_field_quantiles": false,
        "include_field_distinct_value_frequencies": false,
        "include_field_histogram": false,
        "include_field_sample_values": true,
        "max_workers": 20,
        "report_dropped_profiles": false,
        "turn_off_expensive_profiling_metrics": false,
        "field_sample_values_limit": 20,
        "max_number_of_fields_to_profile": null,
        "profile_if_updated_since_days": null,
        "profile_table_size_limit": 5,
        "profile_table_row_limit": 5000000,
        "profile_table_row_count_estimate_only": false,
        "query_combiner_enabled": true,
        "catch_exceptions": true,
        "partition_profiling_enabled": true,
        "partition_datetime": null,
        "use_sampling": true,
        "sample_size": 10000,
        "profile_external_tables": false,
        "tags_to_ignore_sampling": null,
        "profile_nested_fields": false
      }
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "username",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "password",
      "title": "Password"
    },
    "host_port": {
      "default": "localhost:3306",
      "description": "Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083).",
      "title": "Host Port",
      "type": "string"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "database (catalog)",
      "title": "Database"
    },
    "sqlalchemy_uri": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
      "title": "Sqlalchemy Uri"
    },
    "connection_type": {
      "$ref": "#/$defs/HiveMetastoreConnectionType",
      "default": "sql",
      "description": "Connection method: 'sql' for direct database access (MySQL/PostgreSQL), 'thrift' for HMS Thrift API with optional Kerberos support."
    },
    "views_where_clause_suffix": {
      "default": "",
      "description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
      "title": "Views Where Clause Suffix",
      "type": "string"
    },
    "tables_where_clause_suffix": {
      "default": "",
      "description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
      "title": "Tables Where Clause Suffix",
      "type": "string"
    },
    "schemas_where_clause_suffix": {
      "default": "",
      "description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
      "title": "Schemas Where Clause Suffix",
      "type": "string"
    },
    "metastore_db_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering.",
      "title": "Metastore Db Name"
    },
    "use_kerberos": {
      "default": false,
      "description": "Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'.",
      "title": "Use Kerberos",
      "type": "boolean"
    },
    "kerberos_service_name": {
      "default": "hive",
      "description": "Kerberos service name for the HMS principal. Only for connection_type='thrift'.",
      "title": "Kerberos Service Name",
      "type": "string"
    },
    "kerberos_hostname_override": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'.",
      "title": "Kerberos Hostname Override"
    },
    "kerberos_qop": {
      "default": "auth",
      "description": "Kerberos Quality of Protection (QOP) for SASL authentication. Options: 'auth' (authentication only), 'auth-int' (authentication + integrity), 'auth-conf' (authentication + confidentiality/encryption). Must match the server's hadoop.rpc.protection setting. Only for connection_type='thrift'.",
      "title": "Kerberos Qop",
      "type": "string"
    },
    "timeout_seconds": {
      "default": 60,
      "description": "Connection timeout in seconds. Only for connection_type='thrift'.",
      "title": "Timeout Seconds",
      "type": "integer"
    },
    "catalog_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'.",
      "title": "Catalog Name"
    },
    "database_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for databases to filter."
    },
    "mode": {
      "$ref": "#/$defs/HiveMetastoreConfigMode",
      "default": "hive",
      "description": "Platform mode for metadata. Valid options: ['hive', 'presto', 'presto-on-hive', 'trino']"
    },
    "use_catalog_subtype": {
      "default": true,
      "description": "Use 'Catalog' (True) or 'Database' (False) as container subtype.",
      "title": "Use Catalog Subtype",
      "type": "boolean"
    },
    "use_dataset_pascalcase_subtype": {
      "default": false,
      "description": "Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype.",
      "title": "Use Dataset Pascalcase Subtype",
      "type": "boolean"
    },
    "include_catalog_name_in_ids": {
      "default": false,
      "description": "Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD)",
      "title": "Include Catalog Name In Ids",
      "type": "boolean"
    },
    "enable_properties_merge": {
      "default": true,
      "description": "Merge properties with existing server data instead of overwriting.",
      "title": "Enable Properties Merge",
      "type": "boolean"
    },
    "simplify_nested_field_paths": {
      "default": false,
      "description": "Simplify v2 field paths to v1. Falls back to v2 for Union/Array types.",
      "title": "Simplify Nested Field Paths",
      "type": "boolean"
    },
    "ingestion_job_id": {
      "default": "",
      "title": "Ingestion Job Id",
      "type": "string"
    }
  },
  "title": "HiveMetastore",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Thrift Connection Mode

Use connection_type: thrift when you cannot access the metastore database directly but have access to the HMS Thrift API (typically port 9083). This is common in:

Kerberized Hadoop clusters where database access is restricted
Cloud-managed Hive services that only expose the Thrift API
Environments with strict network segmentation

Thrift Mode Prerequisites

Before using Thrift mode, ensure:

Network Access: The machine running DataHub ingestion can reach HMS on port 9083
HMS Service Running: The Hive Metastore service is running and accepting Thrift connections
For Kerberos: A valid Kerberos ticket is available (see Kerberos section below)

Verify connectivity:

# Test network connectivity to HMS
telnet hms.company.com 9083

# For Kerberos environments, verify ticket
klist

Thrift Mode Dependencies

# Install with Thrift support
pip install 'acryl-datahub[hive-metastore]'

# For Kerberos authentication, also install:
pip install thrift-sasl pyhive[hive-pure-sasl]

Thrift Configuration Options

Option	Type	Default	Required	Description
`connection_type`	string	`sql`	Yes (for Thrift)	Set to `thrift` to enable Thrift mode
`host_port`	string	-	Yes	HMS host and port (e.g., `hms.company.com:9083`)
`use_kerberos`	boolean	`false`	No	Enable Kerberos/SASL authentication
`kerberos_service_name`	string	`hive`	No	Kerberos service principal name
`kerberos_hostname_override`	string	-	No	Override hostname for Kerberos principal (for load balancers)
`kerberos_qop`	string	`auth`	No	Kerberos Quality of Protection: `auth`, `auth-int`, or `auth-conf` (see below)
`timeout_seconds`	int	`60`	No	Connection timeout in seconds
`max_retries`	int	`3`	No	Maximum retry attempts for transient failures
`catalog_name`	string	-	No	HMS 3.x catalog name (e.g., `spark_catalog`)
`include_catalog_name_in_ids`	boolean	`false`	No	Include catalog in dataset URNs
`database_pattern`	AllowDeny	-	No	Filter databases by regex pattern
`table_pattern`	AllowDeny	-	No	Filter tables by regex pattern

Note: SQL WHERE clause options (tables_where_clause_suffix, views_where_clause_suffix, schemas_where_clause_suffix) have been deprecated for security reasons (SQL injection risk) and are no longer supported. Use database_pattern and table_pattern instead.

Basic Thrift Configuration

source:
  type: hive-metastore
  config:
    connection_type: thrift
    host_port: hms.company.com:9083

Thrift with Kerberos Authentication

Ensure you have a valid Kerberos ticket (kinit -kt /path/to/keytab user@REALM) before running ingestion:

source:
  type: hive-metastore
  config:
    connection_type: thrift
    host_port: hms.company.com:9083
    use_kerberos: true
    kerberos_service_name: hive # Change if HMS uses different principal
    # kerberos_hostname_override: hms-internal.company.com  # If using load balancer
    # catalog_name: spark_catalog  # For HMS 3.x multi-catalog
    # kerberos_qop: auth-conf # For Kerberos QOP authentication + integrity + encryption
    database_pattern:
      allow:
        - "^prod_.*"

Kerberos Quality of Protection (QOP)

If your Hive Metastore is configured with hadoop.rpc.protection set to integrity or privacy, you must configure the matching QOP level:

`hadoop.rpc.protection`	`kerberos_qop`	Description
`authentication`	`auth`	Authentication only (default)
`integrity`	`auth-int`	Authentication + integrity checking
`privacy`	`auth-conf`	Authentication + integrity + encryption

Thrift Mode Limitations

No Presto/Trino view lineage: View SQL parsing requires SQL mode
No WHERE clause filtering: Use database_pattern/table_pattern instead
Kerberos ticket required: Must have valid ticket before running (not embedded in config)
HMS version compatibility: Tested with HMS 2.x and 3.x

Storage Lineage

The Hive Metastore connector supports the same storage lineage features as the Hive connector, with enhanced performance due to direct database access.

Quick Start

Enable storage lineage with minimal configuration:

source:
  type: hive-metastore
  config:
    host_port: metastore-db.company.com:5432
    database: metastore
    username: datahub_user
    password: ${METASTORE_PASSWORD}
    scheme: "postgresql+psycopg2"

    # Enable storage lineage
    emit_storage_lineage: true

Configuration Options

Storage lineage is controlled by the same parameters as the Hive connector:

Parameter	Type	Default	Description
`emit_storage_lineage`	boolean	`false`	Master toggle to enable/disable storage lineage
`hive_storage_lineage_direction`	string	`"upstream"`	Direction: `"upstream"` (storage → Hive) or `"downstream"` (Hive → storage)
`include_column_lineage`	boolean	`true`	Enable column-level lineage from storage paths to Hive columns
`storage_platform_instance`	string	`None`	Platform instance for storage URNs (e.g., `"prod-s3"`, `"dev-hdfs"`)

Supported Storage Platforms

All storage platforms supported by the Hive connector are also supported here:

Amazon S3 (s3://, s3a://, s3n://)
HDFS (hdfs://)
Google Cloud Storage (gs://)
Azure Blob Storage (wasb://, wasbs://)
Azure Data Lake (adl://, abfs://, abfss://)
Databricks File System (dbfs://)
Local File System (file://)

See the sections above for complete configuration details.

Presto and Trino View Support

A key advantage of the Hive Metastore connector is its ability to extract metadata from Presto and Trino views that are stored in the metastore.

How It Works

View Detection: The connector identifies views by checking the TABLE_PARAMS table for Presto/Trino view definitions.
View Parsing: Presto/Trino view JSON is parsed to extract:
- Original SQL text
- Referenced tables
- Column metadata and types
Lineage Extraction: SQL is parsed using sqlglot to create table-to-view lineage.
Storage Lineage Integration: If storage lineage is enabled, the connector also creates lineage from storage → tables → views.

Configuration

Presto/Trino view support is automatically enabled when ingesting from a metastore that contains Presto/Trino views. No additional configuration is required.

Example

source:
  type: hive-metastore
  config:
    host_port: metastore-db.company.com:5432
    database: metastore
    username: datahub_user
    password: ${METASTORE_PASSWORD}
    scheme: "postgresql+psycopg2"

    # Enable storage lineage for complete lineage chain
    emit_storage_lineage: true

This configuration will create complete lineage:

S3 Bucket → Hive Table → Presto View

Limitations

Presto/Trino Version: The connector supports Presto 0.200+ and Trino view formats
Complex SQL: Very complex SQL with non-standard syntax may have incomplete lineage
Cross-Database References: Lineage is extracted for references within the same Hive metastore

Schema Filtering

For large metastore deployments with many databases, use filtering to limit ingestion scope:

Database Filtering

source:
  type: hive-metastore
  config:
    # ... connection config ...

    # Only ingest from specific databases
    schema_pattern:
      allow:
        - "^production_.*" # All databases starting with production_
        - "analytics" # Specific database
      deny:
        - ".*_test$" # Exclude test databases

Table Filtering with SQL

For filtering by database name, use pattern-based filtering:

source:
  type: hive-metastore
  config:
    # ... connection config ...

    # Filter to specific databases using regex patterns
    database_pattern:
      allow:
        - "^production_db$"
        - "^analytics_db$"
      deny:
        - "^test_.*"
        - ".*_staging$"

Note: The deprecated *_where_clause_suffix options have been removed for security reasons. Use database_pattern and table_pattern for filtering.

Performance Considerations

Advantages Over HiveServer2 Connector

The Hive Metastore connector is significantly faster than the Hive connector because:

Direct Database Access: No HiveServer2 overhead
Batch Queries: Fetches all metadata in optimized SQL queries
No Query Execution: Doesn't run Hive queries to extract metadata
Parallel Processing: Can process multiple databases concurrently

Performance Comparison (approximate):

10 databases, 1000 tables: ~2 minutes (Metastore) vs ~15 minutes (HiveServer2)
100 databases, 10,000 tables: ~15 minutes (Metastore) vs ~2 hours (HiveServer2)

Optimization Tips

Database Connection Pooling: The connector uses SQLAlchemy's default connection pooling. For very large deployments, consider tuning pool size:
```
options:
  pool_size: 10
  max_overflow: 20
```
Schema Filtering: Use schema_pattern to limit scope and reduce query time.

Stateful Ingestion: Enable to only process changes:

stateful_ingestion:
  enabled: true
  remove_stale_metadata: true

Disable Column Lineage: If not needed:

emit_storage_lineage: true
include_column_lineage: false # Faster

Network Considerations

Latency: Low latency to the metastore database is important
Bandwidth: Minimal bandwidth required (only metadata, no data transfer)
Connection Limits: Ensure metastore database can handle additional read connections

Platform Instances

When ingesting from multiple metastores (e.g., different clusters or environments), use platform_instance:

source:
  type: hive-metastore
  config:
    host_port: prod-metastore-db.company.com:5432
    database: metastore
    platform_instance: "prod-hive"

Best Practice: Combine with storage_platform_instance:

source:
  type: hive-metastore
  config:
    platform_instance: "prod-hive" # Hive tables
    storage_platform_instance: "prod-hdfs" # Storage locations
    emit_storage_lineage: true

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Metastore Schema Compatibility

Hive Versions: Tested with Hive 1.x, 2.x, and 3.x metastore schemas
Schema Variations: Different Hive versions may have slightly different metastore schemas
Custom Tables: If your organization has added custom metastore tables, they won't be processed

Database Support

Supported: PostgreSQL, MySQL, MariaDB
Not Supported: Oracle, MSSQL (may work but untested)
Derby: Not recommended (embedded metastore, typically single-user)

View Lineage Parsing

Simple SQL: Fully supported with accurate lineage
Complex SQL: Best-effort parsing; some edge cases may have incomplete lineage
Non-standard SQL: Presto/Trino-specific functions may not be fully parsed

Permissions Limitations

Read-Only: The connector only needs SELECT permissions
No Write Operations: Never requires INSERT, UPDATE, or DELETE
Metastore Locks: Read operations don't acquire metastore locks

Storage Lineage Limitations

Same as the Hive connector:

Only tables with defined storage locations have lineage
Temporary tables are not supported
Partition-level lineage is aggregated at table level

Troubleshooting

Large Column Lists: Tables with 500+ columns may be slow to process due to metastore query complexity.
View Definition Encoding: Some older Hive versions store view definitions in non-UTF-8 encoding, which may cause parsing issues.
Case Sensitivity:
- PostgreSQL metastore: Case-sensitive identifiers
- MySQL metastore: Case-insensitive by default
- DataHub automatically lowercases URNs for consistency
Concurrent Metastore Writes: If the metastore is being actively modified during ingestion, some metadata may be inconsistent.

Connection Issues

Problem: Could not connect to metastore database

Solutions:

Verify host_port, database, and scheme are correct
Check network connectivity: telnet <host> <port>
Verify firewall rules allow connections
For PostgreSQL: Check pg_hba.conf allows connections from your IP
For MySQL: Check bind-address in my.cnf

Authentication Failures

Problem: Authentication failed or Access denied

Solutions:

Verify username and password are correct
Check user has CONNECT/LOGIN privileges
For Azure: Ensure username includes @server-name suffix
Review database logs for detailed error messages

Missing Tables

Problem: Not all tables appear in DataHub

Solutions:

Verify database user has SELECT on all metastore tables
Check if tables are filtered by schema_pattern, database_pattern, or table_pattern
Query metastore directly to verify tables exist:

  SELECT d.name as db_name, t.tbl_name as table_name, t.tbl_type
  FROM TBLS t
  JOIN DBS d ON t.db_id = d.db_id
  WHERE d.name = 'your_database';

Presto/Trino Views Not Appearing

Problem: Views defined in Presto/Trino don't show up

Solutions:

Check view definitions exist in metastore:

  SELECT d.name as db_name, t.tbl_name as view_name, tp.param_value
  FROM TBLS t
  JOIN DBS d ON t.db_id = d.db_id
  JOIN TABLE_PARAMS tp ON t.tbl_id = tp.tbl_id
  WHERE t.tbl_type = 'VIRTUAL_VIEW'
  AND tp.param_key = 'presto_view'
  LIMIT 10;

Review ingestion logs for parsing errors
Verify view JSON is valid

Storage Lineage Not Appearing

Problem: No storage lineage relationships visible

Solutions:

Verify emit_storage_lineage: true is set
Check tables have storage locations in metastore:

  SELECT d.name as db_name, t.tbl_name as table_name, s.location
  FROM TBLS t
  JOIN DBS d ON t.db_id = d.db_id
  JOIN SDS s ON t.sd_id = s.sd_id
  WHERE s.location IS NOT NULL
  LIMIT 10;

Review logs for "Failed to parse storage location" warnings
See the "Storage Lineage" section above for troubleshooting tips

Slow Ingestion

Problem: Ingestion takes too long

Solutions:

Use schema filtering to reduce scope
Enable stateful ingestion to only process changes
Check database query performance (may need indexes on metastore tables)
Ensure low latency network connection to metastore database
Consider disabling column lineage if not needed

Code Coordinates

Class Name: datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource
Browse on GitHub

Module `presto-on-hive`

Important Capabilities

Capability	Status	Notes
Asset Containers	✅	Enabled by default. Supported for types - Catalog, Schema.
Classification	❌	Not Supported.
Column-level Lineage	✅	Enabled by default for views via `include_view_lineage`, and to storage via `include_column_lineage` when storage lineage is enabled. Supported for types - Table, View.
Data Profiling	❌	Not Supported.
Descriptions	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled by default via stateful ingestion.
Domains	✅	Enabled by default.
Schema Metadata	✅	Enabled by default.
Table-Level Lineage	✅	Enabled by default for views via `include_view_lineage`, and to upstream/downstream storage via `emit_storage_lineage`. Supported for types - Table, View.
Test Connection	✅	Enabled by default.

Overview

The presto-on-hive module ingests metadata for Presto deployments that use Hive Metastore-backed catalogs.

Prerequisites

Connectivity to the target Presto deployment and backing metastore.
Credentials and permissions to read catalog/schema/table metadata and optional query metadata.

Install the Plugin

pip install 'acryl-datahub[presto-on-hive]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: presto-on-hive
  config:
    host_port: "localhost:8080"
    database: "hive"

sink:
  # sink configs

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
catalog_name One of string, null	Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'. Default: None
connection_type Enum	One of: "sql", "thrift"
convert_urns_to_lowercase boolean	Whether to convert dataset urns to lowercase. Default: False
database One of string, null	database (catalog) Default: None
emit_storage_lineage boolean	Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False
enable_properties_merge boolean	Merge properties with existing server data instead of overwriting. Default: True
hive_storage_lineage_direction Enum	One of: "upstream", "downstream"
host_port string	Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083). Default: localhost:3306
include_catalog_name_in_ids boolean	Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD) Default: False
include_column_lineage boolean	When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True
include_table_location_lineage boolean	If the source supports it, include table lineage to the underlying storage location. Default: True
include_tables boolean	Whether tables should be ingested. Default: True
include_view_column_lineage boolean	Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled. Default: True
include_view_lineage boolean	Extract lineage from Hive views by parsing view definitions. Default: True
include_views boolean	Whether views should be ingested. Default: True
incremental_lineage boolean	When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False
ingestion_job_id string	Default:
kerberos_hostname_override One of string, null	Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'. Default: None
kerberos_qop string	Kerberos Quality of Protection (QOP) for SASL authentication. Options: 'auth' (authentication only), 'auth-int' (authentication + integrity), 'auth-conf' (authentication + confidentiality/encryption). Must match the server's hadoop.rpc.protection setting. Only for connection_type='thrift'. Default: auth
kerberos_service_name string	Kerberos service name for the HMS principal. Only for connection_type='thrift'. Default: hive
metastore_db_name One of string, null	Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering. Default: None
mode Enum	One of: "hive", "presto", "presto-on-hive", "trino"
options object	Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under `connect_args`.
password One of string(password), null	password Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
schemas_where_clause_suffix string	DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default:
simplify_nested_field_paths boolean	Simplify v2 field paths to v1. Falls back to v2 for Union/Array types. Default: False
sqlalchemy_uri One of string, null	URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None
storage_platform_instance One of string, null	Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None
tables_where_clause_suffix string	DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default:
timeout_seconds integer	Connection timeout in seconds. Only for connection_type='thrift'. Default: 60
use_catalog_subtype boolean	Use 'Catalog' (True) or 'Database' (False) as container subtype. Default: True
use_dataset_pascalcase_subtype boolean	Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype. Default: False
use_file_backed_cache boolean	Whether to use a file backed cache for the view definitions. Default: True
use_kerberos boolean	Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'. Default: False
username One of string, null	username Default: None
views_where_clause_suffix string	DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead. Default:
env string	The environment that all assets produced by this connector belong to Default: PROD
database_pattern AllowDenyPattern	A class to store allow deny regexes
database_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
domain map(str,AllowDenyPattern)	A class to store allow deny regexes
domain.`key`.allow array	List of regex patterns to include in ingestion Default: ['.*']
domain.`key`.allow.string string
domain.`key`.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
domain.`key`.deny array	List of regex patterns to exclude from ingestion. Default: []
domain.`key`.deny.string string
profile_pattern AllowDenyPattern	A class to store allow deny regexes
profile_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
schema_pattern AllowDenyPattern	A class to store allow deny regexes
schema_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
table_pattern AllowDenyPattern	A class to store allow deny regexes
table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
view_pattern AllowDenyPattern	A class to store allow deny regexes
view_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification ClassificationConfig
classification.enabled boolean	Whether classification should be used to auto-detect glossary terms Default: False
classification.info_type_to_term map(str,string)
classification.max_workers integer	Number of worker processes to use for classification. Set to 1 to disable. Default: 4
classification.sample_size integer	Number of sample values used for classification. Default: 100
classification.classifiers array	Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string	The type of the classifier to use. For DataHub, use `datahub`
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null	The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None
classification.column_pattern AllowDenyPattern	A class to store allow deny regexes
classification.column_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification.table_pattern AllowDenyPattern	A class to store allow deny regexes
classification.table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profiling GEProfilingConfig
profiling.catch_exceptions boolean	Default: True
profiling.enabled boolean	Whether profiling should be done. Default: False
profiling.field_sample_values_limit integer	Upper limit for number of sample values to collect for all columns. Default: 20
profiling.include_field_distinct_count boolean	Whether to profile for the number of distinct values for each column. Default: True
profiling.include_field_distinct_value_frequencies boolean	Whether to profile for distinct value frequencies. Default: False
profiling.include_field_histogram boolean	Whether to profile for the histogram for numeric fields. Default: False
profiling.include_field_max_value boolean	Whether to profile for the max value of numeric columns. Default: True
profiling.include_field_mean_value boolean	Whether to profile for the mean value of numeric columns. Default: True
profiling.include_field_median_value boolean	Whether to profile for the median value of numeric columns. Default: True
profiling.include_field_min_value boolean	Whether to profile for the min value of numeric columns. Default: True
profiling.include_field_null_count boolean	Whether to profile for the number of nulls for each column. Default: True
profiling.include_field_quantiles boolean	Whether to profile for the quantiles of numeric columns. Default: False
profiling.include_field_sample_values boolean	Whether to profile for the sample values for all columns. Default: True
profiling.include_field_stddev_value boolean	Whether to profile for the standard deviation of numeric columns. Default: True
profiling.limit One of integer, null	Max number of documents to profile. By default, profiles all documents. Default: None
profiling.max_number_of_fields_to_profile One of integer, null	A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None
profiling.max_workers integer	Number of worker threads to use for profiling. Set to 1 to disable. Default: 20
profiling.method Enum	One of: "ge", "sqlalchemy" Default: ge
profiling.offset One of integer, null	Offset in documents to profile. By default, uses no offset. Default: None
profiling.partition_datetime One of string(date-time), null	If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None
profiling.partition_profiling_enabled boolean	Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True
profiling.profile_external_tables boolean	Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False
profiling.profile_if_updated_since_days One of number, null	Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. Default: None
profiling.profile_nested_fields boolean	Whether to profile complex types like structs, arrays and maps. Default: False
profiling.profile_table_level_only boolean	Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False
profiling.profile_table_row_count_estimate_only boolean	Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False
profiling.profile_table_row_limit One of integer, null	Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats. Default: 5000000
profiling.profile_table_size_limit One of integer, null	Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats. Default: 5
profiling.query_combiner_enabled boolean	This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True
profiling.report_dropped_profiles boolean	Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes. Default: False
profiling.sample_size integer	Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True. Default: 10000
profiling.turn_off_expensive_profiling_metrics boolean	Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False
profiling.use_sampling boolean	Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True
profiling.operation_config OperationConfig
profiling.operation_config.lower_freq_profile_enabled boolean	Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False
profiling.operation_config.profile_date_of_month One of integer, null	Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.operation_config.profile_day_of_week One of integer, null	Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.tags_to_ignore_sampling One of array, null	Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`. Default: None
profiling.tags_to_ignore_sampling.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Configuration for stateful ingestion and stale entity removal. Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "ClassificationConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether classification should be used to auto-detect glossary terms",
          "title": "Enabled",
          "type": "boolean"
        },
        "sample_size": {
          "default": 100,
          "description": "Number of sample values used for classification.",
          "title": "Sample Size",
          "type": "integer"
        },
        "max_workers": {
          "default": 4,
          "description": "Number of worker processes to use for classification. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "table_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
        },
        "column_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
        },
        "info_type_to_term": {
          "additionalProperties": {
            "type": "string"
          },
          "default": {},
          "description": "Optional mapping to provide glossary term identifier for info type",
          "title": "Info Type To Term",
          "type": "object"
        },
        "classifiers": {
          "default": [
            {
              "type": "datahub",
              "config": null
            }
          ],
          "description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
          "items": {
            "$ref": "#/$defs/DynamicTypedClassifierConfig"
          },
          "title": "Classifiers",
          "type": "array"
        }
      },
      "title": "ClassificationConfig",
      "type": "object"
    },
    "DynamicTypedClassifierConfig": {
      "additionalProperties": false,
      "properties": {
        "type": {
          "description": "The type of the classifier to use. For DataHub,  use `datahub`",
          "title": "Type",
          "type": "string"
        },
        "config": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
          "title": "Config"
        }
      },
      "required": [
        "type"
      ],
      "title": "DynamicTypedClassifierConfig",
      "type": "object"
    },
    "GEProfilingConfig": {
      "additionalProperties": false,
      "properties": {
        "method": {
          "default": "ge",
          "description": "Profiling method to use. Options: `ge` (Great Expectations) or `sqlalchemy` (custom SQLAlchemy-based profiler). The SQLAlchemy profiler has no GE dependency and provides the same functionality.",
          "enum": [
            "ge",
            "sqlalchemy"
          ],
          "title": "Method",
          "type": "string"
        },
        "enabled": {
          "default": false,
          "description": "Whether profiling should be done.",
          "title": "Enabled",
          "type": "boolean"
        },
        "operation_config": {
          "$ref": "#/$defs/OperationConfig",
          "description": "Experimental feature. To specify operation configs."
        },
        "limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Max number of documents to profile. By default, profiles all documents.",
          "title": "Limit"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Offset in documents to profile. By default, uses no offset.",
          "title": "Offset"
        },
        "profile_table_level_only": {
          "default": false,
          "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
          "title": "Profile Table Level Only",
          "type": "boolean"
        },
        "include_field_null_count": {
          "default": true,
          "description": "Whether to profile for the number of nulls for each column.",
          "title": "Include Field Null Count",
          "type": "boolean"
        },
        "include_field_distinct_count": {
          "default": true,
          "description": "Whether to profile for the number of distinct values for each column.",
          "title": "Include Field Distinct Count",
          "type": "boolean"
        },
        "include_field_min_value": {
          "default": true,
          "description": "Whether to profile for the min value of numeric columns.",
          "title": "Include Field Min Value",
          "type": "boolean"
        },
        "include_field_max_value": {
          "default": true,
          "description": "Whether to profile for the max value of numeric columns.",
          "title": "Include Field Max Value",
          "type": "boolean"
        },
        "include_field_mean_value": {
          "default": true,
          "description": "Whether to profile for the mean value of numeric columns.",
          "title": "Include Field Mean Value",
          "type": "boolean"
        },
        "include_field_median_value": {
          "default": true,
          "description": "Whether to profile for the median value of numeric columns.",
          "title": "Include Field Median Value",
          "type": "boolean"
        },
        "include_field_stddev_value": {
          "default": true,
          "description": "Whether to profile for the standard deviation of numeric columns.",
          "title": "Include Field Stddev Value",
          "type": "boolean"
        },
        "include_field_quantiles": {
          "default": false,
          "description": "Whether to profile for the quantiles of numeric columns.",
          "title": "Include Field Quantiles",
          "type": "boolean"
        },
        "include_field_distinct_value_frequencies": {
          "default": false,
          "description": "Whether to profile for distinct value frequencies.",
          "title": "Include Field Distinct Value Frequencies",
          "type": "boolean"
        },
        "include_field_histogram": {
          "default": false,
          "description": "Whether to profile for the histogram for numeric fields.",
          "title": "Include Field Histogram",
          "type": "boolean"
        },
        "include_field_sample_values": {
          "default": true,
          "description": "Whether to profile for the sample values for all columns.",
          "title": "Include Field Sample Values",
          "type": "boolean"
        },
        "max_workers": {
          "default": 20,
          "description": "Number of worker threads to use for profiling. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "report_dropped_profiles": {
          "default": false,
          "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
          "title": "Report Dropped Profiles",
          "type": "boolean"
        },
        "turn_off_expensive_profiling_metrics": {
          "default": false,
          "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
          "title": "Turn Off Expensive Profiling Metrics",
          "type": "boolean"
        },
        "field_sample_values_limit": {
          "default": 20,
          "description": "Upper limit for number of sample values to collect for all columns.",
          "title": "Field Sample Values Limit",
          "type": "integer"
        },
        "max_number_of_fields_to_profile": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
          "title": "Max Number Of Fields To Profile"
        },
        "profile_if_updated_since_days": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery"
            ]
          },
          "title": "Profile If Updated Since Days"
        },
        "profile_table_size_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5,
          "description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "unity-catalog",
              "oracle"
            ]
          },
          "title": "Profile Table Size Limit"
        },
        "profile_table_row_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5000000,
          "description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "oracle"
            ]
          },
          "title": "Profile Table Row Limit"
        },
        "profile_table_row_count_estimate_only": {
          "default": false,
          "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
          "schema_extra": {
            "supported_sources": [
              "postgres",
              "mysql"
            ]
          },
          "title": "Profile Table Row Count Estimate Only",
          "type": "boolean"
        },
        "query_combiner_enabled": {
          "default": true,
          "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
          "title": "Query Combiner Enabled",
          "type": "boolean"
        },
        "catch_exceptions": {
          "default": true,
          "description": "",
          "title": "Catch Exceptions",
          "type": "boolean"
        },
        "partition_profiling_enabled": {
          "default": true,
          "description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
          "schema_extra": {
            "supported_sources": [
              "athena",
              "bigquery"
            ]
          },
          "title": "Partition Profiling Enabled",
          "type": "boolean"
        },
        "partition_datetime": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
          "schema_extra": {
            "supported_sources": [
              "bigquery"
            ]
          },
          "title": "Partition Datetime"
        },
        "use_sampling": {
          "default": true,
          "description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Use Sampling",
          "type": "boolean"
        },
        "sample_size": {
          "default": 10000,
          "description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Sample Size",
          "type": "integer"
        },
        "profile_external_tables": {
          "default": false,
          "description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
          "schema_extra": {
            "supported_sources": [
              "redshift",
              "snowflake"
            ]
          },
          "title": "Profile External Tables",
          "type": "boolean"
        },
        "tags_to_ignore_sampling": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
          "title": "Tags To Ignore Sampling"
        },
        "profile_nested_fields": {
          "default": false,
          "description": "Whether to profile complex types like structs, arrays and maps. ",
          "title": "Profile Nested Fields",
          "type": "boolean"
        }
      },
      "title": "GEProfilingConfig",
      "type": "object"
    },
    "HiveMetastoreConfigMode": {
      "description": "Mode for metadata extraction.",
      "enum": [
        "hive",
        "presto",
        "presto-on-hive",
        "trino"
      ],
      "title": "HiveMetastoreConfigMode",
      "type": "string"
    },
    "HiveMetastoreConnectionType": {
      "description": "Connection type for HiveMetastoreSource.",
      "enum": [
        "sql",
        "thrift"
      ],
      "title": "HiveMetastoreConnectionType",
      "type": "string"
    },
    "LineageDirection": {
      "description": "Direction of lineage relationship between storage and Hive",
      "enum": [
        "upstream",
        "downstream"
      ],
      "title": "LineageDirection",
      "type": "string"
    },
    "OperationConfig": {
      "additionalProperties": false,
      "properties": {
        "lower_freq_profile_enabled": {
          "default": false,
          "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
          "title": "Lower Freq Profile Enabled",
          "type": "boolean"
        },
        "profile_day_of_week": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Day Of Week"
        },
        "profile_date_of_month": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Date Of Month"
        }
      },
      "title": "OperationConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "description": "Configuration for Hive Metastore source.\n\nSupports two connection types:\n- sql: Direct database access (MySQL/PostgreSQL) to HMS backend\n- thrift: HMS Thrift API with Kerberos support",
  "properties": {
    "schema_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
    },
    "table_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "view_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "classification": {
      "$ref": "#/$defs/ClassificationConfig",
      "default": {
        "enabled": false,
        "sample_size": 100,
        "max_workers": 4,
        "table_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "column_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "info_type_to_term": {},
        "classifiers": [
          {
            "config": null,
            "type": "datahub"
          }
        ]
      },
      "description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
    },
    "incremental_lineage": {
      "default": false,
      "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
      "title": "Incremental Lineage",
      "type": "boolean"
    },
    "convert_urns_to_lowercase": {
      "default": false,
      "description": "Whether to convert dataset urns to lowercase.",
      "title": "Convert Urns To Lowercase",
      "type": "boolean"
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Configuration for stateful ingestion and stale entity removal."
    },
    "emit_storage_lineage": {
      "default": false,
      "description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
      "title": "Emit Storage Lineage",
      "type": "boolean"
    },
    "hive_storage_lineage_direction": {
      "$ref": "#/$defs/LineageDirection",
      "default": "upstream",
      "description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
    },
    "include_column_lineage": {
      "default": true,
      "description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
      "title": "Include Column Lineage",
      "type": "boolean"
    },
    "storage_platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
      "title": "Storage Platform Instance"
    },
    "options": {
      "additionalProperties": true,
      "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
      "title": "Options",
      "type": "object"
    },
    "profile_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
    },
    "domain": {
      "additionalProperties": {
        "$ref": "#/$defs/AllowDenyPattern"
      },
      "default": {},
      "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
      "title": "Domain",
      "type": "object"
    },
    "include_views": {
      "default": true,
      "description": "Whether views should be ingested.",
      "title": "Include Views",
      "type": "boolean"
    },
    "include_tables": {
      "default": true,
      "description": "Whether tables should be ingested.",
      "title": "Include Tables",
      "type": "boolean"
    },
    "include_table_location_lineage": {
      "default": true,
      "description": "If the source supports it, include table lineage to the underlying storage location.",
      "title": "Include Table Location Lineage",
      "type": "boolean"
    },
    "include_view_lineage": {
      "default": true,
      "description": "Extract lineage from Hive views by parsing view definitions.",
      "title": "Include View Lineage",
      "type": "boolean"
    },
    "include_view_column_lineage": {
      "default": true,
      "description": "Populates column-level lineage for  view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
      "title": "Include View Column Lineage",
      "type": "boolean"
    },
    "use_file_backed_cache": {
      "default": true,
      "description": "Whether to use a file backed cache for the view definitions.",
      "title": "Use File Backed Cache",
      "type": "boolean"
    },
    "profiling": {
      "$ref": "#/$defs/GEProfilingConfig",
      "default": {
        "method": "ge",
        "enabled": false,
        "operation_config": {
          "lower_freq_profile_enabled": false,
          "profile_date_of_month": null,
          "profile_day_of_week": null
        },
        "limit": null,
        "offset": null,
        "profile_table_level_only": false,
        "include_field_null_count": true,
        "include_field_distinct_count": true,
        "include_field_min_value": true,
        "include_field_max_value": true,
        "include_field_mean_value": true,
        "include_field_median_value": true,
        "include_field_stddev_value": true,
        "include_field_quantiles": false,
        "include_field_distinct_value_frequencies": false,
        "include_field_histogram": false,
        "include_field_sample_values": true,
        "max_workers": 20,
        "report_dropped_profiles": false,
        "turn_off_expensive_profiling_metrics": false,
        "field_sample_values_limit": 20,
        "max_number_of_fields_to_profile": null,
        "profile_if_updated_since_days": null,
        "profile_table_size_limit": 5,
        "profile_table_row_limit": 5000000,
        "profile_table_row_count_estimate_only": false,
        "query_combiner_enabled": true,
        "catch_exceptions": true,
        "partition_profiling_enabled": true,
        "partition_datetime": null,
        "use_sampling": true,
        "sample_size": 10000,
        "profile_external_tables": false,
        "tags_to_ignore_sampling": null,
        "profile_nested_fields": false
      }
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "username",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "password",
      "title": "Password"
    },
    "host_port": {
      "default": "localhost:3306",
      "description": "Host and port. For SQL: database port (3306/5432). For Thrift: HMS Thrift port (9083).",
      "title": "Host Port",
      "type": "string"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "database (catalog)",
      "title": "Database"
    },
    "sqlalchemy_uri": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
      "title": "Sqlalchemy Uri"
    },
    "connection_type": {
      "$ref": "#/$defs/HiveMetastoreConnectionType",
      "default": "sql",
      "description": "Connection method: 'sql' for direct database access (MySQL/PostgreSQL), 'thrift' for HMS Thrift API with optional Kerberos support."
    },
    "views_where_clause_suffix": {
      "default": "",
      "description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
      "title": "Views Where Clause Suffix",
      "type": "string"
    },
    "tables_where_clause_suffix": {
      "default": "",
      "description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
      "title": "Tables Where Clause Suffix",
      "type": "string"
    },
    "schemas_where_clause_suffix": {
      "default": "",
      "description": "DEPRECATED: This option has been deprecated for security reasons and will be removed in a future release. Use 'database_pattern' instead.",
      "title": "Schemas Where Clause Suffix",
      "type": "string"
    },
    "metastore_db_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Name of the Hive metastore's database (usually: metastore). For backward compatibility, if not provided, the database field will be used. If both 'database' and 'metastore_db_name' are set, 'database' is used for filtering.",
      "title": "Metastore Db Name"
    },
    "use_kerberos": {
      "default": false,
      "description": "Whether to use Kerberos/SASL authentication. Only for connection_type='thrift'.",
      "title": "Use Kerberos",
      "type": "boolean"
    },
    "kerberos_service_name": {
      "default": "hive",
      "description": "Kerberos service name for the HMS principal. Only for connection_type='thrift'.",
      "title": "Kerberos Service Name",
      "type": "string"
    },
    "kerberos_hostname_override": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Override hostname for Kerberos principal construction. Use when connecting through a load balancer. Only for connection_type='thrift'.",
      "title": "Kerberos Hostname Override"
    },
    "kerberos_qop": {
      "default": "auth",
      "description": "Kerberos Quality of Protection (QOP) for SASL authentication. Options: 'auth' (authentication only), 'auth-int' (authentication + integrity), 'auth-conf' (authentication + confidentiality/encryption). Must match the server's hadoop.rpc.protection setting. Only for connection_type='thrift'.",
      "title": "Kerberos Qop",
      "type": "string"
    },
    "timeout_seconds": {
      "default": 60,
      "description": "Connection timeout in seconds. Only for connection_type='thrift'.",
      "title": "Timeout Seconds",
      "type": "integer"
    },
    "catalog_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Catalog name for HMS 3.x multi-catalog deployments. Only for connection_type='thrift'.",
      "title": "Catalog Name"
    },
    "database_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for databases to filter."
    },
    "mode": {
      "$ref": "#/$defs/HiveMetastoreConfigMode",
      "default": "hive",
      "description": "Platform mode for metadata. Valid options: ['hive', 'presto', 'presto-on-hive', 'trino']"
    },
    "use_catalog_subtype": {
      "default": true,
      "description": "Use 'Catalog' (True) or 'Database' (False) as container subtype.",
      "title": "Use Catalog Subtype",
      "type": "boolean"
    },
    "use_dataset_pascalcase_subtype": {
      "default": false,
      "description": "Use 'Table'/'View' (True) or 'table'/'view' (False) as dataset subtype.",
      "title": "Use Dataset Pascalcase Subtype",
      "type": "boolean"
    },
    "include_catalog_name_in_ids": {
      "default": false,
      "description": "Add catalog name to dataset URNs. Example: urn:li:dataset:(urn:li:dataPlatform:hive,catalog.db.table,PROD)",
      "title": "Include Catalog Name In Ids",
      "type": "boolean"
    },
    "enable_properties_merge": {
      "default": true,
      "description": "Merge properties with existing server data instead of overwriting.",
      "title": "Enable Properties Merge",
      "type": "boolean"
    },
    "simplify_nested_field_paths": {
      "default": false,
      "description": "Simplify v2 field paths to v1. Falls back to v2 for Union/Array types.",
      "title": "Simplify Nested Field Paths",
      "type": "boolean"
    },
    "ingestion_job_id": {
      "default": "",
      "title": "Ingestion Job Id",
      "type": "string"
    }
  },
  "title": "HiveMetastore",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported metadata, lineage, and usage features for this module.

Limitations

Coverage depends on metadata visibility in both Presto and the Hive Metastore integration.
Lineage and usage fidelity depend on query access and parsing compatibility.

Troubleshooting

Verify catalog connectivity and metadata permissions in Presto.
Confirm metastore access and namespace/table visibility for the ingestion principal.
Check ingestion logs for query parsing errors or permission-denied responses.

Code Coordinates

Class Name: datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for Hive Metastore, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

Hive Metastore

Overview​

Concept Mapping​

Module hive-metastore​

Important Capabilities​

Overview​

Related Documentation​

Prerequisites​

Required Database Permissions​

PostgreSQL Metastore​

MySQL Metastore​

Required Metastore Tables​

Authentication​

PostgreSQL​

MySQL​

Amazon RDS (PostgreSQL or MySQL)​

Azure Database for PostgreSQL/MySQL​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Thrift Connection Mode​

Thrift Mode Prerequisites​

Thrift Mode Dependencies​

Thrift Configuration Options​

Basic Thrift Configuration​

Thrift with Kerberos Authentication​

Kerberos Quality of Protection (QOP)​

Thrift Mode Limitations​

Storage Lineage​

Quick Start​

Configuration Options​

Supported Storage Platforms​

Presto and Trino View Support​

How It Works​

Configuration​

Example​

Limitations​

Schema Filtering​

Database Filtering​

Table Filtering with SQL​

Performance Considerations​

Advantages Over HiveServer2 Connector​

Optimization Tips​

Network Considerations​

Platform Instances​

Limitations​

Metastore Schema Compatibility​

Database Support​

View Lineage Parsing​

Permissions Limitations​

Storage Lineage Limitations​

Troubleshooting​

Connection Issues​

Authentication Failures​

Missing Tables​

Presto/Trino Views Not Appearing​

Storage Lineage Not Appearing​

Slow Ingestion​

Code Coordinates​

Module presto-on-hive​

Important Capabilities​

Overview​

Prerequisites​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Limitations​

Troubleshooting​

Code Coordinates​

Overview

Concept Mapping

Module `hive-metastore`

Important Capabilities

Overview

Related Documentation

Prerequisites

Required Database Permissions

PostgreSQL Metastore

MySQL Metastore

Required Metastore Tables

Authentication

PostgreSQL

MySQL

Amazon RDS (PostgreSQL or MySQL)

Azure Database for PostgreSQL/MySQL

Install the Plugin

Starter Recipe

Config Details

Capabilities

Thrift Connection Mode

Thrift Mode Prerequisites

Thrift Mode Dependencies

Thrift Configuration Options

Basic Thrift Configuration

Thrift with Kerberos Authentication

Kerberos Quality of Protection (QOP)

Thrift Mode Limitations

Storage Lineage

Quick Start

Configuration Options

Supported Storage Platforms

Presto and Trino View Support

How It Works

Configuration

Example

Limitations

Schema Filtering

Database Filtering

Table Filtering with SQL

Performance Considerations

Advantages Over HiveServer2 Connector

Optimization Tips

Network Considerations

Platform Instances

Limitations

Metastore Schema Compatibility

Database Support

View Lineage Parsing

Permissions Limitations

Storage Lineage Limitations

Troubleshooting

Connection Issues

Authentication Failures

Missing Tables

Presto/Trino Views Not Appearing

Storage Lineage Not Appearing

Slow Ingestion

Code Coordinates

Module `presto-on-hive`

Important Capabilities

Overview

Prerequisites

Install the Plugin

Starter Recipe

Config Details

Capabilities

Limitations

Troubleshooting

Code Coordinates