Skip to main content

SQL Queries

Overview

SQL Queries is a DataHub utility or metadata-focused integration. Learn more in the official SQL Queries documentation.

The DataHub integration for SQL Queries covers metadata entities and operational objects relevant to this connector. Depending on module capabilities, it can also capture features such as lineage, usage, profiling, ownership, tags, and stateful deletion detection.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source ConceptDataHub ConceptNotes
Platform/account/project scopePlatform Instance, ContainerOrganizes assets within the platform context.
Core technical asset (for example table/view/topic/file)DatasetPrimary ingested technical asset.
Schema fields / columnsSchemaFieldIncluded when schema extraction is supported.
Ownership and collaboration principalsCorpUser, CorpGroupEmitted by modules that support ownership and identity metadata.
Dependencies and processing relationshipsLineage edgesAvailable when lineage extraction is supported and enabled.

Module sql-queries

Incubating

Important Capabilities

CapabilityStatusNotes
Column-level LineageParsed from SQL queries.
Table-Level LineageParsed from SQL queries.

Overview

The sql-queries module ingests metadata from SQL Queries into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

This source reads a newline-delimited JSON file containing SQL queries and parses them to generate lineage.

Prerequisites

Before running ingestion, ensure network connectivity to the source, valid authentication credentials, and read permissions for metadata APIs required by this module.

Install the Plugin

pip install 'acryl-datahub[sql-queries]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

datahub_api:  # Only necessary if using a non-DataHub sink, e.g. the file sink
server: http://localhost:8080
timeout_sec: 60
source:
type: sql-queries
config:
platform: "snowflake"
default_db: "SNOWFLAKE"
query_file: "./queries.json"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
platform 
string
The platform for which to generate data, e.g. snowflake
query_file 
string
Path to file to ingest
default_db
One of string, null
The default database to use for unqualified table names
Default: None
default_schema
One of string, null
The default schema to use for unqualified table names
Default: None
enable_lazy_schema_loading
boolean
Enable lazy schema loading for better performance. When enabled, schemas are fetched on-demand instead of bulk loading all schemas upfront, reducing startup time and memory usage.
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
override_dialect
One of string, null
The SQL dialect to use when parsing queries. Overrides automatic dialect detection.
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
aws_config
One of AwsConnectionConfig, null
AWS configuration for S3 access. Required when query_file is an S3 URI (s3://).
Default: None
aws_config.aws_access_key_id
One of string, null
AWS access key ID. Can be auto-detected, see the AWS boto3 docs for details.
Default: None
aws_config.aws_advanced_config
object
Advanced AWS configuration options. These are passed directly to botocore.config.Config.
aws_config.aws_endpoint_url
One of string, null
The AWS service endpoint. This is normally constructed automatically, but can be overridden here.
Default: None
aws_config.aws_profile
One of string, null
The named profile to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config.
Default: None
aws_config.aws_proxy
One of string, null
A set of proxy configs to use with AWS. See the botocore.config docs for details.
Default: None
aws_config.aws_region
One of string, null
AWS region code.
Default: None
aws_config.aws_retry_mode
Enum
One of: "legacy", "standard", "adaptive"
Default: standard
aws_config.aws_retry_num
integer
Number of times to retry failed AWS requests. See the botocore.retry docs for details.
Default: 5
aws_config.aws_secret_access_key
One of string(password), null
AWS secret access key. Can be auto-detected, see the AWS boto3 docs for details.
Default: None
aws_config.aws_session_token
One of string(password), null
AWS session token. Can be auto-detected, see the AWS boto3 docs for details.
Default: None
aws_config.read_timeout
number
The timeout for reading from the connection (in seconds).
Default: 60
aws_config.aws_role
One of string, array, null
AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as boto3's STS.Client.assume_role.
Default: None
aws_config.aws_role.union
One of string, AwsAssumeRoleConfig
aws_config.aws_role.union.RoleArn 
string
ARN of the role to assume.
aws_config.aws_role.union.ExternalId
One of string, null
External ID to use when assuming the role.
Default: None
temp_table_patterns
array
Regex patterns for temporary tables to filter in lineage ingestion. Specify regex to match the entire table name. This is useful for platforms like Athena that don't have native temp tables but use naming patterns for fake temp tables.
Default: []
temp_table_patterns.string
string
usage
BaseUsageConfig
usage.bucket_duration
Enum
One of: "DAY", "HOUR"
usage.end_time
string(date-time)
Latest date of lineage/usage to consider. Default: Current time in UTC
usage.format_sql_queries
boolean
Whether to format sql queries
Default: False
usage.include_operational_stats
boolean
Whether to display operational stats.
Default: True
usage.include_read_operational_stats
boolean
Whether to report read operational stats. Experimental.
Default: False
usage.include_top_n_queries
boolean
Whether to ingest the top_n_queries.
Default: True
usage.start_time
string(date-time)
Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.
Default: None
usage.top_n_queries
integer
Number of top queries to save to each table.
Default: 10
usage.user_email_pattern
AllowDenyPattern
A class to store allow deny regexes
usage.user_email_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Example Queries File

{"query": "SELECT x FROM my_table", "timestamp": 1689232738.051, "user": "user_a", "downstream_tables": [], "upstream_tables": ["my_database.my_schema.my_table"]}
{"query": "INSERT INTO my_table VALUES (1, 'a')", "timestamp": 1689232737.669, "user": "user_b", "downstream_tables": ["my_database.my_schema.my_table"], "upstream_tables": []}

Note: This uses newline-delimited JSON format (NDJSON), where each line is a separate JSON object.

Query File Format

The source expects newline-delimited JSON (NDJSON), with one query object per line. Common fields:

  • query (required): SQL text to parse
  • timestamp (optional): query execution timestamp
  • user (optional): actor used for usage attribution
  • operation_type (optional): fallback operation classification
  • session_id (optional): session key used to resolve temporary tables across related queries
  • downstream_tables / upstream_tables (optional): fallback lineage hints if parsing fails

Incremental Lineage

When stateful ingestion is enabled, lineage extraction can progress incrementally over new query records while preserving historical checkpoints.

Temporary Table Support

session_id enables correlation of temporary-table creation and downstream usage across query sequences, improving lineage continuity.

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Code Coordinates

  • Class Name: datahub.ingestion.source.sql_queries.SqlQueriesSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for SQL Queries, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.