SQL Profiling

SQL Profiling collects table level and column level statistics. The SQL-based profiler does not run alone, but rather can be enabled for other SQL-based sources. Enabling profiling will slow down ingestion runs.

caution

Running profiling against many tables or over many rows can run up significant costs. While we've done our best to limit the expensiveness of the queries the profiler runs, you should be prudent about the set of tables profiling is enabled on or the frequency of the profiling runs.

Capabilities

Extracts:

Row and column counts for each table
For each column, if applicable:
- null counts and proportions
- distinct counts and proportions
- minimum, maximum, mean, median, standard deviation, some quantile values
- histograms or frequencies of unique values

Supported Sources

Source	Notes
ABS Data Lake	Optionally enabled via configuration.
Apache Doris	Optionally enabled via configuration.
Athena	Optionally enabled via configuration. Profiling uses sql queries on whole table which can be expensive operation.
BigQuery	Optionally enabled via configuration.
ClickHouse `clickhouse-usage`	Optionally enabled via configuration.
ClickHouse `clickhouse`	Optionally enabled via configuration.
CockroachDB	Optionally enabled via configuration.
Databricks	Supported via the `profiling.enabled` config.
Dremio	Optionally enabled via configuration.
Excel	Optionally enabled via configuration.
IBM Db2	Optionally enabled via configuration.
Iceberg	Optionally enabled via configuration.
MariaDB	Optionally enabled via configuration.
Microsoft SQL Server	Optionally enabled via configuration.
MySQL	Optionally enabled via configuration.
Postgres	Optionally enabled via configuration.
PowerBI	Optionally enabled via configuration profiling.enabled.
Presto	Optionally enabled via configuration.
Redshift	Optionally enabled via configuration.
S3 / Local Files	Optionally enabled via configuration.
Salesforce	Only table level profiling is supported via `profiling.enabled` config field.
SAP HANA	Optionally enabled via configuration.
Snowflake	Optionally enabled via configuration `profiling.enabled`.
SQLAlchemy	Optionally enabled via configuration.
StarRocks	Optionally enabled via configuration.
Teradata	Optionally enabled via configuration.
TimescaleDB	Optionally enabled via configuration.
Trino `trino`	Optionally enabled via configuration.
Vertica	Optionally enabled via configuration.

Profiler Implementation

DataHub uses a SQLAlchemy-based profiler by default for all SQL sources.

Default: SQLAlchemy Profiler

The default profiler runs profiling queries directly against your SQL source's existing SQLAlchemy connection and emits the table- and column-level statistics listed under Capabilities. No additional dependencies are required beyond the SQL connector itself.

No configuration is required to use it — any SQL source with profiling enabled will use the SQLAlchemy profiler automatically:

source:
  config:
    profiling:
      enabled: true

Optional: Great Expectations Profiler (Deprecated)

danger

The Great Expectations profiler is deprecated and is planned for removal in a future release. The SQLAlchemy profiler above is the recommended replacement and has feature parity for all dataset- and column-level metrics. Existing users still relying on method: ge should plan to migrate.

To use the legacy GE profiler, install the optional profiling-ge extra and set profiling.method explicitly:

pip install 'acryl-datahub[profiling-ge]'

source:
  config:
    profiling:
      enabled: true
      method: ge

If you set profiling.method: ge without installing the extra, the ingestion will fail with a ConfigurationError pointing at the fix.

Differences

The two profilers produce equivalent dataset- and column-level statistics. The only known difference is histogram bucket layout (controlled by include_field_histogram, off by default): the SQLAlchemy profiler uses 10 equal-width buckets, while the GE profiler uses Great Expectations' adaptive partitioning.

Is this page helpful?

SQL Profiling

Capabilities​

Supported Sources​

Profiler Implementation​

Default: SQLAlchemy Profiler​

Optional: Great Expectations Profiler (Deprecated)​

Differences​