Excel
Important Capabilities
Capability | Status | Notes |
---|---|---|
Asset Containers | ✅ | Enabled by default. |
Data Profiling | ✅ | Optionally enabled via configuration. |
Detect Deleted Entities | ✅ | Optionally enabled via stateful_ingestion.remove_stale_metadata . |
Schema Metadata | ✅ | Enabled by default. |
This connector ingests Excel worksheet datasets into DataHub. Workbooks (Excel files) can be ingested from the local filesystem, from S3 buckets, or from Azure Blob Storage. An asterisk (*
) can be used in place of a directory or as part of a file name to match multiple directories or files with a single path specification.
By default, this connector will ingest all worksheets in a workbook (an Excel file). To filter worksheets use the worksheet_pattern
config option, or to only ingest the active worksheet use the active_sheet_only
config option.
Supported file types
Supported file types are as follows:
- Excel workbook (*.xlsx)
- Excel macro-enabled workbook (*.xlsm)
The connector will attempt to identify which cells contain table data. A table is defined as a header row, which is used to derive the column names, followed by data rows. The schema is inferred from the data types that are present in a column.
Rows that are directly above or directly below the table where only the first two columns have values are assumed to contain metadata. If such rows are located, they are converted to custom properties where the first column is the key, and the second column is the value. Additionally, the workbook standard and custom properties are also imported as dataset custom properties.
Data Model Mapping
The following table shows how Excel entities are mapped to DataHub entities:
Excel Entity | DataHub Entity | Description |
---|---|---|
Excel Worksheet | Dataset | Each worksheet becomes a dataset with URN pattern: urn:li:dataset:(urn:li:dataPlatform:excel,{path}/[{filename}]{sheet_name},PROD) |
File/Directory Structure | Container | Directory hierarchy creates containers with obfuscated URNs for organizing datasets |
Note: The Excel workbook file itself does not become a separate DataHub entity - only the individual worksheets within it are ingested as datasets.
Prerequisites
AWS S3
When configuring an S3 ingestion source to access files in an S3 bucket, the AWS account referenced in your ingestion recipe must have appropriate S3 permissions. Create a policy with the minimum required permissions by following these steps:
- Create an IAM Policy: Create a policy that grants read access to specific S3 buckets.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Permissions Explanation:
s3:ListBucket
: Allows listing the objects in the bucket. This permission is necessary for the S3 ingestion source to know which objects are available to read.s3:GetBucketLocation
: Allows retrieving the location of the bucket.s3:GetObject
: Allows reading the actual content of the objects in the bucket. This is needed to infer schema from sample files.
Link Policy to Identity: Associate your newly created policy with the appropriate IAM user or role that will be used by the S3 ingestion process.
Set Up S3 Data Source: When configuring your S3 ingestion source, specify the IAM user to whom you assigned the permissions in the previous step.
Azure Blob Storage
To access files on Azure Blob Storage, you will need the following:
Azure Storage Account: A storage account that provides a unique namespace for your data in Azure.
Authentication Credentials: One of these supported authentication methods:
- Account Key: Use your storage account's access key
- Client Secret: Use a service principal with client ID and client secret for Microsoft Entra ID authentication
- SAS Token: Use a Shared Access Signature token that provides limited, time-bound access
Container: A blob container that organizes your blobs (similar to a directory in a file system).
Access Permissions: Appropriate authorization for the authentication method:
- For account key: Full access to the storage account
- For client secret: Appropriate Azure role assignments (like Storage Blob Data Contributor)
- For SAS token: Permissions are defined within the token itself
Starter Recipes
Check out the following recipes to get started with ingestion.
For general pointers on writing and running a recipe, see our main recipe guide.
S3
source:
type: excel
config:
path_list:
- "s3://bucket/data/excel/*/*.xlsx"
aws_config:
aws_access_key_id: AKIAIOSFODNN7EXAMPLE
aws_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
aws_region: us-east-1
profiling:
enabled: false
Azure Blob Storage
source:
type: excel
config:
path_list:
- "https://storageaccountname.blob.core.windows.net/abs-data/excel/*/*.xlsx"
azure_config:
account_name: storageaccountname
sas_token: sv=2022-11-02&ss=b&srt=sco&sp=rwdlacx&se=2025-06-07T21:00:00Z&st=2025-05-07T13:00:00Z&spr=https&sig=a1B2c3D4%3D
container_name: abs-data
profiling:
enabled: false
Local Files
source:
type: excel
config:
path_list:
- "/data/path/reporting/excel/*.xlsx"
profiling:
enabled: false
CLI based Ingestion
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
path_list ✅ array | List of paths to Excel files or folders to ingest. |
path_list.string string | |
active_sheet_only boolean | Enable to only ingest the active sheet of the workbook. If not set, all sheets will be ingested. Default: False |
convert_urns_to_lowercase boolean | Enable to convert the Excel asset urns to lowercase Default: False |
path_pattern AllowDenyPattern | Regex patterns for file paths to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
platform_instance string | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. |
profile_pattern AllowDenyPattern | Regex patterns for worksheets to profile. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
use_abs_blob_tags boolean | Whether to create tags in datahub from the abs blob tags Default: False |
use_s3_bucket_tags boolean | Whether or not to create tags in datahub from the s3 bucket Default: False |
use_s3_object_tags boolean | Whether or not to create tags in datahub from the s3 object Default: False |
verify_ssl One of boolean, string | Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Default: True |
worksheet_pattern AllowDenyPattern | Regex patterns for worksheets to ingest. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
env string | The environment that all assets produced by this connector belong to Default: PROD |
aws_config AwsConnectionConfig | AWS configuration |
aws_config.aws_access_key_id string | AWS access key ID. Can be auto-detected, see the AWS boto3 docs for details. |
aws_config.aws_advanced_config object | Advanced AWS configuration options. These are passed directly to botocore.config.Config. |
aws_config.aws_endpoint_url string | The AWS service endpoint. This is normally constructed automatically, but can be overridden here. |
aws_config.aws_profile string | The named profile to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config. |
aws_config.aws_proxy map(str,string) | |
aws_config.aws_region string | AWS region code. |
aws_config.aws_retry_mode Enum | One of: "legacy", "standard", "adaptive" Default: standard |
aws_config.aws_retry_num integer | Number of times to retry failed AWS requests. See the botocore.retry docs for details. Default: 5 |
aws_config.aws_secret_access_key string | AWS secret access key. Can be auto-detected, see the AWS boto3 docs for details. |
aws_config.aws_session_token string | AWS session token. Can be auto-detected, see the AWS boto3 docs for details. |
aws_config.read_timeout number | The timeout for reading from the connection (in seconds). Default: 60 |
aws_config.aws_role One of string, array | AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as boto3's STS.Client.assume_role. |
aws_config.aws_role.union One of string, AwsAssumeRoleConfig | |
aws_config.aws_role.union.RoleArn ❓ string | ARN of the role to assume. |
aws_config.aws_role.union.ExternalId string | External ID to use when assuming the role. |
azure_config AzureConnectionConfig | Azure configuration |
azure_config.account_name ❓ string | Name of the Azure storage account. See Microsoft official documentation on how to create a storage account. |
azure_config.container_name ❓ string | Azure storage account container name. |
azure_config.account_key string | Azure storage account access key that can be used as a credential. An account key, a SAS token or a client secret is required for authentication. |
azure_config.base_path string | Base folder in hierarchical namespaces to start from. Default: / |
azure_config.client_id string | Azure client (Application) ID required when a client_secret is used as a credential. |
azure_config.client_secret string | Azure client secret that can be used as a credential. An account key, a SAS token or a client secret is required for authentication. |
azure_config.sas_token string | Azure storage account Shared Access Signature (SAS) token that can be used as a credential. An account key, a SAS token or a client secret is required for authentication. |
azure_config.tenant_id string | Azure tenant (Directory) ID required when a client_secret is used as a credential. |
profiling GEProfilingConfig | Configuration for profiling Default: {'enabled': False, 'operation_config': {'lower_fre... |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit integer | Max number of documents to profile. By default, profiles all documents. |
profiling.max_number_of_fields_to_profile integer | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.offset integer | Offset in documents to profile. By default, uses no offset. |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.operation_config OperationConfig | Experimental feature. To specify operation configs. |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month integer | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. |
profiling.operation_config.profile_day_of_week integer | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. |
profiling.tags_to_ignore_sampling array | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling . |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion StatefulStaleMetadataRemovalConfig | Configuration for stateful ingestion and stale metadata removal. |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"title": "ExcelSourceConfig",
"description": "Base configuration class for stateful ingestion for source configs to inherit from.",
"type": "object",
"properties": {
"env": {
"title": "Env",
"description": "The environment that all assets produced by this connector belong to",
"default": "PROD",
"type": "string"
},
"platform_instance": {
"title": "Platform Instance",
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"type": "string"
},
"stateful_ingestion": {
"title": "Stateful Ingestion",
"description": "Configuration for stateful ingestion and stale metadata removal.",
"allOf": [
{
"$ref": "#/definitions/StatefulStaleMetadataRemovalConfig"
}
]
},
"path_list": {
"title": "Path List",
"description": "List of paths to Excel files or folders to ingest.",
"type": "array",
"items": {
"type": "string"
}
},
"path_pattern": {
"title": "Path Pattern",
"description": "Regex patterns for file paths to filter in ingestion.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"aws_config": {
"title": "Aws Config",
"description": "AWS configuration",
"allOf": [
{
"$ref": "#/definitions/AwsConnectionConfig"
}
]
},
"use_s3_bucket_tags": {
"title": "Use S3 Bucket Tags",
"description": "Whether or not to create tags in datahub from the s3 bucket",
"default": false,
"type": "boolean"
},
"use_s3_object_tags": {
"title": "Use S3 Object Tags",
"description": "Whether or not to create tags in datahub from the s3 object",
"default": false,
"type": "boolean"
},
"verify_ssl": {
"title": "Verify Ssl",
"description": "Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use.",
"default": true,
"anyOf": [
{
"type": "boolean"
},
{
"type": "string"
}
]
},
"azure_config": {
"title": "Azure Config",
"description": "Azure configuration",
"allOf": [
{
"$ref": "#/definitions/AzureConnectionConfig"
}
]
},
"use_abs_blob_tags": {
"title": "Use Abs Blob Tags",
"description": "Whether to create tags in datahub from the abs blob tags",
"default": false,
"type": "boolean"
},
"convert_urns_to_lowercase": {
"title": "Convert Urns To Lowercase",
"description": "Enable to convert the Excel asset urns to lowercase",
"default": false,
"type": "boolean"
},
"active_sheet_only": {
"title": "Active Sheet Only",
"description": "Enable to only ingest the active sheet of the workbook. If not set, all sheets will be ingested.",
"default": false,
"type": "boolean"
},
"worksheet_pattern": {
"title": "Worksheet Pattern",
"description": "Regex patterns for worksheets to ingest. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"profile_pattern": {
"title": "Profile Pattern",
"description": "Regex patterns for worksheets to profile. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"profiling": {
"title": "Profiling",
"description": "Configuration for profiling",
"default": {
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_day_of_week": null,
"profile_date_of_month": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
},
"allOf": [
{
"$ref": "#/definitions/GEProfilingConfig"
}
]
}
},
"required": [
"path_list"
],
"additionalProperties": false,
"definitions": {
"DynamicTypedStateProviderConfig": {
"title": "DynamicTypedStateProviderConfig",
"type": "object",
"properties": {
"type": {
"title": "Type",
"description": "The type of the state provider to use. For DataHub use `datahub`",
"type": "string"
},
"config": {
"title": "Config",
"description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19).",
"default": {},
"type": "object"
}
},
"required": [
"type"
],
"additionalProperties": false
},
"StatefulStaleMetadataRemovalConfig": {
"title": "StatefulStaleMetadataRemovalConfig",
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"type": "object",
"properties": {
"enabled": {
"title": "Enabled",
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"default": false,
"type": "boolean"
},
"remove_stale_metadata": {
"title": "Remove Stale Metadata",
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"default": true,
"type": "boolean"
},
"fail_safe_threshold": {
"title": "Fail Safe Threshold",
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"default": 75.0,
"minimum": 0.0,
"maximum": 100.0,
"type": "number"
}
},
"additionalProperties": false
},
"AllowDenyPattern": {
"title": "AllowDenyPattern",
"description": "A class to store allow deny regexes",
"type": "object",
"properties": {
"allow": {
"title": "Allow",
"description": "List of regex patterns to include in ingestion",
"default": [
".*"
],
"type": "array",
"items": {
"type": "string"
}
},
"deny": {
"title": "Deny",
"description": "List of regex patterns to exclude from ingestion.",
"default": [],
"type": "array",
"items": {
"type": "string"
}
},
"ignoreCase": {
"title": "Ignorecase",
"description": "Whether to ignore case sensitivity during pattern matching.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
},
"AwsAssumeRoleConfig": {
"title": "AwsAssumeRoleConfig",
"type": "object",
"properties": {
"RoleArn": {
"title": "Rolearn",
"description": "ARN of the role to assume.",
"type": "string"
},
"ExternalId": {
"title": "Externalid",
"description": "External ID to use when assuming the role.",
"type": "string"
}
},
"required": [
"RoleArn"
]
},
"AwsConnectionConfig": {
"title": "AwsConnectionConfig",
"description": "Common AWS credentials config.\n\nCurrently used by:\n - Glue source\n - SageMaker source\n - dbt source",
"type": "object",
"properties": {
"aws_access_key_id": {
"title": "Aws Access Key Id",
"description": "AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.",
"type": "string"
},
"aws_secret_access_key": {
"title": "Aws Secret Access Key",
"description": "AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.",
"type": "string"
},
"aws_session_token": {
"title": "Aws Session Token",
"description": "AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.",
"type": "string"
},
"aws_role": {
"title": "Aws Role",
"description": "AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as [boto3's STS.Client.assume_role](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role).",
"anyOf": [
{
"type": "string"
},
{
"type": "array",
"items": {
"anyOf": [
{
"type": "string"
},
{
"$ref": "#/definitions/AwsAssumeRoleConfig"
}
]
}
}
]
},
"aws_profile": {
"title": "Aws Profile",
"description": "The [named profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html) to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config.",
"type": "string"
},
"aws_region": {
"title": "Aws Region",
"description": "AWS region code.",
"type": "string"
},
"aws_endpoint_url": {
"title": "Aws Endpoint Url",
"description": "The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here.",
"type": "string"
},
"aws_proxy": {
"title": "Aws Proxy",
"description": "A set of proxy configs to use with AWS. See the [botocore.config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html) docs for details.",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"aws_retry_num": {
"title": "Aws Retry Num",
"description": "Number of times to retry failed AWS requests. See the [botocore.retry](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html) docs for details.",
"default": 5,
"type": "integer"
},
"aws_retry_mode": {
"title": "Aws Retry Mode",
"description": "Retry mode to use for failed AWS requests. See the [botocore.retry](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html) docs for details.",
"default": "standard",
"enum": [
"legacy",
"standard",
"adaptive"
],
"type": "string"
},
"read_timeout": {
"title": "Read Timeout",
"description": "The timeout for reading from the connection (in seconds).",
"default": 60,
"type": "number"
},
"aws_advanced_config": {
"title": "Aws Advanced Config",
"description": "Advanced AWS configuration options. These are passed directly to [botocore.config.Config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html).",
"type": "object"
}
},
"additionalProperties": false
},
"AzureConnectionConfig": {
"title": "AzureConnectionConfig",
"description": "Common Azure credentials config.\n\nhttps://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python",
"type": "object",
"properties": {
"base_path": {
"title": "Base Path",
"description": "Base folder in hierarchical namespaces to start from.",
"default": "/",
"type": "string"
},
"container_name": {
"title": "Container Name",
"description": "Azure storage account container name.",
"type": "string"
},
"account_name": {
"title": "Account Name",
"description": "Name of the Azure storage account. See [Microsoft official documentation on how to create a storage account.](https://docs.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account)",
"type": "string"
},
"account_key": {
"title": "Account Key",
"description": "Azure storage account access key that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
"type": "string"
},
"sas_token": {
"title": "Sas Token",
"description": "Azure storage account Shared Access Signature (SAS) token that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
"type": "string"
},
"client_secret": {
"title": "Client Secret",
"description": "Azure client secret that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
"type": "string"
},
"client_id": {
"title": "Client Id",
"description": "Azure client (Application) ID required when a `client_secret` is used as a credential.",
"type": "string"
},
"tenant_id": {
"title": "Tenant Id",
"description": "Azure tenant (Directory) ID required when a `client_secret` is used as a credential.",
"type": "string"
}
},
"required": [
"container_name",
"account_name"
],
"additionalProperties": false
},
"OperationConfig": {
"title": "OperationConfig",
"type": "object",
"properties": {
"lower_freq_profile_enabled": {
"title": "Lower Freq Profile Enabled",
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"default": false,
"type": "boolean"
},
"profile_day_of_week": {
"title": "Profile Day Of Week",
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"type": "integer"
},
"profile_date_of_month": {
"title": "Profile Date Of Month",
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"type": "integer"
}
},
"additionalProperties": false
},
"GEProfilingConfig": {
"title": "GEProfilingConfig",
"type": "object",
"properties": {
"enabled": {
"title": "Enabled",
"description": "Whether profiling should be done.",
"default": false,
"type": "boolean"
},
"operation_config": {
"title": "Operation Config",
"description": "Experimental feature. To specify operation configs.",
"allOf": [
{
"$ref": "#/definitions/OperationConfig"
}
]
},
"limit": {
"title": "Limit",
"description": "Max number of documents to profile. By default, profiles all documents.",
"type": "integer"
},
"offset": {
"title": "Offset",
"description": "Offset in documents to profile. By default, uses no offset.",
"type": "integer"
},
"profile_table_level_only": {
"title": "Profile Table Level Only",
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"default": false,
"type": "boolean"
},
"include_field_null_count": {
"title": "Include Field Null Count",
"description": "Whether to profile for the number of nulls for each column.",
"default": true,
"type": "boolean"
},
"include_field_distinct_count": {
"title": "Include Field Distinct Count",
"description": "Whether to profile for the number of distinct values for each column.",
"default": true,
"type": "boolean"
},
"include_field_min_value": {
"title": "Include Field Min Value",
"description": "Whether to profile for the min value of numeric columns.",
"default": true,
"type": "boolean"
},
"include_field_max_value": {
"title": "Include Field Max Value",
"description": "Whether to profile for the max value of numeric columns.",
"default": true,
"type": "boolean"
},
"include_field_mean_value": {
"title": "Include Field Mean Value",
"description": "Whether to profile for the mean value of numeric columns.",
"default": true,
"type": "boolean"
},
"include_field_median_value": {
"title": "Include Field Median Value",
"description": "Whether to profile for the median value of numeric columns.",
"default": true,
"type": "boolean"
},
"include_field_stddev_value": {
"title": "Include Field Stddev Value",
"description": "Whether to profile for the standard deviation of numeric columns.",
"default": true,
"type": "boolean"
},
"include_field_quantiles": {
"title": "Include Field Quantiles",
"description": "Whether to profile for the quantiles of numeric columns.",
"default": false,
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"title": "Include Field Distinct Value Frequencies",
"description": "Whether to profile for distinct value frequencies.",
"default": false,
"type": "boolean"
},
"include_field_histogram": {
"title": "Include Field Histogram",
"description": "Whether to profile for the histogram for numeric fields.",
"default": false,
"type": "boolean"
},
"include_field_sample_values": {
"title": "Include Field Sample Values",
"description": "Whether to profile for the sample values for all columns.",
"default": true,
"type": "boolean"
},
"max_workers": {
"title": "Max Workers",
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"default": 20,
"type": "integer"
},
"report_dropped_profiles": {
"title": "Report Dropped Profiles",
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"default": false,
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"title": "Turn Off Expensive Profiling Metrics",
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"default": false,
"type": "boolean"
},
"field_sample_values_limit": {
"title": "Field Sample Values Limit",
"description": "Upper limit for number of sample values to collect for all columns.",
"default": 20,
"type": "integer"
},
"max_number_of_fields_to_profile": {
"title": "Max Number Of Fields To Profile",
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"exclusiveMinimum": 0,
"type": "integer"
},
"profile_if_updated_since_days": {
"title": "Profile If Updated Since Days",
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"exclusiveMinimum": 0,
"type": "number"
},
"profile_table_size_limit": {
"title": "Profile Table Size Limit",
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"default": 5,
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"type": "integer"
},
"profile_table_row_limit": {
"title": "Profile Table Row Limit",
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"default": 5000000,
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"type": "integer"
},
"profile_table_row_count_estimate_only": {
"title": "Profile Table Row Count Estimate Only",
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"default": false,
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"type": "boolean"
},
"query_combiner_enabled": {
"title": "Query Combiner Enabled",
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"default": true,
"type": "boolean"
},
"catch_exceptions": {
"title": "Catch Exceptions",
"default": true,
"type": "boolean"
},
"partition_profiling_enabled": {
"title": "Partition Profiling Enabled",
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"default": true,
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"type": "boolean"
},
"partition_datetime": {
"title": "Partition Datetime",
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"type": "string",
"format": "date-time"
},
"use_sampling": {
"title": "Use Sampling",
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"default": true,
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"type": "boolean"
},
"sample_size": {
"title": "Sample Size",
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"default": 10000,
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"type": "integer"
},
"profile_external_tables": {
"title": "Profile External Tables",
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"default": false,
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"type": "boolean"
},
"tags_to_ignore_sampling": {
"title": "Tags To Ignore Sampling",
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"type": "array",
"items": {
"type": "string"
}
},
"profile_nested_fields": {
"title": "Profile Nested Fields",
"description": "Whether to profile complex types like structs, arrays and maps. ",
"default": false,
"type": "boolean"
}
},
"additionalProperties": false
}
}
}
Code Coordinates
- Class Name:
datahub.ingestion.source.excel.source.ExcelSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Excel, feel free to ping us on our Slack.