Version: Next

Excel

Important Capabilities

Capability	Status	Notes
Asset Containers	✅	Enabled by default.
Data Profiling	✅	Optionally enabled via configuration.
Detect Deleted Entities	✅	Optionally enabled via `stateful_ingestion.remove_stale_metadata`.
Schema Metadata	✅	Enabled by default.

This connector ingests Excel worksheet datasets into DataHub. Workbooks (Excel files) can be ingested from the local filesystem, from S3 buckets, or from Azure Blob Storage. An asterisk (*) can be used in place of a directory or as part of a file name to match multiple directories or files with a single path specification.

tip

By default, this connector will ingest all worksheets in a workbook (an Excel file). To filter worksheets use the worksheet_pattern config option, or to only ingest the active worksheet use the active_sheet_only config option.

Supported file types

Supported file types are as follows:

Excel workbook (*.xlsx)
Excel macro-enabled workbook (*.xlsm)

The connector will attempt to identify which cells contain table data. A table is defined as a header row, which is used to derive the column names, followed by data rows. The schema is inferred from the data types that are present in a column.

Rows that are directly above or directly below the table where only the first two columns have values are assumed to contain metadata. If such rows are located, they are converted to custom properties where the first column is the key, and the second column is the value. Additionally, the workbook standard and custom properties are also imported as dataset custom properties.

Data Model Mapping

The following table shows how Excel entities are mapped to DataHub entities:

Excel Entity	DataHub Entity	Description
Excel Worksheet	Dataset	Each worksheet becomes a dataset with URN pattern: `urn:li:dataset:(urn:li:dataPlatform:excel,{path}/[{filename}]{sheet_name},PROD)`
File/Directory Structure	Container	Directory hierarchy creates containers with obfuscated URNs for organizing datasets

Note: The Excel workbook file itself does not become a separate DataHub entity - only the individual worksheets within it are ingested as datasets.

Prerequisites

AWS S3

When configuring an S3 ingestion source to access files in an S3 bucket, the AWS account referenced in your ingestion recipe must have appropriate S3 permissions. Create a policy with the minimum required permissions by following these steps:

Create an IAM Policy: Create a policy that grants read access to specific S3 buckets.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": ["s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject"],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Permissions Explanation:

s3:ListBucket: Allows listing the objects in the bucket. This permission is necessary for the S3 ingestion source to know which objects are available to read.
s3:GetBucketLocation: Allows retrieving the location of the bucket.
s3:GetObject: Allows reading the actual content of the objects in the bucket. This is needed to infer schema from sample files.

Link Policy to Identity: Associate your newly created policy with the appropriate IAM user or role that will be used by the S3 ingestion process.
Set Up S3 Data Source: When configuring your S3 ingestion source, specify the IAM user to whom you assigned the permissions in the previous step.

Azure Blob Storage

To access files on Azure Blob Storage, you will need the following:

Azure Storage Account: A storage account that provides a unique namespace for your data in Azure.
Authentication Credentials: One of these supported authentication methods:
- Account Key: Use your storage account's access key
- Client Secret: Use a service principal with client ID and client secret for Microsoft Entra ID authentication
- SAS Token: Use a Shared Access Signature token that provides limited, time-bound access
Container: A blob container that organizes your blobs (similar to a directory in a file system).
Access Permissions: Appropriate authorization for the authentication method:
- For account key: Full access to the storage account
- For client secret: Appropriate Azure role assignments (like Storage Blob Data Contributor)
- For SAS token: Permissions are defined within the token itself

Starter Recipes

Check out the following recipes to get started with ingestion.

For general pointers on writing and running a recipe, see our main recipe guide.

S3

source:
  type: excel
  config:
    path_list:
      - "s3://bucket/data/excel/*/*.xlsx"
    aws_config:
      aws_access_key_id: AKIAIOSFODNN7EXAMPLE
      aws_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      aws_region: us-east-1
    profiling:
      enabled: false

Azure Blob Storage

source:
  type: excel
  config:
    path_list:
      - "https://storageaccountname.blob.core.windows.net/abs-data/excel/*/*.xlsx"
    azure_config:
      account_name: storageaccountname
      sas_token: sv=2022-11-02&ss=b&srt=sco&sp=rwdlacx&se=2025-06-07T21:00:00Z&st=2025-05-07T13:00:00Z&spr=https&sig=a1B2c3D4%3D
      container_name: abs-data
    profiling:
      enabled: false

Local Files

source:
  type: excel
  config:
    path_list:
      - "/data/path/reporting/excel/*.xlsx"
    profiling:
      enabled: false

CLI based Ingestion

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
path_list ✅ array	List of paths to Excel files or folders to ingest.
path_list.string string
active_sheet_only boolean	Enable to only ingest the active sheet of the workbook. If not set, all sheets will be ingested. Default: False
convert_urns_to_lowercase boolean	Enable to convert the Excel asset urns to lowercase Default: False
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
use_abs_blob_tags One of boolean, null	Whether to create tags in datahub from the abs blob tags Default: False
use_s3_bucket_tags One of boolean, null	Whether or not to create tags in datahub from the s3 bucket Default: False
use_s3_object_tags One of boolean, null	Whether or not to create tags in datahub from the s3 object Default: False
verify_ssl One of boolean, string	Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Default: True
env string	The environment that all assets produced by this connector belong to Default: PROD
aws_config One of AwsConnectionConfig, null	AWS configuration Default: None
aws_config.aws_access_key_id One of string, null	AWS access key ID. Can be auto-detected, see the AWS boto3 docs for details. Default: None
aws_config.aws_advanced_config object	Advanced AWS configuration options. These are passed directly to botocore.config.Config.
aws_config.aws_endpoint_url One of string, null	The AWS service endpoint. This is normally constructed automatically, but can be overridden here. Default: None
aws_config.aws_profile One of string, null	The named profile to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config. Default: None
aws_config.aws_proxy One of string, null	A set of proxy configs to use with AWS. See the botocore.config docs for details. Default: None
aws_config.aws_region One of string, null	AWS region code. Default: None
aws_config.aws_retry_mode Enum	One of: "legacy", "standard", "adaptive" Default: standard
aws_config.aws_retry_num integer	Number of times to retry failed AWS requests. See the botocore.retry docs for details. Default: 5
aws_config.aws_secret_access_key One of string, null	AWS secret access key. Can be auto-detected, see the AWS boto3 docs for details. Default: None
aws_config.aws_session_token One of string, null	AWS session token. Can be auto-detected, see the AWS boto3 docs for details. Default: None
aws_config.read_timeout number	The timeout for reading from the connection (in seconds). Default: 60
aws_config.aws_role One of string, array, null	AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as boto3's STS.Client.assume_role. Default: None
aws_config.aws_role.union One of string, AwsAssumeRoleConfig
aws_config.aws_role.union.RoleArn ❓ string	ARN of the role to assume.
aws_config.aws_role.union.ExternalId One of string, null	External ID to use when assuming the role. Default: None
azure_config One of AzureConnectionConfig, null	Azure configuration Default: None
azure_config.account_name ❓ string	Name of the Azure storage account. See Microsoft official documentation on how to create a storage account.
azure_config.container_name ❓ string	Azure storage account container name.
azure_config.account_key One of string, null	Azure storage account access key that can be used as a credential. An account key, a SAS token or a client secret is required for authentication. Default: None
azure_config.base_path string	Base folder in hierarchical namespaces to start from. Default: /
azure_config.client_id One of string, null	Azure client (Application) ID required when a `client_secret` is used as a credential. Default: None
azure_config.client_secret One of string, null	Azure client secret that can be used as a credential. An account key, a SAS token or a client secret is required for authentication. Default: None
azure_config.sas_token One of string, null	Azure storage account Shared Access Signature (SAS) token that can be used as a credential. An account key, a SAS token or a client secret is required for authentication. Default: None
azure_config.tenant_id One of string, null	Azure tenant (Directory) ID required when a `client_secret` is used as a credential. Default: None
path_pattern AllowDenyPattern	A class to store allow deny regexes
path_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profile_pattern AllowDenyPattern	A class to store allow deny regexes
profile_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
worksheet_pattern AllowDenyPattern	A class to store allow deny regexes
worksheet_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profiling GEProfilingConfig
profiling.catch_exceptions boolean	Default: True
profiling.enabled boolean	Whether profiling should be done. Default: False
profiling.field_sample_values_limit integer	Upper limit for number of sample values to collect for all columns. Default: 20
profiling.include_field_distinct_count boolean	Whether to profile for the number of distinct values for each column. Default: True
profiling.include_field_distinct_value_frequencies boolean	Whether to profile for distinct value frequencies. Default: False
profiling.include_field_histogram boolean	Whether to profile for the histogram for numeric fields. Default: False
profiling.include_field_max_value boolean	Whether to profile for the max value of numeric columns. Default: True
profiling.include_field_mean_value boolean	Whether to profile for the mean value of numeric columns. Default: True
profiling.include_field_median_value boolean	Whether to profile for the median value of numeric columns. Default: True
profiling.include_field_min_value boolean	Whether to profile for the min value of numeric columns. Default: True
profiling.include_field_null_count boolean	Whether to profile for the number of nulls for each column. Default: True
profiling.include_field_quantiles boolean	Whether to profile for the quantiles of numeric columns. Default: False
profiling.include_field_sample_values boolean	Whether to profile for the sample values for all columns. Default: True
profiling.include_field_stddev_value boolean	Whether to profile for the standard deviation of numeric columns. Default: True
profiling.limit One of integer, null	Max number of documents to profile. By default, profiles all documents. Default: None
profiling.max_number_of_fields_to_profile One of integer, null	A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None
profiling.max_workers integer	Number of worker threads to use for profiling. Set to 1 to disable. Default: 20
profiling.offset One of integer, null	Offset in documents to profile. By default, uses no offset. Default: None
profiling.partition_datetime One of string(date-time), null	If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None
profiling.partition_profiling_enabled boolean	Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True
profiling.profile_external_tables boolean	Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False
profiling.profile_if_updated_since_days One of number, null	Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. Default: None
profiling.profile_nested_fields boolean	Whether to profile complex types like structs, arrays and maps. Default: False
profiling.profile_table_level_only boolean	Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False
profiling.profile_table_row_count_estimate_only boolean	Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False
profiling.profile_table_row_limit One of integer, null	Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats. Default: 5000000
profiling.profile_table_size_limit One of integer, null	Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats. Default: 5
profiling.query_combiner_enabled boolean	This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True
profiling.report_dropped_profiles boolean	Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes. Default: False
profiling.sample_size integer	Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True. Default: 10000
profiling.turn_off_expensive_profiling_metrics boolean	Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False
profiling.use_sampling boolean	Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True
profiling.operation_config OperationConfig
profiling.operation_config.lower_freq_profile_enabled boolean	Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False
profiling.operation_config.profile_date_of_month One of integer, null	Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.operation_config.profile_day_of_week One of integer, null	Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.tags_to_ignore_sampling One of array, null	Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`. Default: None
profiling.tags_to_ignore_sampling.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Configuration for stateful ingestion and stale metadata removal. Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "AwsAssumeRoleConfig": {
      "additionalProperties": true,
      "properties": {
        "RoleArn": {
          "description": "ARN of the role to assume.",
          "title": "Rolearn",
          "type": "string"
        },
        "ExternalId": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "External ID to use when assuming the role.",
          "title": "Externalid"
        }
      },
      "required": [
        "RoleArn"
      ],
      "title": "AwsAssumeRoleConfig",
      "type": "object"
    },
    "AwsConnectionConfig": {
      "additionalProperties": false,
      "description": "Common AWS credentials config.\n\nCurrently used by:\n    - Glue source\n    - SageMaker source\n    - dbt source",
      "properties": {
        "aws_access_key_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "AWS access key ID. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.",
          "title": "Aws Access Key Id"
        },
        "aws_secret_access_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "AWS secret access key. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.",
          "title": "Aws Secret Access Key"
        },
        "aws_session_token": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "AWS session token. Can be auto-detected, see [the AWS boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for details.",
          "title": "Aws Session Token"
        },
        "aws_role": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "items": {
                "anyOf": [
                  {
                    "type": "string"
                  },
                  {
                    "$ref": "#/$defs/AwsAssumeRoleConfig"
                  }
                ]
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as [boto3's STS.Client.assume_role](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_role).",
          "title": "Aws Role"
        },
        "aws_profile": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The [named profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html) to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config.",
          "title": "Aws Profile"
        },
        "aws_region": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "AWS region code.",
          "title": "Aws Region"
        },
        "aws_endpoint_url": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The AWS service endpoint. This is normally [constructed automatically](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html), but can be overridden here.",
          "title": "Aws Endpoint Url"
        },
        "aws_proxy": {
          "anyOf": [
            {
              "additionalProperties": {
                "type": "string"
              },
              "type": "object"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "A set of proxy configs to use with AWS. See the [botocore.config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html) docs for details.",
          "title": "Aws Proxy"
        },
        "aws_retry_num": {
          "default": 5,
          "description": "Number of times to retry failed AWS requests. See the [botocore.retry](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html) docs for details.",
          "title": "Aws Retry Num",
          "type": "integer"
        },
        "aws_retry_mode": {
          "default": "standard",
          "description": "Retry mode to use for failed AWS requests. See the [botocore.retry](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html) docs for details.",
          "enum": [
            "legacy",
            "standard",
            "adaptive"
          ],
          "title": "Aws Retry Mode",
          "type": "string"
        },
        "read_timeout": {
          "default": 60,
          "description": "The timeout for reading from the connection (in seconds).",
          "title": "Read Timeout",
          "type": "number"
        },
        "aws_advanced_config": {
          "additionalProperties": true,
          "description": "Advanced AWS configuration options. These are passed directly to [botocore.config.Config](https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html).",
          "title": "Aws Advanced Config",
          "type": "object"
        }
      },
      "title": "AwsConnectionConfig",
      "type": "object"
    },
    "AzureConnectionConfig": {
      "additionalProperties": false,
      "description": "Common Azure credentials config.\n\nhttps://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python",
      "properties": {
        "base_path": {
          "default": "/",
          "description": "Base folder in hierarchical namespaces to start from.",
          "title": "Base Path",
          "type": "string"
        },
        "container_name": {
          "description": "Azure storage account container name.",
          "title": "Container Name",
          "type": "string"
        },
        "account_name": {
          "description": "Name of the Azure storage account.  See [Microsoft official documentation on how to create a storage account.](https://docs.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account)",
          "title": "Account Name",
          "type": "string"
        },
        "account_key": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Azure storage account access key that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
          "title": "Account Key"
        },
        "sas_token": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Azure storage account Shared Access Signature (SAS) token that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
          "title": "Sas Token"
        },
        "client_secret": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Azure client secret that can be used as a credential. **An account key, a SAS token or a client secret is required for authentication.**",
          "title": "Client Secret"
        },
        "client_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Azure client (Application) ID required when a `client_secret` is used as a credential.",
          "title": "Client Id"
        },
        "tenant_id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Azure tenant (Directory) ID required when a `client_secret` is used as a credential.",
          "title": "Tenant Id"
        }
      },
      "required": [
        "container_name",
        "account_name"
      ],
      "title": "AzureConnectionConfig",
      "type": "object"
    },
    "GEProfilingConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether profiling should be done.",
          "title": "Enabled",
          "type": "boolean"
        },
        "operation_config": {
          "$ref": "#/$defs/OperationConfig",
          "description": "Experimental feature. To specify operation configs."
        },
        "limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Max number of documents to profile. By default, profiles all documents.",
          "title": "Limit"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Offset in documents to profile. By default, uses no offset.",
          "title": "Offset"
        },
        "profile_table_level_only": {
          "default": false,
          "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
          "title": "Profile Table Level Only",
          "type": "boolean"
        },
        "include_field_null_count": {
          "default": true,
          "description": "Whether to profile for the number of nulls for each column.",
          "title": "Include Field Null Count",
          "type": "boolean"
        },
        "include_field_distinct_count": {
          "default": true,
          "description": "Whether to profile for the number of distinct values for each column.",
          "title": "Include Field Distinct Count",
          "type": "boolean"
        },
        "include_field_min_value": {
          "default": true,
          "description": "Whether to profile for the min value of numeric columns.",
          "title": "Include Field Min Value",
          "type": "boolean"
        },
        "include_field_max_value": {
          "default": true,
          "description": "Whether to profile for the max value of numeric columns.",
          "title": "Include Field Max Value",
          "type": "boolean"
        },
        "include_field_mean_value": {
          "default": true,
          "description": "Whether to profile for the mean value of numeric columns.",
          "title": "Include Field Mean Value",
          "type": "boolean"
        },
        "include_field_median_value": {
          "default": true,
          "description": "Whether to profile for the median value of numeric columns.",
          "title": "Include Field Median Value",
          "type": "boolean"
        },
        "include_field_stddev_value": {
          "default": true,
          "description": "Whether to profile for the standard deviation of numeric columns.",
          "title": "Include Field Stddev Value",
          "type": "boolean"
        },
        "include_field_quantiles": {
          "default": false,
          "description": "Whether to profile for the quantiles of numeric columns.",
          "title": "Include Field Quantiles",
          "type": "boolean"
        },
        "include_field_distinct_value_frequencies": {
          "default": false,
          "description": "Whether to profile for distinct value frequencies.",
          "title": "Include Field Distinct Value Frequencies",
          "type": "boolean"
        },
        "include_field_histogram": {
          "default": false,
          "description": "Whether to profile for the histogram for numeric fields.",
          "title": "Include Field Histogram",
          "type": "boolean"
        },
        "include_field_sample_values": {
          "default": true,
          "description": "Whether to profile for the sample values for all columns.",
          "title": "Include Field Sample Values",
          "type": "boolean"
        },
        "max_workers": {
          "default": 20,
          "description": "Number of worker threads to use for profiling. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "report_dropped_profiles": {
          "default": false,
          "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
          "title": "Report Dropped Profiles",
          "type": "boolean"
        },
        "turn_off_expensive_profiling_metrics": {
          "default": false,
          "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
          "title": "Turn Off Expensive Profiling Metrics",
          "type": "boolean"
        },
        "field_sample_values_limit": {
          "default": 20,
          "description": "Upper limit for number of sample values to collect for all columns.",
          "title": "Field Sample Values Limit",
          "type": "integer"
        },
        "max_number_of_fields_to_profile": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
          "title": "Max Number Of Fields To Profile"
        },
        "profile_if_updated_since_days": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery"
            ]
          },
          "title": "Profile If Updated Since Days"
        },
        "profile_table_size_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5,
          "description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "unity-catalog",
              "oracle"
            ]
          },
          "title": "Profile Table Size Limit"
        },
        "profile_table_row_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5000000,
          "description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "oracle"
            ]
          },
          "title": "Profile Table Row Limit"
        },
        "profile_table_row_count_estimate_only": {
          "default": false,
          "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
          "schema_extra": {
            "supported_sources": [
              "postgres",
              "mysql"
            ]
          },
          "title": "Profile Table Row Count Estimate Only",
          "type": "boolean"
        },
        "query_combiner_enabled": {
          "default": true,
          "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
          "title": "Query Combiner Enabled",
          "type": "boolean"
        },
        "catch_exceptions": {
          "default": true,
          "description": "",
          "title": "Catch Exceptions",
          "type": "boolean"
        },
        "partition_profiling_enabled": {
          "default": true,
          "description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
          "schema_extra": {
            "supported_sources": [
              "athena",
              "bigquery"
            ]
          },
          "title": "Partition Profiling Enabled",
          "type": "boolean"
        },
        "partition_datetime": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
          "schema_extra": {
            "supported_sources": [
              "bigquery"
            ]
          },
          "title": "Partition Datetime"
        },
        "use_sampling": {
          "default": true,
          "description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Use Sampling",
          "type": "boolean"
        },
        "sample_size": {
          "default": 10000,
          "description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Sample Size",
          "type": "integer"
        },
        "profile_external_tables": {
          "default": false,
          "description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
          "schema_extra": {
            "supported_sources": [
              "redshift",
              "snowflake"
            ]
          },
          "title": "Profile External Tables",
          "type": "boolean"
        },
        "tags_to_ignore_sampling": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
          "title": "Tags To Ignore Sampling"
        },
        "profile_nested_fields": {
          "default": false,
          "description": "Whether to profile complex types like structs, arrays and maps. ",
          "title": "Profile Nested Fields",
          "type": "boolean"
        }
      },
      "title": "GEProfilingConfig",
      "type": "object"
    },
    "OperationConfig": {
      "additionalProperties": false,
      "properties": {
        "lower_freq_profile_enabled": {
          "default": false,
          "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
          "title": "Lower Freq Profile Enabled",
          "type": "boolean"
        },
        "profile_day_of_week": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Day Of Week"
        },
        "profile_date_of_month": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Date Of Month"
        }
      },
      "title": "OperationConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Configuration for stateful ingestion and stale metadata removal."
    },
    "path_list": {
      "description": "List of paths to Excel files or folders to ingest.",
      "items": {
        "type": "string"
      },
      "title": "Path List",
      "type": "array"
    },
    "path_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for file paths to filter in ingestion."
    },
    "aws_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/AwsConnectionConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "AWS configuration"
    },
    "use_s3_bucket_tags": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "description": "Whether or not to create tags in datahub from the s3 bucket",
      "title": "Use S3 Bucket Tags"
    },
    "use_s3_object_tags": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "description": "Whether or not to create tags in datahub from the s3 object",
      "title": "Use S3 Object Tags"
    },
    "verify_ssl": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "string"
        }
      ],
      "default": true,
      "description": "Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use.",
      "title": "Verify Ssl"
    },
    "azure_config": {
      "anyOf": [
        {
          "$ref": "#/$defs/AzureConnectionConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Azure configuration"
    },
    "use_abs_blob_tags": {
      "anyOf": [
        {
          "type": "boolean"
        },
        {
          "type": "null"
        }
      ],
      "default": false,
      "description": "Whether to create tags in datahub from the abs blob tags",
      "title": "Use Abs Blob Tags"
    },
    "convert_urns_to_lowercase": {
      "default": false,
      "description": "Enable to convert the Excel asset urns to lowercase",
      "title": "Convert Urns To Lowercase",
      "type": "boolean"
    },
    "active_sheet_only": {
      "default": false,
      "description": "Enable to only ingest the active sheet of the workbook. If not set, all sheets will be ingested.",
      "title": "Active Sheet Only",
      "type": "boolean"
    },
    "worksheet_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for worksheets to ingest. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'."
    },
    "profile_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for worksheets to profile. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'."
    },
    "profiling": {
      "$ref": "#/$defs/GEProfilingConfig",
      "default": {
        "enabled": false,
        "operation_config": {
          "lower_freq_profile_enabled": false,
          "profile_date_of_month": null,
          "profile_day_of_week": null
        },
        "limit": null,
        "offset": null,
        "profile_table_level_only": false,
        "include_field_null_count": true,
        "include_field_distinct_count": true,
        "include_field_min_value": true,
        "include_field_max_value": true,
        "include_field_mean_value": true,
        "include_field_median_value": true,
        "include_field_stddev_value": true,
        "include_field_quantiles": false,
        "include_field_distinct_value_frequencies": false,
        "include_field_histogram": false,
        "include_field_sample_values": true,
        "max_workers": 20,
        "report_dropped_profiles": false,
        "turn_off_expensive_profiling_metrics": false,
        "field_sample_values_limit": 20,
        "max_number_of_fields_to_profile": null,
        "profile_if_updated_since_days": null,
        "profile_table_size_limit": 5,
        "profile_table_row_limit": 5000000,
        "profile_table_row_count_estimate_only": false,
        "query_combiner_enabled": true,
        "catch_exceptions": true,
        "partition_profiling_enabled": true,
        "partition_datetime": null,
        "use_sampling": true,
        "sample_size": 10000,
        "profile_external_tables": false,
        "tags_to_ignore_sampling": null,
        "profile_nested_fields": false
      },
      "description": "Configuration for profiling"
    }
  },
  "required": [
    "path_list"
  ],
  "title": "ExcelSourceConfig",
  "type": "object"
}

Code Coordinates

Class Name: datahub.ingestion.source.excel.source.ExcelSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Excel, feel free to ping us on our Slack.

Is this page helpful?

Excel

Important Capabilities​

Supported file types​

Data Model Mapping​

Prerequisites​

AWS S3​

Azure Blob Storage​

Starter Recipes​

S3​

Azure Blob Storage​

Local Files​

CLI based Ingestion​

Config Details​

Code Coordinates​

Questions

Important Capabilities

Supported file types

Data Model Mapping

Prerequisites

AWS S3

Azure Blob Storage

Starter Recipes

S3

Azure Blob Storage

Local Files

CLI based Ingestion

Config Details

Code Coordinates