Skip to main content
Version: Next

Excel

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default.
Data ProfilingOptionally enabled via configuration.
Detect Deleted EntitiesOptionally enabled via stateful_ingestion.remove_stale_metadata.
Schema MetadataEnabled by default.

This connector ingests Excel worksheet datasets into DataHub. Workbooks (Excel files) can be ingested from the local filesystem, from S3 buckets, or from Azure Blob Storage. An asterisk (*) can be used in place of a directory or as part of a file name to match multiple directories or files with a single path specification.

tip

By default, this connector will ingest all worksheets in a workbook (an Excel file). To filter worksheets use the worksheet_pattern config option, or to only ingest the active worksheet use the active_sheet_only config option.

Supported file types

Supported file types are as follows:

  • Excel workbook (*.xlsx)
  • Excel macro-enabled workbook (*.xlsm)

The connector will attempt to identify which cells contain table data. A table is defined as a header row, which is used to derive the column names, followed by data rows. The schema is inferred from the data types that are present in a column.

Rows that are directly above or directly below the table where only the first two columns have values are assumed to contain metadata. If such rows are located, they are converted to custom properties where the first column is the key, and the second column is the value. Additionally, the workbook standard and custom properties are also imported as dataset custom properties.

Data Model Mapping

The following table shows how Excel entities are mapped to DataHub entities:

Excel EntityDataHub EntityDescription
Excel WorksheetDatasetEach worksheet becomes a dataset with URN pattern: urn:li:dataset:(urn:li:dataPlatform:excel,{path}/[{filename}]{sheet_name},PROD)
File/Directory StructureContainerDirectory hierarchy creates containers with obfuscated URNs for organizing datasets

Note: The Excel workbook file itself does not become a separate DataHub entity - only the individual worksheets within it are ingested as datasets.

Prerequisites

AWS S3

When configuring an S3 ingestion source to access files in an S3 bucket, the AWS account referenced in your ingestion recipe must have appropriate S3 permissions. Create a policy with the minimum required permissions by following these steps:

  1. Create an IAM Policy: Create a policy that grants read access to specific S3 buckets.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}

Permissions Explanation:

  • s3:ListBucket: Allows listing the objects in the bucket. This permission is necessary for the S3 ingestion source to know which objects are available to read.
  • s3:GetBucketLocation: Allows retrieving the location of the bucket.
  • s3:GetObject: Allows reading the actual content of the objects in the bucket. This is needed to infer schema from sample files.
  1. Link Policy to Identity: Associate your newly created policy with the appropriate IAM user or role that will be used by the S3 ingestion process.

  2. Set Up S3 Data Source: When configuring your S3 ingestion source, specify the IAM user to whom you assigned the permissions in the previous step.

Azure Blob Storage

To access files on Azure Blob Storage, you will need the following:

  1. Azure Storage Account: A storage account that provides a unique namespace for your data in Azure.

  2. Authentication Credentials: One of these supported authentication methods:

    • Account Key: Use your storage account's access key
    • Client Secret: Use a service principal with client ID and client secret for Microsoft Entra ID authentication
    • SAS Token: Use a Shared Access Signature token that provides limited, time-bound access
  3. Container: A blob container that organizes your blobs (similar to a directory in a file system).

  4. Access Permissions: Appropriate authorization for the authentication method:

    • For account key: Full access to the storage account
    • For client secret: Appropriate Azure role assignments (like Storage Blob Data Contributor)
    • For SAS token: Permissions are defined within the token itself

Starter Recipes

Check out the following recipes to get started with ingestion.

For general pointers on writing and running a recipe, see our main recipe guide.

S3

source:
type: excel
config:
path_list:
- "s3://bucket/data/excel/*/*.xlsx"
aws_config:
aws_access_key_id: AKIAIOSFODNN7EXAMPLE
aws_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
aws_region: us-east-1
profiling:
enabled: false

Azure Blob Storage

source:
type: excel
config:
path_list:
- "https://storageaccountname.blob.core.windows.net/abs-data/excel/*/*.xlsx"
azure_config:
account_name: storageaccountname
sas_token: sv=2022-11-02&ss=b&srt=sco&sp=rwdlacx&se=2025-06-07T21:00:00Z&st=2025-05-07T13:00:00Z&spr=https&sig=a1B2c3D4%3D
container_name: abs-data
profiling:
enabled: false

Local Files

source:
type: excel
config:
path_list:
- "/data/path/reporting/excel/*.xlsx"
profiling:
enabled: false

CLI based Ingestion

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
path_list 
array
List of paths to Excel files or folders to ingest.
path_list.string
string
active_sheet_only
boolean
Enable to only ingest the active sheet of the workbook. If not set, all sheets will be ingested.
Default: False
convert_urns_to_lowercase
boolean
Enable to convert the Excel asset urns to lowercase
Default: False
path_pattern
AllowDenyPattern
Regex patterns for file paths to filter in ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
profile_pattern
AllowDenyPattern
Regex patterns for worksheets to profile. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
use_abs_blob_tags
boolean
Whether to create tags in datahub from the abs blob tags
Default: False
use_s3_bucket_tags
boolean
Whether or not to create tags in datahub from the s3 bucket
Default: False
use_s3_object_tags
boolean
Whether or not to create tags in datahub from the s3 object
Default: False
verify_ssl
One of boolean, string
Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use.
Default: True
worksheet_pattern
AllowDenyPattern
Regex patterns for worksheets to ingest. Worksheets are specified as 'filename_without_extension.worksheet_name'. For example to allow the worksheet Sheet1 from file report.xlsx, use the pattern: 'report.Sheet1'.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
env
string
The environment that all assets produced by this connector belong to
Default: PROD
aws_config
AwsConnectionConfig
AWS configuration
aws_config.aws_access_key_id
string
AWS access key ID. Can be auto-detected, see the AWS boto3 docs for details.
aws_config.aws_advanced_config
object
Advanced AWS configuration options. These are passed directly to botocore.config.Config.
aws_config.aws_endpoint_url
string
The AWS service endpoint. This is normally constructed automatically, but can be overridden here.
aws_config.aws_profile
string
The named profile to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config.
aws_config.aws_proxy
map(str,string)
aws_config.aws_region
string
AWS region code.
aws_config.aws_retry_mode
Enum
One of: "legacy", "standard", "adaptive"
Default: standard
aws_config.aws_retry_num
integer
Number of times to retry failed AWS requests. See the botocore.retry docs for details.
Default: 5
aws_config.aws_secret_access_key
string
AWS secret access key. Can be auto-detected, see the AWS boto3 docs for details.
aws_config.aws_session_token
string
AWS session token. Can be auto-detected, see the AWS boto3 docs for details.
aws_config.read_timeout
number
The timeout for reading from the connection (in seconds).
Default: 60
aws_config.aws_role
One of string, array
AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as boto3's STS.Client.assume_role.
aws_config.aws_role.union
One of string, AwsAssumeRoleConfig
aws_config.aws_role.union.RoleArn 
string
ARN of the role to assume.
aws_config.aws_role.union.ExternalId
string
External ID to use when assuming the role.
azure_config
AzureConnectionConfig
Azure configuration
azure_config.account_name 
string
Name of the Azure storage account. See Microsoft official documentation on how to create a storage account.
azure_config.container_name 
string
Azure storage account container name.
azure_config.account_key
string
Azure storage account access key that can be used as a credential. An account key, a SAS token or a client secret is required for authentication.
azure_config.base_path
string
Base folder in hierarchical namespaces to start from.
Default: /
azure_config.client_id
string
Azure client (Application) ID required when a client_secret is used as a credential.
azure_config.client_secret
string
Azure client secret that can be used as a credential. An account key, a SAS token or a client secret is required for authentication.
azure_config.sas_token
string
Azure storage account Shared Access Signature (SAS) token that can be used as a credential. An account key, a SAS token or a client secret is required for authentication.
azure_config.tenant_id
string
Azure tenant (Directory) ID required when a client_secret is used as a credential.
profiling
GEProfilingConfig
Configuration for profiling
Default: {'enabled': False, 'operation_config': {'lower_fre...
profiling.catch_exceptions
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
integer
Max number of documents to profile. By default, profiles all documents.
profiling.max_number_of_fields_to_profile
integer
A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
integer
Offset in documents to profile. By default, uses no offset.
profiling.profile_nested_fields
boolean
Whether to profile complex types like structs, arrays and maps.
Default: False
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.query_combiner_enabled
boolean
This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.operation_config
OperationConfig
Experimental feature. To specify operation configs.
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
integer
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
profiling.operation_config.profile_day_of_week
integer
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
profiling.tags_to_ignore_sampling
array
Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling.
profiling.tags_to_ignore_sampling.string
string
stateful_ingestion
StatefulStaleMetadataRemovalConfig
Configuration for stateful ingestion and stale metadata removal.
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.excel.source.ExcelSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Excel, feel free to ping us on our Slack.