BigID
Overview
BigID is a data intelligence platform for data discovery, classification, and privacy. It scans connected data sources, classifies columns and documents against a catalog of classifiers and business glossary terms, and correlates personal data to identities (IDSoR). Learn more in the official BigID documentation.
The DataHub integration for BigID is an enrichment connector: it syncs BigID's classification findings, business glossary terms, and tags onto data assets that already exist in DataHub. It maps BigID business glossary items to GlossaryTerms, classification findings to column-level GlossaryTerms with attribution, BigID tags to DataHub Tags, and BigID risk scores to a structured property. It can optionally create Dataset and schema entities for sources not yet present in DataHub, and supports platform instance mapping, domains, ownership on terms, and stateful ingestion for stale entity removal.
Concept Mapping
| Source Concept | DataHub Concept | Notes |
|---|---|---|
| Data source (connection) | Data Platform | Mapped to a DataHub platform (e.g. snowflake, mysql) for URN build. |
| Catalog object | Dataset | Enriched in place; created only when create_datasets is enabled. |
| Business glossary item | GlossaryTerm | Grouped under a BigID root GlossaryNode. |
| Classification finding | GlossaryTerm on SchemaField | Emitted with MetadataAttribution recording confidence and counts. |
| Unlinked classifier | GlossaryTerm | Grouped under a BigID > Classifier GlossaryNode when not linked to a term. |
| IDSoR correlation attribute | GlossaryTerm | Grouped under a BigID > IDSoR GlossaryNode when not linked to a term. |
| Tag (OBJECT-scoped) | Tag | Applied to datasets; hidden and non-OBJECT tags are skipped. |
| Risk score | Structured Property (bigid.riskScore) | Numeric 0–100 value patched onto the dataset. |
| Domain / sub-domain | Domain | Optional; controlled by domain_mode. |
| Column profile | Dataset Profile | Column-level statistics from BigID columnProfile data. |
Module bigid
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Data Profiling | ✅ | Column-level profiles from BigID columnProfile data. |
| Detect Deleted Entities | ✅ | Stale entity removal via stateful ingestion. Only meaningful with create_datasets=True: in pure enrichment mode the connector owns no Dataset entities, so there is nothing for stale removal to soft-delete (glossary terms, tags and domains are shared, not per-run). |
| Domains | ✅ | Domain entities created when domain_mode is auto_namespaced or config_map. |
| Extract Ownership | ✅ | Ownership on GlossaryTerms (not Datasets); controlled by owner_type config. |
| Extract Tags | ✅ | BigID tags applied to datasets and columns. |
| Glossary Terms | ✅ | BigID classification findings as GlossaryTerms on SchemaFields. |
| Platform Instance | ✅ | Platform instance emitted per dataset when platform_instance is configured. |
| Schema Metadata | ✅ | Column schema from BigID columns API (requires create_datasets=True). |
| Table-Level Lineage | ❌ | Not supported. |
Overview
The bigid module ingests classification and governance metadata from BigID into DataHub. It reads BigID's data catalog, business glossary, classifications, and IDSoR correlation results, then enriches matching DataHub datasets with GlossaryTerms, Tags, risk scores, and profiles.
By default this connector runs in pure enrichment mode (create_datasets: false): it never emits structural aspects and only augments datasets that already exist in DataHub. Enable create_datasets to also emit DatasetProperties and SchemaMetadata for sources that BigID knows about but DataHub does not.
Prerequisites
Before running ingestion, ensure you have:
- Network connectivity to your BigID instance over HTTPS.
- A BigID service account (a System User) with a long-lived user token and read access to the catalog, classification, business glossary, and (if used) correlation/IDSoR APIs. See Authentication and Required permissions below.
- Datasets already present in DataHub for the sources BigID scans, unless you enable
create_datasets.
Authentication
The connector authenticates to the BigID REST API with a bearer token. It does not perform an interactive login, so single sign-on (SSO/SAML/OIDC) is never invoked at ingestion time — the token is what grants access. There are two ways to supply that token, in order of preference:
| Config | Token kind | Lifetime | Auto-refresh | Use for |
|---|---|---|---|---|
user_token (recommended) | Long-lived user token generated in the UI | Up to 999 days | Yes — exchanged for a short-lived session token at startup and re-fetched automatically on a 401 | Scheduled / production ingestion |
access_token | Short-lived session token, used directly | Minutes–hours (BigID default) | No — a run that outlives the token fails | One-off/manual runs, or SSO-only tenants where you cannot create a service-account user token |
Provide exactly one. If you set both, user_token takes precedence (it can auto-refresh) and the standalone access_token is ignored. Paste the raw token for either — do not add a Bearer prefix (the connector sends it exactly as given).
user_token is strongly preferred: it is exchanged for a session token at startup (via GET /api/v1/refresh-access-token) and transparently refreshed if that session token expires mid-run, so scheduled ingestion keeps working without manual rotation. access_token skips the exchange but is not refreshed, so it is only suitable for short, manual runs.
SSO / SAML environments
Because the connector uses token auth rather than an interactive login, an SSO-only tenant does not block ingestion — but note:
- Preferred: create a local System User service account (independent of your SSO directory) and generate a
user_tokenfor it. This is the most robust option and is unaffected by SSO. Username/password session login (POST /api/v1/sessions) is not supported by the connector, and would not work for SSO/federated users anyway. - If local service users are disallowed: an SSO user can sign in to BigID and obtain a short-lived session token, then pass it as
access_tokenfor a manual run. This will expire, so it is not suitable for scheduled ingestion.
Generating a user token
Generate the long-lived user_token from the BigID UI:
- Go to Administration → Access Management and select (or create) a user from the System Users List. A dedicated, read-only service-account user is recommended over a personal login.
- Open the user's profile in the right-hand detail panel and, in the Tokens section, click Generate.
- Set an expiration (BigID allows up to 999 days) and click Generate again.
- Copy the token value immediately — BigID does not display it again after the dialog closes.
- Click Save on the user profile. This step is required: an unsaved token stays inactive and the API rejects it with
{"message":"Refresh token not valid"}(HTTP 401). Tokens cannot be edited after creation — to rotate, generate a new one and Save again.
Required permissions
Assign the service user a role with read access to the resources the connector reads. It issues only GET requests to these endpoints:
| Endpoint | Purpose | Required when |
|---|---|---|
GET /api/v1/refresh-access-token | Exchange the user token for a session token | Always (when using user_token) |
GET /api/v1/ds-connections | Data source → DataHub platform resolution; connection test | Always |
GET /api/v1/data-catalog/ | Catalog objects (datasets) | Always |
GET /api/v1/data-catalog/columns | Column-level schema, classifications, and profiles | Structured sources |
GET /api/v1/all-classifications | Classifier → Business Glossary linkage | Always |
GET /api/v1/business_glossary_items | Business Glossary terms | Always |
GET /api/v1/data-catalog/results-tuning/attributes | IDSoR (correlation) attribute → glossary mapping | Only when sync_idsor is enabled (default) |
A read-only role granting the Data Catalog, Classification, Business Glossary, and Correlation permission groups covers all of the above. If IDSoR sync is not needed, you can omit the Correlation permission and set sync_idsor: false.
Connection-to-Platform Resolution
BigID connection type values are mapped to DataHub platform names automatically (for example rdb-postgresql → postgres, snowflake → snowflake). Two levers let you override this:
datasource_platform_mapping— per-connection overrides of platform,env,platform_instance, andconvert_urns_to_lowercase. Required when a connection's type has no built-in mapping, or when a dataset's URN must match a specific platform instance created by a native connector. Setconvert_urns_to_lowercaseon a connection when the native connector's URN casing differs from BigID's default (Snowflake, BigQuery and Redshift are lowercased by default) — for example a Snowflake source ingested withconvert_urns_to_lowercase: false.connection_pattern— regex allow/deny patterns matched against the BigID connection name. Use this to scope ingestion to a subset of connections in large deployments that expose hundreds of data sources.
Confidence Filtering
Classification findings carry a BigID confidence rank. Ranks map to HIGH = 0.75, MEDIUM = 0.50, LOW = 0.25. Set minimum_confidence_threshold (0.0–1.0) to drop low-confidence findings.
Domain Handling
BigID domain/sub_domain values are mapped into DataHub according to domain_mode:
none(default) — domain values are stored incustomPropertiesonly; no domain entities are created.auto_namespaced— oneurn:li:domainentity is auto-created per BigID domain/sub-domain, keyed deterministically by name (the human-readable label is carried ondomainProperties.name).config_map— BigID domain values are mapped to pre-existing DataHub domain URNs viadomain_mapping.
In auto_namespaced mode the generated domain GUID is scoped by env and platform_instance, mirroring how datasets and data products are separated. The same domain name under different env or platform_instance values resolves to distinct urn:li:domain entities. To share a domain across two BigID ingestions, give them the same env and platform_instance; to keep them separate, vary either. config_map mode is unaffected (URNs come from your domain_mapping).
Install the Plugin
pip install 'acryl-datahub[bigid]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: bigid
config:
# Coordinates
bigid_url: "https://bigid.example.com"
# Credentials — provide either user_token (recommended) or access_token
user_token: "${BIGID_USER_TOKEN}"
# access_token: "${BIGID_ACCESS_TOKEN}"
# HTTP behaviour (optional)
# timeout: 60
# max_retries: 3
# Environment applied to generated dataset URNs
env: PROD
# Scope ingestion to a subset of BigID connections (regex allow/deny)
# connection_pattern:
# allow:
# - "^prod-.*"
# deny:
# - "^sandbox-.*"
# Per-connection platform / env / instance overrides (and mappings for
# connection types without a built-in platform mapping)
# datasource_platform_mapping:
# my-snowflake-conn:
# platform: snowflake
# env: PROD
# platform_instance: prod-account
# Dataset creation (opt-in). Default false = pure enrichment mode.
# create_datasets: false
# Classification findings
# minimum_confidence_threshold: 0.0 # HIGH=0.75, MEDIUM=0.50, LOW=0.25
# confidence_level_tag: false # also emit urn:li:tag:bigid.confidence:{LEVEL}
# Tags
# sync_tags: true
# tag_application_types: ["sensitivityClassification", "risk", "userDefined"]
# Business glossary / classifiers / IDSoR
# sync_unlinked_classifiers: true
# sync_idsor: true
# sync_unstructured_enrichment: false
# owner_type: user # user | group | none
# domain_mode: none # none | auto_namespaced | config_map
# Stateful ingestion — removes stale entities emitted by this source
# stateful_ingestion:
# enabled: true
sink:
# sink configs
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
bigid_url ✅ string | Base URL of the BigID instance (e.g. 'https://bigid.example.com'). |
access_token One of string(password), null | Short-lived BigID session token, used directly without the startup exchange and NOT auto-refreshed — a run that outlives it fails. Intended for one-off runs or SSO-only tenants where a service-account user_token cannot be created; prefer user_token for scheduled ingestion. Provide either this or user_token; if both are set, user_token is used and this value is ignored. Default: None |
confidence_level_tag boolean | Emit urn:li:tag:bigid.confidence:{LEVEL} alongside each GlossaryTerm on a column. Lossy (can't tie level to a specific term when multiple exist), but visible in DataHub UI. Default: False |
create_datasets boolean | If True, emit DatasetProperties + SchemaMetadata for datasets not yet in DataHub. Default False (pure enrichment mode — never emits structural aspects). Default: False |
domain_mapping map(str,string) | |
domain_mode Enum | One of: "none", "auto_namespaced", "config_map" |
max_retries integer | Maximum number of retries for transient errors. Default: 3 |
minimum_confidence_threshold number | Filter column classification findings below this confidence level. Accepts 0.0–1.0 (not a rank string). BigID ranks map to: HIGH = 0.75, MEDIUM = 0.50, LOW = 0.25 (unknown ranks = 0.0). Default: 0.0 |
owner_type Enum | One of: "user", "group", "none" |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
risk_score_structured_property_urn string | URN of the StructuredProperty used for riskScore values. Default: urn:li:structuredProperty:bigid.riskScore |
sync_idsor boolean | Emit GlossaryTerms for IDSoR (Identity Source of Record) attribute findings from BigID's correlation engine. IDSoR findings are separate from classifier findings and only appear when a Correlation Set is configured and enabled in the scan profile. When the attribute links to an existing Business Glossary term (via glossaryId), that term is reused. Otherwise an auto-generated term is created under a dedicated 'bigid.idsor' GlossaryNode. Term URNs are deterministic GUIDs keyed on the attribute identity. Default: True |
sync_tags boolean | Emit BigID tags as DataHub Tag entities. Default: True |
sync_unlinked_classifiers boolean | Emit GlossaryTerms for classifier findings that have no Business Glossary linkage in BigID. Terms are auto-generated on demand (only when a column finding references the classifier) and placed under the same 'bigid' root GlossaryNode. Term URNs are deterministic GUIDs keyed on the classifier identity. Default: True |
sync_unstructured_enrichment boolean | Emit dataset-level GlossaryTerms and DatasetProfile for unstructured and email sources (SharePoint, Google Drive, O365, Kafka, AI models, etc.) using the attribute_details field returned by BigID's catalog API. Only applies to objects where BigID has classification findings (attribute_details non-empty). Controlled by the same sync_unlinked_classifiers and sync_idsor flags as structured enrichment. Default: False |
timeout integer | HTTP request timeout in seconds. Default: 60 |
user_token One of string(password), null | Recommended auth. Long-lived BigID user token, generated under Administration → Access Management → System Users (Save the user after generating so the token activates). Exchanged for a short-lived session token at startup and auto-refreshed on expiry, so it is safe for scheduled ingestion. Provide the raw token — no 'Bearer' prefix. Provide either this or access_token; if both are set, user_token takes precedence because it can auto-refresh. Default: None |
env string | The environment that all assets produced by this connector belong to Default: PROD |
connection_pattern AllowDenyPattern | A class to store allow deny regexes |
connection_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
dataset_pattern AllowDenyPattern | A class to store allow deny regexes |
dataset_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
datasource_platform_mapping map(str,ConnectionPlatformConfig) | Per-connection platform override for a single BigID data source. |
datasource_platform_mapping. key.platform ❓string | DataHub platform name (e.g. 'snowflake', 'mysql'). |
datasource_platform_mapping. key.convert_urns_to_lowercaseOne of boolean, null | Override dataset-name casing for this connection so BigID's enrichment URN byte-matches the one the native connector emitted. When set, forces (true) or disables (false) lowercasing of the dataset-name segment. Leave unset to use the built-in per-platform default (Snowflake, BigQuery and Redshift are lowercased). Set false for, e.g., a Snowflake connection ingested with convert_urns_to_lowercase: false. Default: None |
datasource_platform_mapping. key.platform_instanceOne of string, null | DataHub platform instance identifier for this connection. Default: None |
datasource_platform_mapping. key.envOne of string, null | Environment override for this connection (e.g. 'PROD', 'DEV'). Falls back to top-level env if not set. Default: None |
item_types array | Allow-list of BigID item types to sync. OOTB Personal Data Items are always included regardless of this filter. |
item_types.string string | |
tag_application_types array | BigID applicationType values to sync as tags. |
tag_application_types.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"ConnectionPlatformConfig": {
"additionalProperties": false,
"description": "Per-connection platform override for a single BigID data source.",
"properties": {
"platform": {
"description": "DataHub platform name (e.g. 'snowflake', 'mysql').",
"title": "Platform",
"type": "string"
},
"env": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Environment override for this connection (e.g. 'PROD', 'DEV'). Falls back to top-level env if not set.",
"title": "Env"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "DataHub platform instance identifier for this connection.",
"title": "Platform Instance"
},
"convert_urns_to_lowercase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Override dataset-name casing for this connection so BigID's enrichment URN byte-matches the one the native connector emitted. When set, forces (`true`) or disables (`false`) lowercasing of the dataset-name segment. Leave unset to use the built-in per-platform default (Snowflake, BigQuery and Redshift are lowercased). Set `false` for, e.g., a Snowflake connection ingested with `convert_urns_to_lowercase: false`.",
"title": "Convert Urns To Lowercase"
}
},
"required": [
"platform"
],
"title": "ConnectionPlatformConfig",
"type": "object"
},
"DomainMode": {
"description": "How BigID domain/sub_domain values are mapped into DataHub.",
"enum": [
"none",
"auto_namespaced",
"config_map"
],
"title": "DomainMode",
"type": "string"
},
"OwnerType": {
"description": "How BigID owner strings are interpreted when building owner URNs.",
"enum": [
"user",
"group",
"none"
],
"title": "OwnerType",
"type": "string"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"properties": {
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null
},
"bigid_url": {
"description": "Base URL of the BigID instance (e.g. 'https://bigid.example.com').",
"title": "Bigid Url",
"type": "string"
},
"user_token": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "Recommended auth. Long-lived BigID user token, generated under Administration \u2192 Access Management \u2192 System Users (Save the user after generating so the token activates). Exchanged for a short-lived session token at startup and auto-refreshed on expiry, so it is safe for scheduled ingestion. Provide the raw token \u2014 no 'Bearer' prefix. Provide either this or access_token; if both are set, user_token takes precedence because it can auto-refresh.",
"title": "User Token"
},
"access_token": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "Short-lived BigID session token, used directly without the startup exchange and NOT auto-refreshed \u2014 a run that outlives it fails. Intended for one-off runs or SSO-only tenants where a service-account user_token cannot be created; prefer user_token for scheduled ingestion. Provide either this or user_token; if both are set, user_token is used and this value is ignored.",
"title": "Access Token"
},
"timeout": {
"default": 60,
"description": "HTTP request timeout in seconds.",
"title": "Timeout",
"type": "integer"
},
"max_retries": {
"default": 3,
"description": "Maximum number of retries for transient errors.",
"title": "Max Retries",
"type": "integer"
},
"datasource_platform_mapping": {
"additionalProperties": {
"$ref": "#/$defs/ConnectionPlatformConfig"
},
"description": "Map BigID connection name \u2192 platform config. Auto-detected from ds-connections API if omitted; explicit entries override.",
"title": "Datasource Platform Mapping",
"type": "object"
},
"connection_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex allow/deny patterns matched against the BigID connection (data source) name. Use this to scope ingestion to a subset of connections in large BigID deployments that expose hundreds of data sources. Catalog objects whose source connection is denied are skipped entirely."
},
"dataset_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex allow/deny patterns matched against the BigID catalog object's fully-qualified name. Complements connection_pattern with finer, dataset-level scoping for large deployments where a single connection exposes many objects. Objects whose fully-qualified name is denied are skipped."
},
"create_datasets": {
"default": false,
"description": "If True, emit DatasetProperties + SchemaMetadata for datasets not yet in DataHub. Default False (pure enrichment mode \u2014 never emits structural aspects).",
"title": "Create Datasets",
"type": "boolean"
},
"minimum_confidence_threshold": {
"default": 0.0,
"description": "Filter column classification findings below this confidence level. Accepts 0.0\u20131.0 (not a rank string). BigID ranks map to: HIGH = 0.75, MEDIUM = 0.50, LOW = 0.25 (unknown ranks = 0.0).",
"maximum": 1.0,
"minimum": 0.0,
"title": "Minimum Confidence Threshold",
"type": "number"
},
"confidence_level_tag": {
"default": false,
"description": "Emit `urn:li:tag:bigid.confidence:{LEVEL}` alongside each GlossaryTerm on a column. Lossy (can't tie level to a specific term when multiple exist), but visible in DataHub UI.",
"title": "Confidence Level Tag",
"type": "boolean"
},
"item_types": {
"description": "Allow-list of BigID item types to sync. OOTB Personal Data Items are always included regardless of this filter.",
"items": {
"type": "string"
},
"title": "Item Types",
"type": "array"
},
"domain_mode": {
"$ref": "#/$defs/DomainMode",
"default": "none",
"description": "Domain handling mode. 'none': store raw domain/sub_domain in customProperties only. 'auto_namespaced': auto-create GUID-based `urn:li:domain` entities (one per BigID domain/sub-domain, keyed deterministically by name). 'config_map': map BigID domain values to existing DataHub domain URNs."
},
"domain_mapping": {
"additionalProperties": {
"type": "string"
},
"description": "Used when domain_mode='config_map'. Maps BigID domain string \u2192 DataHub domain URN.",
"title": "Domain Mapping",
"type": "object"
},
"owner_type": {
"$ref": "#/$defs/OwnerType",
"default": "user",
"description": "How to interpret BigID owner strings. 'user' \u2192 `urn:li:corpuser:{owner}`. 'group' \u2192 `urn:li:corpGroup:{owner}`. 'none' \u2192 stored in customProperties only."
},
"sync_tags": {
"default": true,
"description": "Emit BigID tags as DataHub Tag entities.",
"title": "Sync Tags",
"type": "boolean"
},
"tag_application_types": {
"description": "BigID applicationType values to sync as tags.",
"items": {
"type": "string"
},
"title": "Tag Application Types",
"type": "array"
},
"risk_score_structured_property_urn": {
"default": "urn:li:structuredProperty:bigid.riskScore",
"description": "URN of the StructuredProperty used for riskScore values.",
"title": "Risk Score Structured Property Urn",
"type": "string"
},
"sync_unlinked_classifiers": {
"default": true,
"description": "Emit GlossaryTerms for classifier findings that have no Business Glossary linkage in BigID. Terms are auto-generated on demand (only when a column finding references the classifier) and placed under the same 'bigid' root GlossaryNode. Term URNs are deterministic GUIDs keyed on the classifier identity.",
"title": "Sync Unlinked Classifiers",
"type": "boolean"
},
"sync_idsor": {
"default": true,
"description": "Emit GlossaryTerms for IDSoR (Identity Source of Record) attribute findings from BigID's correlation engine. IDSoR findings are separate from classifier findings and only appear when a Correlation Set is configured and enabled in the scan profile. When the attribute links to an existing Business Glossary term (via glossaryId), that term is reused. Otherwise an auto-generated term is created under a dedicated 'bigid.idsor' GlossaryNode. Term URNs are deterministic GUIDs keyed on the attribute identity.",
"title": "Sync Idsor",
"type": "boolean"
},
"sync_unstructured_enrichment": {
"default": false,
"description": "Emit dataset-level GlossaryTerms and DatasetProfile for unstructured and email sources (SharePoint, Google Drive, O365, Kafka, AI models, etc.) using the attribute_details field returned by BigID's catalog API. Only applies to objects where BigID has classification findings (attribute_details non-empty). Controlled by the same sync_unlinked_classifiers and sync_idsor flags as structured enrichment.",
"title": "Sync Unstructured Enrichment",
"type": "boolean"
}
},
"required": [
"bigid_url"
],
"title": "BigIDSourceConfig",
"type": "object"
}
Capabilities
- Business glossary sync — BigID business glossary items become GlossaryTerms under a
BigIDroot GlossaryNode, with domains and ownership optionally attached. - Column classification — classification findings are emitted as GlossaryTerms on schema fields, carrying
MetadataAttributionthat records the classifier, confidence level, and finding counts. Classifiers not linked to a business glossary item are auto-generated under aBigID > Classifiernode (controlled bysync_unlinked_classifiers). - IDSoR correlation — Identity Source of Record findings are resolved via a three-path strategy: reuse a linked business glossary term, auto-generate a term under a
BigID > IDSoRnode, or synthesize one from the raw attribute name. - Tags and risk score — OBJECT-scoped BigID tags become DataHub Tags; risk scores are written to the
bigid.riskScorestructured property. - Non-destructive enrichment — tags, glossary terms, schema-field annotations, and the risk score are applied to existing datasets via PATCH (merge) semantics, so BigID metadata is added alongside — never overwriting — tags and terms that stewards curate in the DataHub UI.
- No placeholder datasets — in pure-enrichment mode (
create_datasets: false) dataset aspects are emitted as non-primary, so BigID never materializes (or later soft-deletes) a dataset that a native connector has not already created. Enablecreate_datasetsto have BigID own and create datasets it scans. - Profiling — column-level statistics from BigID
columnProfiledata are emitted as Dataset Profiles. - Stateful ingestion — enables automatic removal of entities emitted by this source when they disappear from BigID.
Limitations
Enrichment adds are not retracted
Because enrichment is applied additively (PATCH), removing a classification, tag, or glossary link in BigID does not remove the previously-added term or tag from an existing DataHub dataset on the next run — the PATCH only adds. Stateful ingestion removes entities this connector owns (e.g. glossary terms/nodes it created), but it does not retract annotations merged onto datasets owned by a native connector. Remove such annotations in the DataHub UI if needed.
dataPlatformInstance is only emitted when configured
To avoid overwriting the platform instance already set by a native connector (e.g. Snowflake, BigQuery), the dataPlatformInstance aspect is emitted only when a platform_instance is explicitly configured for the connection.
Troubleshooting
Datasets are not being enriched
The connector matches BigID objects to existing DataHub dataset URNs. If enrichment does not appear, verify that the resolved platform, env, platform_instance, and URN casing produce a URN that matches the one created by your native connector. Use datasource_platform_mapping to align them — including convert_urns_to_lowercase when the native connector's casing differs from BigID's per-platform default.
No enrichment applied at all
If both the business glossary and classification map fail to load, the connector reports a failure and emits nothing. Check BigID API connectivity and that the token has read access to the catalog, classifications, and glossary APIs.
Unknown connection type
When a BigID connection type has no built-in platform mapping, the raw type is used as the platform in URNs and a warning is reported. Add an entry to datasource_platform_mapping to map it to the correct DataHub platform.
Code Coordinates
- Class Name:
datahub.ingestion.source.bigid.bigid_source.BigIDSource - Browse on GitHub
If you've got any questions on configuring ingestion for BigID, feel free to ping us on our Slack.
This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.
Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.