Skip to main content

BigID

Overview

BigID is a data intelligence platform for data discovery, classification, and privacy. It scans connected data sources, classifies columns and documents against a catalog of classifiers and business glossary terms, and correlates personal data to identities (IDSoR). Learn more in the official BigID documentation.

The DataHub integration for BigID is an enrichment connector: it syncs BigID's classification findings, business glossary terms, and tags onto data assets that already exist in DataHub. It maps BigID business glossary items to GlossaryTerms, classification findings to column-level GlossaryTerms with attribution, BigID tags to DataHub Tags, and BigID risk scores to a structured property. It can optionally create Dataset and schema entities for sources not yet present in DataHub, and supports platform instance mapping, domains, ownership on terms, and stateful ingestion for stale entity removal.

Concept Mapping

Source ConceptDataHub ConceptNotes
Data source (connection)Data PlatformMapped to a DataHub platform (e.g. snowflake, mysql) for URN build.
Catalog objectDatasetEnriched in place; created only when create_datasets is enabled.
Business glossary itemGlossaryTermGrouped under a BigID root GlossaryNode.
Classification findingGlossaryTerm on SchemaFieldEmitted with MetadataAttribution recording confidence and counts.
Unlinked classifierGlossaryTermGrouped under a BigID > Classifier GlossaryNode when not linked to a term.
IDSoR correlation attributeGlossaryTermGrouped under a BigID > IDSoR GlossaryNode when not linked to a term.
Tag (OBJECT-scoped)TagApplied to datasets; hidden and non-OBJECT tags are skipped.
Risk scoreStructured Property (bigid.riskScore)Numeric 0–100 value patched onto the dataset.
Domain / sub-domainDomainOptional; controlled by domain_mode.
Column profileDataset ProfileColumn-level statistics from BigID columnProfile data.

Module bigid

Incubating

Important Capabilities

CapabilityStatusNotes
Data ProfilingColumn-level profiles from BigID columnProfile data.
Detect Deleted EntitiesStale entity removal via stateful ingestion. Only meaningful with create_datasets=True: in pure enrichment mode the connector owns no Dataset entities, so there is nothing for stale removal to soft-delete (glossary terms, tags and domains are shared, not per-run).
DomainsDomain entities created when domain_mode is auto_namespaced or config_map.
Extract OwnershipOwnership on GlossaryTerms (not Datasets); controlled by owner_type config.
Extract TagsBigID tags applied to datasets and columns.
Glossary TermsBigID classification findings as GlossaryTerms on SchemaFields.
Platform InstancePlatform instance emitted per dataset when platform_instance is configured.
Schema MetadataColumn schema from BigID columns API (requires create_datasets=True).
Table-Level LineageNot supported.

Overview

The bigid module ingests classification and governance metadata from BigID into DataHub. It reads BigID's data catalog, business glossary, classifications, and IDSoR correlation results, then enriches matching DataHub datasets with GlossaryTerms, Tags, risk scores, and profiles.

By default this connector runs in pure enrichment mode (create_datasets: false): it never emits structural aspects and only augments datasets that already exist in DataHub. Enable create_datasets to also emit DatasetProperties and SchemaMetadata for sources that BigID knows about but DataHub does not.

Prerequisites

Before running ingestion, ensure you have:

  1. Network connectivity to your BigID instance over HTTPS.
  2. A BigID service account (a System User) with a long-lived user token and read access to the catalog, classification, business glossary, and (if used) correlation/IDSoR APIs. See Authentication and Required permissions below.
  3. Datasets already present in DataHub for the sources BigID scans, unless you enable create_datasets.

Authentication

The connector authenticates to the BigID REST API with a bearer token. It does not perform an interactive login, so single sign-on (SSO/SAML/OIDC) is never invoked at ingestion time — the token is what grants access. There are two ways to supply that token, in order of preference:

ConfigToken kindLifetimeAuto-refreshUse for
user_token (recommended)Long-lived user token generated in the UIUp to 999 daysYes — exchanged for a short-lived session token at startup and re-fetched automatically on a 401Scheduled / production ingestion
access_tokenShort-lived session token, used directlyMinutes–hours (BigID default)No — a run that outlives the token failsOne-off/manual runs, or SSO-only tenants where you cannot create a service-account user token

Provide exactly one. If you set both, user_token takes precedence (it can auto-refresh) and the standalone access_token is ignored. Paste the raw token for either — do not add a Bearer prefix (the connector sends it exactly as given).

user_token is strongly preferred: it is exchanged for a session token at startup (via GET /api/v1/refresh-access-token) and transparently refreshed if that session token expires mid-run, so scheduled ingestion keeps working without manual rotation. access_token skips the exchange but is not refreshed, so it is only suitable for short, manual runs.

SSO / SAML environments

Because the connector uses token auth rather than an interactive login, an SSO-only tenant does not block ingestion — but note:

  • Preferred: create a local System User service account (independent of your SSO directory) and generate a user_token for it. This is the most robust option and is unaffected by SSO. Username/password session login (POST /api/v1/sessions) is not supported by the connector, and would not work for SSO/federated users anyway.
  • If local service users are disallowed: an SSO user can sign in to BigID and obtain a short-lived session token, then pass it as access_token for a manual run. This will expire, so it is not suitable for scheduled ingestion.
Generating a user token

Generate the long-lived user_token from the BigID UI:

  1. Go to Administration → Access Management and select (or create) a user from the System Users List. A dedicated, read-only service-account user is recommended over a personal login.
  2. Open the user's profile in the right-hand detail panel and, in the Tokens section, click Generate.
  3. Set an expiration (BigID allows up to 999 days) and click Generate again.
  4. Copy the token value immediately — BigID does not display it again after the dialog closes.
  5. Click Save on the user profile. This step is required: an unsaved token stays inactive and the API rejects it with {"message":"Refresh token not valid"} (HTTP 401). Tokens cannot be edited after creation — to rotate, generate a new one and Save again.
Required permissions

Assign the service user a role with read access to the resources the connector reads. It issues only GET requests to these endpoints:

EndpointPurposeRequired when
GET /api/v1/refresh-access-tokenExchange the user token for a session tokenAlways (when using user_token)
GET /api/v1/ds-connectionsData source → DataHub platform resolution; connection testAlways
GET /api/v1/data-catalog/Catalog objects (datasets)Always
GET /api/v1/data-catalog/columnsColumn-level schema, classifications, and profilesStructured sources
GET /api/v1/all-classificationsClassifier → Business Glossary linkageAlways
GET /api/v1/business_glossary_itemsBusiness Glossary termsAlways
GET /api/v1/data-catalog/results-tuning/attributesIDSoR (correlation) attribute → glossary mappingOnly when sync_idsor is enabled (default)

A read-only role granting the Data Catalog, Classification, Business Glossary, and Correlation permission groups covers all of the above. If IDSoR sync is not needed, you can omit the Correlation permission and set sync_idsor: false.

Connection-to-Platform Resolution

BigID connection type values are mapped to DataHub platform names automatically (for example rdb-postgresqlpostgres, snowflakesnowflake). Two levers let you override this:

  • datasource_platform_mapping — per-connection overrides of platform, env, platform_instance, and convert_urns_to_lowercase. Required when a connection's type has no built-in mapping, or when a dataset's URN must match a specific platform instance created by a native connector. Set convert_urns_to_lowercase on a connection when the native connector's URN casing differs from BigID's default (Snowflake, BigQuery and Redshift are lowercased by default) — for example a Snowflake source ingested with convert_urns_to_lowercase: false.
  • connection_pattern — regex allow/deny patterns matched against the BigID connection name. Use this to scope ingestion to a subset of connections in large deployments that expose hundreds of data sources.

Confidence Filtering

Classification findings carry a BigID confidence rank. Ranks map to HIGH = 0.75, MEDIUM = 0.50, LOW = 0.25. Set minimum_confidence_threshold (0.0–1.0) to drop low-confidence findings.

Domain Handling

BigID domain/sub_domain values are mapped into DataHub according to domain_mode:

  • none (default) — domain values are stored in customProperties only; no domain entities are created.
  • auto_namespaced — one urn:li:domain entity is auto-created per BigID domain/sub-domain, keyed deterministically by name (the human-readable label is carried on domainProperties.name).
  • config_map — BigID domain values are mapped to pre-existing DataHub domain URNs via domain_mapping.

In auto_namespaced mode the generated domain GUID is scoped by env and platform_instance, mirroring how datasets and data products are separated. The same domain name under different env or platform_instance values resolves to distinct urn:li:domain entities. To share a domain across two BigID ingestions, give them the same env and platform_instance; to keep them separate, vary either. config_map mode is unaffected (URNs come from your domain_mapping).

Install the Plugin

pip install 'acryl-datahub[bigid]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: bigid
config:
# Coordinates
bigid_url: "https://bigid.example.com"

# Credentials — provide either user_token (recommended) or access_token
user_token: "${BIGID_USER_TOKEN}"
# access_token: "${BIGID_ACCESS_TOKEN}"

# HTTP behaviour (optional)
# timeout: 60
# max_retries: 3

# Environment applied to generated dataset URNs
env: PROD

# Scope ingestion to a subset of BigID connections (regex allow/deny)
# connection_pattern:
# allow:
# - "^prod-.*"
# deny:
# - "^sandbox-.*"

# Per-connection platform / env / instance overrides (and mappings for
# connection types without a built-in platform mapping)
# datasource_platform_mapping:
# my-snowflake-conn:
# platform: snowflake
# env: PROD
# platform_instance: prod-account

# Dataset creation (opt-in). Default false = pure enrichment mode.
# create_datasets: false

# Classification findings
# minimum_confidence_threshold: 0.0 # HIGH=0.75, MEDIUM=0.50, LOW=0.25
# confidence_level_tag: false # also emit urn:li:tag:bigid.confidence:{LEVEL}

# Tags
# sync_tags: true
# tag_application_types: ["sensitivityClassification", "risk", "userDefined"]

# Business glossary / classifiers / IDSoR
# sync_unlinked_classifiers: true
# sync_idsor: true
# sync_unstructured_enrichment: false
# owner_type: user # user | group | none
# domain_mode: none # none | auto_namespaced | config_map

# Stateful ingestion — removes stale entities emitted by this source
# stateful_ingestion:
# enabled: true


sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
bigid_url 
string
Base URL of the BigID instance (e.g. 'https://bigid.example.com').
access_token
One of string(password), null
Short-lived BigID session token, used directly without the startup exchange and NOT auto-refreshed — a run that outlives it fails. Intended for one-off runs or SSO-only tenants where a service-account user_token cannot be created; prefer user_token for scheduled ingestion. Provide either this or user_token; if both are set, user_token is used and this value is ignored.
Default: None
confidence_level_tag
boolean
Emit urn:li:tag:bigid.confidence:{LEVEL} alongside each GlossaryTerm on a column. Lossy (can't tie level to a specific term when multiple exist), but visible in DataHub UI.
Default: False
create_datasets
boolean
If True, emit DatasetProperties + SchemaMetadata for datasets not yet in DataHub. Default False (pure enrichment mode — never emits structural aspects).
Default: False
domain_mapping
map(str,string)
domain_mode
Enum
One of: "none", "auto_namespaced", "config_map"
max_retries
integer
Maximum number of retries for transient errors.
Default: 3
minimum_confidence_threshold
number
Filter column classification findings below this confidence level. Accepts 0.0–1.0 (not a rank string). BigID ranks map to: HIGH = 0.75, MEDIUM = 0.50, LOW = 0.25 (unknown ranks = 0.0).
Default: 0.0
owner_type
Enum
One of: "user", "group", "none"
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
risk_score_structured_property_urn
string
URN of the StructuredProperty used for riskScore values.
Default: urn:li:structuredProperty:bigid.riskScore
sync_idsor
boolean
Emit GlossaryTerms for IDSoR (Identity Source of Record) attribute findings from BigID's correlation engine. IDSoR findings are separate from classifier findings and only appear when a Correlation Set is configured and enabled in the scan profile. When the attribute links to an existing Business Glossary term (via glossaryId), that term is reused. Otherwise an auto-generated term is created under a dedicated 'bigid.idsor' GlossaryNode. Term URNs are deterministic GUIDs keyed on the attribute identity.
Default: True
sync_tags
boolean
Emit BigID tags as DataHub Tag entities.
Default: True
sync_unlinked_classifiers
boolean
Emit GlossaryTerms for classifier findings that have no Business Glossary linkage in BigID. Terms are auto-generated on demand (only when a column finding references the classifier) and placed under the same 'bigid' root GlossaryNode. Term URNs are deterministic GUIDs keyed on the classifier identity.
Default: True
sync_unstructured_enrichment
boolean
Emit dataset-level GlossaryTerms and DatasetProfile for unstructured and email sources (SharePoint, Google Drive, O365, Kafka, AI models, etc.) using the attribute_details field returned by BigID's catalog API. Only applies to objects where BigID has classification findings (attribute_details non-empty). Controlled by the same sync_unlinked_classifiers and sync_idsor flags as structured enrichment.
Default: False
timeout
integer
HTTP request timeout in seconds.
Default: 60
user_token
One of string(password), null
Recommended auth. Long-lived BigID user token, generated under Administration → Access Management → System Users (Save the user after generating so the token activates). Exchanged for a short-lived session token at startup and auto-refreshed on expiry, so it is safe for scheduled ingestion. Provide the raw token — no 'Bearer' prefix. Provide either this or access_token; if both are set, user_token takes precedence because it can auto-refresh.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
connection_pattern
AllowDenyPattern
A class to store allow deny regexes
connection_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
dataset_pattern
AllowDenyPattern
A class to store allow deny regexes
dataset_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
datasource_platform_mapping
map(str,ConnectionPlatformConfig)
Per-connection platform override for a single BigID data source.
datasource_platform_mapping.key.platform 
string
DataHub platform name (e.g. 'snowflake', 'mysql').
datasource_platform_mapping.key.convert_urns_to_lowercase
One of boolean, null
Override dataset-name casing for this connection so BigID's enrichment URN byte-matches the one the native connector emitted. When set, forces (true) or disables (false) lowercasing of the dataset-name segment. Leave unset to use the built-in per-platform default (Snowflake, BigQuery and Redshift are lowercased). Set false for, e.g., a Snowflake connection ingested with convert_urns_to_lowercase: false.
Default: None
datasource_platform_mapping.key.platform_instance
One of string, null
DataHub platform instance identifier for this connection.
Default: None
datasource_platform_mapping.key.env
One of string, null
Environment override for this connection (e.g. 'PROD', 'DEV'). Falls back to top-level env if not set.
Default: None
item_types
array
Allow-list of BigID item types to sync. OOTB Personal Data Items are always included regardless of this filter.
item_types.string
string
tag_application_types
array
BigID applicationType values to sync as tags.
tag_application_types.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

  • Business glossary sync — BigID business glossary items become GlossaryTerms under a BigID root GlossaryNode, with domains and ownership optionally attached.
  • Column classification — classification findings are emitted as GlossaryTerms on schema fields, carrying MetadataAttribution that records the classifier, confidence level, and finding counts. Classifiers not linked to a business glossary item are auto-generated under a BigID > Classifier node (controlled by sync_unlinked_classifiers).
  • IDSoR correlation — Identity Source of Record findings are resolved via a three-path strategy: reuse a linked business glossary term, auto-generate a term under a BigID > IDSoR node, or synthesize one from the raw attribute name.
  • Tags and risk score — OBJECT-scoped BigID tags become DataHub Tags; risk scores are written to the bigid.riskScore structured property.
  • Non-destructive enrichment — tags, glossary terms, schema-field annotations, and the risk score are applied to existing datasets via PATCH (merge) semantics, so BigID metadata is added alongside — never overwriting — tags and terms that stewards curate in the DataHub UI.
  • No placeholder datasets — in pure-enrichment mode (create_datasets: false) dataset aspects are emitted as non-primary, so BigID never materializes (or later soft-deletes) a dataset that a native connector has not already created. Enable create_datasets to have BigID own and create datasets it scans.
  • Profiling — column-level statistics from BigID columnProfile data are emitted as Dataset Profiles.
  • Stateful ingestion — enables automatic removal of entities emitted by this source when they disappear from BigID.

Limitations

Enrichment adds are not retracted

Because enrichment is applied additively (PATCH), removing a classification, tag, or glossary link in BigID does not remove the previously-added term or tag from an existing DataHub dataset on the next run — the PATCH only adds. Stateful ingestion removes entities this connector owns (e.g. glossary terms/nodes it created), but it does not retract annotations merged onto datasets owned by a native connector. Remove such annotations in the DataHub UI if needed.

dataPlatformInstance is only emitted when configured

To avoid overwriting the platform instance already set by a native connector (e.g. Snowflake, BigQuery), the dataPlatformInstance aspect is emitted only when a platform_instance is explicitly configured for the connection.

Troubleshooting

Datasets are not being enriched

The connector matches BigID objects to existing DataHub dataset URNs. If enrichment does not appear, verify that the resolved platform, env, platform_instance, and URN casing produce a URN that matches the one created by your native connector. Use datasource_platform_mapping to align them — including convert_urns_to_lowercase when the native connector's casing differs from BigID's per-platform default.

No enrichment applied at all

If both the business glossary and classification map fail to load, the connector reports a failure and emits nothing. Check BigID API connectivity and that the token has read access to the catalog, classifications, and glossary APIs.

Unknown connection type

When a BigID connection type has no built-in platform mapping, the raw type is used as the platform in URNs and a warning is reported. Add an entry to datasource_platform_mapping to map it to the correct DataHub platform.

Code Coordinates

  • Class Name: datahub.ingestion.source.bigid.bigid_source.BigIDSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for BigID, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.