Skip to main content

Fabric Data Factory

Overview

Microsoft Fabric Data Factory is a cloud-based data integration service within the Microsoft Fabric platform. Learn more in the official Microsoft Fabric Data Factory documentation.

The DataHub integration for Fabric Data Factory covers pipeline and orchestration entities such as workspaces, data pipelines, and activities. Depending on module capabilities, it can also capture features such as lineage, execution history, platform instance mapping, and stateful deletion detection.

Concept Mapping

Fabric Data Factory ConceptDataHub EntityNotes
WorkspaceContainer (subtype: Fabric Workspace)Top-level organizational unit
Data PipelineDataFlowOrchestration pipeline containing activities
ActivityDataJobIndividual task within a pipeline (Copy, Lookup, Spark, etc.)
Pipeline RunDataProcessInstanceExecution record for a pipeline run
Activity RunDataProcessInstanceExecution record for an individual activity within a pipeline
Connection(resolved to external Dataset)Used for lineage resolution to datasets on external platforms

Hierarchy Structure

Platform (fabric-data-factory)
└── Workspace (Container)
└── Data Pipeline (DataFlow)
└── Activity (DataJob)
├── Pipeline Run (DataProcessInstance)
└── Activity Run (DataProcessInstance)

Module fabric-data-factory

Testing

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default.
Detect Deleted EntitiesOptionally enabled via stateful_ingestion config.
Platform InstanceEnabled by default.
Table-Level LineageEnabled by default via Copy and InvokePipeline activities.

Overview

The fabric-data-factory module ingests metadata from Microsoft Fabric Data Factory into DataHub. It extracts workspaces, data pipelines, activities, and execution history, and resolves lineage from Copy activities to external datasets.

Quick Start
  1. Set up authentication — Configure Azure credentials (see Prerequisites)
  2. Enable API access — Ensure a Fabric admin has enabled service principal API access (if using SP or managed identity)
  3. Grant permissions — Add your identity as a workspace Contributor (required for pipeline definitions and lineage)
  4. Configure recipe — Use fabric-data-factory_recipe.yml as a template
  5. Run ingestion — Execute datahub ingest -c fabric-data-factory_recipe.yml

Key Features

  • Workspaces as containers, data pipelines as DataFlows (DataHub entity type), activities as DataJobs
  • Dataset-level lineage from Copy and InvokePipeline activities
  • Pipeline and activity execution history as DataProcessInstances
  • Cross-recipe lineage via platform_instance_map for connecting to externally ingested datasets
  • Pattern-based filtering for workspaces and pipelines
  • Stateful ingestion for stale entity removal
  • Multiple authentication methods (Service Principal, Managed Identity, Azure CLI, DefaultAzureCredential)

References

Azure Authentication

Fabric Data Factory Concepts

Prerequisites

Required Permissions

The connector requires Contributor role on each workspace. Contributor is needed to fetch pipeline definitions without it. With Reader role only, the connector will list workspaces and pipelines but will not extract pipeline activities, activity run details, or lineage.

Delegated (on behalf of a user) authentication

If using delegated auth (e.g., Azure CLI), the signed-in user's existing Fabric permissions apply directly. The connector requires the following delegated scopes:

  • Workspace.Read.All or Workspace.ReadWrite.All — for listing workspaces and items
  • Item.ReadWrite.All or DataPipeline.ReadWrite.All — for Get Item Definition, List Item Connections, and Query Activity Runs (Item.Read.All is not sufficient for definitions and connections)
  • Item.Read.All or DataPipeline.Read.All — sufficient for List Item Job Instances (execution history)

The Azure CLI token includes the necessary Fabric API scopes by default.

Service Principal and Managed Identity authentication

Service principals and managed identities do not inherit any permissions by default. You need to:

  1. Enable API access: A Fabric admin must enable the service principal tenant settings (see Fabric Admin Settings below)
  2. Grant workspace access: Add the SP or MI as a workspace Contributor for each workspace you want to ingest

Fabric Admin Settings

danger

For service principal and managed identity authentication, a Fabric administrator must enable API access for service principals in the Fabric admin portal. Without this, API calls will fail with 401 errors even if workspace permissions are correctly assigned.

As of mid-2025, Microsoft split the original single tenant setting into two separate settings. Configure them as follows:

  1. Go to the Fabric Admin Portal > Tenant settings
  2. Under Developer settings, enable the applicable setting(s):
    • Service principals can call Fabric public APIs — Controls access to CRUD APIs protected by the Fabric permission model (e.g., reading workspaces and items). This is enabled by default for new tenants since August 2025.
    • Service principals can create workspaces, connections, and deployment pipelines — Controls access to global APIs not protected by Fabric permissions. This is disabled by default. Enable only if needed.
  3. Restrict access to a dedicated security group containing only the service principals that need API access. This is the recommended approach.
tip

If you are on an older tenant where the legacy single setting Service principals can use Fabric APIs is still visible, enable that instead. It will be automatically migrated to the two new settings.

tip

Tenant setting changes can take up to 15 minutes to propagate. If you receive 401 errors immediately after enabling, wait and retry.

For detailed instructions, see Developer admin settings and Identity support for Fabric REST APIs.

Authentication

The connector supports four authentication methods via the shared credential config block. All methods use Azure's TokenCredential interface.

Register an application in Microsoft Entra ID and note the client_id, client_secret, and tenant_id. Then:

  1. Ensure the Fabric admin has enabled service principal API access (see Fabric Admin Settings above)
  2. Create a security group in Entra ID and add the service principal as a member
  3. Add the security group as Contributor in each target workspace (Contributor role grants access to pipeline definitions and item connections for lineage)
credential:
authentication_method: service_principal
client_id: ${AZURE_CLIENT_ID}
client_secret: ${AZURE_CLIENT_SECRET}
tenant_id: ${AZURE_TENANT_ID}

All three fields are required when using this method.

Managed Identity (for Azure-hosted deployments)

Use this when running DataHub ingestion on an Azure VM, AKS, App Service, or other Azure compute that supports managed identities. The managed identity must be added as a workspace Contributor in Fabric. A Fabric admin must also enable the tenant settings described in Fabric Admin Settings above — these settings govern API access for both service principals and managed identities, despite the setting name referencing only service principals.

# System-assigned managed identity (no additional config needed)
credential:
authentication_method: managed_identity

For user-assigned managed identity, provide the client ID:

credential:
authentication_method: managed_identity
managed_identity_client_id: "<your-managed-identity-client-id>"
Azure CLI (for local development and testing)

Uses the credentials from your local az login session. The signed-in user's existing Fabric permissions apply directly — no additional setup needed beyond workspace access.

credential:
authentication_method: cli

Run az login before starting ingestion. For remote servers without a browser, use az login.

DefaultAzureCredential (flexible auto-detection)

Uses Azure's DefaultAzureCredential chain, which tries multiple credential sources in order: environment variables, workload identity, managed identity, shared token cache, Azure CLI, Azure PowerShell, Azure Developer CLI, and more.

credential:
authentication_method: default

You can exclude specific credential sources from the chain to speed up detection or avoid unintended auth in mixed environments:

credential:
authentication_method: default
exclude_cli_credential: true # Skip Azure CLI (recommended in production)
exclude_environment_credential: false
exclude_managed_identity_credential: false

Setup

  1. Choose an authentication method from above and configure the credential block.
  2. If using service principal or managed identity:
    • Ensure the Fabric admin has enabled the appropriate developer settings (see Fabric Admin Settings)
    • Create a security group, add your identity, and grant Contributor on target workspaces
  3. If using Azure CLI, run az login (or az login --use-device-code on remote servers).
  4. Configure the ingestion recipe with optional workspace and pipeline filters.

Install the Plugin

pip install 'acryl-datahub[fabric-data-factory]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

# Example recipe for Fabric Data Factory source
# See README.md for full configuration options

source:
type: fabric-data-factory
config:
# Authentication (using service principal)
credential:
authentication_method: service_principal
client_id: ${AZURE_CLIENT_ID}
client_secret: ${AZURE_CLIENT_SECRET}
tenant_id: ${AZURE_TENANT_ID}

# Optional: Filter workspaces by name pattern
workspace_pattern:
allow:
- ".*" # Allow all workspaces by default
deny: []

# Optional: Filter pipelines by name pattern
pipeline_pattern:
allow:
- ".*" # Allow all pipelines by default
deny: []

# Feature flags
extract_pipelines: true
include_lineage: true
include_execution_history: true
execution_history_days: 7 # 1-90 days

# Optional: Map Fabric connection names to platform instances for accurate lineage
# platform_instance_map:
# "my-snowflake-connection": "prod_snowflake"
# "my-bigquery-connection": "analytics_project"

# Optional: Platform instance for this Fabric Data Factory connector
# platform_instance: "my-fabric-tenant"

# Environment
env: PROD

# Optional: Stateful ingestion for stale entity removal
# stateful_ingestion:
# enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"
token: ${DATAHUB_GMS_TOKEN}

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_timeout
integer
Timeout for REST API calls in seconds.
Default: 30
execution_history_days
integer
Number of days of execution history to extract. Only used when include_execution_history is True. Higher values increase ingestion time. Note: Fabric API returns at most 100 recently completed runs per pipeline.
Default: 7
extract_pipelines
boolean
Whether to extract Data Pipelines and their activities.
Default: True
include_execution_history
boolean
Extract pipeline and activity execution history as DataProcessInstance. Includes run status, duration, and parameters. Enables lineage extraction from parameterized activities using actual runtime values.
Default: True
include_lineage
boolean
Extract lineage from activity inputs/outputs. Maps Fabric connections to DataHub datasets based on connection type.
Default: True
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
platform_instance_map
map(str,string)
env
string
The environment that all assets produced by this connector belong to
Default: PROD
credential
AzureCredentialConfig
Unified Azure authentication configuration.

This class provides a reusable authentication configuration that can be
composed into any Azure connector's configuration. It supports multiple
authentication methods and returns a TokenCredential that works with
any Azure SDK client.

Example usage in a connector config:
class MyAzureConnectorConfig(ConfigModel):
credential: AzureCredentialConfig = Field(
default_factory=AzureCredentialConfig,
description="Azure authentication configuration"
)
subscription_id: str = Field(...)
credential.authentication_method
Enum
One of: "default", "service_principal", "managed_identity", "cli"
credential.client_id
One of string, null
Azure Application (client) ID. Required for service_principal authentication. Find this in Azure Portal > App registrations > Your app > Overview.
Default: None
credential.client_secret
One of string(password), null
Azure client secret. Required for service_principal authentication. Create in Azure Portal > App registrations > Your app > Certificates & secrets.
Default: None
credential.exclude_cli_credential
boolean
When using 'default' authentication, exclude Azure CLI credential. Useful in production to avoid accidentally using developer credentials.
Default: False
credential.exclude_environment_credential
boolean
When using 'default' authentication, exclude environment variables. Environment variables checked: AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID.
Default: False
credential.exclude_managed_identity_credential
boolean
When using 'default' authentication, exclude managed identity. Useful during local development when managed identity is not available.
Default: False
credential.managed_identity_client_id
One of string, null
Client ID for user-assigned managed identity. Leave empty to use system-assigned managed identity. Only used when authentication_method is 'managed_identity'.
Default: None
credential.tenant_id
One of string, null
Azure tenant (directory) ID. Required for service_principal authentication. Find this in Azure Portal > Microsoft Entra ID > Overview.
Default: None
pipeline_pattern
AllowDenyPattern
A class to store allow deny regexes
pipeline_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
workspace_pattern
AllowDenyPattern
A class to store allow deny regexes
workspace_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Configuration for stateful ingestion and stale entity removal. When enabled, tracks ingested entities and removes those that no longer exist in Fabric.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Lineage Extraction

Which Activities Produce Lineage?

The connector extracts dataset-level lineage from these Fabric activity types:

Activity TypeLineage Behavior
CopyCreates lineage from input dataset(s) to output dataset
InvokePipelineCreates pipeline-to-pipeline lineage to the child pipeline

Lineage is enabled by default (include_lineage: true).

How Lineage Resolution Works

For lineage to connect properly to datasets ingested from other sources (e.g., Snowflake, BigQuery), the connector resolves Fabric connections to DataHub platforms.

Step 1: Automatic Connection Mapping

The connector automatically maps Fabric connection types to DataHub platforms (e.g., a Snowflake connection maps to the snowflake platform). See FABRIC_CONNECTION_PLATFORM_MAP for the full list of supported mappings. Unsupported connection types fall back to using the connection type string as the platform name.

Step 2: Platform Instance Mapping (for cross-recipe lineage)

If you're ingesting the same data sources with other DataHub connectors (e.g., Snowflake, BigQuery), you need to ensure the platform_instance values match. Use platform_instance_map to map your Fabric connection names to the platform instance used in your other recipes:

# Fabric Data Factory Recipe
source:
type: fabric-data-factory
config:
credential:
authentication_method: service_principal
client_id: ${AZURE_CLIENT_ID}
client_secret: ${AZURE_CLIENT_SECRET}
tenant_id: ${AZURE_TENANT_ID}
platform_instance_map:
# Key: Your Fabric connection name (exact match required)
# Value: The platform_instance from your other source recipe
"snowflake-prod-connection": "prod_warehouse"
"bigquery-analytics": "analytics_project"
# Corresponding Snowflake Recipe (platform_instance must match)
source:
type: snowflake
config:
platform_instance: "prod_warehouse" # Must match the value in platform_instance_map
# ... other config

Without matching platform_instance values, lineage will create separate dataset entities instead of connecting to your existing ingested datasets.

Execution History

Pipeline and activity runs are extracted as DataProcessInstance entities by default:

source:
type: fabric-data-factory
config:
include_execution_history: true # default
execution_history_days: 7 # 1-90 days

This provides run status, duration, timestamps, invoke type, and activity-level details including error messages and retry attempts.

note

The Fabric API returns at most 100 recently completed runs per pipeline. Run ingestion more frequently to capture deeper history.

Advanced: Multi-Tenant Setup

When to Use platform_instance

Use the connector's platform_instance config to distinguish separate Fabric tenants when ingesting from multiple environments:

ScenarioRiskSolution
Single tenantNoneNot needed
Multiple tenantsHigh - name collision riskRequired
# Multi-tenant example
source:
type: fabric-data-factory
config:
platform_instance: "contoso-tenant" # Prevents URN collisions
danger

Different Fabric tenants could have identically-named workspaces and pipelines. Use platform_instance to prevent entity overwrites.

URN Format

Pipeline URNs follow this format:

urn:li:dataFlow:(fabric-data-factory,{workspace_id}.{pipeline_id},{env})

With platform_instance:

urn:li:dataFlow:(fabric-data-factory,{platform_instance}.{workspace_id}.{pipeline_id},{env})

Limitations

  • Run history limit: The Fabric API returns at most 100 recently completed runs per pipeline. If execution_history_days covers more runs than this limit, only the most recent 100 are returned. Run ingestion more frequently to capture deeper history.
  • No Dataflow Gen2 support: Dataflow Gen2 items (standalone workspace-level items with transformation logic) are not extracted.
  • No CopyJob support: Standalone CopyJob items at the workspace level are not extracted. Only Copy activities embedded within pipelines produce lineage.
  • No trigger/schedule metadata: Pipeline triggers and schedules are not extracted.
  • ExecutePipeline not supported: The ExecutePipeline activity type is marked as legacy in Fabric and is not supported for cross-pipeline lineage.

Lineage

  • Lineage scope: Only Copy and InvokePipeline activities produce dataset or pipeline lineage. Other activity types (Lookup, Wait, ForEach, Script, etc.) are ingested as DataJobs without dataset-level lineage.
  • InvokePipeline Activity operation types: Only the InvokeFabricPipeline operation type is supported for cross-pipeline lineage. Other operation types (InvokeAdfPipeline, InvokeExternalPipeline) are not resolved and will be skipped.
  • Query-based Copy sources: When a Copy activity uses sqlReaderQuery or sqlReaderStoredProcedureName instead of a direct table reference, lineage is not extracted.
  • No column-level lineage: The connector extracts dataset-level lineage only. Column-to-column mappings from Copy activity translator configurations are not extracted.
  • No Notebook/SparkJobDefinition lineage: Notebook and SparkJobDefinition activities are ingested as DataJobs but their lineage is not resolved.
  • Connection resolution: Unmapped connection types fall back to using the connection type string as the platform name, which may not match your existing DataHub platform names. Use platform_instance_map to explicitly map connection names.

Troubleshooting

  • 401/403 errors: Ensure the service principal has the correct Fabric API permissions and is added as a workspace member.
  • Empty results: Check that workspace_pattern and pipeline_pattern are not filtering out all items.
  • Missing lineage: Verify that include_lineage: true is set and that Fabric connections are properly configured for the pipelines. Also review the Lineage limitations section for unsupported activity types and scenarios.
  • Stale entities: Enable stateful_ingestion to automatically remove entities that no longer exist in Fabric.

Code Coordinates

  • Class Name: datahub.ingestion.source.fabric.data_factory.source.FabricDataFactorySource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Fabric Data Factory, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.