Skip to main content

Informatica

Overview

Informatica Intelligent Data Management Cloud (IDMC) is a cloud-native data integration and management platform. Learn more in the official Informatica documentation.

The DataHub integration for Informatica covers projects and folders as containers; Mapping Tasks as DataFlows with a transform DataJob per task; Taskflows as DataFlows with a single orchestrate DataJob that chains the step order via inputDatajobs; and resolves table-level lineage across the data estate from mapping source/target connections. It also supports ownership extraction and stateful deletion detection.

Concept Mapping

Source ConceptDataHub ConceptNotes
"informatica"Data Platform
ProjectContainerSubType "Project"
FolderContainerSubType "Folder"
TaskflowDataFlow + one orchestrate DataJobSubTypes "Taskflow" / "Taskflow Orchestration"; the orchestrate sits at the end of the chain with inputDatajobs = [last MT]
Mapping TaskDataFlow + inner transform DataJobSubTypes "Mapping Task" / "Task Logic"; MTs chain to each other via inputDatajobs in Taskflow step order
Mappingnot emitted as a standalone entityOnly Mapping Tasks (runnable schedules) are emitted; the Mapping reference is surfaced via customProperties on the Task
Mappletnot emittedInternal sub-mappings included in other mappings; skipped
Source/TargetDatasetUpstream/downstream lineage; external dataset URNs receive a minimal Status stub so they resolve in lineage search

Module informatica

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersProjects and folders as containers.
Detect Deleted EntitiesVia stateful ingestion.
Extract OwnershipFrom IDMC object createdBy/updatedBy.
Extract TagsIDMC object tags emitted as DataHub GlobalTags.
Platform InstanceEnabled by default.
Table-Level LineageTable-level lineage via v3 Export API.
Test ConnectionEnabled by default.

Overview

The informatica module ingests metadata from Informatica Cloud (IDMC) into DataHub. It extracts projects, folders, Mapping Tasks, and Taskflows, and resolves table-level lineage from the Mapping each Task references. Standalone Mappings (ones without a Mapping Task) and Mapplets are not emitted.

Quick Start
  1. Create a service account — Use a dedicated IDMC user with minimum permissions (see Required Permissions)
  2. Identify your pod URL — Determine the IDMC regional login URL (US, US2, EMEA, or APAC)
  3. Configure recipe — Use informatica_recipe.yml as a template
  4. Run ingestion — Execute datahub ingest -c informatica_recipe.yml

Key Features

  • Projects and folders as Containers
  • Mapping Tasks as DataFlows with a transform DataJob each; Taskflows as DataFlows with one orchestrate DataJob that chains the MTs in step order
  • Table-level lineage (source → mapping → target) resolved via the v3 Export API and connection metadata; Mapping Tasks chain to each other in Taskflow step order and the Taskflow orchestrate DataJob anchors the end of the chain
  • Three-layer filtering: tag-based (recommended for large orgs), project/folder pattern, and mapping/taskflow name pattern
  • Cross-source lineage to datasets ingested by other connectors (Snowflake, Oracle, BigQuery, etc.) via connection type mapping
  • Manual connection type overrides for unusual or custom connectors
  • Stateful ingestion for stale entity removal
  • Ownership extraction from createdBy/updatedBy

Concept Mapping

IDMC conceptDataHub entitySubtype
ProjectContainerProject
FolderContainerFolder
TaskflowDataFlow and one orchestrate DataJobTaskflow / Taskflow Orchestration
Mapping TaskDataFlow and one transform DataJobMapping Task / Task Logic
Mappingnot emitted — see notes
Mappletnot emitted — see notes
Source/targetDataset (upstream/downstream lineage)

Mapping Tasks are the runnable schedules in IDMC, and that's what we emit as first-class entities. Each MT's inner transform DataJob carries the dataJobInputOutput aspect with the source/target tables resolved from the Mapping it references — so cross-source lineage lands on the thing users actually schedule and operate.

Mappings without a Mapping Task are not emitted (they're not runnable on their own). Mapplets are not emitted either — they're internal sub-mappings included in other mappings. The referenced Mapping's friendly name, v2 id, and v3 GUID are still surfaced as customProperties.mappingName / mappingId / mappingV3Id on every MT so you can cross-reference back to IDMC without leaving DataHub.

Taskflow step DAG

The Taskflow step order is resolved from the v3 Export API (.TASKFLOW.xml), parsed from the IDMC taskflowModel <eventContainer> / <service> / <link> graph. All Taskflow GUIDs for a single ingestion run are submitted as one export job for efficiency.

Rather than emitting a separate DataJob per step, the connector collapses step references into the MT they run and chains the MT transform DataJobs directly via dataJobInputOutput.inputDatajobs. A single orchestrate DataJob is emitted per Taskflow and anchored at the end of the chain: inputDatajobs = [last MT], outputDatasets mirrors the last MT's outputs.

The resulting Taskflow lineage reads cleanly end to end:

input_dataset → MT1.transform → MT2.transform → … → MTn.transform → orchestrate → output_dataset

Non-data steps (command / decision / notification / …) don't participate in the chain but are summarized in customProperties.stepSummary on the orchestrate DataJob for auditing.

Prerequisites

Required Permissions

CapabilityIDMC privilegeNotes
AuthenticateAny active IDMC userUses the v2 login endpoint
List projects, folders, taskflowsAsset - read (or the Observer role)Needed for all container/flow emission
List mappings / mapping tasksAsset - readMapping Tasks are optional and skipped with a warning if 403
Extract table-level lineageAsset - exportSubmits v3 export jobs; skip by setting extract_lineage: false
List connectionsConnection - readNeeded for lineage to resolve to dataset URNs

Regional login URLs

Set login_url to your IDMC pod's regional URL (not the API runtime URL — the connector discovers that from the login response):

Regionlogin_url
UShttps://dm-us.informaticacloud.com
US2https://dm2-us.informaticacloud.com
EMEAhttps://dm-em.informaticacloud.com
APAChttps://dm-ap.informaticacloud.com

References

Install the Plugin

pip install 'acryl-datahub[informatica]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: informatica
config:
# -------------------------------------------------------------------------
# Connection
# -------------------------------------------------------------------------

# Regional login URL for your IDMC pod. The connector discovers the runtime
# API URL from the login response. Common values:
# US https://dm-us.informaticacloud.com
# US2 https://dm2-us.informaticacloud.com
# EMEA https://dm-em.informaticacloud.com
# APAC https://dm-ap.informaticacloud.com
login_url: "https://dm-us.informaticacloud.com"

# IDMC service account. Prefer a dedicated user with the Observer role
# plus "Asset - export" (required for lineage, see README).
username: "${IDMC_USERNAME}"
password: "${IDMC_PASSWORD}"

# Optional: group entities into a platform instance if you ingest more than
# one IDMC org/pod into the same DataHub instance.
# platform_instance: "idmc_prod"
# env: "PROD"

# -------------------------------------------------------------------------
# Filtering — combine any or all three layers
# -------------------------------------------------------------------------

# Layer 1 (recommended for large orgs): only ingest objects tagged in IDMC
# with at least one of these names. Tags are matched exactly.
# Applies to Projects, Folders, Taskflows, and Mapping Tasks only —
# Mappings and Connections are always fetched in full regardless of this filter.
# tag_filter_names: ["datahub", "critical"]

# Layer 2: filter by project/folder name (regex).
# project_pattern:
# allow:
# - "^Production_.*"
# deny:
# - ".*_sandbox$"
# folder_pattern:
# allow:
# - ".*"

# Layer 3: filter by mapping/taskflow name (regex, applied across all matches).
# mapping_pattern:
# allow:
# - ".*"
# taskflow_pattern:
# allow:
# - ".*"

# -------------------------------------------------------------------------
# Features
# -------------------------------------------------------------------------

# Requires the "Asset - export" privilege on the service account.
extract_lineage: true

# Derives owners from IDMC createdBy/updatedBy fields.
extract_ownership: true

# Emits IDMC object tags as DataHub GlobalTags on Projects, Folders,
# Taskflows, and Mapping Tasks. Defaults to true — tags will be ingested
# even if this field is not specified.
extract_tags: true

# -------------------------------------------------------------------------
# Connection → platform overrides
# -------------------------------------------------------------------------

# Use when IDMC reports a connection type the connector doesn't know about.
# Keys are IDMC connection IDs; values are DataHub platform names.
# connection_type_overrides:
# "01DM180B000000000008": "snowflake"

# -------------------------------------------------------------------------
# Performance (tune for large orgs)
# -------------------------------------------------------------------------

# page_size: 200 # v3 objects per page (max 200)
# export_batch_size: 1000 # mappings per export job (max 1000)
# export_poll_timeout_secs: 300 # seconds to wait for an export job
# export_poll_interval_secs: 5 # seconds between export polls

# -------------------------------------------------------------------------
# Stateful ingestion — recommended for automatic stale-entity removal
# -------------------------------------------------------------------------

stateful_ingestion:
enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
password 
string(password)
Informatica Cloud password.
username 
string
Informatica Cloud username (email or service account name).
connection_to_platform_instance
map(str,string)
connection_type_overrides
map(str,string)
connection_type_platform_map
map(str,string)
convert_urns_to_lowercase
boolean
Lowercase the dataset qualifier in emitted upstream URNs to match the default behavior of the Snowflake, Postgres, and BigQuery sources (which lowercase by default). Set to False only if you've disabled lowercasing on every source this connector produces lineage to.
Default: True
export_batch_size
integer
Number of mappings per v3 export batch job (max 1000).
Default: 1000
export_poll_interval_secs
integer
Interval in seconds between export job status polls.
Default: 5
export_poll_timeout_secs
integer
Timeout in seconds for polling export job completion.
Default: 300
extract_lineage
boolean
Whether to extract table-level lineage from mapping definitions. Requires the 'Asset - export' privilege on the service account. When enabled, uses the v3 Export API to fetch full mapping definitions.
Default: True
extract_ownership
boolean
Whether to extract ownership from IDMC object createdBy/updatedBy fields.
Default: True
extract_tags
boolean
Emit IDMC object tags as DataHub GlobalTags on Projects, Folders, Taskflows, and Mapping Tasks. Set to False to skip tag extraction.
Default: True
login_url
string
Informatica Cloud login URL. This is the regional pod URL, not the runtime serverUrl. After login, the connector discovers the actual API base URL from the login response. Common values: https://dm-us.informaticacloud.com (US), https://dm2-us.informaticacloud.com (US2), https://dm-em.informaticacloud.com (EMEA), https://dm-ap.informaticacloud.com (APAC).
max_concurrent_export_jobs
integer
Maximum number of v3 export jobs to run concurrently. Each job covers one batch of mappings. Increase to reduce lineage wall-clock time on large orgs; decrease if hitting IDMC rate limits.
Default: 4
page_size
integer
Number of objects to fetch per API page (max 200 for v3 objects).
Default: 200
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
request_timeout_secs
integer
HTTP timeout in seconds for IDMC API requests. Raise this for large deployments where /api/v2/mapping or /api/v2/connection returns many records and the default 60s is insufficient.
Default: 60
strip_user_email_domain
boolean
Strip the domain from IDMC user identifiers before forming the CorpUser URN (e.g. alice@acme.comurn:li:corpuser:alice). Enable when your Okta/AzureAD source ingests users without the email domain so ownership edges align with existing CorpUser URNs.
Default: False
env
string
The environment that all assets produced by this connector belong to
Default: PROD
folder_pattern
AllowDenyPattern
A class to store allow deny regexes
folder_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
mapping_task_pattern
AllowDenyPattern
A class to store allow deny regexes
mapping_task_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
project_pattern
AllowDenyPattern
A class to store allow deny regexes
project_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
tag_filter_names
array
List of literal IDMC tag names. When set, only objects tagged with at least one of these tags will be ingested. Tags are matched exactly (not regex). This is the recommended filtering approach for large orgs — IDMC admins tag objects in the UI and the connector picks them up.
Default: []
tag_filter_names.string
string
taskflow_pattern
AllowDenyPattern
A class to store allow deny regexes
taskflow_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Configuration for stateful ingestion and stale entity removal.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Filtering

Three filter layers can be combined, applied in order:

  1. Tag-based (tag_filter_names, recommended for large orgs) — an allowlist of IDMC tags; only tagged objects are ingested.
  2. Path-based (project_pattern, folder_pattern) — regex allow/deny on project and folder names.
  3. Name-based (mapping_pattern, taskflow_pattern) — regex allow/deny on mapping and taskflow names.

Connection Type Mapping

When emitting lineage, each IDMC connection is mapped to a DataHub platform (e.g. Snowflake_Cloud_Data_Warehouse → snowflake). The mapping is driven by connParams["Connection Type"]. If IDMC returns an unknown type (or a customer-specific connector), set connection_type_overrides to map that connection ID to a DataHub platform name. The connector will warn about unknown platforms at config-parse time.

External Dataset Stubs

Every input/output dataset URN referenced by mapping lineage receives a minimal Status aspect when it is first seen. Without this stub, DataHub treats URNs that no other connector has ingested as non-existent and searchAcrossLineage filters them out of results — which would leave the left-side chevron on a Mapping Task's transform DataJob unable to expand upstream datasets. The stub is idempotent and does not override Schema, Ownership, or other metadata written by the source platform's own connector when it runs.

Limitations

  • No column-level lineage — the v3 export gives us transformation-level source/target tables but not column mappings.
  • No execution history — the connector does not ingest Activity Monitor runs as DataProcessInstances.
  • Taskflow step DAG requires Asset - export — Taskflow step ordering lives in a taskflowModel XML document fetched via the v3 Export API. Ingestion will silently no-op the step chain for Taskflows the user can't export (the Taskflow itself is still emitted as a DataFlow with its orchestrate DataJob, but that orchestrate won't have an inputDatajobs chain). The report includes taskflows_with_steps so you can confirm coverage.
  • Single-user auth only — service-principal / federated SSO login is not supported; use a native IDMC user.
  • v2 API endpoints are not paginated/api/v2/mapping and /api/v2/connection return all records in a single response; the IDMC v2 API does not honour limit, skip, or maxRecordsCount parameters (verified against a live instance). For orgs with very large numbers of mappings (>10k) or connections (>1k), the single call may exceed request_timeout_secs or produce a very large response. Mitigations: raise request_timeout_secs, or use tag_filter_names to scope ingestion to a tagged subset of objects.

Troubleshooting

IDMC login failed at startup

The connector raises this when the v2 login endpoint returns non-200 or a body without icSessionId/serverUrl. Common causes:

  • Wrong login_url for your pod (see the region table in the Prerequisites section).
  • Service account locked out, MFA-protected, or password-expired. Use a dedicated IDMC user without interactive MFA.
  • Firewall blocking egress to *.informaticacloud.com.

The raised error includes the HTTP status, a truncated response body, and the login_url used.

connections_unresolved entries in the report

The connector resolves lineage dataset URNs by matching the mapping's connectionId (e.g. saas:@fed-xyz) against the IDMC connection catalog. If a connection cannot be mapped to a DataHub platform, the lineage edge is dropped and the connection is recorded in connections_unresolved. Two typical causes:

  1. The connection uses a type not in the built-in CONNECTION_TYPE_MAP (e.g. a custom connector). Add it to connection_type_overrides with the connection ID → DataHub platform.
  2. The Connection - read privilege is missing from the service account, so list_connections fetches an empty or partial catalog.

Failed to fetch mapping tasks warning

Mapping Tasks live at /api/v2/mttask, which is often restricted to specific roles. The connector treats this as a warning (not a failure) because mapping and lineage ingestion can still complete without it. Grant Asset - read on mapping tasks if you need them.

Export job timed out

The v3 Export API is asynchronous; for very large orgs, the default export_poll_timeout_secs: 300 may be too short. Try:

  • Reduce export_batch_size (default 1000) — smaller batches finish faster individually.
  • Raise export_poll_timeout_secs (max 3600).
  • Use tag_filter_names to scope the export to tagged mappings only.

The connector emits a report warning titled "IDMC export job timed out" for each timed-out batch and records it under export_jobs_failed.

Add-On Bundles showing up

IDMC ships several marketplace bundles (e.g. Cloud Data Integration templates). The connector filters these out automatically by checking path.startswith("Add-On Bundles/") or updated_by == "bundle-license-notifier". If you see bundle mappings leaking through, open an issue with the offending object's path.

Code Coordinates

  • Class Name: datahub.ingestion.source.informatica.source.InformaticaSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Informatica, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.