Skip to main content

Amazon Kinesis Data Streams

Overview

AWS Kinesis is a real-time streaming and data-delivery service. See the official AWS Kinesis page for product details.

This connector covers both AWS streaming services with one recipe and one IAM policy:

  • Amazon Kinesis Data Streams (KDS) — emitted under the kinesis platform (display name: Amazon Kinesis Data Streams) as Datasets (Stream subtype).
  • Amazon Data Firehose (formerly Amazon Kinesis Data Firehose / KDF) — each Firehose stream is emitted under the kinesis-firehose platform (display name: Amazon Data Firehose) as its own DataFlow (Firehose Stream subtype) containing a single DataJob (Delivery subtype) that carries cross-platform lineage edges to the destination (S3, Redshift, OpenSearch, Snowflake, Apache Iceberg, MongoDB).

AWS resource tags become DataHub tags (and can be turned into ownership via the extract_ownership_from_tags transformer). Glue Schema Registry can be opted in to attach Avro / JSON / Protobuf schemas to streams.

Looking specifically for Amazon Data Firehose?

Firehose streams are ingested by this same connector — see the Concept Mapping table below or the Limitations section for cross-platform lineage configuration.

Concept Mapping

Source ConceptDataHub ConceptNotes
"kinesis" / "kinesis-firehose"Data PlatformTwo platforms, mirroring AWS's own service split.
AWS RegionContainerSubtype Region. One per recipe.
Kinesis Data StreamDatasetSubtype Stream. Parent: the regional Container.
Glue Schema Registry schema (per stream)SchemaMetadata aspectAvro / JSON / Protobuf. Attached when glue_schema_registry.enabled: true and a schema is resolved for the stream.
Firehose streamDataFlowSubtype Firehose Stream. One DataFlow per Firehose stream (the pipeline).
Firehose delivery (the stream's data movement)DataJobSubtype Delivery. The single DataJob inside each Firehose stream's DataFlow; carries the lineage.
Firehose destination (S3, Redshift, OpenSearch, Snowflake, Iceberg, MongoDB)Lineage edgeEmitted via dataJobInputOutput.outputDatasets. Upstream is the source KDS stream when DeliveryStreamType=KinesisStreamAsSource.
AWS resource tag (Key=Value)TagTag URN form: urn:li:tag:Key:Value.
AWS resource tag (via the extract_ownership_from_tags transformer)OwnerOwnership is derived from the emitted tags by the transformer, not by this source directly.

Compatibility

Six Firehose destination platforms are supported: S3, Redshift, OpenSearch/Elasticsearch, Snowflake, Apache Iceberg, and MongoDB. Firehose streams targeting other destinations (HTTP, Datadog, Splunk, New Relic, etc.) are still emitted (as a DataFlow + Delivery DataJob) but without lineage output edges, and surface a warning in the ingestion report. See Limitations for the destination_platform_map configuration required when destination platforms were ingested with a non-default platform_instance.

Module kinesis

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersRegion containers.
DescriptionsEnabled by default.
Detect Deleted EntitiesVia stateful ingestion.
Extract TagsFrom AWS resource tags.
Schema MetadataOpt-in via glue_schema_registry.enabled (AWS Glue Schema Registry).
Table-Level LineageFirehose -> destination lineage.

Overview

Looking for Amazon Data Firehose (formerly Kinesis Data Firehose)?

You're in the right place — Firehose streams are ingested by this same kinesis connector. See the kinesis-firehose platform section below.

This connector ingests both AWS streaming services with one recipe, one IAM policy, and one ingestion job:

  • kinesis (display name: Amazon Kinesis Data Streams) — KDS streams are emitted as Datasets (Stream subtype) under a regional Container, with StreamARN, shard count, retention, encryption, and stream mode in custom properties, AWS resource tags as DataHub tags, and (optionally) schemaMetadata resolved from AWS Glue Schema Registry.
  • kinesis-firehose (display name: Amazon Data Firehose) — each Firehose stream is emitted as its own DataFlow (Firehose Stream subtype) containing a single DataJob (Delivery subtype), whose dataJobInputOutput edges draw lineage from the source Kinesis stream to the destination platform. Six destinations are supported: S3, Redshift, OpenSearch/Elasticsearch, Snowflake, Apache Iceberg, and MongoDB.

Cross-service lineage (e.g. KDS Stream → Firehose stream → S3) is rendered in the DataHub lineage viewer as edges crossing platform boundaries, making the data flow immediately legible.

The connector is API-based (boto3 + AWS IAM SigV4) and region-scoped per recipe — a multi-region setup runs multiple recipes, one per region. The region is encoded in dataset names and Firehose DataFlow ids, so multiple regions of the same account share one platform_instance (the account ID, by default) without colliding on URN.

Prerequisites

AWS IAM Permissions

The connector needs read-only access to the Kinesis, Firehose, and (optionally) Glue services. The minimum policy is:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "KinesisDataStreamsRead",
"Effect": "Allow",
"Action": [
"kinesis:ListStreams",
"kinesis:DescribeStream",
"kinesis:ListTagsForStream"
],
"Resource": "*"
},
{
"Sid": "KinesisFirehoseRead",
"Effect": "Allow",
"Action": [
"firehose:ListDeliveryStreams",
"firehose:DescribeDeliveryStream",
"firehose:ListTagsForDeliveryStream"
],
"Resource": "*"
},
{
"Sid": "GlueSchemaRegistryRead",
"Effect": "Allow",
"Action": ["glue:ListRegistries", "glue:GetSchemaVersion"],
"Resource": "*"
}
]
}

Notes on each statement:

  • Account ID resolution — no extra permission is needed. When you don't set platform_instance explicitly, the connector derives the AWS account ID from a resource ARN (parsed from the kinesis:DescribeStream / firehose:DescribeDeliveryStream calls it already makes) and uses it as the default platform_instance (<account_id>), so URNs disambiguate across accounts; the region is encoded in the dataset name and DataFlow id rather than the platform_instance. If no resource is available or the lookup fails, the connector logs a warning and continues with platform_instance=None; URNs then won't include the account ID, so cross-account collision-safety depends on you setting platform_instance explicitly in the recipe.
  • KinesisFirehoseRead — required only when include_firehose: true (the default). If you don't have these permissions and Firehose extraction is enabled, the connector logs a warning ("Permission denied for Firehose") and continues with KDS only — Firehose section is skipped, KDS ingestion proceeds normally.
  • GlueSchemaRegistryRead — required only when glue_schema_registry.enabled: true. AWS also provides a ready-made managed policy (AWSGlueSchemaRegistryReadonlyAccess) you can attach instead.
  • KinesisDataStreamsRead — denial of kinesis:ListStreams on the first page is logged as a warning and the KDS section is skipped (the user may intentionally have Firehose-only IAM). A mid-pagination failure escalates to report.failure to prevent stateful ingestion from soft-deleting un-listed streams on the next run.

Authentication

Credentials are resolved by the standard boto3 chain, in priority order:

  1. Static credentials in aws_config (aws_access_key_id + aws_secret_access_key, plus aws_session_token for STS temporary credentials).
  2. AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY environment variables (and AWS_SESSION_TOKEN when applicable).
  3. A profile selected by aws_config.aws_profile from ~/.aws/credentials.
  4. An IAM role attached to the EC2 / ECS / EKS host the ingestion runs on.
  5. An AWS SSO profile.

The three patterns below cover most setups. Prefer IAM roles or short-lived SSO credentials over long-lived access keys in checked-in recipes.

Environment variables (recommended for CI / containers) — inject AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (and AWS_SESSION_TOKEN when using temporary credentials) as env vars, and leave only aws_region in the recipe:

aws_config:
aws_region: "us-east-1"

Assume-role (recommended for cross-account access) — set aws_config.aws_role to the role ARN. The credentials picked up by steps 1–5 above must have sts:AssumeRole on the target role:

aws_config:
aws_region: "us-east-1"
aws_role: "arn:aws:iam::123456789012:role/datahub-kinesis-read"
# aws_external_id: "${DATAHUB_EXTERNAL_ID}" # if required by trust policy

Named profile (recommended for local development) — reference a profile from ~/.aws/credentials:

aws_config:
aws_region: "us-east-1"
aws_profile: "datahub-prod"

Install the Plugin

pip install 'acryl-datahub[kinesis]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

pipeline_name: "kinesis_ingestion"
source:
type: "kinesis"
config:
# One recipe per AWS region. platform_instance defaults to the AWS
# account ID (derived from a resource ARN); set it explicitly only to
# disambiguate across accounts. The region is already encoded in dataset
# names and the Firehose DataFlow id, so different regions of the same
# account never collide on URN.
# platform_instance: "prod-account"
env: "PROD"

# Standard DataHub AWS connection config. Credentials are resolved by
# the boto3 chain (env vars, shared credentials file, IAM role on
# EC2/ECS/EKS, SSO profile). Prefer IAM roles where possible; the
# AWS_*_KEY env vars below are shown only for completeness.
# aws_config:
# aws_region: "us-west-2"
# aws_access_key_id: "${AWS_ACCESS_KEY_ID}"
# aws_secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
# aws_role: "arn:aws:iam::123456789012:role/datahub-kinesis-ingest"

# --- Kinesis Data Streams (KDS) -----------------------------------
include_streams: true
stream_pattern:
# Exclude internal / debug / audit streams by default.
deny:
- "^_.*"

# --- Amazon Data Firehose -----------------------------------------
include_firehose: true
firehose_stream_pattern:
allow:
- ".*"
include_table_lineage: true

# --- Tags ---------------------------------------------------------
# AWS resource tags are emitted as DataHub globalTags. To derive
# ownership from a tag, apply the extract_ownership_from_tags
# transformer (see the `transformers:` block below).
extract_tags: true

# --- Cross-platform lineage ---------------------------------------
# Required when a destination platform was ingested with a
# non-default platform_instance — without it, Firehose lineage
# edges reference valid-but-dead URNs. See the Limitations section
# of the connector docs.
#
# Allowed keys (validated at parse time):
# s3, redshift, snowflake, iceberg, mongodb, elasticsearch, glue
#
# Per-destination fields:
# platform_instance — match the destination source's setting
# env — match the destination source's env
# convert_urns_to_lowercase — default True. Set False for
# case-sensitive destinations (Iceberg,
# MongoDB) or when the destination's own
# source has convert_urns_to_lowercase=False.
# destination_platform_map:
# snowflake:
# platform_instance: "prod-snowflake-east"
# env: "PROD"
# convert_urns_to_lowercase: false # only if Snowflake source also has this
# redshift:
# platform_instance: "analytics-cluster"
# env: "PROD"
# s3:
# platform_instance: "data-lake-prod"
# iceberg:
# convert_urns_to_lowercase: false # Iceberg catalogs are case-sensitive

# --- Glue Schema Registry (optional) ------------------------------
# Disabled by default. Requires the extra `glue:Get*` / `glue:List*`
# IAM statement listed in the connector prerequisites.
glue_schema_registry:
enabled: true
registry_name: "default-registry"
# Explicit stream -> schema overrides — the recommended way to attach
# schemas. Unlike Kafka + Confluent Schema Registry, AWS does NOT define
# a relationship between a Kinesis stream and a schema; schemas are
# chosen per-record by producers. List your known stream->schema pairs
# explicitly:
stream_schema_map: {}
# events: "events-v2"
# clicks: "click-events-schema"
# Optional heuristic: look up a schema whose name == stream name.
# Off by default (no AWS-defined convention). Enable only if your
# organization has adopted schema-name == stream-name as an internal
# convention.
use_naming_convention: false

# --- Stateful ingestion (deletion detection) ----------------------
# Soft-deletes entities that have disappeared from AWS since the last
# successful run (diffed per ingestion job — independent of platform_instance).
stateful_ingestion:
enabled: true
remove_stale_metadata: true
fail_safe_threshold: 100

# Derive ownership from an AWS `owner` tag (emitted above as the globalTag
# `owner:<value>`). Uncomment to map it to a corpuser owner.
# transformers:
# - type: "extract_ownership_from_tags"
# config:
# tag_pattern: "owner:"

# Default sink is datahub-rest, picked up from ~/.datahubenv (set via
# `datahub init`). Override here only when you need a different target.
# See https://docs.datahub.com/docs/metadata-ingestion/sink_docs/datahub
# for customization options.

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
extract_tags
boolean
Extract AWS resource tags as DataHub globalTags. To derive ownership from a tag, apply the extract_ownership_from_tags transformer to the emitted tags (see the connector docs).
Default: True
include_firehose
boolean
Extract Amazon Data Firehose streams.
Default: True
include_streams
boolean
Extract Kinesis Data Streams (KDS).
Default: True
include_table_lineage
boolean
Emit Firehose-stream → destination lineage edges.
Default: True
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
aws_config
AwsConnectionConfig
Common AWS credentials config.

Currently used by:
- Glue source
- SageMaker source
- dbt source
aws_config.aws_access_key_id
One of string, null
AWS access key ID. Can be auto-detected, see the AWS boto3 docs for details.
Default: None
aws_config.aws_advanced_config
object
Advanced AWS configuration options. These are passed directly to botocore.config.Config.
aws_config.aws_endpoint_url
One of string, null
The AWS service endpoint. This is normally constructed automatically, but can be overridden here.
Default: None
aws_config.aws_profile
One of string, null
The named profile to use from AWS credentials. Falls back to default profile if not specified and no access keys provided. Profiles are configured in ~/.aws/credentials or ~/.aws/config.
Default: None
aws_config.aws_proxy
One of string, null
A set of proxy configs to use with AWS. See the botocore.config docs for details.
Default: None
aws_config.aws_region
One of string, null
AWS region code.
Default: None
aws_config.aws_retry_mode
Enum
One of: "legacy", "standard", "adaptive"
Default: standard
aws_config.aws_retry_num
integer
Number of times to retry failed AWS requests. See the botocore.retry docs for details.
Default: 5
aws_config.aws_secret_access_key
One of string(password), null
AWS secret access key. Can be auto-detected, see the AWS boto3 docs for details.
Default: None
aws_config.aws_session_token
One of string(password), null
AWS session token. Can be auto-detected, see the AWS boto3 docs for details.
Default: None
aws_config.read_timeout
number
The timeout for reading from the connection (in seconds).
Default: 60
aws_config.aws_role
One of string, array, null
AWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are the same as boto3's STS.Client.assume_role.
Default: None
aws_config.aws_role.union
One of string, AwsAssumeRoleConfig
aws_config.aws_role.union.RoleArn 
string
ARN of the role to assume.
aws_config.aws_role.union.ExternalId
One of string, null
External ID to use when assuming the role.
Default: None
destination_platform_map
map(str,DestinationPlatformDetail)
Per-destination-platform override for cross-platform lineage URN construction.

Used by KinesisSourceConfig.destination_platform_map so Firehose lineage edges
to S3 / Redshift / Snowflake / etc. resolve to the same URNs those platforms'
own DataHub ingestion sources emit.
destination_platform_map.key.platform_instance
One of string, null
platform_instance of the destination platform (must match the ingestion of that platform). If unset, the downstream URN has no platform_instance slot.
Default: None
destination_platform_map.key.convert_urns_to_lowercase
boolean
Whether to lowercase the URN name (<db>.<schema>.<table> etc.) before emitting it. Mirrors the same-named field on platforms' own DataHub source configs (notably Snowflake). Defaults to True, matching the Snowflake source's default behavior. Set to False when your destination's source recipe set convert_urns_to_lowercase=False and your identifiers are case-sensitive (Iceberg, MongoDB, or quoted-identifier Snowflake / Redshift).
Default: True
destination_platform_map.key.env
One of string, null
env (e.g., PROD, DEV) of the destination. If unset, inherits from this source's env.
Default: None
firehose_stream_pattern
AllowDenyPattern
A class to store allow deny regexes
firehose_stream_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
glue_schema_registry
KinesisGlueSchemaRegistryConfig
Glue Schema Registry attachment for Kinesis streams.

Follows the Kafka source's topic_subject_map pattern (see
confluent_schema_registry.py:81-112). AWS doesn't store stream→schema
associations server-side; we resolve via explicit map OR naming convention.
glue_schema_registry.enabled
boolean
When True, fetch Glue Schema Registry schemas for streams. Requires the additional glue:Get* / glue:List* IAM permissions listed in the connector prerequisites.
Default: False
glue_schema_registry.registry_name
string
Glue Schema Registry registry name to query.
Default: default-registry
glue_schema_registry.stream_schema_map
map(str,string)
glue_schema_registry.use_naming_convention
boolean
Heuristic fallback: when stream_schema_map has no entry for a stream, look up a Glue schema whose name matches the stream name. Disabled by default — unlike Kafka + Confluent Schema Registry (which has a documented TopicNameStrategy), AWS Glue Schema Registry has NO AWS-defined relationship between a Kinesis Data Stream and a schema: schemas are chosen per-record by producers, and one stream may carry multiple schemas. Enable only if your organization has adopted schema-name == stream-name as an internal convention.
Default: False
stream_pattern
AllowDenyPattern
A class to store allow deny regexes
stream_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration for stale entity removal.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

See the Important Capabilities table above for the full list of supported capabilities. The sections below cover the connector-specific configuration behind them.

Firehose lineage — supported destinations

The following destination types produce a dataJobInputOutput.outputDatasets edge:

AWS destinationDataHub platformURN format
Amazon S3 / Extended S3s3urn:li:dataset:(urn:li:dataPlatform:s3,<bucket>[/<prefix>],...)
Amazon Redshiftredshifturn:li:dataset:(urn:li:dataPlatform:redshift,<db>.<schema>.<table>,...)
Amazon OpenSearch / Elasticsearchelasticsearchurn:li:dataset:(urn:li:dataPlatform:elasticsearch,<index>,...)
Snowflakesnowflakeurn:li:dataset:(urn:li:dataPlatform:snowflake,<db>.<schema>.<table>,...)
Apache Icebergicebergurn:li:dataset:(urn:li:dataPlatform:iceberg,<namespace>.<table>,...)
MongoDBmongodburn:li:dataset:(urn:li:dataPlatform:mongodb,<database>.<collection>,...)

URN names are lowercased by default (matching the Snowflake source's convert_urns_to_lowercase=True default). Override per destination via destination_platform_map.<platform>.convert_urns_to_lowercase: false — see Cross-platform lineage below.

Unsupported Firehose destinations (HTTP, Datadog, Splunk, New Relic, Coralogix, LogicMonitor, Dynatrace, Honeycomb, Sumo Logic, etc.) do not produce a lineage edge — the connector logs a warning ("Unsupported Firehose destination") and records the destination configuration as a custom property on the DataJob. The DataJob itself is still emitted.

Cross-platform lineage with destination_platform_map

Firehose destinations live on other platforms (S3, Redshift, Snowflake, etc.), so Kinesis-emitted lineage URNs must match the URN convention of those platforms' own DataHub sources. The destination_platform_map lets you override per-destination URN parameters:

destination_platform_map:
snowflake:
platform_instance: "prod-snowflake-east"
env: "PROD"
# Set to false if your Snowflake source recipe also has
# `convert_urns_to_lowercase: false` (preserves UPPER_CASE identifiers):
convert_urns_to_lowercase: false
redshift:
platform_instance: "analytics-cluster"
env: "PROD"
iceberg:
# Iceberg catalogs are case-sensitive — disable lowercasing
# to preserve the table's original case:
convert_urns_to_lowercase: false

Three knobs are available per destination platform:

  • platform_instance — the same string the destination platform's own source recipe used. Without this, Firehose lineage edges resolve to dead URNs in the DataHub UI.
  • envPROD / DEV / etc. Inherits from this source's env when unset.
  • convert_urns_to_lowercase — default true. Set to false for case-sensitive destinations (Iceberg, MongoDB) or for Snowflake / Redshift sources ingested with their own convert_urns_to_lowercase=false.

Glue table lineage (Firehose format conversion)

When a Firehose stream has Parquet/ORC format conversion enabled, its SchemaConfiguration references a Glue table that defines the output schema. The connector surfaces this as an additional upstream input on the Firehose delivery DataJob:

  • The source Kinesis stream remains an input (existing behavior).
  • The destination S3 path remains an output (existing behavior).
  • The Glue table is added as a second input — its schema governs what gets written to the S3 path.

For the Glue table URN to be emitted, SchemaConfiguration must include DatabaseName and TableName. CatalogId is not required — AWS omits it from DescribeDeliveryStream responses when it equals the caller's account (per AWS docs, it's an input-side default). SchemaConfigurations present but missing DatabaseName / TableName are recorded in the source report's firehose_glue_schema_skipped field for diagnosis.

If your Glue catalog was ingested under a non-default platform_instance, set the override:

destination_platform_map:
glue:
platform_instance: "central-catalog"
env: "PROD"

This whole behavior is gated by the include_table_lineage flag — when that's off, no Glue lineage is emitted.

Filtering streams and Firehose streams

stream_pattern and firehose_stream_pattern use the standard DataHub AllowDenyPattern shape. A common deny rule excludes internal / audit / debug streams:

stream_pattern:
deny:
- "^_.*"
- ".*-debug$"

Deriving ownership from tags

The connector emits AWS resource tags as DataHub globalTags (a tag Key=Value becomes urn:li:tag:Key:Value). To turn a tag into ownership, apply the built-in extract_ownership_from_tags transformer to the emitted tags — this keeps ownership handling consistent with every other DataHub source rather than reimplementing it per connector.

For example, to treat the value of an owner tag as a corpuser owner:

transformers:
- type: "extract_ownership_from_tags"
config:
tag_pattern: "owner:"

The transformer also supports corp groups, owner types, and appending an email domain — see its documentation for the full set of options.

Glue Schema Registry

GSR is opt-in because it requires extra IAM permissions (glue:Get* / glue:List*).

What you get / miss:

  • Disabled (default) — streams are emitted without a schemaMetadata aspect. All other metadata (properties, tags, ownership, lineage) is unaffected. No glue:* permissions are needed.
  • Enabled — for each stream that resolves to a schema (see resolution order below), the connector fetches the schema from AWS Glue Schema Registry and attaches a schemaMetadata aspect with the parsed fields (Avro / JSON / Protobuf). Streams that don't resolve to a schema are still emitted, just without schemaMetadata — enabling GSR never drops a stream. Requires the GlueSchemaRegistryRead IAM statement.

To enable it:

glue_schema_registry:
enabled: true
registry_name: "default-registry"
# Recommended: declare known stream -> schema associations explicitly.
stream_schema_map:
events: "events-v2"
clicks: "click-events-schema"
# Optional heuristic — see explanation below:
use_naming_convention: false

Schema resolution order for each stream:

  1. If the stream name is a key in stream_schema_map, use the mapped schema name.
  2. Otherwise, if use_naming_convention: true, look up a schema whose name equals the stream name in the configured registry_name.
  3. Otherwise, emit the stream without schemaMetadata.

Why is use_naming_convention off by default?

Unlike Kafka + Confluent Schema Registry — which defines a standardized TopicNameStrategy (<topic>-key/-value subject naming) — AWS does not define any relationship between a Kinesis Data Stream and a Glue schema. Schemas are chosen per-record by producers (the GlueSchemaRegistrySerializer embeds the schema-id in each record), and the stream itself has no schema binding. Multiple producers can write to one stream using different schemas, and one schema can be reused across multiple streams.

Some organizations adopt "schema name == stream name" as an internal convention, but it's not AWS best practice. If your organization follows that convention, set use_naming_convention: true. Otherwise, declare your known associations in stream_schema_map — that's the most predictable mode.

(For Firehose streams with Parquet/ORC format conversion, AWS does define a relationship via SchemaConfiguration — see Glue table lineage above. That extraction is on by default and unaffected by this flag.)

Limitations

  1. destination_platform_map is required for any destination platform ingested with a non-default platform_instance. Without it, Firehose lineage edges reference URNs that are syntactically valid but resolve to nothing in DataHub — the lineage looks correct in the emitted JSON but the target is a dead link in the UI. Always populate destination_platform_map for destinations whose own source recipe set a platform_instance. Example:

    destination_platform_map:
    snowflake:
    platform_instance: "prod-snowflake-east"
    redshift:
    platform_instance: "analytics-cluster"
  2. Glue Schema Registry cross-schema references aren't resolved. Schemas with $ref (JSON Schema) or named imports (Avro / Protobuf) emit the top-level schema only — nested references aren't followed. Streams whose schemas depend on cross-schema imports will be missing some fields.

  3. One region per recipe. The connector ingests a single AWS region per run. For multi-region accounts, run one recipe per region with a distinct platform_instance.

  4. One Glue Schema Registry per recipe. Only the registry named in glue_schema_registry.registry_name is queried. If your schemas span multiple registries, run multiple recipes.

  5. No USAGE_STATS capability. Read/write throughput per stream is available in CloudWatch but isn't read by this connector.

  6. Unsupported Firehose destinations. Splunk, HTTP, Datadog, New Relic, Coralogix, LogicMonitor, Dynatrace, Honeycomb, and Sumo Logic destinations emit a DataJob without an output lineage edge and surface a warning in the report. The six supported destinations are listed in the Firehose lineage table above.

  7. No schema inference from record sampling. The connector relies on AWS Glue Schema Registry for schemas — streams without a registered schema are emitted without schemaMetadata.

  8. No Lambda consumer discovery. The connector doesn't enumerate Lambda functions consuming a stream, so Stream → Lambda lineage is not emitted.

  9. No Kinesis Data Analytics (KDA / managed Flink) support. KDA applications are not ingested by this connector.

Troubleshooting

Lineage edges show but the target dataset isn't found in DataHub. Your destination_platform_map doesn't match the destination platform's own ingestion settings. Check that platform_instance and env in destination_platform_map.<platform> exactly match the values that platform's source recipe used. See Limitation #1.

Snowflake or Redshift lineage doesn't resolve, and the destination identifiers are mixed-case. The connector lowercases destination URN names by default. If your Snowflake / Redshift source recipe set convert_urns_to_lowercase: false, mirror that on the Kinesis side:

destination_platform_map:
snowflake:
convert_urns_to_lowercase: false

Iceberg or MongoDB lineage doesn't resolve. Both platforms have case-sensitive identifiers. Disable lowercasing for them as above (convert_urns_to_lowercase: false).

Ingestion succeeded but search doesn't show the entities. Your DataHub backend may have entered a read-only state due to disk pressure on the search index. From the host running DataHub:

docker exec <opensearch-or-elasticsearch-container> \
curl -s localhost:9200/_cluster/allocation/explain

If the cluster reports disk.watermark.low exceeded, free disk space and re-index.

AccessDeniedException on kinesis:ListStreams. The IAM identity used by the recipe is missing the KinesisDataStreamsRead permissions in the AWS IAM Permissions section. A first-page denial is logged as a warning and skips the KDS section (the user may intentionally have Firehose-only IAM); a mid-pagination failure escalates to report.failure to prevent stateful soft-deletion of un-listed streams.

Firehose section is silently empty even though Firehose streams exist. The IAM identity is missing firehose:ListDeliveryStreams and/or firehose:DescribeDeliveryStream — Kinesis Data Streams permissions don't cover Firehose. A missing Firehose permission is logged as a warning ("Permission denied for Firehose") rather than a failure; check the ingestion run report.

Schema not found for a stream. glue:GetSchemaVersion returned EntityNotFoundException for the expected schema name. Either add an explicit entry in glue_schema_registry.stream_schema_map, rename the GSR schema to match the stream name (and enable use_naming_convention: true), or accept that the stream will be emitted without schemaMetadata.

Code Coordinates

  • Class Name: datahub.ingestion.source.kinesis.kinesis.KinesisSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Amazon Kinesis Data Streams, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.