⚡ DataHub Actions Framework
Welcome to DataHub Actions! The Actions framework makes responding to realtime changes in your Metadata Graph easy, enabling you to seamlessly integrate DataHub into a broader events-based architecture.
For a detailed introduction, check out the original announcement of the DataHub Actions Framework at the DataHub April 2022 Town Hall. For a more in-depth look at use cases and concepts, check out DataHub Actions Concepts.
Quickstart
To get started right away, check out the DataHub Actions Quickstart Guide.
Prerequisites
The DataHub Actions CLI commands are an extension of the base datahub CLI commands. We recommend
first installing the datahub CLI:
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub --version
Note that the Actions Framework requires a version of
acryl-datahub>= v0.8.34
Installation
Next, simply install the acryl-datahub-actions package from PyPi:
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub-actions
datahub actions version
Configuring an Action
Actions are configured using a YAML file, much in the same way DataHub ingestion sources are. An action configuration file consists of the following
- Action Pipeline Name (Should be unique and static)
- Source Configurations
- Filter Configurations (recommended:
filters:list) - Transform Configurations (Optional)
- Action Configuration
- Pipeline Options (Optional)
- DataHub API configs (Optional - required for select actions)
With each component being independently pluggable and configurable.
# 1. Required: Action Pipeline Name
name: <action-pipeline-name>
# 2. Required: Event Source - Where to source event from.
source:
type: <source-type>
config:
# Event Source specific configs (map)
# 3. Optional: Pipeline filters — evaluated before transformers (AND semantics across items).
# Each filter type declares which events it passes. Events that don't match ALL filters are dropped.
filters:
- type: event_type
config:
filter:
# Map from event_type string to optional body predicates.
# The event passes if its type is listed here and satisfies the associated predicate.
<event-type-string>:
event:
# List of predicate dicts — OR across list items, AND within each dict.
- <field>: <value>
# 4. Optional: Custom Transformers to run on events (array)
transform:
- type: <transformer-type>
config:
# Transformer-specific configs (map)
# 5. Required: Action - What action to take on events.
action:
type: <action-type>
config:
# Action-specific configs (map)
# 6. Optional: Additional pipeline options (error handling, etc)
options:
retry_count: 0 # The number of times to retry an Action with the same event. (If an exception is thrown). 0 by default.
failure_mode: "CONTINUE" # What to do when an event fails to be processed. Either 'CONTINUE' to make progress or 'THROW' to stop the pipeline. Either way, the failed event will be logged to a failed_events.log file.
failed_events_dir: "/tmp/datahub/actions" # The directory in which to write a failed_events.log file that tracks events which fail to be processed. Defaults to "/tmp/logs/datahub/actions".
# 7. Optional: DataHub API configuration
datahub:
server: "http://localhost:8080" # Location of DataHub API
# token: <your-access-token> # Required if Metadata Service Auth enabled
Example: Hello World
A simple configuration file for a "Hello World" action, which simply prints all events it receives, is
# 1. Action Pipeline Name
name: "hello_world"
# 2. Event Source: Where to source event from.
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 3. Action: What action to take on events.
action:
type: "hello_world"
We can filter for specific events using the filters: section.
The example below passes only EntityChangeEvent_v1 events where a PII tag was added:
# 1. Action Pipeline Name
name: "hello_world"
# 2. Event Source - Where to source event from.
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 3. Filters - Only pass events that match all filters.
filters:
- type: event_type
config:
filter:
EntityChangeEvent_v1:
event:
- category: "TAG"
operation: "ADD"
modifier: "urn:li:tag:pii"
# 4. Action - What action to take on events.
action:
type: "hello_world"
Multiple event types in one filter
The event_type filter supports multiple event types in a single filter block (OR semantics across types):
filters:
- type: event_type
config:
filter:
MetadataChangeLogEvent_v1:
event:
- entityType: schemaField
aspectName: documentation
EntityChangeEvent_v1:
event:
- category: DOCUMENTATION
entityType: schemaField
Performance tip: MCL pre-deserialization filtering with Kafka
When using the Kafka event source you can add enable_mcl_pre_deserialization_filter: true to
drop MetadataChangeLog (MCL) messages before the expensive avrogen deserialization step.
This flag only affects MCL events — EntityChangeEvent (ECE) delivery is never impacted because
ECE fields are only accessible after deserializing the PlatformEvent envelope.
This is especially effective for pipelines that only care about entity-change events (e.g. Slack or Teams notifications), where all MCL messages can be dropped without ever deserializing them:
source:
type: kafka
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
enable_mcl_pre_deserialization_filter: true
filters:
- type: event_type
config:
filter:
EntityChangeEvent_v1: {}
Legacy filter: section
The original filter: section is still accepted for backwards compatibility but is deprecated.
Migrate to filters: for improved expressiveness and optional performance benefits.
# Deprecated — use `filters:` instead
filter:
event_type: "EntityChangeEvent_v1"
event:
category: "TAG"
operation: "ADD"
modifier: "urn:li:tag:pii"
Running an Action
To run a new Action, just use the actions CLI command
datahub actions -c <config.yml>
Once the Action is running, you will see
Action Pipeline with name '<action-pipeline-name>' is now running.
Running multiple Actions
You can run multiple actions pipeline within the same command. Simply provide multiple config files by restating the "-c" command line argument.
For example,
datahub actions -c <config-1.yaml> -c <config-2.yaml>
Running in debug mode
Simply append the --debug flag to the CLI to run your action in debug mode. NOTE: This will reveal sensitive information in the logs, do not share the logs with outside resources and ensure untrusted
users will not have access to logs through UI ingestions before enabling on instances.
datahub actions -c <config.yaml> --debug
Stopping an Action
Just issue a Control-C as usual. You should see the Actions Pipeline shut down gracefully, with a small summary of processing results.
Actions Pipeline with name '<action-pipeline-name' has been stopped.
Supported Events
Two event types are currently supported. Read more about them below.
Supported Event Sources
Currently, the following event sources are supported:
Supported Actions
By default, DataHub supports a set of standard actions plugins. These can be found inside the folder
src/datahub-actions/plugins.
Some pre-included Actions include
Development
Build and Test
Notice that we support all actions command using a separate datahub-actions CLI entry point. Feel free
to use this during development.
# Build datahub-actions module
./gradlew datahub-actions:build
# Drop into virtual env
cd datahub-actions && source venv/bin/activate
# Start hello world action
datahub-actions actions -c ../examples/hello_world.yaml
# Start ingestion executor action
datahub-actions actions -c ../examples/executor.yaml
# Start multiple actions
datahub-actions actions -c ../examples/executor.yaml -c ../examples/hello_world.yaml
Developing a Transformer
To develop a new Transformer, check out the Developing a Transformer guide.
Developing an Action
To develop a new Action, check out the Developing an Action guide.
Contributing
Contributing guidelines follow those of the main DataHub project. We are accepting contributions for Actions, Transformers, and general framework improvements (tests, error handling, etc).
Resources
Check out the original announcement of the DataHub Actions Framework at the DataHub April 2022 Town Hall.
License
Apache 2.0