Version: Next

ML Feature Table

The ML Feature Table entity represents a collection of related machine learning features organized together in a feature store. Feature tables are fundamental building blocks in the ML feature management ecosystem, grouping features that share common characteristics such as the same primary keys, update cadence, or data source. They bridge the gap between raw data in data warehouses and the features consumed by ML models during training and inference.

Identity

ML Feature Tables are identified by two pieces of information:

The platform that hosts the feature table: this is the specific feature store or ML platform technology. Examples include feast, tecton, sagemaker, etc. See dataplatform for more details.
The name of the feature table: a unique identifier within the specific platform that represents this collection of features.

An example of an ML Feature Table identifier is urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,users_feature_table).

The identity is defined by the mlFeatureTableKey aspect, which contains:

platform: A URN reference to the data platform hosting the feature table
name: The unique name of the feature table within that platform

Important Capabilities

Feature Table Properties

ML Feature Tables support comprehensive metadata through the mlFeatureTableProperties aspect. This aspect captures the essential characteristics of the feature table:

Description and Documentation

Feature tables can have detailed descriptions explaining their purpose, the type of features they contain, and when they should be used. This documentation helps data scientists and ML engineers discover and understand feature tables in their organization.

Python SDK: Create an ML Feature Table with properties

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter

gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})

feature_table_urn = builder.make_ml_feature_table_urn(
    feature_table_name="customer_features", platform="feast"
)

feature_table_properties = models.MLFeatureTablePropertiesClass(
    description="Customer demographic and behavioral features for churn prediction models. "
    "Updated daily from the customer data warehouse.",
    customProperties={
        "update_frequency": "daily",
        "feature_count": "25",
        "team": "customer-analytics",
        "sla_hours": "24",
    },
)

metadata_change_proposal = MetadataChangeProposalWrapper(
    entityUrn=feature_table_urn,
    aspect=feature_table_properties,
)

emitter.emit(metadata_change_proposal)

Features

The most important property of a feature table is the collection of features it contains. Feature tables maintain explicit relationships to their constituent features through the mlFeatures property. This creates a "Contains" relationship between the feature table and each individual feature, enabling:

Discovery of all features in a table
Navigation from feature table to individual features
Understanding of feature organization and grouping

Python SDK: Add features to a feature table

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})

feature_table_urn = builder.make_ml_feature_table_urn(
    feature_table_name="customer_features", platform="feast"
)

new_feature_urns = [
    builder.make_ml_feature_urn(
        feature_name="customer_lifetime_value",
        feature_table_name="customer_features",
    ),
    builder.make_ml_feature_urn(
        feature_name="days_since_last_purchase",
        feature_table_name="customer_features",
    ),
    builder.make_ml_feature_urn(
        feature_name="total_purchase_count",
        feature_table_name="customer_features",
    ),
]

# Read existing features to avoid overwriting them
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))
feature_table_properties = graph.get_aspect(
    entity_urn=feature_table_urn,
    aspect_type=models.MLFeatureTablePropertiesClass,
)

if feature_table_properties and feature_table_properties.mlFeatures:
    existing_features = feature_table_properties.mlFeatures
    all_feature_urns = list(set(existing_features + new_feature_urns))
else:
    all_feature_urns = new_feature_urns

updated_properties = models.MLFeatureTablePropertiesClass(
    mlFeatures=all_feature_urns,
    description="Customer features with newly added purchase metrics",
)

metadata_change_proposal = MetadataChangeProposalWrapper(
    entityUrn=feature_table_urn,
    aspect=updated_properties,
)

emitter.emit(metadata_change_proposal)

Primary Keys

Feature tables define one or more primary keys that uniquely identify each row in the table. These primary keys are critical for:

Joining features with training datasets
Looking up feature values during model inference
Understanding the entity granularity of the features (e.g., user-level, transaction-level)

When multiple primary keys are specified, they act as a composite key. The mlPrimaryKeys property creates a "KeyedBy" relationship to each primary key entity.

Python SDK: Add primary keys to a feature table

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})

feature_table_urn = builder.make_ml_feature_table_urn(
    feature_table_name="customer_features", platform="feast"
)

primary_key_urns = [
    builder.make_ml_primary_key_urn(
        feature_table_name="customer_features",
        primary_key_name="customer_id",
    )
]

# Read existing properties to preserve other fields
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))
feature_table_properties = graph.get_aspect(
    entity_urn=feature_table_urn,
    aspect_type=models.MLFeatureTablePropertiesClass,
)

if feature_table_properties:
    feature_table_properties.mlPrimaryKeys = primary_key_urns
    updated_properties = feature_table_properties
else:
    updated_properties = models.MLFeatureTablePropertiesClass(
        mlPrimaryKeys=primary_key_urns,
    )

metadata_change_proposal = MetadataChangeProposalWrapper(
    entityUrn=feature_table_urn,
    aspect=updated_properties,
)

emitter.emit(metadata_change_proposal)

# Also create the primary key entity with its properties
dataset_urn = builder.make_dataset_urn(
    name="customers", platform="snowflake", env="PROD"
)
primary_key_urn = primary_key_urns[0]

primary_key_properties = models.MLPrimaryKeyPropertiesClass(
    description="Unique identifier for customers in the system",
    dataType="TEXT",
    sources=[dataset_urn],
)

pk_metadata_change_proposal = MetadataChangeProposalWrapper(
    entityUrn=primary_key_urn,
    aspect=primary_key_properties,
)

emitter.emit(pk_metadata_change_proposal)

Custom Properties

Feature tables support custom properties through the customProperties field, allowing you to capture platform-specific or organization-specific metadata that doesn't fit into the standard schema. This might include information like:

Update frequency or freshness SLAs
Feature store configuration settings
Cost or resource usage information
Team or project ownership details

Primary Key Properties

While primary keys are referenced from feature tables, they are separate entities with their own properties defined in the mlPrimaryKeyProperties aspect. Understanding primary key metadata is essential for proper feature table usage:

Data Type

Primary keys have a data type (defined using MLFeatureDataType) that specifies the type of values:

ORDINAL: Integer values
NOMINAL: Categorical values
BINARY: Boolean values
COUNT: Count values
TIME: Timestamp values
TEXT: String values
Other numeric types like CONTINUOUS, INTERVAL

Source Lineage

Primary keys can declare their source datasets through the sources property. This creates lineage relationships showing which upstream datasets the primary key values are derived from. This is crucial for understanding data provenance and impact analysis.

Versioning

Primary keys support versioning through the version property, allowing teams to track changes to key definitions over time and maintain multiple versions in parallel.

Tags and Glossary Terms

Like other DataHub entities, ML Feature Tables support tags and glossary terms for classification and discovery:

Tags (via globalTags aspect) provide lightweight categorization
Glossary Terms (via glossaryTerms aspect) link to business definitions and concepts

Read this blog to understand when to use tags vs terms.

Ownership

Ownership is associated with feature tables using the ownership aspect. Owners can be individuals or teams responsible for maintaining the feature table. Clear ownership is essential for:

Knowing who to contact with questions about features
Understanding responsibility for feature quality and updates
Governance and access control decisions

Domains and Organization

Feature tables can be organized into domains (via the domains aspect) to represent organizational structure or functional areas. This helps teams manage large feature catalogs by grouping related feature tables together.

Code Examples

Creating a Complete ML Feature Table

Here's a comprehensive example that creates a feature table with all core aspects:

Python SDK: Create a complete ML Feature Table

import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter

gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})

# Step 1: Create the source dataset for lineage
dataset_urn = builder.make_dataset_urn(
    name="customer_transactions", platform="snowflake", env="PROD"
)

# Step 2: Create the primary key entity
primary_key_urn = builder.make_ml_primary_key_urn(
    feature_table_name="transaction_features",
    primary_key_name="transaction_id",
)

primary_key_properties = models.MLPrimaryKeyPropertiesClass(
    description="Unique identifier for each transaction",
    dataType="TEXT",
    sources=[dataset_urn],
)

emitter.emit(
    MetadataChangeProposalWrapper(
        entityUrn=primary_key_urn,
        aspect=primary_key_properties,
    )
)

# Step 3: Create the feature entities
feature_1_urn = builder.make_ml_feature_urn(
    feature_name="transaction_amount",
    feature_table_name="transaction_features",
)

emitter.emit(
    MetadataChangeProposalWrapper(
        entityUrn=feature_1_urn,
        aspect=models.MLFeaturePropertiesClass(
            description="Total amount of the transaction in USD",
            dataType="CONTINUOUS",
            sources=[dataset_urn],
        ),
    )
)

feature_2_urn = builder.make_ml_feature_urn(
    feature_name="is_fraud",
    feature_table_name="transaction_features",
)

emitter.emit(
    MetadataChangeProposalWrapper(
        entityUrn=feature_2_urn,
        aspect=models.MLFeaturePropertiesClass(
            description="Binary indicator of fraudulent transaction",
            dataType="BINARY",
            sources=[dataset_urn],
        ),
    )
)

# Step 4: Create the feature table with all properties
feature_table_urn = builder.make_ml_feature_table_urn(
    feature_table_name="transaction_features", platform="feast"
)

feature_table_properties = models.MLFeatureTablePropertiesClass(
    description="Real-time transaction features for fraud detection models",
    mlFeatures=[feature_1_urn, feature_2_urn],
    mlPrimaryKeys=[primary_key_urn],
    customProperties={
        "update_frequency": "real-time",
        "team": "fraud-detection",
        "critical": "true",
    },
)

emitter.emit(
    MetadataChangeProposalWrapper(
        entityUrn=feature_table_urn,
        aspect=feature_table_properties,
    )
)

# Step 5: Add tags for categorization
emitter.emit(
    MetadataChangeProposalWrapper(
        entityUrn=feature_table_urn,
        aspect=models.GlobalTagsClass(
            tags=[
                models.TagAssociationClass(tag=builder.make_tag_urn("Fraud Detection")),
                models.TagAssociationClass(
                    tag=builder.make_tag_urn("Real-time Features")
                ),
            ]
        ),
    )
)

# Step 6: Add ownership
emitter.emit(
    MetadataChangeProposalWrapper(
        entityUrn=feature_table_urn,
        aspect=models.OwnershipClass(
            owners=[
                models.OwnerClass(
                    owner=builder.make_user_urn("data_science_team"),
                    type=models.OwnershipTypeClass.DATAOWNER,
                )
            ]
        ),
    )
)

print(f"Successfully created feature table: {feature_table_urn}")

Querying ML Feature Tables

You can retrieve ML Feature Table metadata using both the Python SDK and REST API:

Python SDK: Read an ML Feature Table

from datahub.sdk import DataHubClient, MLFeatureTableUrn

client = DataHubClient.from_env()

# Or get this from the UI (share -> copy urn) and use MLFeatureTableUrn.from_string(...)
mlfeature_table_urn = MLFeatureTableUrn(
    "feast", "test_feature_table_all_feature_dtypes"
)

mlfeature_table_entity = client.entities.get(mlfeature_table_urn)
print("MLFeature Table name:", mlfeature_table_entity.name)
print("MLFeature Table platform:", mlfeature_table_entity.platform)
print("MLFeature Table description:", mlfeature_table_entity.description)

REST API: Fetch ML Feature Table metadata

# Get the complete entity with all aspects
curl 'http://localhost:8080/entities/urn%3Ali%3AmlFeatureTable%3A(urn%3Ali%3AdataPlatform%3Afeast,users_feature_table)'

# Get relationships to see features and primary keys
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AmlFeatureTable%3A(urn%3Ali%3AdataPlatform%3Afeast,users_feature_table)&types=Contains,KeyedBy'

Integration Points

ML Feature Tables integrate with multiple other entities in DataHub's metadata model:

Relationships with ML Features

Feature tables contain ML Features through the "Contains" relationship. Each feature in the mlFeatures array represents an individual feature that can be:

Used independently by ML models
Have its own metadata, lineage, and documentation
Shared across multiple feature tables in some feature store implementations

Navigation works bidirectionally - from feature table to features, and from features back to their parent tables.

Relationships with ML Primary Keys

Feature tables reference ML Primary Keys through the "KeyedBy" relationship. Primary keys:

Define the entity granularity of the feature table
Enable joining features with entity identifiers in training datasets
Can be shared across multiple feature tables when they represent the same entity type
Have their own lineage to upstream datasets through the sources property

Relationships with ML Models

While not directly referenced in feature table metadata, ML Models consume features through the mlFeatures property in MLModelProperties. This creates a "Consumes" lineage relationship showing which models use features from a particular feature table. This lineage enables:

Understanding downstream impact when feature tables change
Discovering which models depend on specific feature tables
Tracking feature usage and adoption across models

Relationships with Datasets

Feature tables have indirect relationships to datasets through two paths:

Via ML Features: Individual features can declare source datasets through their sources property, creating "DerivedFrom" lineage
Via ML Primary Keys: Primary keys can declare source datasets, showing where entity identifiers originate

This lineage connects the feature store to upstream data warehouses, enabling end-to-end data lineage from raw data to model predictions.

Platform Integration

Feature tables are associated with a specific data platform (e.g., Feast, Tecton) through the platform property in the key aspect. This creates a "SourcePlatform" relationship that:

Identifies which feature store system hosts the feature table
Enables filtering and organization by platform
Supports multi-platform feature store environments

Notable Exceptions

Feature Store Platform Variations

Different feature store platforms have different capabilities and concepts:

Feast: Uses the term "feature table" directly. Feature tables in Feast correspond 1:1 with this entity.
Tecton: Uses "feature views" and "feature services" as similar concepts. These can be modeled as feature tables.
SageMaker Feature Store: Uses "feature groups" which map to feature tables.
Databricks Feature Store: Uses "feature tables" but with database.schema.table naming patterns.

When ingesting from these platforms, ensure the naming conventions match the platform's terminology for consistency.

Custom Properties Usage

Unlike datasets which have both datasetProperties and editableDatasetProperties, feature tables have:

mlFeatureTableProperties: The main properties aspect (usually from ingestion)
editableMlFeatureTableProperties: UI-editable description only

For custom metadata, use the customProperties map in mlFeatureTableProperties rather than creating custom aspects.

Entity References vs. Entity Creation

When using the SDK to create feature tables:

You must create the referenced entities first: Create individual ML Features and ML Primary Keys before referencing them in the feature table
The feature table only stores URN references - it doesn't create the feature or primary key entities
If you reference non-existent entities, they will appear as broken references in the UI

This is different from some other DataHub entities where child entities can be created inline.

Lineage Considerations

Feature table lineage is typically established through the features and primary keys it contains:

Feature tables themselves don't have direct upstreamLineage aspects
Instead, lineage flows through the contained features' sources properties
When querying lineage, you'll need to traverse through the "Contains" relationships to find upstream datasets

This design reflects that features are the atomic unit of lineage in ML systems, while feature tables are organizational constructs.

Technical Reference

For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.

Is this page helpful?

ML Feature Table

Identity​

Important Capabilities​

Feature Table Properties​

Description and Documentation​

Features​

Primary Keys​

Custom Properties​

Primary Key Properties​

Data Type​

Source Lineage​

Versioning​

Tags and Glossary Terms​

Ownership​

Domains and Organization​

Code Examples​

Creating a Complete ML Feature Table​

Querying ML Feature Tables​

Integration Points​

Relationships with ML Features​

Relationships with ML Primary Keys​

Relationships with ML Models​

Relationships with Datasets​

Platform Integration​

Notable Exceptions​

Feature Store Platform Variations​

Custom Properties Usage​

Entity References vs. Entity Creation​

Lineage Considerations​

Technical Reference​