ML Feature Table
The ML Feature Table entity represents a collection of related machine learning features organized together in a feature store. Feature tables are fundamental building blocks in the ML feature management ecosystem, grouping features that share common characteristics such as the same primary keys, update cadence, or data source. They bridge the gap between raw data in data warehouses and the features consumed by ML models during training and inference.
Identity
ML Feature Tables are identified by two pieces of information:
- The platform that hosts the feature table: this is the specific feature store or ML platform technology. Examples include
feast,tecton,sagemaker, etc. See dataplatform for more details. - The name of the feature table: a unique identifier within the specific platform that represents this collection of features.
An example of an ML Feature Table identifier is urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,users_feature_table).
The identity is defined by the mlFeatureTableKey aspect, which contains:
platform: A URN reference to the data platform hosting the feature tablename: The unique name of the feature table within that platform
Important Capabilities
Feature Table Properties
ML Feature Tables support comprehensive metadata through the mlFeatureTableProperties aspect. This aspect captures the essential characteristics of the feature table:
Description and Documentation
Feature tables can have detailed descriptions explaining their purpose, the type of features they contain, and when they should be used. This documentation helps data scientists and ML engineers discover and understand feature tables in their organization.
Python SDK: Create an ML Feature Table with properties
import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})
feature_table_urn = builder.make_ml_feature_table_urn(
feature_table_name="customer_features", platform="feast"
)
feature_table_properties = models.MLFeatureTablePropertiesClass(
description="Customer demographic and behavioral features for churn prediction models. "
"Updated daily from the customer data warehouse.",
customProperties={
"update_frequency": "daily",
"feature_count": "25",
"team": "customer-analytics",
"sla_hours": "24",
},
)
metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=feature_table_urn,
aspect=feature_table_properties,
)
emitter.emit(metadata_change_proposal)
Features
The most important property of a feature table is the collection of features it contains. Feature tables maintain explicit relationships to their constituent features through the mlFeatures property. This creates a "Contains" relationship between the feature table and each individual feature, enabling:
- Discovery of all features in a table
- Navigation from feature table to individual features
- Understanding of feature organization and grouping
Python SDK: Add features to a feature table
import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})
feature_table_urn = builder.make_ml_feature_table_urn(
feature_table_name="customer_features", platform="feast"
)
new_feature_urns = [
builder.make_ml_feature_urn(
feature_name="customer_lifetime_value",
feature_table_name="customer_features",
),
builder.make_ml_feature_urn(
feature_name="days_since_last_purchase",
feature_table_name="customer_features",
),
builder.make_ml_feature_urn(
feature_name="total_purchase_count",
feature_table_name="customer_features",
),
]
# Read existing features to avoid overwriting them
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))
feature_table_properties = graph.get_aspect(
entity_urn=feature_table_urn,
aspect_type=models.MLFeatureTablePropertiesClass,
)
if feature_table_properties and feature_table_properties.mlFeatures:
existing_features = feature_table_properties.mlFeatures
all_feature_urns = list(set(existing_features + new_feature_urns))
else:
all_feature_urns = new_feature_urns
updated_properties = models.MLFeatureTablePropertiesClass(
mlFeatures=all_feature_urns,
description="Customer features with newly added purchase metrics",
)
metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=feature_table_urn,
aspect=updated_properties,
)
emitter.emit(metadata_change_proposal)
Primary Keys
Feature tables define one or more primary keys that uniquely identify each row in the table. These primary keys are critical for:
- Joining features with training datasets
- Looking up feature values during model inference
- Understanding the entity granularity of the features (e.g., user-level, transaction-level)
When multiple primary keys are specified, they act as a composite key. The mlPrimaryKeys property creates a "KeyedBy" relationship to each primary key entity.
Python SDK: Add primary keys to a feature table
import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})
feature_table_urn = builder.make_ml_feature_table_urn(
feature_table_name="customer_features", platform="feast"
)
primary_key_urns = [
builder.make_ml_primary_key_urn(
feature_table_name="customer_features",
primary_key_name="customer_id",
)
]
# Read existing properties to preserve other fields
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))
feature_table_properties = graph.get_aspect(
entity_urn=feature_table_urn,
aspect_type=models.MLFeatureTablePropertiesClass,
)
if feature_table_properties:
feature_table_properties.mlPrimaryKeys = primary_key_urns
updated_properties = feature_table_properties
else:
updated_properties = models.MLFeatureTablePropertiesClass(
mlPrimaryKeys=primary_key_urns,
)
metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=feature_table_urn,
aspect=updated_properties,
)
emitter.emit(metadata_change_proposal)
# Also create the primary key entity with its properties
dataset_urn = builder.make_dataset_urn(
name="customers", platform="snowflake", env="PROD"
)
primary_key_urn = primary_key_urns[0]
primary_key_properties = models.MLPrimaryKeyPropertiesClass(
description="Unique identifier for customers in the system",
dataType="TEXT",
sources=[dataset_urn],
)
pk_metadata_change_proposal = MetadataChangeProposalWrapper(
entityUrn=primary_key_urn,
aspect=primary_key_properties,
)
emitter.emit(pk_metadata_change_proposal)
Custom Properties
Feature tables support custom properties through the customProperties field, allowing you to capture platform-specific or organization-specific metadata that doesn't fit into the standard schema. This might include information like:
- Update frequency or freshness SLAs
- Feature store configuration settings
- Cost or resource usage information
- Team or project ownership details
Primary Key Properties
While primary keys are referenced from feature tables, they are separate entities with their own properties defined in the mlPrimaryKeyProperties aspect. Understanding primary key metadata is essential for proper feature table usage:
Data Type
Primary keys have a data type (defined using MLFeatureDataType) that specifies the type of values:
ORDINAL: Integer valuesNOMINAL: Categorical valuesBINARY: Boolean valuesCOUNT: Count valuesTIME: Timestamp valuesTEXT: String values- Other numeric types like
CONTINUOUS,INTERVAL
Source Lineage
Primary keys can declare their source datasets through the sources property. This creates lineage relationships showing which upstream datasets the primary key values are derived from. This is crucial for understanding data provenance and impact analysis.
Versioning
Primary keys support versioning through the version property, allowing teams to track changes to key definitions over time and maintain multiple versions in parallel.
Tags and Glossary Terms
Like other DataHub entities, ML Feature Tables support tags and glossary terms for classification and discovery:
- Tags (via
globalTagsaspect) provide lightweight categorization - Glossary Terms (via
glossaryTermsaspect) link to business definitions and concepts
Read this blog to understand when to use tags vs terms.
Ownership
Ownership is associated with feature tables using the ownership aspect. Owners can be individuals or teams responsible for maintaining the feature table. Clear ownership is essential for:
- Knowing who to contact with questions about features
- Understanding responsibility for feature quality and updates
- Governance and access control decisions
Domains and Organization
Feature tables can be organized into domains (via the domains aspect) to represent organizational structure or functional areas. This helps teams manage large feature catalogs by grouping related feature tables together.
Code Examples
Creating a Complete ML Feature Table
Here's a comprehensive example that creates a feature table with all core aspects:
Python SDK: Create a complete ML Feature Table
import datahub.emitter.mce_builder as builder
import datahub.metadata.schema_classes as models
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
gms_endpoint = "http://localhost:8080"
emitter = DatahubRestEmitter(gms_server=gms_endpoint, extra_headers={})
# Step 1: Create the source dataset for lineage
dataset_urn = builder.make_dataset_urn(
name="customer_transactions", platform="snowflake", env="PROD"
)
# Step 2: Create the primary key entity
primary_key_urn = builder.make_ml_primary_key_urn(
feature_table_name="transaction_features",
primary_key_name="transaction_id",
)
primary_key_properties = models.MLPrimaryKeyPropertiesClass(
description="Unique identifier for each transaction",
dataType="TEXT",
sources=[dataset_urn],
)
emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=primary_key_urn,
aspect=primary_key_properties,
)
)
# Step 3: Create the feature entities
feature_1_urn = builder.make_ml_feature_urn(
feature_name="transaction_amount",
feature_table_name="transaction_features",
)
emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=feature_1_urn,
aspect=models.MLFeaturePropertiesClass(
description="Total amount of the transaction in USD",
dataType="CONTINUOUS",
sources=[dataset_urn],
),
)
)
feature_2_urn = builder.make_ml_feature_urn(
feature_name="is_fraud",
feature_table_name="transaction_features",
)
emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=feature_2_urn,
aspect=models.MLFeaturePropertiesClass(
description="Binary indicator of fraudulent transaction",
dataType="BINARY",
sources=[dataset_urn],
),
)
)
# Step 4: Create the feature table with all properties
feature_table_urn = builder.make_ml_feature_table_urn(
feature_table_name="transaction_features", platform="feast"
)
feature_table_properties = models.MLFeatureTablePropertiesClass(
description="Real-time transaction features for fraud detection models",
mlFeatures=[feature_1_urn, feature_2_urn],
mlPrimaryKeys=[primary_key_urn],
customProperties={
"update_frequency": "real-time",
"team": "fraud-detection",
"critical": "true",
},
)
emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=feature_table_urn,
aspect=feature_table_properties,
)
)
# Step 5: Add tags for categorization
emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=feature_table_urn,
aspect=models.GlobalTagsClass(
tags=[
models.TagAssociationClass(tag=builder.make_tag_urn("Fraud Detection")),
models.TagAssociationClass(
tag=builder.make_tag_urn("Real-time Features")
),
]
),
)
)
# Step 6: Add ownership
emitter.emit(
MetadataChangeProposalWrapper(
entityUrn=feature_table_urn,
aspect=models.OwnershipClass(
owners=[
models.OwnerClass(
owner=builder.make_user_urn("data_science_team"),
type=models.OwnershipTypeClass.DATAOWNER,
)
]
),
)
)
print(f"Successfully created feature table: {feature_table_urn}")
Querying ML Feature Tables
You can retrieve ML Feature Table metadata using both the Python SDK and REST API:
Python SDK: Read an ML Feature Table
from datahub.sdk import DataHubClient, MLFeatureTableUrn
client = DataHubClient.from_env()
# Or get this from the UI (share -> copy urn) and use MLFeatureTableUrn.from_string(...)
mlfeature_table_urn = MLFeatureTableUrn(
"feast", "test_feature_table_all_feature_dtypes"
)
mlfeature_table_entity = client.entities.get(mlfeature_table_urn)
print("MLFeature Table name:", mlfeature_table_entity.name)
print("MLFeature Table platform:", mlfeature_table_entity.platform)
print("MLFeature Table description:", mlfeature_table_entity.description)
REST API: Fetch ML Feature Table metadata
# Get the complete entity with all aspects
curl 'http://localhost:8080/entities/urn%3Ali%3AmlFeatureTable%3A(urn%3Ali%3AdataPlatform%3Afeast,users_feature_table)'
# Get relationships to see features and primary keys
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AmlFeatureTable%3A(urn%3Ali%3AdataPlatform%3Afeast,users_feature_table)&types=Contains,KeyedBy'
Integration Points
ML Feature Tables integrate with multiple other entities in DataHub's metadata model:
Relationships with ML Features
Feature tables contain ML Features through the "Contains" relationship. Each feature in the mlFeatures array represents an individual feature that can be:
- Used independently by ML models
- Have its own metadata, lineage, and documentation
- Shared across multiple feature tables in some feature store implementations
Navigation works bidirectionally - from feature table to features, and from features back to their parent tables.
Relationships with ML Primary Keys
Feature tables reference ML Primary Keys through the "KeyedBy" relationship. Primary keys:
- Define the entity granularity of the feature table
- Enable joining features with entity identifiers in training datasets
- Can be shared across multiple feature tables when they represent the same entity type
- Have their own lineage to upstream datasets through the
sourcesproperty
Relationships with ML Models
While not directly referenced in feature table metadata, ML Models consume features through the mlFeatures property in MLModelProperties. This creates a "Consumes" lineage relationship showing which models use features from a particular feature table. This lineage enables:
- Understanding downstream impact when feature tables change
- Discovering which models depend on specific feature tables
- Tracking feature usage and adoption across models
Relationships with Datasets
Feature tables have indirect relationships to datasets through two paths:
- Via ML Features: Individual features can declare source datasets through their
sourcesproperty, creating "DerivedFrom" lineage - Via ML Primary Keys: Primary keys can declare source datasets, showing where entity identifiers originate
This lineage connects the feature store to upstream data warehouses, enabling end-to-end data lineage from raw data to model predictions.
Platform Integration
Feature tables are associated with a specific data platform (e.g., Feast, Tecton) through the platform property in the key aspect. This creates a "SourcePlatform" relationship that:
- Identifies which feature store system hosts the feature table
- Enables filtering and organization by platform
- Supports multi-platform feature store environments
Notable Exceptions
Feature Store Platform Variations
Different feature store platforms have different capabilities and concepts:
- Feast: Uses the term "feature table" directly. Feature tables in Feast correspond 1:1 with this entity.
- Tecton: Uses "feature views" and "feature services" as similar concepts. These can be modeled as feature tables.
- SageMaker Feature Store: Uses "feature groups" which map to feature tables.
- Databricks Feature Store: Uses "feature tables" but with database.schema.table naming patterns.
When ingesting from these platforms, ensure the naming conventions match the platform's terminology for consistency.
Custom Properties Usage
Unlike datasets which have both datasetProperties and editableDatasetProperties, feature tables have:
mlFeatureTableProperties: The main properties aspect (usually from ingestion)editableMlFeatureTableProperties: UI-editable description only
For custom metadata, use the customProperties map in mlFeatureTableProperties rather than creating custom aspects.
Entity References vs. Entity Creation
When using the SDK to create feature tables:
- You must create the referenced entities first: Create individual ML Features and ML Primary Keys before referencing them in the feature table
- The feature table only stores URN references - it doesn't create the feature or primary key entities
- If you reference non-existent entities, they will appear as broken references in the UI
This is different from some other DataHub entities where child entities can be created inline.
Lineage Considerations
Feature table lineage is typically established through the features and primary keys it contains:
- Feature tables themselves don't have direct
upstreamLineageaspects - Instead, lineage flows through the contained features'
sourcesproperties - When querying lineage, you'll need to traverse through the "Contains" relationships to find upstream datasets
This design reflects that features are the atomic unit of lineage in ML systems, while feature tables are organizational constructs.
Technical Reference
For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.