ML Model Group
ML Model Groups represent collections of related machine learning models within an organization's ML infrastructure. They serve as logical containers for organizing model versions, experimental variants, or families of models that share common characteristics. Model groups are essential for managing the lifecycle of ML models, tracking model evolution over time, and organizing models by purpose, architecture, or business function.
Identity
ML Model Groups are identified by three pieces of information:
Platform: The ML platform or tool where the model group exists. This represents the specific ML technology that hosts the model group. Examples include
mlflow,sagemaker,databricks,kubeflow,tensorflow,pytorch, etc. The platform is represented as a URN likeurn:li:dataPlatform:mlflow.Name: The unique name of the model group within the specific platform. This is typically a human-readable identifier that describes the purpose or family of models. Names should be meaningful and follow your organization's naming conventions. Examples include
recommendation-models,fraud-detection-v2, orcustomer-churn-prediction.Origin (Fabric): The environment or fabric where the model group belongs or was generated. This qualifier helps distinguish between models in different environments such as Production (PROD), Staging (QA), Development (DEV), or Testing environments. The full list of supported environments is available in FabricType.pdl.
An example of an ML Model Group identifier is urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,recommendation-models,PROD).
Important Capabilities
ML Model Group Properties
Model group properties are stored in the mlModelGroupProperties aspect and contain the core metadata about a model group:
- Name: The display name of the model group, which can be more descriptive than the identifier
- Description: Detailed documentation about what the model group represents, its purpose, and any important context
- Version: An optional version tag for the entire model group (distinct from individual model versions)
- Created Timestamp: Audit information about when and who created the model group
- Last Modified Timestamp: When the model group was last updated
- Custom Properties: Extensible key-value pairs for storing additional metadata specific to your organization or platform
- External References: Links to external systems or documentation (e.g., model registry URLs, experiment tracking systems)
Here is an example of creating an ML Model Group with properties:
Python SDK: Create an ML Model Group
from datahub.sdk import DataHubClient
from datahub.sdk.mlmodelgroup import MLModelGroup
client = DataHubClient.from_env()
mlmodel_group = MLModelGroup(
id="my-recommendations-model-group",
name="My Recommendations Model Group",
platform="mlflow",
description="Grouping of ml model related to home page recommendations",
custom_properties={
"framework": "pytorch",
},
)
client.entities.upsert(mlmodel_group)
print(f"Created ML model group: {mlmodel_group.urn}")
Lineage and Training Information
ML Model Groups inherit lineage capabilities from the MLModelLineageInfo record, which captures important information about how models in the group are created and used:
- Training Jobs: References to data jobs or process instances that were used to train models in this group. This creates lineage relationships showing where models come from.
- Downstream Jobs: References to data jobs or process instances that consume or use models from this group. This tracks how models are deployed and utilized.
These lineage relationships are visible in DataHub's lineage graph and help track the full lifecycle of ML models from training data through deployment.
Python SDK: Add training lineage to a model group
from datahub.emitter import mce_builder
from datahub.metadata.urns import MlModelGroupUrn
from datahub.sdk import DataHubClient
client = DataHubClient.from_env()
group_urn = MlModelGroupUrn(
platform="mlflow",
name="recommendation-models",
env="PROD",
)
training_job_urn = mce_builder.make_data_job_urn(
orchestrator="airflow",
flow_id="train_recommendation_model",
job_id="training_task",
)
group = client.entities.get(group_urn)
group.add_training_job(training_job_urn)
client.entities.update(group)
print(f"Added training job {training_job_urn} to ML model group {group_urn}")
Ownership
Like other entities in DataHub, ML Model Groups can have owners assigned using the ownership aspect. Model group owners are typically responsible for:
- Managing which models belong to the group
- Maintaining model group metadata and documentation
- Overseeing model quality and governance standards
- Serving as points of contact for model-related questions
- Coordinating model deployment and monitoring
Ownership types for model groups follow the same patterns as other entities, including TECHNICAL_OWNER, BUSINESS_OWNER, DATA_STEWARD, DATAOWNER, PRODUCER, DEVELOPER, etc.
Python SDK: Add an owner to a model group
from datahub.sdk import CorpUserUrn, DataHubClient, MlModelGroupUrn
client = DataHubClient.from_env()
group = client.entities.get(
MlModelGroupUrn(platform="mlflow", name="recommendation-models", env="PROD")
)
group.add_owner(CorpUserUrn("data_science_team"))
client.entities.update(group)
Tags and Glossary Terms
ML Model Groups support both tags and glossary terms for categorization and discovery:
- Tags (via
globalTagsaspect): Informal labels for quick categorization. Use tags for properties likeexperimental,production-ready,high-priority,computer-vision,nlp, etc. - Glossary Terms (via
glossaryTermsaspect): Formal business vocabulary terms that provide standardized definitions. Use terms to link model groups to business concepts.
Python SDK: Add tags to a model group
from datahub.metadata.urns import MlModelGroupUrn, TagUrn
from datahub.sdk import DataHubClient
client = DataHubClient.from_env()
group = client.entities.get(
MlModelGroupUrn(platform="mlflow", name="recommendation-models", env="PROD")
)
group.add_tag(TagUrn("production-ready"))
client.entities.update(group)
Python SDK: Add glossary terms to a model group
from datahub.metadata.urns import MlModelGroupUrn
from datahub.sdk import DataHubClient, GlossaryTermUrn
client = DataHubClient.from_env()
group_urn = MlModelGroupUrn(platform="mlflow", name="recommendation-models", env="PROD")
mlmodel_group = client.entities.get(group_urn)
mlmodel_group.add_term(GlossaryTermUrn("Recommendation"))
client.entities.update(mlmodel_group)
print(f"Added term {GlossaryTermUrn('Recommendation')} to ML model group {group_urn}")
Documentation and Links
Model groups support documentation through multiple aspects:
- Description in
mlModelGroupProperties: Primary documentation field for describing the model group - Institutional Memory (via
institutionalMemoryaspect): Links to external resources such as:- Confluence pages describing the model group's purpose and architecture
- Model cards and documentation
- Experiment tracking dashboards (MLflow, Weights & Biases, etc.)
- Research papers or technical specifications
- Team wikis or runbooks
Python SDK: Add documentation links to a model group
from datahub.sdk import DataHubClient
from datahub.sdk.mlmodelgroup import MLModelGroup
client = DataHubClient.from_env()
mlmodel_group = client.entities.get(
MLModelGroup.get_urn_type()(
platform="mlflow", name="recommendation-models", env="PROD"
)
)
doc_url = "https://wiki.example.com/ml/recommendation-models"
doc_description = "Model architecture and training documentation"
mlmodel_group.add_link((doc_url, doc_description))
client.entities.update(mlmodel_group)
Domains
ML Model Groups can be assigned to domains using the domains aspect. This allows organizing model groups by business unit, department, or functional area. A model group can belong to only one domain at a time.
Python SDK: Assign a model group to a domain
from datahub.metadata.urns import DomainUrn, MlModelGroupUrn
from datahub.sdk import DataHubClient
client = DataHubClient.from_env()
mlmodel_group = client.entities.get(
MlModelGroupUrn(platform="mlflow", name="recommendation-models", env="PROD")
)
# If you don't know the domain urn, you can look it up:
# domain_urn = client.resolve.domain(name="marketing")
# NOTE: This will overwrite the existing domain
mlmodel_group.set_domain(DomainUrn(id="marketing"))
client.entities.update(mlmodel_group)
Deprecation
Model groups can be marked as deprecated using the deprecation aspect when they are no longer actively maintained or should be replaced. This helps users understand which model families are still supported.
Python SDK: Deprecate a model group
from datetime import datetime
import datahub.metadata.schema_classes as models
from datahub.sdk import DataHubClient, MlModelGroupUrn
client = DataHubClient.from_env()
group_urn = MlModelGroupUrn(
platform="mlflow",
name="legacy-recommendation-models",
env="PROD",
)
mlmodel_group = client.entities.get(group_urn)
deprecation_aspect = models.DeprecationClass(
deprecated=True,
note="This model group has been replaced by the new transformer-based recommendation models",
decommissionTime=int(datetime.now().timestamp() * 1000),
actor="urn:li:corpuser:datahub",
)
mlmodel_group._set_aspect(deprecation_aspect)
client.entities.update(mlmodel_group)
print(f"Deprecated ML model group: {group_urn}")
Structured Properties
Model groups support structured properties for storing typed, schema-validated metadata that goes beyond simple key-value pairs. This is useful for enforcing organizational standards around model metadata.
Integration Points
Relationship to ML Models
The primary relationship for ML Model Groups is with ML Models themselves. Models can be associated with a model group through the groups field in the mlModelProperties aspect. This creates a MemberOf relationship from the model to the model group.
Common patterns for organizing models in groups include:
Version-based grouping: All versions of a model (v1, v2, v3) belong to the same group. Note that versioning is handled through the
versionPropertiesaspect on individual models (which includes version numbers, versionSet URNs, and aliases like "champion" or "challenger"), while the model group serves as the organizational container.Experiment-based grouping: Different experimental variants of a model belong to the same group
Architecture-based grouping: Models sharing the same architecture or approach
Purpose-based grouping: Models serving the same business purpose or use case
Python SDK: Add a model to a model group
from datahub.metadata.urns import MlModelGroupUrn
from datahub.sdk import DataHubClient
from datahub.sdk.mlmodel import MLModel
client = DataHubClient.from_env()
model = MLModel(
id="my-recommendations-model",
platform="mlflow",
)
model.set_model_group(
MlModelGroupUrn(
platform="mlflow",
name="my-recommendations-model-group",
)
)
client.entities.upsert(model)
Relationship to Data Jobs
Through the MLModelLineageInfo fields, model groups can be connected to:
- Training Jobs: Data jobs that produce models in the group (upstream lineage)
- Downstream Jobs: Data jobs that consume models from the group (downstream lineage)
These relationships enable end-to-end lineage tracking from training data through model deployment.
Relationship to ML Features
While ML Model Groups don't directly reference ML Features, the individual models within the group often consume ML Features. The model group serves as an organizational layer above the model-to-feature relationships.
Relationship to Containers
Model groups can be organized within containers using the container aspect. This is useful for representing hierarchical structures like:
- ML Workspace > Project > Model Group
- Registry > Namespace > Model Group
Querying Model Groups
You can query model groups and their associated models using both the REST API and the Python SDK.
Fetching Model Group Information via REST API
REST API: Get model group by URN
curl 'http://localhost:8080/entities/urn%3Ali%3AmlModelGroup%3A(urn%3Ali%3AdataPlatform%3Amlflow,recommendation-models,PROD)' \
-H 'Authorization: Bearer <token>'
This will return the model group entity with all its aspects, including:
mlModelGroupKey: The unique identifiermlModelGroupProperties: Name, description, version, timestampsownership: Owners of the model groupglobalTags: Tags attached to the groupglossaryTerms: Business terms associated with the groupdomains: Domain assignmentinstitutionalMemory: Links and documentation
Python SDK: Read a model group
from datahub.metadata.urns import MlModelGroupUrn
from datahub.sdk import DataHubClient
client = DataHubClient.from_env()
# Or get this from the UI (share -> copy urn) and use MlModelGroupUrn.from_string(...)
mlmodel_group_urn = MlModelGroupUrn(
platform="mlflow", name="my-recommendations-model-group"
)
mlmodel_group_entity = client.entities.get(mlmodel_group_urn)
print("Model Group Name: ", mlmodel_group_entity.name)
print("Model Group Description: ", mlmodel_group_entity.description)
print("Model Group Custom Properties: ", mlmodel_group_entity.custom_properties)
Finding Models in a Model Group
To find all models that belong to a specific model group, you can query the relationships.
REST API: Find all models in a model group
curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3AmlModelGroup%3A(urn%3Ali%3AdataPlatform%3Amlflow,recommendation-models,PROD)&types=MemberOf' \
-H 'Authorization: Bearer <token>'
This returns all ML Model entities that are members of the specified model group.
Integration with ML Platforms
ML Model Groups work seamlessly with popular ML platforms:
MLflow
MLflow's registered models naturally map to DataHub model groups, with model versions becoming individual MLModel entities within the group. The MLflow ingestion connector automatically creates these relationships.
SageMaker
Amazon SageMaker model packages and model package groups can be represented as model groups in DataHub, providing a unified view across AWS environments.
Databricks
Databricks ML models registered in Unity Catalog can be organized into model groups for better organization and governance.
Kubeflow
Kubeflow model registries can leverage model groups to organize models by pipeline or serving configuration.
Custom Platforms
Any custom ML platform can use model groups by specifying an appropriate platform identifier in the URN.
Notable Exceptions
Model Group vs Individual Models
It's important to understand when to use a model group versus tracking individual models:
- Use Model Groups when: You have multiple related versions or variants of models that should be organized together. For example, all versions of a "customer churn prediction" model.
- Use Individual Models when: You have standalone models that don't have multiple versions or aren't part of a logical family.
Lineage Inheritance
Lineage information stored at the model group level (via trainingJobs and downstreamJobs) represents common lineage across all models in the group. Individual models can also have their own specific lineage information in their mlModelProperties aspect. The two levels of lineage are complementary:
- Model group lineage: Shared training pipelines or common downstream consumers
- Individual model lineage: Specific training runs or deployment-specific consumers
Naming Considerations
When creating model groups, consider your naming strategy:
- Names should be stable over time as they are part of the identifier
- Avoid including version numbers in the group name (use individual model names for versioning)
- Use clear, descriptive names that indicate the model family or purpose
- Follow consistent naming conventions across your organization
Platform Instance Support
Model groups support platform instances via the dataPlatformInstance aspect. This is useful when you have multiple instances of the same platform (e.g., multiple MLflow registries) and need to distinguish model groups across them.
Search and Discovery
Model groups are searchable in DataHub by:
- Name (with autocomplete support)
- Description (full-text search)
- Platform
- Origin/Fabric
- Tags and glossary terms
- Domain
- Owners
This makes it easy to discover relevant model groups through the DataHub UI or search API.
Technical Reference
For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.