Skip to main content
Version: Next

Entities

The DataHub SDK provides a set of entities that can be used to interact with DataHub’s metadata.

Dataset

class datahub.sdk.dataset.Dataset(*, platform, name, platform_instance = None, env = 'PROD', description = None, display_name = None, qualified_name = None, external_url = None, custom_properties = None, created = None, last_modified = None, parent_container = Unset.token, subtype = None, owners = None, links = None, tags = None, terms = None, domain = None, schema = None, upstreams = None, structured_properties = None, extra_aspects = None)

Bases: HasPlatformInstance, HasSubtype, HasContainer, HasOwnership, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, HasStructuredProperties, Entity

Represents a dataset in DataHub.

A dataset represents a collection of data, such as a table, view, or file. This class provides methods for managing dataset metadata including schema, lineage, and various aspects like ownership, tags, and terms.

  • Parameters:
    • platform (str)
    • name (str)
    • platform_instance (Optional [str])
    • env (str)
    • description (Optional [str])
    • display_name (Optional [str])
    • qualified_name (Optional [str])
    • external_url (Optional [str])
    • custom_properties (Optional [Dict[str,str] ])
    • created (Optional [datetime])
    • last_modified (Optional [datetime])
    • parent_container (ParentContainerInputType | Unset)
    • subtype (Optional [str])
    • owners (Optional [OwnersInputType])
    • links (Optional [LinksInputType])
    • tags (Optional [TagsInputType])
    • terms (Optional [TermsInputType])
    • domain (Optional [DomainInputType])
    • schema (Optional [SchemaFieldsInputType])
    • upstreams (Optional [models.UpstreamLineageClass])
    • structured_properties (Optional [StructuredPropertyInputType])
    • extra_aspects (ExtraAspectsType)

property created : datetime | None

Get the creation timestamp of the dataset.

  • Returns: The creation timestamp if set, None otherwise.

property custom_properties : Dict[str, str]

Get the custom properties of the dataset.

  • Returns: Dictionary of custom properties.

property description : str | None

Get the description of the dataset.

  • Returns: The description if set, None otherwise.

property display_name : str | None

Get the display name of the dataset.

  • Returns: The display name if set, None otherwise.

property external_url : str | None

Get the external URL of the dataset.

  • Returns: The external URL if set, None otherwise.

classmethod get_urn_type()

Get the URN type for datasets.

  • Return type:Type[DatasetUrn]
  • Returns: The DatasetUrn class.

property last_modified : datetime | None

Get the last modification timestamp of the dataset.

  • Returns: The last modification timestamp if set, None otherwise.

property qualified_name : str | None

Get the qualified name of the dataset.

  • Returns: The qualified name if set, None otherwise.

property schema : List[SchemaField]

Get the schema fields of the dataset.

  • Returns: List of SchemaField objects representing the dataset’s schema.

set_created(created)

Set the creation timestamp of the dataset.

  • Parameters:created (datetime) – The creation timestamp to set.
  • Return type:None

set_custom_properties(custom_properties)

Set the custom properties of the dataset.

  • Parameters:custom_properties (Dict[str, str]) – Dictionary of custom properties to set.
  • Return type:None

set_description(description)

Set the description of the dataset.

  • Parameters:description (str) – The description to set.
  • Return type:None

NOTE

If called during ingestion, this will warn if overwriting a non-ingestion description.

set_display_name(display_name)

Set the display name of the dataset.

  • Parameters:display_name (str) – The display name to set.
  • Return type:None

set_external_url(external_url)

Set the external URL of the dataset.

  • Parameters:external_url (str) – The external URL to set.
  • Return type:None

set_last_modified(last_modified)

  • Parameters:last_modified (datetime)
  • Return type:None

set_qualified_name(qualified_name)

Set the qualified name of the dataset.

  • Parameters:qualified_name (str) – The qualified name to set.
  • Return type:None

set_upstreams(upstreams)

property upstreams : UpstreamLineageClass | None

property urn : DatasetUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.

SchemaField

class datahub.sdk.dataset.SchemaField(parent, field_path)

Bases: object

  • Parameters:
    • parent (Dataset) –
    • field_path (str)

add_tag(tag)

add_term(term)

property description : str | None

property field_path : str

property mapped_type : SchemaFieldDataTypeClass

property native_type : str

remove_tag(tag)

remove_term(term)

set_description(description)

  • Parameters:description (str)
  • Return type:None

set_tags(tags)

set_terms(terms)

property tags : List[TagAssociationClass] | None

property terms : List[GlossaryTermAssociationClass] | None

parse_cll_mapping

datahub.sdk.dataset.parse_cll_mapping(*, upstream, downstream, cll_mapping)

Container

class datahub.sdk.container.Container(container_key, *, display_name, qualified_name = None, description = None, external_url = None, extra_properties = None, created = None, last_modified = None, parent_container = Auto.token, subtype = None, owners = None, links = None, tags = None, terms = None, domain = None, structured_properties = None, extra_aspects = None)

Bases: HasPlatformInstance, HasSubtype, HasContainer, HasOwnership, HasInstitutionalMemory, HasStructuredProperties, HasTags, HasTerms, HasDomain, Entity

property created : datetime | None

property custom_properties : Dict[str, str] | None

property description : str | None

property display_name : str

property external_url : str | None

classmethod get_urn_type()

Get the URN type for this entity class.

  • Return type:Type[ContainerUrn]
  • Returns: The URN type class that corresponds to this entity type.

property last_modified : datetime | None

property qualified_name : str | None

set_created(created)

  • Parameters:created (datetime)
  • Return type:None

set_custom_properties(custom_properties)

  • Parameters:custom_properties (Dict[str, str])
  • Return type:None

set_description(description)

  • Parameters:description (str)
  • Return type:None

set_display_name(value)

  • Parameters:value (str)
  • Return type:None

set_external_url(external_url)

  • Parameters:external_url (str)
  • Return type:None

set_last_modified(last_modified)

  • Parameters:last_modified (datetime)
  • Return type:None

set_qualified_name(qualified_name)

  • Parameters:qualified_name (str)
  • Return type:None

MLModel

class datahub.sdk.mlmodel.MLModel(id, platform, version = None, aliases = None, platform_instance = None, env = 'PROD', name = None, description = None, training_metrics = None, hyper_params = None, external_url = None, custom_properties = None, created = None, last_modified = None, owners = None, links = None, tags = None, terms = None, domain = None, model_group = None, training_jobs = None, downstream_jobs = None, structured_properties = None, extra_aspects = None)

Bases: HasPlatformInstance, HasOwnership, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, HasVersion, HasStructuredProperties, Entity

add_downstream_job(downstream_job)

add_hyper_params(params)

  • Parameters:params (Union[List[MLHyperParamClass], Dict[str, Optional[str]]]) –
  • Return type:None

add_training_job(training_job)

add_training_metrics(metrics)

  • Parameters:metrics (Union[List[MLMetricClass], Dict[str, Optional[str]]]) –
  • Return type:None

property created : datetime | None

property custom_properties : Dict[str, str] | None

property description : str | None

property downstream_jobs : List[str] | None

property external_url : str | None

classmethod get_urn_type()

Get the URN type for this entity class.

  • Return type:Type[MlModelUrn]
  • Returns: The URN type class that corresponds to this entity type.

property hyper_params : List[MLHyperParamClass] | None

property last_modified : datetime | None

property model_group : str | None

property name : str | None

remove_downstream_job(downstream_job)

remove_training_job(training_job)

set_created(created)

  • Parameters:created (datetime)
  • Return type:None

set_custom_properties(custom_properties)

  • Parameters:custom_properties (Dict[str, str])
  • Return type:None

set_description(description)

  • Parameters:description (str)
  • Return type:None

set_downstream_jobs(downstream_jobs)

set_external_url(external_url)

  • Parameters:external_url (str)
  • Return type:None

set_hyper_params(params)

  • Parameters:params (Union[List[MLHyperParamClass], Dict[str, Optional[str]]]) –
  • Return type:None

set_last_modified(last_modified)

  • Parameters:last_modified (datetime)
  • Return type:None

set_model_group(group)

set_name(name)

  • Parameters:name (str)
  • Return type:None

set_training_jobs(training_jobs)

set_training_metrics(metrics)

  • Parameters:metrics (Union[List[MLMetricClass], Dict[str, Optional[str]]]) –
  • Return type:None

property training_jobs : List[str] | None

property training_metrics : List[MLMetricClass] | None

property urn : MlModelUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.

MLModelGroup

class datahub.sdk.mlmodelgroup.MLModelGroup(id, platform, name = '', platform_instance = None, env = 'PROD', description = None, display_name = None, external_url = None, custom_properties = None, created = None, last_modified = None, owners = None, links = None, tags = None, terms = None, domain = None, training_jobs = None, downstream_jobs = None, structured_properties = None, extra_aspects = None)

Bases: HasPlatformInstance, HasOwnership, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, HasStructuredProperties, Entity

add_downstream_job(downstream_job)

add_training_job(training_job)

property created : datetime | None

property custom_properties : Dict[str, str] | None

property description : str | None

property downstream_jobs : List[str] | None

property external_url : str | None

classmethod get_urn_type()

Get the URN type for this entity class.

  • Return type:Type[MlModelGroupUrn]
  • Returns: The URN type class that corresponds to this entity type.

property last_modified : datetime | None

property name : str | None

remove_downstream_job(downstream_job)

remove_training_job(training_job)

set_created(created)

  • Parameters:created (datetime)
  • Return type:None

set_custom_properties(custom_properties)

  • Parameters:custom_properties (Dict[str, str])
  • Return type:None

set_description(description)

  • Parameters:description (str)
  • Return type:None

set_downstream_jobs(downstream_jobs)

set_external_url(external_url)

  • Parameters:external_url (str)
  • Return type:None

set_last_modified(last_modified)

  • Parameters:last_modified (datetime)
  • Return type:None

set_name(display_name)

  • Parameters:display_name (str)
  • Return type:None

set_training_jobs(training_jobs)

property training_jobs : List[str] | None

property urn : MlModelGroupUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.

Dashboard

class datahub.sdk.dashboard.Dashboard(*, name, platform, display_name = None, platform_instance = None, description = '', external_url = None, dashboard_url = None, custom_properties = None, last_modified = None, last_refreshed = None, input_datasets = None, charts = None, dashboards = None, subtype = None, owners = None, links = None, tags = None, terms = None, domain = None, extra_aspects = None)

Bases: HasPlatformInstance, HasSubtype, HasOwnership, HasContainer, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, Entity

Represents a dashboard in DataHub.

add_chart(chart)

Add a chart to the dashboard.

  • Parameters:chart (Union[str, ChartUrn, Chart]) –
  • Return type:None

add_dashboard(dashboard)

Add a dashboard to the dashboard.

add_input_dataset(input_dataset)

Add an input dataset to the dashboard.

property charts : List[ChartUrn]

Get the charts of the dashboard.

property custom_properties : Dict[str, str]

Get the custom properties of the dashboard.

property dashboard_url : str | None

Get the dashboard URL.

property dashboards : List[DashboardUrn]

Get the dashboards of the dashboard.

property description : str | None

Get the description of the dashboard.

property display_name : str | None

Get the display name of the dashboard.

property external_url : str | None

Get the external URL of the dashboard.

classmethod get_urn_type()

Get the URN type for dashboards. :rtype: Type[DashboardUrn] :returns: The DashboardUrn class.

property input_datasets : List[DatasetUrn]

Get the input datasets of the dashboard.

property last_modified : datetime | None

Get the last modification timestamp of the dashboard.

property last_refreshed : datetime | None

Get the last refresh timestamp of the dashboard.

property name : str

Get the name of the dashboard.

remove_chart(chart)

Remove a chart from the dashboard.

  • Parameters:chart (Union[str, ChartUrn, Chart]) –
  • Return type:None

remove_input_dataset(input_dataset)

Remove an input dataset from the dashboard.

set_charts(charts)

Set the charts of the dashboard.

  • Parameters:charts (List[Union[str, ChartUrn, Chart]]) –
  • Return type:None

set_custom_properties(custom_properties)

Set the custom properties of the dashboard.

  • Parameters:custom_properties (Dict[str, str])
  • Return type:None

set_dashboard_url(dashboard_url)

Set the dashboard URL.

  • Parameters:dashboard_url (str)
  • Return type:None

set_dashboards(dashboards)

Set the dashboards of the dashboard.

set_description(description)

Set the description of the dashboard.

  • Parameters:description (str)
  • Return type:None

set_display_name(display_name)

Set the display name of the dashboard.

  • Parameters:display_name (str)
  • Return type:None

set_external_url(external_url)

Set the external URL of the dashboard.

  • Parameters:external_url (str)
  • Return type:None

set_input_datasets(input_datasets)

Set the input datasets of the dashboard.

  • Parameters:input_datasets (List[Union[str, DatasetUrn, Dataset]]) –
  • Return type:None

set_last_modified(last_modified)

Set the last modification timestamp of the dashboard.

  • Parameters:last_modified (datetime)
  • Return type:None

set_last_refreshed(last_refreshed)

Set the last refresh timestamp of the dashboard.

  • Parameters:last_refreshed (datetime)
  • Return type:None

set_title(title)

Set the title of the dashboard.

  • Parameters:title (str)
  • Return type:None

property title : str

Get the title of the dashboard.

property urn : DashboardUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.

Chart

class datahub.sdk.chart.Chart(*, name, platform, display_name = None, platform_instance = None, description = '', external_url = None, chart_url = None, custom_properties = None, last_modified = None, last_refreshed = None, chart_type = None, access = None, subtype = None, owners = None, links = None, tags = None, terms = None, domain = None, input_datasets = None, extra_aspects = None)

Bases: HasPlatformInstance, HasSubtype, HasOwnership, HasContainer, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, Entity

Represents a chart in DataHub.

property access : str | None

Get the access level of the chart as a string.

add_input_dataset(input_dataset)

Add an input to the chart.

property chart_type : str | None

Get the type of the chart as a string.

property chart_url : str | None

Get the chart URL.

property custom_properties : Dict[str, str]

Get the custom properties of the chart.

property description : str | None

Get the description of the chart.

property display_name : str | None

Get the display name of the chart.

property external_url : str | None

Get the external URL of the chart.

classmethod get_urn_type()

Get the URN type for charts. :rtype: Type[ChartUrn] :returns: The ChartUrn class.

property input_datasets : List[DatasetUrn]

Get the input datasets of the chart.

property last_modified : datetime | None

Get the last modification timestamp of the chart.

property last_refreshed : datetime | None

Get the last refresh timestamp of the chart.

property name : str

Get the name of the chart.

remove_input_dataset(input_dataset)

Remove an input from the chart.

set_access(access)

Set the access level of the chart.

set_chart_type(chart_type)

Set the type of the chart.

  • Parameters:chart_type (Union[str, ChartTypeClass]) –
  • Return type:None

set_chart_url(chart_url)

Set the chart URL.

  • Parameters:chart_url (str)
  • Return type:None

set_custom_properties(custom_properties)

Set the custom properties of the chart.

  • Parameters:custom_properties (Dict[str, str])
  • Return type:None

set_description(description)

Set the description of the chart.

  • Parameters:description (str)
  • Return type:None

set_display_name(display_name)

Set the display name of the chart.

  • Parameters:display_name (str)
  • Return type:None

set_external_url(external_url)

Set the external URL of the chart.

  • Parameters:external_url (str)
  • Return type:None

set_input_datasets(input_datasets)

Set the input datasets of the chart.

  • Parameters:input_datasets (List[Union[str, DatasetUrn, Dataset]]) –
  • Return type:None

set_last_modified(last_modified)

Set the last modification timestamp of the chart.

  • Parameters:last_modified (datetime)
  • Return type:None

set_last_refreshed(last_refreshed)

Set the last refresh timestamp of the chart.

  • Parameters:last_refreshed (datetime)
  • Return type:None

set_title(title)

Set the title of the chart.

  • Parameters:title (str)
  • Return type:None

property title : str

Get the title of the chart.

property urn : ChartUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.

DataJob

class datahub.sdk.datajob.DataJob(*, name, flow = None, flow_urn = None, platform_instance = None, display_name = None, description = None, external_url = None, custom_properties = None, created = None, last_modified = None, subtype = None, owners = None, links = None, tags = None, terms = None, domain = None, inlets = None, outlets = None, structured_properties = None, extra_aspects = None)

Bases: HasPlatformInstance, HasSubtype, HasContainer, HasOwnership, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, HasStructuredProperties, Entity

Represents a data job in DataHub. A data job is an executable unit of a data pipeline, such as an Airflow task or a Spark job.

property created : datetime | None

Get the creation timestamp of the data job.

property custom_properties : Dict[str, str]

Get the custom properties of the data job.

property description : str | None

Get the description of the data job.

property display_name : str | None

Get the display name of the data job.

property external_url : str | None

Get the external URL of the data job.

property flow_urn : DataFlowUrn

Get the data flow associated with the data job.

classmethod get_urn_type()

Get the URN type for data jobs.

property inlets : List[DatasetUrn]

Get the inlets of the data job.

property last_modified : datetime | None

Get the last modification timestamp of the data job.

property name : str

Get the name of the data job.

property outlets : List[DatasetUrn]

Get the outlets of the data job.

set_created(created)

Set the creation timestamp of the data job.

  • Parameters:created (datetime)
  • Return type:None

set_custom_properties(custom_properties)

Set the custom properties of the data job.

  • Parameters:custom_properties (Dict[str, str])
  • Return type:None

set_description(description)

Set the description of the data job.

  • Parameters:description (str)
  • Return type:None

set_display_name(display_name)

Set the display name of the data job.

  • Parameters:display_name (str)
  • Return type:None

set_external_url(external_url)

Set the external URL of the data job.

  • Parameters:external_url (str)
  • Return type:None

set_inlets(inlets)

Set the inlets of the data job.

  • Parameters:inlets (List[Union[str, DatasetUrn]]) –
  • Return type:None

set_last_modified(last_modified)

Set the last modification timestamp of the data job.

  • Parameters:last_modified (datetime)
  • Return type:None

set_outlets(outlets)

Set the outlets of the data job.

  • Parameters:outlets (List[Union[str, DatasetUrn]]) –
  • Return type:None

property urn : DataJobUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.

DataFlow

class datahub.sdk.dataflow.DataFlow(*, name, platform, display_name = None, platform_instance = None, env = 'PROD', description = None, external_url = None, custom_properties = None, created = None, last_modified = None, subtype = None, owners = None, links = None, tags = None, terms = None, domain = None, parent_container = Unset.token, structured_properties = None, extra_aspects = None)

Bases: HasPlatformInstance, HasSubtype, HasOwnership, HasContainer, HasInstitutionalMemory, HasTags, HasTerms, HasDomain, HasStructuredProperties, Entity

Represents a dataflow in DataHub. A dataflow represents a collection of data, such as a table, view, or file. This class provides methods for managing dataflow metadata including schema, lineage, and various aspects like ownership, tags, and terms.

property created : datetime | None

Get the creation timestamp of the dataflow. :returns: The creation timestamp if set, None otherwise.

property custom_properties : Dict[str, str]

Get the custom properties of the dataflow. :returns: Dictionary of custom properties.

property description : str | None

Get the description of the dataflow. :returns: The description if set, None otherwise.

property display_name : str | None

Get the display name of the dataflow. :returns: The display name if set, None otherwise.

property env : str | FabricTypeClass | None

Get the environment of the dataflow.

property external_url : str | None

Get the external URL of the dataflow. :returns: The external URL if set, None otherwise.

classmethod get_urn_type()

Get the URN type for dataflows. :rtype: Type[DataFlowUrn] :returns: The DataflowUrn class.

property last_modified : datetime | None

Get the last modification timestamp of the dataflow. :returns: The last modification timestamp if set, None otherwise.

property name : str

Get the name of the dataflow. :returns: The name of the dataflow.

set_created(created)

Set the creation timestamp of the dataflow. :type created: datetime :param created: The creation timestamp to set.

  • Return type:None
  • Parameters:created (datetime)

set_custom_properties(custom_properties)

Set the custom properties of the dataflow. :type custom_properties: Dict[str, str] :param custom_properties: Dictionary of custom properties to set.

  • Return type:None
  • Parameters:custom_properties (Dict [str,str])

set_description(description)

Set the description of the dataflow. :type description: str :param description: The description to set. :rtype: None

NOTE

If called during ingestion, this will warn if overwriting a non-ingestion description.

  • Parameters:description (str)
  • Return type: None

set_display_name(display_name)

Set the display name of the dataflow. :type display_name: str :param display_name: The display name to set.

  • Return type:None
  • Parameters:display_name (str)

set_external_url(external_url)

Set the external URL of the dataflow. :type external_url: str :param external_url: The external URL to set.

  • Return type:None
  • Parameters:external_url (str)

set_last_modified(last_modified)

  • Parameters:last_modified (datetime)
  • Return type:None

property urn : DataFlowUrn

Get the entity’s URN.

  • Returns: The URN that uniquely identifies this entity.