Skip to main content
Version: Next

Search

DataHub's Python SDK makes it easy to search and discover metadata across your data ecosystem. Whether you're exploring unknown datasets, filtering by environment, or building advanced search tools, this guide walks you through how to do it all programmatically.

With the Search SDK, you can:

  • Search for data assets by keyword or using structured filters
  • Filter by environment, platform, type, custom properties, or other metadata fields
  • Use AND / OR / NOT logic for advanced queries

Getting Started

To use DataHub SDK, you'll need to install acryl-datahub and set up a connection to your DataHub instance. Follow the installation guide to get started.

Connect to your DataHub instance:

from datahub.sdk import DataHubClient

client = DataHubClient(server="<your_server>", token="<your_token>")
  • server: The URL of your DataHub GMS server
    • local: http://localhost:8080
    • hosted: https://<your_datahub_url>/gms
  • token: You'll need to generate a Personal Access Token from your DataHub instance.

Search Types

DataHub offers two primary search approaches:

  • Query-based search : search using simple keywords across common fields like name, description, and column names.
  • Filter-based search : search using structured filters to scope results by platform, environment, entity type, and other metadata fields.
Combining Query and Filters

Query and filters can be used together for more precise searches. Check out this example for more details.

Query-based search allows you to search using simple keywords. This matches across common fields like name, description, and column names. This is useful for exploration when you're unsure of the exact asset you're looking for.

For example, the script below searches for any assets that have sales in their metadata.

# Inlined from /metadata-ingestion/examples/library/search_with_query.py
from datahub.sdk import DataHubClient

client = DataHubClient(server="<your_server>", token="<your_token>")

# Search for entities with "sales" in the metadata
results = client.search.get_urns(query="sales")

print(list(results))

Example output:

[
DatasetUrn("urn:li:dataset:(urn:li:dataPlatform:snowflake,sales_revenue_2023,PROD)"),
DatasetUrn("urn:li:dataset:(urn:li:dataPlatform:snowflake,sales_forecast,PROD)")
]

Filter-based search allows you to scope results by platform, environment, entity type, and other structured fields. This is useful when you want to narrow down results to specific asset types or metadata fields.

Find All Snowflake Entities

For example, the script below searches for entities on the Snowflake platform.

# Inlined from /metadata-ingestion/examples/library/search_with_filter.py
from datahub.sdk import DataHubClient, FilterDsl as F

client = DataHubClient(server="<your_server>", token="<your_token>")

# Search for entities that are on snowflake platform
results = client.search.get_urns(filter=F.platform("snowflake"))

print(list(results))

You can combine query and filters to refine search results further. For example, search for anything containing "forecast" that is either a chart or a Snowflake dataset.

# Inlined from /metadata-ingestion/examples/library/search_with_query_and_filter.py
from datahub.sdk import DataHubClient, FilterDsl as F

client = DataHubClient(server="<your_server>", token="<your_token>")

# Search snowflake datasets that have "forecast" in the metadata
results = client.search.get_urns(
query="forecast", filter=F.and_(F.platform("snowflake"), F.entity_type("dataset"))
)
print(list(results))

For more details on available filters, see the filter options.

Common Search Patterns

Here are some common examples of advanced queries using filters and logical operations:

Find All Dashboards

# Inlined from /metadata-ingestion/examples/library/search_filter_by_entity_type.py
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

# search for all dashboards
client = DataHubClient(server="<your_server>", token="<your_token>")
results = client.search.get_urns(filter=F.entity_type("dashboard"))

Find All Snowflake Entities

# Inlined from /metadata-ingestion/examples/library/search_filter_by_platform.py
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

# search for all snowflake assets
client = DataHubClient(server="<your_server>", token="<your_token>")
results = client.search.get_urns(filter=F.platform("snowflake"))

Find All Entities in the Production Environment

# Inlined from /metadata-ingestion/examples/library/search_filter_by_env.py
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

# search for all assets in the production environment
client = DataHubClient(server="<your_server>", token="<your_token>")
results = client.search.get_urns(filter=F.env("PROD"))

Find All Entities in a Specific Domain

# Inlined from /metadata-ingestion/examples/library/search_filter_by_domain.py
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

# search for all assets in the marketing domain
client = DataHubClient(server="<your_server>", token="<your_token>")
results = client.search.get_urns(filter=F.domain("urn:li:domain:marketing"))

Find All Entities With a Specific Subtype

# Inlined from /metadata-ingestion/examples/library/search_filter_by_entity_subtype.py
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

# search for all mlflow assets of subtype "ML Experiment"
client = DataHubClient(server="<your_server>", token="<your_token>")
results = client.search.get_urns(
filter=F.and_(F.platform("mlflow"), F.entity_subtype("ML Experiment"))
)

Find All Entities With Specific Custom Properties

# Inlined from /metadata-ingestion/examples/library/search_filter_by_custom_property.py
from datahub.sdk import DataHubClient
from datahub.sdk.search_filters import FilterDsl as F

client = DataHubClient(server="<your_server>", token="<your_token>")
# search for all assets with a custom property "my_custom_property" set to "my_value"
results = client.search.get_urns(
filter=F.has_custom_property("my_custom_property", "my_value")
)

Find All Charts and Snowflake Datasets

You can combine filters using logical operations like and_, or_, and not_ to build advanced queries. Check the Logical Operator Options for more details.

# Inlined from /metadata-ingestion/examples/library/search_filter_combined_operation.py
from datahub.sdk import DataHubClient, FilterDsl as F

client = DataHubClient(server="<your_server>", token="<your_token>")

# Search for charts or snowflake datasets
results = client.search.get_urns(
filter=F.or_(
F.entity_type("chart"),
F.and_(F.platform("snowflake"), F.entity_type("dataset")),
)
)

print(list(results))

Find All Charts That Are Not in the Production Environment

# Inlined from /metadata-ingestion/examples/library/search_filter_not.py
from datahub.sdk import DataHubClient, FilterDsl as F

client = DataHubClient(server="<your_server>", token="<your_token>")

# Search for charts that are not in the PROD environment.
results = client.search.get_urns(
filter=F.and_(F.entity_type("chart"), F.not_(F.env("PROD"))),
)

print(list(results))

Advanced: Find entities by other searchable fields

Use F.custom_filter() to target specific fields such as urn, name, or description. Check the Supported Conditions for Custom Filter for the full list of allowed condition values.

# Inlined from /metadata-ingestion/examples/library/search_filter_custom.py
from datahub.sdk import DataHubClient, FilterDsl as F

client = DataHubClient(server="<your_server>", token="<your_token>")

# Search for datasets that have "example_dataset" in the urn
results = client.search.get_urns(
filter=F.custom_filter(field="urn", condition="CONTAIN", values=["example_dataset"])
)

print(list(results))

Searchable Fields

With F.custom_filter(), the fields annotated with @Searchable in the PDL file can be used for filtering. For example, you can filter datajob entities by fields like name, description, or env since they are annotated with @Searchable in the DataJobInfo.pdl.

Search SDK Reference

For a full reference, see the search SDK reference.

Filter Options

The following filter options are available in the SDK:

Filter TypeExample Code
PlatformF.platform("snowflake")
EnvironmentF.env("PROD")
Entity TypeF.entity_type("dataset")
DomainF.domain("urn:li:domain:xyz")
SubtypeF.entity_subtype("ML Experiment")
Deletion StatusF.soft_deleted("NOT_SOFT_DELETED")
Custom PropertyF.has_custom_property("department", "sales")

Logical Operator Options

The following logical operators can be used to combine filters:

OperatorExample CodeDescription
ANDF.and_(...)Return entities matching all specified conditions.
ORF.or_(...)Return entities matching at least one condition.
NOTF.not_(...)Exclude entities that match a given condition.

Supported Conditions for Custom Filter

Use F.custom_filter() to apply conditions on specific fields such as urn, name, or description.

ConditionDescription
EQUALExact match for string fields.
CONTAINContains substring in string fields.
START_WITHBegins with a specific substring.
END_WITHEnds with a specific substring.
GREATER_THANFor numeric or timestamp fields, checks if the value is greater than the specified value.
LESS_THANFor numeric or timestamp fields, checks if the value is less than the specified value.

FAQ

How do I handle authentication? Generate a Personal Access Token from your DataHub instance settings and pass it into the DataHubClient. Check out the Personal Access Token Guide.

Can I combine query and filters? Yes. Use query along with filter for more precise searches.