Skip to main content

DataHub Semantic Search

This directory contains documentation for DataHub's semantic search capability, which enables natural language search across metadata entities using vector embeddings.

Note: This is developer documentation for the semantic search feature. For a working example, see the smoke test at smoke-test/tests/semantic/test_semantic_search.py.

Overview

Traditional keyword search requires exact term matches, limiting discoverability. Semantic search uses AI-generated embeddings to understand the meaning of queries and documents, returning relevant results even when exact keywords don't match.

Example:

  • Query: "how to request data access permissions"
  • Keyword search: ❌ No results (no exact match)
  • Semantic search: ✅ Returns "Data Access Request Process" document

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│ DataHub Semantic Search │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Ingestion │ │ GMS │────▶│ OpenSearch │ │
│ │ Connector │ │ │ │ │ │
│ │ │ │ ┌──────────┐ │ │ ┌────────────────────┐ │ │
│ │ 1. Generate │ │ │ Process │ │ │ │ entityindex_v2 │ │ │
│ │ embeddings│ │ │ MCP + │ │ │ │ (keyword search) │ │ │
│ │ │ │ │ Write to │ │ │ └────────────────────┘ │ │
│ │ 2. Emit MCP │────▶│ │ indices │ │ │ │ │
│ │ with │ │ └──────────┘ │ │ ┌────────────────────┐ │ │
│ │ Semantic │ │ │ │ │ entityindex_v2_ │ │ │
│ │ Embedding │ │ │ │ │ semantic │ │ │
│ │ aspect │ │ │ │ │ (vector search) │ │ │
│ └──────────────┘ └──────────────┘ │ └────────────────────┘ │ │
│ └──────────────────────────┘ │
│ │ │
│ ┌──────────────┐ │ │
│ │ GraphQL │◀───────────────────────────────────────┘ │
│ │ Client │ semanticSearchAcrossEntities() │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

How It Works

1. Data Ingestion

Documents and other entities are ingested into DataHub using standard ingestion connectors. When semantic search is enabled, GMS performs a dual-write:

  • Primary Index (entityindex_v2): Standard keyword-searchable index
  • Semantic Index (entityindex_v2_semantic): Vector-enabled index for semantic search

Note: The dual-index approach is transitional. The plan is to eventually retire v2 indices and use _semantic indices exclusively for both keyword and semantic search. See Architecture for details.

2. Embedding Generation

Embeddings are generated at two points:

Document Embeddings (at ingestion time):

  • Generated by the ingestion connector
  • Emitted via MCP (Metadata Change Proposal) as a SemanticContent aspect
  • GMS processes the MCP and writes embeddings to the semantic index
  • Supports privacy-sensitive use cases where only embeddings (not source text) are shared

Query Embeddings (at search time):

  • Generated by GMS using the configured embedding provider
  • Used to find similar documents via k-NN search

3. Query Processing

When a user performs a semantic search:

  1. The query text is converted to an embedding vector using the same model
  2. OpenSearch performs k-NN (k-nearest neighbors) vector similarity search
  3. Results are ranked by cosine similarity to the query embedding
  4. Top matches are returned through the GraphQL API

Quick Start

Prerequisites

  • DataHub running with semantic search enabled
  • AWS credentials (for Bedrock) or API key (for Cohere/OpenAI)

Set in your environment (e.g., docker/profiles/empty2.env):

ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document

2. Run the Smoke Test

The best way to verify semantic search is working is to run the smoke test:

cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v

This test:

  • Ingests sample documents via GraphQL
  • Waits for indexing (20 seconds)
  • Executes semantic search
  • Verifies results

GraphQL API

Semantic Search Query

query SemanticSearch($input: SearchAcrossEntitiesInput!) {
semanticSearchAcrossEntities(input: $input) {
total
searchResults {
entity {
urn
type
... on Document {
info {
title
contents {
text
}
}
}
}
}
}
}

Variables:

{
"input": {
"query": "how to request data access",
"types": ["DOCUMENT"],
"start": 0,
"count": 10
}
}

Documentation Index

FileDescription
README.mdThis documentation - overview and quick start
ARCHITECTURE.mdDetailed architecture and design decisions
CONFIGURATION.mdConfiguration options and embedding models

Testing

For a working example of semantic search:

# Run the smoke test
cd smoke-test
ENABLE_SEMANTIC_SEARCH_TESTS=true pytest tests/semantic/test_semantic_search.py -v

Further Reading