Semantic Search Architecture
This document provides a detailed explanation of DataHub's semantic search architecture, design decisions, and implementation details.
Design Philosophy
Why Semantic Search?
Traditional keyword search has limitations:
- Vocabulary Mismatch: Users may use different terms than those in documents
- Synonym Blindness: "access request" won't match "permission request"
- Context Ignorance: Keywords lack understanding of meaning
Semantic search addresses these by understanding the meaning of text through vector embeddings—numerical representations that capture semantic similarity.
Core Principles
- Non-invasive: Semantic search is additive; it doesn't replace keyword search
- Configurable: Organizations choose which entities and models to use
- Extensible: New embedding models can be added without architectural changes
- Async Processing: Embedding generation happens asynchronously to not block ingestion
Index Architecture
Dual-Index Strategy
For each entity type enabled for semantic search, two indices exist:
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ documentindex_v2 │ │ documentindex_v2_semantic │
├─────────────────────────────────┤ ├─────────────────────────────────┤
│ Standard OpenSearch index │ │ OpenSearch index with k-NN │
│ │ │ │
│ Fields: │ │ Fields: │
│ - urn │ │ - urn │
│ - title (text) │ │ - title (text) │
│ - text (text) │ │ - text (text) │
│ - browsePaths │ │ - browsePaths │
│ - tags │ │ - tags │
│ - ... │ │ - ... │
│ │ │ │
│ │ │ + embeddings (nested object): │
│ │ │ - cohere_embed_v3: │
│ │ │ - model_version │
│ │ │ - generated_at │
│ │ │ - chunks[] (nested): │
│ │ │ - position │
│ │ │ - text │
│ │ │ - vector (knn_vector) │
└─────────────────────────────────┘ └─────────────────────────────────┘
Why Separate Indices? (Transitional Architecture)
The dual-index approach is a transitional architecture. The long-term plan is to:
- Phase 1 (Current): Run both indices in parallel during transition
- Phase 2: Migrate all search traffic to semantic indices
- Phase 3: Retire
v2indices entirely
Benefits of the transitional approach:
- Zero Downtime Migration: Users continue using keyword search while semantic capabilities are built
- Gradual Validation: Semantic search quality can be validated before full rollout
- Rollback Safety: If issues arise, keyword search remains available
- Incremental Embedding Generation: Embeddings can be backfilled without blocking operations
Future State:
Once the transition is complete, the _semantic indices will become the primary (and only) search indices. They will support both:
- Keyword search: Using standard OpenSearch text matching on the same index
- Semantic search: Using k-NN vector similarity
This unified index approach simplifies operations and reduces storage overhead.
Embeddings Schema
The semantic index stores embeddings in a nested structure:
{
"urn": "urn:li:document:example-doc",
"title": "Data Access Guide",
"text": "How to request access to datasets...",
"embeddings": {
"cohere_embed_v3": {
"model_version": "bedrock/cohere.embed-english-v3",
"generated_at": "2024-01-15T10:30:00Z",
"chunking_strategy": "sentence_boundary_400t",
"total_chunks": 3,
"total_tokens": 850,
"chunks": [
{
"position": 0,
"text": "How to request access to datasets...",
"character_offset": 0,
"character_length": 450,
"token_count": 95,
"vector": [0.023, -0.041, 0.087, ...] // 1024 dimensions
},
{
"position": 1,
"text": "For sensitive data, additional approval...",
"character_offset": 450,
"character_length": 380,
"token_count": 82,
"vector": [0.019, -0.055, 0.091, ...]
}
]
}
}
}
Multi-Model Support
The embeddings structure supports multiple embedding models:
{
"embeddings": {
"cohere_embed_v3": { ... },
"openai_text_embedding_3": { ... },
"custom_model": { ... }
}
}
This allows:
- A/B testing different models
- Gradual migration between models
- Model-specific optimizations
Data Flow
Ingestion Flow
The ingestion connector generates document embeddings and sends them to GMS along with the document content:
┌─────────────────────────────────────────────────────────────────────────┐
│ Ingestion Flow │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Source │ 1. Extract documents │
│ │ System │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Ingestion │ 2. Generate embeddings for document content │
│ │ Connector │ (using connector's embedding provider) │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ 3. Send document + embeddings to GMS │
│ │ GMS │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ OpenSearch │ │
│ │ ┌─────────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ entityindex_v2 │ │ entityindex_v2_semantic │ │ │
│ │ │ (keyword search) │ │ (keyword + vector search) │ │ │
│ │ │ │ │ │ │ │
│ │ │ - urn │ │ - urn │ │ │
│ │ │ - title │ │ - title │ │ │
│ │ │ - text │ │ - text │ │ │
│ │ │ - ... │ │ - embeddings.model.chunks[]. │ │ │
│ │ │ │ │ vector │ │ │
│ │ └─────────────────────┘ └─────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Embedding Generation
Document Embeddings are generated by the ingestion connector at ingestion time and sent to GMS via MCP (Metadata Change Proposal). This ensures:
- Consistency: Every ingested document has embeddings from the start
- Simplicity: No separate backfill job to manage
- Freshness: Embeddings are always up-to-date with document content
- Audit Trail: Embeddings are tracked in the Metadata Change Log (MCL)
- Privacy Support: Sensitive sources can generate embeddings locally and share only vectors
MCP-Based Embedding Flow
┌─────────────────────────────────────────────────────────────────────┐
│ Ingestion Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Source │───▶│ Ingestion │───▶│ DataHub GMS │ │
│ │ System │ │ Connector │ │ │ │
│ └──────────────┘ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Generate document│ │ Process MCP and │ │
│ │ embeddings │ │ write to semantic │ │
│ │ (in connector) │ │ search index │ │
│ └────────┬─────────┘ └──────────────────┘ │
│ │ ▲ │
│ │ MCP with │ │
│ └─────SemanticContent─┘ │
│ aspect │
│ │
└─────────────────────────────────────────────────────────────────────┘
SemanticContent Aspect
Embeddings are stored as a proper DataHub aspect (SemanticContent), defined in PDL schema:
{
"entityType": "document",
"entityUrn": "urn:li:document:my-doc",
"aspectName": "semanticContent",
"aspect": {
"embeddings": {
"cohere_embed_v3": {
"modelVersion": "bedrock/cohere.embed-english-v3",
"generatedAt": 1702234567890,
"totalChunks": 2,
"chunks": [
{ "position": 0, "vector": [...], "text": "..." },
{ "position": 1, "vector": [...], "text": "..." }
]
}
}
}
}
Privacy-Sensitive Use Cases
The text field in each chunk is optional. This supports scenarios where:
- Source data contains sensitive information (PII, trade secrets)
- Customers want semantic search without storing source text in DataHub
- Embeddings are generated locally at the data source
Note: Embeddings are one-way—original text cannot be reconstructed from vectors.
Query Embeddings are generated by GMS at search time using the configured embedding provider (e.g., AWS Bedrock):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GraphQL │───▶│ GMS │───▶│ Embedding │───▶│ OpenSearch │
│ Client │ │ │ │ Provider │ │ k-NN Query │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
Query embedding
generated here
(for search only)
Key Point: The GMS embedding provider is used only for query embedding, not for document embedding. The ingestion connector is responsible for document embeddings.
Query Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GraphQL │───▶│ GMS │───▶│ Embedding │───▶│ OpenSearch │
│ Client │ │ │ │ Provider │ │ k-NN Query │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
semanticSearchAcrossEntities( │
query: "how to access data" │
) │
▼
┌─────────────────────────────┐
│ Nested k-NN Query: │
│ │
│ { │
│ "nested": { │
│ "path": "embeddings │
│ .cohere_embed_v3 │
│ .chunks", │
│ "query": { │
│ "knn": { │
│ "...chunks.vector":│
│ { "vector": [...], │
│ "k": 10 } │
│ } │
│ } │
│ } │
│ } │
└─────────────────────────────┘
Chunking Strategy
Why Chunk Documents?
Embedding models have token limits (512 tokens for cohere's embed-english-v3.0). Long documents must be split into chunks:
- Token Limits: Models can't process unlimited text
- Precision: Smaller chunks allow more precise matching
- Relevance: A document may have one highly relevant section
Chunking Algorithm
def chunk_text(text, max_tokens=400):
"""
Chunk text at sentence boundaries, respecting token limits.
1. Split text into sentences
2. Accumulate sentences until approaching limit
3. Save chunk, start new accumulation
4. Handle oversized sentences by character splitting
"""
Parameters:
max_tokens: Target chunk size (default: 400)chars_per_token: Estimation ratio (default: 4 characters ≈ 1 token)
Chunk Metadata
Each chunk stores metadata for debugging and analysis:
{
"position": 0, // Order in document
"text": "...", // Chunk content
"character_offset": 0, // Start position in original
"character_length": 450, // Length in characters
"token_count": 95, // Estimated tokens
"vector": [...] // Embedding vector
}
k-NN Search Configuration
OpenSearch k-NN Settings
The semantic index uses OpenSearch's k-NN plugin with FAISS engine:
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"embeddings": {
"type": "nested",
"properties": {
"cohere_embed_v3": {
"type": "nested",
"properties": {
"chunks": {
"type": "nested",
"properties": {
"vector": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"name": "hnsw",
"engine": "faiss",
"space_type": "cosinesimil",
"parameters": {
"ef_construction": 128,
"m": 16
}
}
}
}
}
}
}
}
}
}
}
}
HNSW Parameters
| Parameter | Value | Description |
|---|---|---|
ef_construction | 128 | Build-time accuracy (higher = more accurate, slower build) |
m | 16 | Number of connections per node (higher = more accurate, more memory) |
space_type | cosinesimil | Similarity metric (cosine similarity) |
Security Considerations
Data Privacy
- Embedding Storage: Vectors are stored alongside documents; same access controls apply
- External API Calls: Embedding providers receive document text; ensure compliance
- Credential Management: API keys/AWS credentials must be secured
Access Control
Semantic search respects DataHub's existing access controls:
- Users only see results they have permission to view
- Entity-level permissions are enforced before returning results
Performance Considerations
Indexing Performance
- Dual-write Impact: ~10-20% increase in write latency
- Embedding Generation: Async; doesn't block ingestion
- Batch Processing: Embeddings generated in batches for efficiency
Query Performance
- k-NN Overhead: ~50-200ms per query (depends on index size)
- Embedding Generation: ~100-300ms for query embedding
- Total Latency: Typically 200-500ms end-to-end
Scaling Recommendations
| Index Size | Recommendation |
|---|---|
| < 100K docs | Single node sufficient |
| 100K - 1M docs | Consider dedicated k-NN nodes |
| > 1M docs | Sharding and replicas recommended |
Future Enhancements
- Hybrid Search: Combine keyword and semantic scores for improved relevance
- Model Fine-tuning: Domain-specific embedding models for better accuracy