Semantic Search Configuration Guide
This guide covers all configuration options for DataHub's semantic search, including embedding models, index settings, and environment variables.
Enabling Semantic Search
Environment Variables
Set these in your deployment configuration (e.g., docker/profiles/empty2.env):
# Enable semantic search feature
ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
# Entity types to enable (comma-separated)
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
# Vector dimensions (must match embedding model)
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=1024
Application Configuration
In metadata-service/configuration/src/main/resources/application.yaml:
elasticsearch:
search:
semanticSearch:
enabled: ${ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED:false}
enabledEntities: ${ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES:document}
models:
cohere_embed_v3:
vectorDimension: ${ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION:1024}
knnEngine: faiss
spaceType: cosinesimil
efConstruction: 128
m: 16
Embedding Models
Understanding Embedding Providers
There are two separate embedding contexts in DataHub's semantic search:
| Context | When | Who Generates | Configuration |
|---|---|---|---|
| Document Embeddings | At ingestion time | Ingestion Connector | Configured in connector |
| Query Embeddings | At search time | GMS | Configured in GMS (below) |
Important: The GMS embedding provider (configured below) is used only for query embedding. Document embeddings are generated by the ingestion connector using its own embedding configuration. Both must use the same embedding model for semantic search to work correctly.
Ingestion via MCP (Metadata Change Proposal)
Document embeddings are sent to DataHub via MCP using the SemanticContent aspect. This is the standard DataHub pattern for ingesting metadata.
MCP Payload Structure:
{
"entityType": "document",
"entityUrn": "urn:li:document:my-doc-123",
"aspectName": "semanticContent",
"aspect": {
"value": "{...JSON encoded SemanticContent...}",
"contentType": "application/json"
}
}
SemanticContent Aspect:
{
"embeddings": {
"cohere_embed_v3": {
"modelVersion": "bedrock/cohere.embed-english-v3",
"generatedAt": 1702234567890,
"chunkingStrategy": "sentence_boundary_400t",
"totalChunks": 2,
"totalTokens": 450,
"chunks": [
{
"position": 0,
"vector": [0.123, -0.456, ...],
"characterOffset": 0,
"characterLength": 512,
"tokenCount": 230,
"text": "First chunk of text..."
}
]
}
}
}
Privacy Option: The text field in each chunk is optional. For sensitive data sources, you can omit the source text and send only the embedding vectors.
GMS Query Embedding Provider (Java)
GMS uses an EmbeddingProvider implementation to generate query embeddings at search time.
Interface: com.linkedin.metadata.search.embedding.EmbeddingProvider
public interface EmbeddingProvider {
/**
* Returns an embedding vector for the given text.
* @param text The text to embed
* @param model The model identifier (nullable, uses default if null)
* @return The embedding vector
*/
@Nonnull
float[] embed(@Nonnull String text, @Nullable String model);
}
Built-in Implementations:
AwsBedrockEmbeddingProvider- Uses AWS Bedrock (default)NoOpEmbeddingProvider- Throws exception if called (used when semantic search disabled)
The following providers can be configured:
AWS Bedrock (Default)
AWS Bedrock provides managed access to embedding models:
# Environment
EMBED_PROVIDER=bedrock
BEDROCK_MODEL=cohere.embed-english-v3
AWS_REGION=us-west-2
# For local development
AWS_PROFILE=your-profile
# For production (IAM role or explicit credentials)
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
Available Bedrock Models:
| Model ID | Dimensions | Max Tokens | Notes |
|---|---|---|---|
cohere.embed-english-v3 | 1024 | 512 | Best for English |
cohere.embed-multilingual-v3 | 1024 | 512 | Multi-language support |
amazon.titan-embed-text-v1 | 1536 | 8192 | Amazon's model |
amazon.titan-embed-text-v2:0 | 256/512/1024 | 8192 | Configurable dimensions |
Other Providers (Cohere Direct, OpenAI, etc.)
Currently, only AWS Bedrock is implemented as a built-in provider. To use other providers (Cohere direct API, OpenAI, etc.), you need to implement a custom EmbeddingProvider. See the "Custom/Self-Hosted Providers" section below.
Potential future built-in providers:
- Cohere Direct API
- OpenAI
- Azure OpenAI
- Google Vertex AI
Custom/Self-Hosted Providers
To add a custom embedding provider, implement the EmbeddingProvider interface and register it in the factory.
1. Implement the interface (metadata-io/src/main/java/com/linkedin/metadata/search/embedding/):
package com.linkedin.metadata.search.embedding;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
public class CustomEmbeddingProvider implements EmbeddingProvider {
private final String endpoint;
private final int dimensions;
public CustomEmbeddingProvider(String endpoint, int dimensions) {
this.endpoint = endpoint;
this.dimensions = dimensions;
}
@Override
@Nonnull
public float[] embed(@Nonnull String text, @Nullable String model) {
// Call your embedding service
// Return float array with `dimensions` elements
return callEmbeddingService(endpoint, text, model);
}
}
2. Register in the factory (metadata-service/factories/.../EmbeddingProviderFactory.java):
@Bean(name = "embeddingProvider")
@Nonnull
protected EmbeddingProvider getInstance() {
// ... existing code ...
String providerType = config.getType();
if ("aws-bedrock".equalsIgnoreCase(providerType)) {
return new AwsBedrockEmbeddingProvider(...);
} else if ("custom".equalsIgnoreCase(providerType)) {
return new CustomEmbeddingProvider(
config.getCustomEndpoint(),
config.getCustomDimensions()
);
} else {
throw new IllegalStateException("Unsupported provider: " + providerType);
}
}
3. Add configuration (EmbeddingProviderConfiguration.java):
@Data
public class EmbeddingProviderConfiguration {
private String type = "aws-bedrock";
private String awsRegion = "us-west-2";
private String modelId = "cohere.embed-english-v3";
// Add custom provider fields
private String customEndpoint;
private int customDimensions = 1024;
}
4. Configure in application.yaml:
elasticsearch:
search:
semanticSearch:
embeddingProvider:
type: custom
customEndpoint: http://your-embedding-service:8080/embed
customDimensions: 1024
Index Configuration
k-NN Settings
The semantic index uses OpenSearch's k-NN plugin. Key parameters:
models:
cohere_embed_v3:
# Vector size (must match model output)
vectorDimension: 1024
# k-NN engine: faiss (recommended) or nmslib
knnEngine: faiss
# Similarity metric
# - cosinesimil: Cosine similarity (recommended for text)
# - l2: Euclidean distance
# - innerproduct: Dot product
spaceType: cosinesimil
# HNSW graph parameters
efConstruction: 128 # Build-time accuracy (32-512)
m: 16 # Connections per node (4-64)
Parameter Tuning
| Parameter | Low | Medium | High | Trade-off |
|---|---|---|---|---|
efConstruction | 32 | 128 | 512 | Speed vs accuracy at build time |
m | 4 | 16 | 64 | Memory vs accuracy |
Recommendations:
- Development:
efConstruction: 64, m: 8 - Production:
efConstruction: 128, m: 16 - High Accuracy:
efConstruction: 256, m: 32
Multiple Models
Configure multiple embedding models for A/B testing or migration:
models:
cohere_embed_v3:
vectorDimension: 1024
knnEngine: faiss
spaceType: cosinesimil
efConstruction: 128
m: 16
openai_text_embedding_3_small:
vectorDimension: 1536
knnEngine: faiss
spaceType: cosinesimil
efConstruction: 128
m: 16
Query Configuration
GMS Query Settings
GMS needs credentials to generate query embeddings at search time:
# Mount AWS credentials in Docker
volumes:
- ${HOME}/.aws:/home/datahub/.aws:ro
# Set profile
AWS_PROFILE=your-profile
Query Parameters
In GraphQL queries:
query SemanticSearch($input: SearchAcrossEntitiesInput!) {
semanticSearchAcrossEntities(input: $input) {
# ...
}
}
Variables:
{
"input": {
"query": "your natural language query",
"types": ["DOCUMENT"], // Entity types to search
"start": 0, // Pagination start
"count": 10 // Results per page
}
}
Chunking Configuration
Chunking is handled by the ingestion connector when generating document embeddings. The typical strategy:
- Chunk size: ~400 tokens per chunk
- Boundary: Split at sentence boundaries when possible
- Overlap: Optional overlap between chunks for context continuity
The chunking parameters are configured in the ingestion connector, not in GMS. GMS receives pre-chunked embeddings from the connector.
Monitoring
Useful Queries
Check embedding coverage:
curl "http://localhost:9200/documentindex_v2_semantic/_search" \
-H "Content-Type: application/json" \
-d '{
"size": 0,
"aggs": {
"with_embeddings": {
"filter": { "exists": { "field": "embeddings.cohere_embed_v3" } }
},
"without_embeddings": {
"filter": { "bool": { "must_not": { "exists": { "field": "embeddings.cohere_embed_v3" } } } }
}
}
}'
Check index health:
curl "http://localhost:9200/_cat/indices/*semantic*?v"
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| "Unable to locate credentials" | AWS creds not available | Mount .aws to /home/datahub/.aws |
| "Profile file contained no credentials" | SSO session expired | Run aws sso login --profile your-profile |
| Empty search results | No embeddings in index | Verify ingestion connector is generating embeddings |
| Wrong results | Model mismatch | Ensure ingestion connector and GMS use the same embedding model |
Debug Logging
Enable debug logging in GMS:
logging:
level:
com.linkedin.metadata.search: DEBUG
Check logs for:
[DEBUG-DUALWRITE] shouldWriteToSemanticIndex returned: true
Semantic dual-write enabled=true, enabledEntities=[document]