Semantic Search Configuration
This guide walks you through configuring semantic search in DataHub using AWS Bedrock for embedding generation.
Overview
DataHub's semantic search uses vector embeddings to find semantically similar entities. In OSS, embeddings are generated using AWS Bedrock with Cohere Embed models. This provides:
- Natural language search: Find datasets using conversational queries like "customer churn analysis"
- Semantic understanding: Match concepts even when exact keywords differ
- Cross-entity search: Search across datasets, dashboards, and other entities simultaneously
Prerequisites
1. OpenSearch Requirements
- OpenSearch 2.17.0 or higher with k-NN plugin enabled
- Alternative: Elasticsearch with k-NN plugin (not officially tested)
Verify k-NN plugin is enabled:
curl -X GET "localhost:9200/_cat/plugins?v&s=component&h=name,component,version"
You should see opensearch-knn in the output.
2. AWS Bedrock Requirements
AWS Account Setup
- AWS Account with Bedrock access
- Supported AWS Region: Bedrock with Cohere Embed v3 is available in:
us-west-2(Oregon) - Recommendedus-east-1(N. Virginia)- Other regions - check AWS Bedrock documentation
Enable Model Access
- Go to AWS Console → Amazon Bedrock → Model access
- Request access to Cohere Embed English v3 (
cohere.embed-english-v3) - Wait for approval (usually instant for Cohere models)
IAM Permissions
Your AWS credentials (IAM user or role) need:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": [
"arn:aws:bedrock:*::foundation-model/cohere.embed-english-v3",
"arn:aws:bedrock:*::foundation-model/cohere.embed-multilingual-v3"
]
}
]
}
For broader access (all Bedrock models):
{
"Effect": "Allow",
"Action": "bedrock:InvokeModel",
"Resource": "arn:aws:bedrock:*::foundation-model/*"
}
Important: Ensure AWS roles are configured for both the GMS container (for query embeddings at search time) and the ingestion container (for document embeddings during ingestion). In typical Kubernetes deployments using the DataHub Helm chart, you'll need to configure IAM roles for service accounts (IRSA) for both the datahub-gms and datahub-ingestion pods.
Configuration
Step 1: Configure DataHub
Edit your application.yaml or set environment variables:
Option A: Environment Variables (Recommended for Production)
# Enable semantic search
export ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
export ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document # Comma-separated: dataset,dashboard,chart
# Configure embedding provider
export EMBEDDING_PROVIDER_TYPE=aws-bedrock
export EMBEDDING_PROVIDER_AWS_REGION=us-west-2
export EMBEDDING_PROVIDER_MODEL_ID=cohere.embed-english-v3
export EMBEDDING_PROVIDER_MAX_CHAR_LENGTH=2048
# Vector index configuration
export ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=1024
export ELASTICSEARCH_SEMANTIC_KNN_ENGINE=faiss
export ELASTICSEARCH_SEMANTIC_SPACE_TYPE=cosinesimil
Option B: application.yaml
elasticsearch:
index:
semanticSearch:
enabled: true
enabledEntities: document # Or: dataset,dashboard,chart
models:
cohere_embed_v3:
vectorDimension: 1024
knnEngine: faiss
spaceType: cosinesimil
efConstruction: 128
m: 16
embeddingProvider:
type: aws-bedrock
awsRegion: us-west-2
modelId: cohere.embed-english-v3
maxCharacterLength: 2048
Step 2: Configure AWS Credentials
Choose one of these authentication methods:
Option 1: AWS Profile (Development)
Create/edit ~/.aws/credentials:
[datahub-dev]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
Then set the profile:
export AWS_PROFILE=datahub-dev
Option 2: Environment Variables (CI/CD)
export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY
export AWS_REGION=us-west-2 # Optional, uses config default if not set
Option 3: EC2 Instance Role (Production - Recommended)
For production deployments on EC2:
- Create an IAM role with Bedrock permissions (see IAM Permissions above)
- Attach the role to your EC2 instance
- No additional configuration needed - credentials auto-discovered via IMDS
Option 4: ECS Task Role (Container Deployments)
For ECS/Fargate deployments:
- Create an IAM role with Bedrock permissions
- Assign the role to your ECS task definition
- No additional configuration needed - credentials auto-discovered
Step 3: Restart DataHub
# Docker Compose
docker-compose restart datahub-gms
# Kubernetes
kubectl rollout restart deployment datahub-gms
Step 4: Verify Configuration
Check DataHub logs for successful initialization:
# Look for these log messages
docker-compose logs datahub-gms | grep -i "semantic\|embedding"
Expected output:
Creating embedding provider with type: aws-bedrock
Configuring AWS Bedrock embedding provider: region=us-west-2, model=cohere.embed-english-v3, maxCharLength=2048
Initialized AwsBedrockEmbeddingProvider with region=us-west-2, model=cohere.embed-english-v3, maxCharLength=2048
Generating Embeddings
Embeddings are generated by dedicated ingestion sources that connect to specific systems where your documents live (e.g., Notion, Confluence, SharePoint). Each source is responsible for:
- Extracting document content from the source system
- Chunking the text into manageable segments
- Generating embeddings for each chunk using the configured provider (e.g., AWS Bedrock)
- Emitting the embeddings to DataHub as
SemanticContentaspects
DataHub Documents Source
DataHub provides a native ingestion source for generating semantic embeddings for DataHub's native Document entities. This source processes documents that are already stored in DataHub (created via GraphQL, Python SDK, or other ingestion sources) and enriches them with embeddings.
Key features:
- Processes Document entities from DataHub
- Supports both batch mode (GraphQL) and event-driven mode (Kafka MCL)
- Incremental processing - only reprocesses documents when content changes
- Stateful ingestion for tracking progress
- Multiple chunking strategies (by_title, basic)
Example Recipe
The DataHub Documents Source uses smart defaults to minimize configuration. Here's the minimal recipe:
source:
type: datahub-documents
config: {}
sink:
type: datahub-rest
config: {}
This minimal configuration automatically:
- Connects to DataHub using
DATAHUB_GMS_URLandDATAHUB_GMS_TOKENenvironment variables (defaults tohttp://localhost:8080if not set) - Enables event-driven mode (processes documents in real-time from Kafka MCL)
- Enables incremental processing (only reprocesses documents when content changes)
- Uses by_title chunking with sensible defaults (500 characters, smart combining)
- Fetches embedding configuration from the server (matches your GMS semantic search config automatically)
Customization Options
If you need to override defaults, you can specify them explicitly:
source:
type: datahub-documents
config:
# DataHub connection (optional - defaults to env vars)
datahub:
server: "http://datahub-gms:8080"
token: "${DATAHUB_TOKEN}"
# Platform filtering (optional - defaults to all documents)
platform_filter: ["notion", "confluence"]
# Event mode (optional - enabled by default)
event_mode:
enabled: true
idle_timeout_seconds: 60
# Incremental processing (optional - enabled by default)
incremental:
enabled: true
# Chunking strategy (optional - has sensible defaults)
chunking:
strategy: by_title
max_characters: 500
# Embedding (optional - fetches from server by default)
embedding:
# Leave empty to auto-fetch from server
batch_size: 50 # Can override processing options
sink:
type: datahub-rest
config: {}
Running the source:
# Set environment variables
export DATAHUB_GMS_URL="http://localhost:8080"
export DATAHUB_GMS_TOKEN="your-token"
# Run ingestion with minimal recipe
datahub ingest -c recipe.yml
# Or inline for one-time use
DATAHUB_GMS_TOKEN=your-token datahub ingest -c recipe.yml
For detailed configuration options and advanced features (event-driven mode, platform filtering, chunking strategies), see the DataHub Documents Source documentation.
External Sources
For documents from external systems like Notion or Confluence, use the respective ingestion sources that support semantic search:
- Notion: Notion Source - Ingest pages, databases, and hierarchical content with embeddings
- Confluence: Coming soon
- SharePoint: Coming soon
Each external source handles fetching, chunking, and embedding generation specific to that platform's document format.
Notion Example
The Notion source automatically fetches embedding configuration from your DataHub server, so you only need to specify your Notion credentials:
source:
type: notion
config:
api_key: "${NOTION_API_KEY}"
page_ids:
- "your-page-id-here"
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
For complete configuration options including custom embedding providers and chunking strategies, see the Notion Source documentation.
Usage
GraphQL API
Single Entity Type Search
query {
semanticSearch(
input: {
type: DATASET
query: "customer churn prediction models"
start: 0
count: 10
}
) {
start
count
total
searchResults {
entity {
urn
type
... on Dataset {
name
description
}
}
}
}
}
Multi-Entity Search
query {
semanticSearchAcrossEntities(
input: {
types: [DATASET, DASHBOARD, CHART]
query: "revenue analysis last quarter"
start: 0
count: 10
}
) {
start
count
total
searchResults {
entity {
urn
type
}
matchedFields {
name
value
}
}
}
}
Python SDK (Coming Soon)
from datahub.emitter.rest_emitter import DatahubRestEmitter
emitter = DatahubRestEmitter("http://localhost:8080")
# Semantic search
results = emitter.semantic_search(
query="customer data pipeline",
types=["dataset"],
start=0,
count=10
)
Supported Models
Cohere Embed v3 (Default)
| Model ID | Dimensions | Max Tokens | Languages |
|---|---|---|---|
cohere.embed-english-v3 | 1024 | 512 | English |
cohere.embed-multilingual-v3 | 1024 | 512 | 100+ languages |
Amazon Titan Embed
| Model ID | Dimensions | Max Tokens | Languages |
|---|---|---|---|
amazon.titan-embed-text-v1 | 1536 | 8192 | English |
amazon.titan-embed-text-v2:0 | 1024 (configurable) | 8192 | English |
Note: If you change models, ensure vectorDimension matches the model's output dimensions.
Troubleshooting
Issue: "Semantic search is disabled or not configured"
Solution: Verify ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true and restart GMS.
Issue: AWS Credentials Error
Unable to load credentials from any provider in the chain
Solutions:
- Verify AWS_PROFILE is set and profile exists in
~/.aws/credentials - For EC2, verify instance role is attached:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ - Check environment variables are set correctly
Issue: Bedrock Access Denied
User: arn:aws:iam::123456789:user/datahub is not authorized to perform: bedrock:InvokeModel
Solution: Add IAM permissions (see IAM Permissions section above).
Issue: Model Not Found
Could not find model: cohere.embed-english-v3
Solutions:
- Verify model access is enabled in AWS Console → Bedrock → Model access
- Check the region supports the model (use
us-west-2for broadest support) - Ensure
EMBEDDING_PROVIDER_AWS_REGIONmatches where you enabled access
Issue: k-NN Index Creation Failed
Codec [zstd_no_dict] cannot be used with k-NN indices
Solution: This is a known issue with older DataHub versions. The semantic search port includes a fix. Ensure you have the latest code.
Issue: Vector Dimension Mismatch
Dimension mismatch: expected 1024, got 1536
Solution: Your model's dimensions don't match the configuration. Update ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION to match your model.
Performance Tuning
OpenSearch k-NN Settings
For better performance, tune these parameters in application.yaml:
semanticSearch:
models:
cohere_embed_v3:
efConstruction: 128 # Higher = better recall, slower indexing (default: 128)
m: 16 # Higher = better recall, more memory (default: 16)
spaceType: cosinesimil # cosinesimil, l2, innerproduct
knnEngine: faiss # faiss, nmslib, lucene
Recommendations:
- Small datasets (<10K docs):
efConstruction: 128, m: 16 - Medium datasets (10K-100K docs):
efConstruction: 256, m: 32 - Large datasets (>100K docs):
efConstruction: 512, m: 48
AWS Bedrock Rate Limits
Cohere Embed v3 on Bedrock has default limits:
- Requests per minute: 1000
- Tokens per minute: 200,000
For higher limits, request a quota increase in AWS Service Quotas.
Cost Estimation
AWS Bedrock Pricing (Cohere Embed v3)
As of December 2024 in us-west-2:
- $0.0001 per 1,000 input tokens (~750 words)
Example costs:
- 10,000 datasets with 200 tokens each = 2M tokens = $0.20
- 100,000 datasets with 200 tokens each = 20M tokens = $2.00
- Query embeddings: ~50 tokens per query = 10,000 queries = $0.05
Monthly estimates (assuming daily re-indexing):
- 10K entities: ~$6/month
- 100K entities: ~$60/month
Check current pricing: https://aws.amazon.com/bedrock/pricing/
Security Best Practices
- Use IAM Roles: Prefer EC2 instance roles over static credentials
- Principle of Least Privilege: Grant only
bedrock:InvokeModelpermission - Enable CloudTrail: Monitor Bedrock API calls
- Resource Tags: Tag IAM roles for cost tracking
- Rotate Credentials: If using access keys, rotate regularly