Pinecone
Overview
Pinecone is a managed vector database platform used to store, index, and query high-dimensional vector embeddings for AI and machine learning applications. Learn more in the official Pinecone documentation.
The DataHub integration for Pinecone extracts metadata about indexes, namespaces, and vector collections, including inferred schemas from vector metadata fields.
Concept Mapping
| Source Concept | DataHub Concept | Notes |
|---|---|---|
| Pinecone Account | Platform Instance | Organizes assets within the platform context. |
| Index | Container (PINECONE_INDEX) | Top-level organizational unit storing vectors. |
| Namespace | Container (PINECONE_NAMESPACE) | Logical partition within an index. |
| Vector Collection | Dataset | Represents the collection of vectors in a namespace. |
| Metadata Fields | SchemaField | Inferred from sampled vector metadata. |
Module pinecone
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled via stateful ingestion. |
| Domains | ✅ | Supported via the domain config field. |
| Platform Instance | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
Overview
The pinecone module ingests metadata from Pinecone vector database into DataHub. It extracts index configurations, namespace statistics, and infers schemas from vector metadata fields.
Prerequisites
Before running ingestion, ensure you have a valid Pinecone API key with read access to your indexes.
Steps to Get the Required Information
- Log in to the Pinecone Console.
- Navigate to API Keys in the left sidebar.
- Copy an existing API key or create a new one with read permissions.
Schema inference samples vectors from each namespace to build a schema. This requires that your vectors have metadata fields attached. If vectors have no metadata, schema inference is skipped gracefully.
Install the Plugin
pip install 'acryl-datahub[pinecone]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: pinecone
config:
# Required: Pinecone API key
api_key: "${PINECONE_API_KEY}"
# Optional: Platform instance for multi-environment setups
# platform_instance: "production"
# Optional: Filter indexes by name pattern
# index_pattern:
# allow:
# - "prod-.*"
# deny:
# - ".*-test"
# Optional: Filter namespaces by name pattern
# namespace_pattern:
# allow:
# - "customer-.*"
# Optional: Schema inference settings
# enable_schema_inference: true
# schema_sampling_size: 100
# max_metadata_fields: 100
# Optional: Stateful ingestion for stale entity removal
# stateful_ingestion:
# enabled: true
# remove_stale_metadata: true
sink:
# config sinks
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
api_key ✅ string(password) | Pinecone API key for authentication. Can be found in the Pinecone console under API Keys. |
enable_schema_inference boolean | Whether to infer schemas from vector metadata. When enabled, samples vectors from each namespace to build a schema. Default: True |
environment One of string, null | Pinecone environment (for pod-based indexes). Not required for serverless indexes. Example: 'us-west1-gcp' Default: None |
index_host_mapping One of string, null | Optional manual mapping of index names to host URLs. Useful if automatic host resolution fails. Example: {'my-index': 'my-index-abc123.svc.pinecone.io'} Default: None |
max_metadata_fields integer | Maximum number of metadata fields to include in the inferred schema. Limits schema size for namespaces with many metadata fields. Default: 100 |
max_workers integer | Maximum number of parallel workers for processing indexes and namespaces. Increase for faster ingestion of many indexes. Default: 5 |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
schema_sampling_size integer | Number of vectors to sample per namespace for schema inference. Higher values provide more accurate schemas but increase ingestion time. Default: 100 |
env string | The environment that all assets produced by this connector belong to Default: PROD |
index_pattern AllowDenyPattern | A class to store allow deny regexes |
index_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
namespace_pattern AllowDenyPattern | A class to store allow deny regexes |
namespace_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Stateful ingestion configuration for tracking processed entities and removing stale metadata. Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for Pinecone source.\n\nExtracts metadata from Pinecone vector database including:\n- Index configurations (dimension, metric, type)\n- Namespace information and statistics\n- Inferred schemas from vector metadata",
"properties": {
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful ingestion configuration for tracking processed entities and removing stale metadata."
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"api_key": {
"description": "Pinecone API key for authentication. Can be found in the Pinecone console under API Keys.",
"format": "password",
"title": "Api Key",
"type": "string",
"writeOnly": true
},
"environment": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Pinecone environment (for pod-based indexes). Not required for serverless indexes. Example: 'us-west1-gcp'",
"title": "Environment"
},
"index_host_mapping": {
"anyOf": [
{
"additionalProperties": {
"type": "string"
},
"type": "object"
},
{
"type": "null"
}
],
"default": null,
"description": "Optional manual mapping of index names to host URLs. Useful if automatic host resolution fails. Example: {'my-index': 'my-index-abc123.svc.pinecone.io'}",
"title": "Index Host Mapping"
},
"index_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for indexes to filter in ingestion. Specify 'allow' patterns to include specific indexes, and 'deny' patterns to exclude indexes."
},
"namespace_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for namespaces to filter in ingestion. Specify 'allow' patterns to include specific namespaces, and 'deny' patterns to exclude namespaces."
},
"enable_schema_inference": {
"default": true,
"description": "Whether to infer schemas from vector metadata. When enabled, samples vectors from each namespace to build a schema.",
"title": "Enable Schema Inference",
"type": "boolean"
},
"schema_sampling_size": {
"default": 100,
"description": "Number of vectors to sample per namespace for schema inference. Higher values provide more accurate schemas but increase ingestion time.",
"exclusiveMinimum": 0,
"title": "Schema Sampling Size",
"type": "integer"
},
"max_metadata_fields": {
"default": 100,
"description": "Maximum number of metadata fields to include in the inferred schema. Limits schema size for namespaces with many metadata fields.",
"exclusiveMinimum": 0,
"title": "Max Metadata Fields",
"type": "integer"
},
"max_workers": {
"default": 5,
"description": "Maximum number of parallel workers for processing indexes and namespaces. Increase for faster ingestion of many indexes.",
"exclusiveMinimum": 0,
"title": "Max Workers",
"type": "integer"
}
},
"required": [
"api_key"
],
"title": "PineconeConfig",
"type": "object"
}
Capabilities
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
Limitations
- Schema inference is based on sampled vectors and may not capture all metadata fields present in the full dataset.
- The
describe_namespace()API is only available for serverless indexes; pod-based indexes usedescribe_index_stats()for namespace discovery. - Vector values themselves are not ingested, only metadata fields and statistics.
Troubleshooting
If ingestion fails, validate your API key, check network connectivity to the Pinecone API, and review ingestion logs for source-specific errors. If schema inference is slow, reduce schema_sampling_size or set enable_schema_inference: false.
Code Coordinates
- Class Name:
datahub.ingestion.source.pinecone.pinecone_source.PineconeSource - Browse on GitHub
If you've got any questions on configuring ingestion for Pinecone, feel free to ping us on our Slack.
This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.
Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.