Skip to main content

Pinecone

Overview

Pinecone is a managed vector database platform used to store, index, and query high-dimensional vector embeddings for AI and machine learning applications. Learn more in the official Pinecone documentation.

The DataHub integration for Pinecone extracts metadata about indexes, namespaces, and vector collections, including inferred schemas from vector metadata fields.

Concept Mapping

Source ConceptDataHub ConceptNotes
Pinecone AccountPlatform InstanceOrganizes assets within the platform context.
IndexContainer (PINECONE_INDEX)Top-level organizational unit storing vectors.
NamespaceContainer (PINECONE_NAMESPACE)Logical partition within an index.
Vector CollectionDatasetRepresents the collection of vectors in a namespace.
Metadata FieldsSchemaFieldInferred from sampled vector metadata.

Module pinecone

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default.
Detect Deleted EntitiesEnabled via stateful ingestion.
DomainsSupported via the domain config field.
Platform InstanceEnabled by default.
Schema MetadataEnabled by default.

Overview

The pinecone module ingests metadata from Pinecone vector database into DataHub. It extracts index configurations, namespace statistics, and infers schemas from vector metadata fields.

Prerequisites

Before running ingestion, ensure you have a valid Pinecone API key with read access to your indexes.

Steps to Get the Required Information

  1. Log in to the Pinecone Console.
  2. Navigate to API Keys in the left sidebar.
  3. Copy an existing API key or create a new one with read permissions.
note

Schema inference samples vectors from each namespace to build a schema. This requires that your vectors have metadata fields attached. If vectors have no metadata, schema inference is skipped gracefully.

Install the Plugin

pip install 'acryl-datahub[pinecone]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: pinecone
config:
# Required: Pinecone API key
api_key: "${PINECONE_API_KEY}"

# Optional: Platform instance for multi-environment setups
# platform_instance: "production"

# Optional: Filter indexes by name pattern
# index_pattern:
# allow:
# - "prod-.*"
# deny:
# - ".*-test"

# Optional: Filter namespaces by name pattern
# namespace_pattern:
# allow:
# - "customer-.*"

# Optional: Schema inference settings
# enable_schema_inference: true
# schema_sampling_size: 100
# max_metadata_fields: 100

# Optional: Stateful ingestion for stale entity removal
# stateful_ingestion:
# enabled: true
# remove_stale_metadata: true

sink:
# config sinks

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
api_key 
string(password)
Pinecone API key for authentication. Can be found in the Pinecone console under API Keys.
enable_schema_inference
boolean
Whether to infer schemas from vector metadata. When enabled, samples vectors from each namespace to build a schema.
Default: True
environment
One of string, null
Pinecone environment (for pod-based indexes). Not required for serverless indexes. Example: 'us-west1-gcp'
Default: None
index_host_mapping
One of string, null
Optional manual mapping of index names to host URLs. Useful if automatic host resolution fails. Example: {'my-index': 'my-index-abc123.svc.pinecone.io'}
Default: None
max_metadata_fields
integer
Maximum number of metadata fields to include in the inferred schema. Limits schema size for namespaces with many metadata fields.
Default: 100
max_workers
integer
Maximum number of parallel workers for processing indexes and namespaces. Increase for faster ingestion of many indexes.
Default: 5
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
schema_sampling_size
integer
Number of vectors to sample per namespace for schema inference. Higher values provide more accurate schemas but increase ingestion time.
Default: 100
env
string
The environment that all assets produced by this connector belong to
Default: PROD
index_pattern
AllowDenyPattern
A class to store allow deny regexes
index_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
namespace_pattern
AllowDenyPattern
A class to store allow deny regexes
namespace_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration for tracking processed entities and removing stale metadata.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Limitations

  • Schema inference is based on sampled vectors and may not capture all metadata fields present in the full dataset.
  • The describe_namespace() API is only available for serverless indexes; pod-based indexes use describe_index_stats() for namespace discovery.
  • Vector values themselves are not ingested, only metadata fields and statistics.

Troubleshooting

If ingestion fails, validate your API key, check network connectivity to the Pinecone API, and review ingestion logs for source-specific errors. If schema inference is slow, reduce schema_sampling_size or set enable_schema_inference: false.

Code Coordinates

  • Class Name: datahub.ingestion.source.pinecone.pinecone_source.PineconeSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Pinecone, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.