Pinecone

Overview

Pinecone is a managed vector database platform used to store, index, and query high-dimensional vector embeddings for AI and machine learning applications. Learn more in the official Pinecone documentation.

The DataHub integration for Pinecone extracts metadata about indexes, namespaces, and vector collections, including inferred schemas from vector metadata fields.

Concept Mapping

Source Concept	DataHub Concept	Notes
Pinecone Account	Platform Instance	Organizes assets within the platform context.
Index	Container (PINECONE_INDEX)	Top-level organizational unit storing vectors.
Namespace	Container (PINECONE_NAMESPACE)	Logical partition within an index.
Vector Collection	Dataset	Represents the collection of vectors in a namespace.
Metadata Fields	SchemaField	Inferred from sampled vector metadata.

Module `pinecone`

Important Capabilities

Capability	Status	Notes
Asset Containers	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled via stateful ingestion.
Domains	✅	Supported via the `domain` config field.
Platform Instance	✅	Enabled by default.
Schema Metadata	✅	Enabled by default.

Overview

The pinecone module ingests metadata from Pinecone vector database into DataHub. It extracts index configurations, namespace statistics, and infers schemas from vector metadata fields.

Prerequisites

Before running ingestion, ensure you have a valid Pinecone API key with read access to your indexes.

Steps to Get the Required Information

Log in to the Pinecone Console.
Navigate to API Keys in the left sidebar.
Copy an existing API key or create a new one with read permissions.

note

Schema inference samples vectors from each namespace to build a schema. This requires that your vectors have metadata fields attached. If vectors have no metadata, schema inference is skipped gracefully.

Install the Plugin

pip install 'acryl-datahub[pinecone]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: pinecone
  config:
    # Required: Pinecone API key
    api_key: "${PINECONE_API_KEY}"

    # Optional: Platform instance for multi-environment setups
    # platform_instance: "production"

    # Optional: Filter indexes by name pattern
    # index_pattern:
    #   allow:
    #     - "prod-.*"
    #   deny:
    #     - ".*-test"

    # Optional: Filter namespaces by name pattern
    # namespace_pattern:
    #   allow:
    #     - "customer-.*"

    # Optional: Schema inference settings
    # enable_schema_inference: true
    # schema_sampling_size: 100
    # max_metadata_fields: 100

    # Optional: Stateful ingestion for stale entity removal
    # stateful_ingestion:
    #   enabled: true
    #   remove_stale_metadata: true

sink:
  # config sinks

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
api_key ✅ string(password)	Pinecone API key for authentication. Can be found in the Pinecone console under API Keys.
enable_schema_inference boolean	Whether to infer schemas from vector metadata. When enabled, samples vectors from each namespace to build a schema. Default: True
environment One of string, null	Pinecone environment (for pod-based indexes). Not required for serverless indexes. Example: 'us-west1-gcp' Default: None
index_host_mapping One of string, null	Optional manual mapping of index names to host URLs. Useful if automatic host resolution fails. Example: {'my-index': 'my-index-abc123.svc.pinecone.io'} Default: None
max_metadata_fields integer	Maximum number of metadata fields to include in the inferred schema. Limits schema size for namespaces with many metadata fields. Default: 100
max_workers integer	Maximum number of parallel workers for processing indexes and namespaces. Increase for faster ingestion of many indexes. Default: 5
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
schema_sampling_size integer	Number of vectors to sample per namespace for schema inference. Higher values provide more accurate schemas but increase ingestion time. Default: 100
env string	The environment that all assets produced by this connector belong to Default: PROD
index_pattern AllowDenyPattern	A class to store allow deny regexes
index_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
namespace_pattern AllowDenyPattern	A class to store allow deny regexes
namespace_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Stateful ingestion configuration for tracking processed entities and removing stale metadata. Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "description": "Configuration for Pinecone source.\n\nExtracts metadata from Pinecone vector database including:\n- Index configurations (dimension, metric, type)\n- Namespace information and statistics\n- Inferred schemas from vector metadata",
  "properties": {
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Stateful ingestion configuration for tracking processed entities and removing stale metadata."
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "api_key": {
      "description": "Pinecone API key for authentication. Can be found in the Pinecone console under API Keys.",
      "format": "password",
      "title": "Api Key",
      "type": "string",
      "writeOnly": true
    },
    "environment": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Pinecone environment (for pod-based indexes). Not required for serverless indexes. Example: 'us-west1-gcp'",
      "title": "Environment"
    },
    "index_host_mapping": {
      "anyOf": [
        {
          "additionalProperties": {
            "type": "string"
          },
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Optional manual mapping of index names to host URLs. Useful if automatic host resolution fails. Example: {'my-index': 'my-index-abc123.svc.pinecone.io'}",
      "title": "Index Host Mapping"
    },
    "index_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for indexes to filter in ingestion. Specify 'allow' patterns to include specific indexes, and 'deny' patterns to exclude indexes."
    },
    "namespace_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for namespaces to filter in ingestion. Specify 'allow' patterns to include specific namespaces, and 'deny' patterns to exclude namespaces."
    },
    "enable_schema_inference": {
      "default": true,
      "description": "Whether to infer schemas from vector metadata. When enabled, samples vectors from each namespace to build a schema.",
      "title": "Enable Schema Inference",
      "type": "boolean"
    },
    "schema_sampling_size": {
      "default": 100,
      "description": "Number of vectors to sample per namespace for schema inference. Higher values provide more accurate schemas but increase ingestion time.",
      "exclusiveMinimum": 0,
      "title": "Schema Sampling Size",
      "type": "integer"
    },
    "max_metadata_fields": {
      "default": 100,
      "description": "Maximum number of metadata fields to include in the inferred schema. Limits schema size for namespaces with many metadata fields.",
      "exclusiveMinimum": 0,
      "title": "Max Metadata Fields",
      "type": "integer"
    },
    "max_workers": {
      "default": 5,
      "description": "Maximum number of parallel workers for processing indexes and namespaces. Increase for faster ingestion of many indexes.",
      "exclusiveMinimum": 0,
      "title": "Max Workers",
      "type": "integer"
    }
  },
  "required": [
    "api_key"
  ],
  "title": "PineconeConfig",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Limitations

Schema inference is based on sampled vectors and may not capture all metadata fields present in the full dataset.
The describe_namespace() API is only available for serverless indexes; pod-based indexes use describe_index_stats() for namespace discovery.
Vector values themselves are not ingested, only metadata fields and statistics.

Troubleshooting

If ingestion fails, validate your API key, check network connectivity to the Pinecone API, and review ingestion logs for source-specific errors. If schema inference is slow, reduce schema_sampling_size or set enable_schema_inference: false.

Code Coordinates

Class Name: datahub.ingestion.source.pinecone.pinecone_source.PineconeSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for Pinecone, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

Pinecone

Overview​

Concept Mapping​

Module pinecone​

Important Capabilities​

Overview​

Prerequisites​

Steps to Get the Required Information​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Limitations​

Troubleshooting​

Code Coordinates​

Overview

Concept Mapping

Module `pinecone`

Important Capabilities

Overview

Prerequisites

Steps to Get the Required Information

Install the Plugin

Starter Recipe

Config Details

Capabilities

Limitations

Troubleshooting

Code Coordinates