RDF

Overview

This connector ingests RDF (Resource Description Framework) ontologies into DataHub, with a focus on business glossaries. It extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.

The RDF ingestion source processes RDF/OWL ontologies in various formats (Turtle, RDF/XML, JSON-LD, N3, N-Triples) and converts them to DataHub entities. It supports loading RDF from files, folders, URLs, and comma-separated file lists.

Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

Source Concept	DataHub Concept	Notes
`"rdf"`	Data Platform
`skos:Concept`	GlossaryTerm	SKOS concepts become glossary terms
`owl:Class`	GlossaryTerm	OWL classes become glossary terms
IRI path hierarchy	GlossaryNode	Path segments create glossary node hierarchy
`skos:broader` / `skos:narrower`	`isRelatedTerms` relationship	Term relationships

Module `rdf`

Important Capabilities

Capability	Status	Notes
Data Profiling	❌	Not applicable.
Descriptions	✅	Enabled by default (from skos:definition or rdfs:comment).
Detect Deleted Entities	✅	Enabled via stateful_ingestion.enabled: true.
Domains	❌	Not applicable (domains used internally for hierarchy).
Extract Ownership	❌	Not supported.
Extract Tags	❌	Not supported.
Platform Instance	✅	Supported via platform_instance config.
Table-Level Lineage	❌	Not in MVP.

Overview

The rdf module ingests RDF/OWL ontologies into DataHub as glossary terms, glossary nodes, and term relationships. It supports multiple RDF formats and dialects.

Prerequisites

In order to ingest metadata from RDF files, you will need:

Python 3.8 or higher
Access to RDF files (local files, folders, or URLs)
RDF files in supported formats: Turtle (.ttl), RDF/XML (.rdf, .xml), JSON-LD (.jsonld), N3 (.n3), or N-Triples (.nt)
A DataHub instance to ingest into

RDF Format Support

The source supports multiple RDF serialization formats:

Turtle (.ttl) - Recommended format, human-readable
RDF/XML (.rdf, .xml) - XML-based RDF format
JSON-LD (.jsonld) - JSON-based RDF format
N3 (.n3) - Notation3 format
N-Triples (.nt) - Line-based RDF format

The format is auto-detected from the file extension, or you can specify it explicitly using the format parameter.

Source Types

The source parameter accepts multiple input types:

Single file: source: path/to/glossary.ttl
Folder: source: path/to/rdf_files/ (processes all RDF files, recursively if recursive: true)
URL: source: https://example.com/ontology.ttl
Comma-separated files: source: file1.ttl, file2.ttl, file3.ttl
Glob pattern: source: path/to/**/*.ttl

RDF Dialects

The source supports different RDF dialects for specialized processing:

default - Standard RDF processing (BCBS239-style)
fibo - FIBO (Financial Industry Business Ontology) dialect
generic - Generic RDF processing

The dialect is auto-detected based on the RDF content, or you can force a specific dialect using the dialect parameter.

SPARQL Filtering

You can use SPARQL CONSTRUCT queries to filter the RDF graph before ingestion. This is useful for filtering by namespace, applying complex filtering logic, or reducing the size of large RDF graphs.

source:
  type: rdf
  config:
    source: large_ontology.ttl
    sparql_filter: |
      CONSTRUCT { ?s ?p ?o }
      WHERE {
        ?s ?p ?o .
        FILTER(STRSTARTS(STR(?s), "https://example.org/module1/"))
      }

Only CONSTRUCT queries are supported. The filter is applied before entity extraction.

Selective Entity Export

You can control which entity types are ingested using export_only or skip_export:

source:
  type: rdf
  config:
    source: glossary.ttl
    export_only:
      - glossary # Only ingest glossary terms

Available entity types: glossary (or glossary_terms), relationship (or relationships).

Install the Plugin

pip install 'acryl-datahub[rdf]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: rdf
  config:
    # Required: RDF source (file, folder, URL, or comma-separated files)
    source: path/to/glossary.ttl
    
    # Optional: DataHub environment
    environment: PROD
    
    # Optional: RDF format (auto-detected if not specified)
    # format: turtle
    
    # Optional: RDF dialect (auto-detected if not specified)
    # dialect: default
    
    # Optional: Export only specific entity types
    # export_only:
    #   - glossary
    
    # Optional: Skip specific entity types
    # skip_export:
    #   - relationship
    
    # Optional: Enable stateful ingestion (recommended for production)
    # stateful_ingestion:
    #   enabled: true

sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'
    # token: "${DATAHUB_TOKEN}"

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
source ✅ string	Primary: Web URL to .ttl or .zip file (e.g. https://example.org/glossary.ttl, https://example.org/data.zip). Also supports web folder URLs. Local file/folder paths work only when running via CLI. Examples: 'https://example.org/glossary.ttl', 'https://example.org/data.zip', 'https://example.org/folder/', '/path/to/file.ttl' (CLI-only)
dialect One of string, null	Force a specific RDF dialect (default: auto-detect). Options: default, fibo, generic Default: None
environment string	DataHub environment (PROD, DEV, TEST, etc.) Default: PROD
format One of string, null	RDF format (auto-detected if not specified). Examples: turtle, xml, n3, nt Default: None
include_provisional boolean	Include terms with provisional/work-in-progress status (default: False). When False, only terms that have been fully approved/released are included. Many ontologies use workflow status properties (e.g., maturity level) to mark terms that are in the pipeline but not yet fully approved. Setting this to False helps reduce noise from unapproved or draft terms. Default: False
parent_glossary_node One of string, null	Optional parent Term Group (glossary node) to place the loaded hierarchy under. Use a name (e.g. 'ExternalOntologies') or full URN (urn:li:glossaryNode:ExternalOntologies). If a name is provided, the parent node is created if it does not exist. When omitted, terms are placed at the top level. Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
recursive boolean	Enable recursive folder processing (default: true) Default: True
sparql_filter One of string, null	Optional SPARQL CONSTRUCT query to filter the RDF graph before ingestion. Useful for filtering by namespace, module, or custom patterns. The query should use CONSTRUCT to build a filtered graph. Example: Filter to specific FIBO modules: 'CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o . FILTER(STRSTARTS(STR(?s), "https://spec.edmcouncil.org/fibo/ontology/FBC/")) }' Default: None
env string	The environment that all assets produced by this connector belong to Default: PROD
export_only One of array, null	Export only specified entity types. Options are dynamically determined from registered entity types. Default: None
export_only.string string
extensions array	File extensions to process when source is a folder Default: ['.ttl', '.rdf', '.owl', '.n3', '.nt']
extensions.string string
skip_export One of array, null	Skip exporting specified entity types. Options are dynamically determined from registered entity types. Default: None
skip_export.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Stateful ingestion configuration. See https://datahubproject.io/docs/stateful-ingestion for more details. Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "description": "Configuration for RDF ingestion source.\n\nMirrors the CLI parameters to provide consistent behavior between\nCLI and ingestion framework usage.\n\nExample configuration (primary: web URL to .ttl or .zip):\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: https://example.org/glossary.ttl\n        format: turtle\n        environment: PROD\n        stateful_ingestion:\n          enabled: true\n          remove_stale_metadata: true\n    ```\n\nExample with parent Term Group:\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: https://example.org/glossary.ttl\n        environment: PROD\n        parent_glossary_node: ExternalOntologies\n    ```\n\nExample with zip URL:\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: https://example.org/rdf_data.zip\n        format: turtle\n        recursive: true\n        environment: PROD\n    ```\n\nExample with web folder URL:\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: https://example.org/rdf_data/\n        format: turtle\n        recursive: true\n        environment: PROD\n    ```\n\nExample with local path (CLI-only):\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: /path/to/glossary.ttl\n        format: turtle\n        recursive: true\n        environment: PROD\n    ```\n\nExample with filtering (CLI-only: local paths):\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: /path/to/ontology.owl\n        export_only: [\"glossary\"]\n        environment: PROD\n    ```\n\nExample with SPARQL filter (filtering by namespace/module):\n    ```yaml\n    source:\n      type: rdf\n      config:\n        source: https://spec.edmcouncil.org/fibo/ontology/master/latest/fibo-all.ttl\n        sparql_filter: |\n          CONSTRUCT { ?s ?p ?o }\n          WHERE {\n            ?s ?p ?o .\n            FILTER(\n              STRSTARTS(STR(?s), \"https://spec.edmcouncil.org/fibo/ontology/FBC/\") ||\n              STRSTARTS(STR(?s), \"https://spec.edmcouncil.org/fibo/ontology/FND/\")\n            )\n          }\n        environment: PROD\n    ```",
  "properties": {
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Stateful ingestion configuration. See https://datahubproject.io/docs/stateful-ingestion for more details."
    },
    "source": {
      "description": "Primary: Web URL to .ttl or .zip file (e.g. https://example.org/glossary.ttl, https://example.org/data.zip). Also supports web folder URLs. Local file/folder paths work only when running via CLI. Examples: 'https://example.org/glossary.ttl', 'https://example.org/data.zip', 'https://example.org/folder/', '/path/to/file.ttl' (CLI-only)",
      "title": "Source",
      "type": "string"
    },
    "format": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "RDF format (auto-detected if not specified). Examples: turtle, xml, n3, nt",
      "title": "Format"
    },
    "extensions": {
      "default": [
        ".ttl",
        ".rdf",
        ".owl",
        ".n3",
        ".nt"
      ],
      "description": "File extensions to process when source is a folder",
      "items": {
        "type": "string"
      },
      "title": "Extensions",
      "type": "array"
    },
    "recursive": {
      "default": true,
      "description": "Enable recursive folder processing (default: true)",
      "title": "Recursive",
      "type": "boolean"
    },
    "environment": {
      "default": "PROD",
      "description": "DataHub environment (PROD, DEV, TEST, etc.)",
      "title": "Environment",
      "type": "string"
    },
    "parent_glossary_node": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Optional parent Term Group (glossary node) to place the loaded hierarchy under. Use a name (e.g. 'ExternalOntologies') or full URN (urn:li:glossaryNode:ExternalOntologies). If a name is provided, the parent node is created if it does not exist. When omitted, terms are placed at the top level.",
      "title": "Parent Glossary Node"
    },
    "dialect": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Force a specific RDF dialect (default: auto-detect). Options: default, fibo, generic",
      "title": "Dialect"
    },
    "include_provisional": {
      "default": false,
      "description": "Include terms with provisional/work-in-progress status (default: False). When False, only terms that have been fully approved/released are included. Many ontologies use workflow status properties (e.g., maturity level) to mark terms that are in the pipeline but not yet fully approved. Setting this to False helps reduce noise from unapproved or draft terms.",
      "title": "Include Provisional",
      "type": "boolean"
    },
    "export_only": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Export only specified entity types. Options are dynamically determined from registered entity types.",
      "title": "Export Only"
    },
    "skip_export": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Skip exporting specified entity types. Options are dynamically determined from registered entity types.",
      "title": "Skip Export"
    },
    "sparql_filter": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Optional SPARQL CONSTRUCT query to filter the RDF graph before ingestion. Useful for filtering by namespace, module, or custom patterns. The query should use CONSTRUCT to build a filtered graph. Example: Filter to specific FIBO modules: 'CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o . FILTER(STRSTARTS(STR(?s), \"https://spec.edmcouncil.org/fibo/ontology/FBC/\")) }'",
      "title": "Sparql Filter"
    }
  },
  "required": [
    "source"
  ],
  "title": "RDFSourceConfig",
  "type": "object"
}

Capabilities

IRI-to-URN Mapping

RDF IRIs are converted to DataHub URNs following this pattern:

http://example.com/finance/credit-risk
→ urn:li:glossaryTerm:example.com/finance/credit-risk

IRI path segments create the glossary node hierarchy:

example.com → Glossary Node
finance → Glossary Node (child of example.com)
credit-risk → Glossary Term (under finance node)

Supported RDF Vocabularies

The source recognizes entities from these standard vocabularies:

SKOS: skos:Concept → Glossary Term, skos:prefLabel → name, skos:definition → definition, skos:broader/skos:narrower → relationships
OWL: owl:Class → Glossary Term, owl:NamedIndividual → Glossary Term
RDFS: rdfs:label → name (fallback), rdfs:comment → definition (fallback)

Limitations

MVP Scope: The current implementation focuses on glossary terms and relationships. Dataset, lineage, and structured property extraction are not included.
Relationship Types: Only skos:broader and skos:narrower relationships are extracted. skos:related and skos:exactMatch are not supported.
Term Requirements: Terms must have a label (skos:prefLabel or rdfs:label) of at least 3 characters to be extracted.
Large Files: Very large RDF files are loaded entirely into memory. Consider splitting large ontologies into multiple files.

Troubleshooting

No glossary terms extracted

Check term types - Ensure entities are typed as skos:Concept, owl:Class, or owl:NamedIndividual
Verify labels - Terms must have skos:prefLabel or rdfs:label with at least 3 characters
Check file format - Verify the RDF file is valid and in a supported format
Review logs - Enable debug logging: datahub ingest -c recipe.yml --debug

Terms not appearing in correct hierarchy

Check IRI structure - Glossary nodes are created from IRI path segments
Verify IRI format - IRIs should be absolute (e.g., https://example.com/path/term)

Relationships not extracted

Check relationship types - Only skos:broader and skos:narrower are supported
Verify term existence - Both source and target terms must exist in the RDF
Check export options - Ensure relationship is not in skip_export

Stateful ingestion not working

Enable it - Set stateful_ingestion.enabled: true
Check server - Verify your DataHub server supports stateful ingestion (version >= 0.8.20)

Code Coordinates

Class Name: datahub.ingestion.source.rdf.ingestion.rdf_source.RDFSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for RDF, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

RDF

Overview​

Concept Mapping​

Module rdf​

Important Capabilities​

Overview​

Prerequisites​

RDF Format Support​

Source Types​

RDF Dialects​

SPARQL Filtering​

Selective Entity Export​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

IRI-to-URN Mapping​

Supported RDF Vocabularies​

Limitations​

Troubleshooting​

No glossary terms extracted​

Terms not appearing in correct hierarchy​

Relationships not extracted​

Stateful ingestion not working​

Code Coordinates​

Overview

Concept Mapping

Module `rdf`

Important Capabilities

Overview

Prerequisites

RDF Format Support

Source Types

RDF Dialects

SPARQL Filtering

Selective Entity Export

Install the Plugin

Starter Recipe

Config Details

Capabilities

IRI-to-URN Mapping

Supported RDF Vocabularies

Limitations

Troubleshooting

No glossary terms extracted

Terms not appearing in correct hierarchy

Relationships not extracted

Stateful ingestion not working

Code Coordinates