Skip to main content

RDF

Overview

This connector ingests RDF (Resource Description Framework) ontologies into DataHub, with a focus on business glossaries. It extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.

The RDF ingestion source processes RDF/OWL ontologies in various formats (Turtle, RDF/XML, JSON-LD, N3, N-Triples) and converts them to DataHub entities. It supports loading RDF from files, folders, URLs, and comma-separated file lists.

Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

Source ConceptDataHub ConceptNotes
"rdf"Data Platform
skos:ConceptGlossaryTermSKOS concepts become glossary terms
owl:ClassGlossaryTermOWL classes become glossary terms
IRI path hierarchyGlossaryNodePath segments create glossary node hierarchy
skos:broader / skos:narrowerisRelatedTerms relationshipTerm relationships

Module rdf

Incubating

Important Capabilities

CapabilityStatusNotes
Data ProfilingNot applicable.
DescriptionsEnabled by default (from skos:definition or rdfs:comment).
Detect Deleted EntitiesEnabled via stateful_ingestion.enabled: true.
DomainsNot applicable (domains used internally for hierarchy).
Extract OwnershipNot supported.
Extract TagsNot supported.
Platform InstanceSupported via platform_instance config.
Table-Level LineageNot in MVP.

Overview

The rdf module ingests RDF/OWL ontologies into DataHub as glossary terms, glossary nodes, and term relationships. It supports multiple RDF formats and dialects.

Prerequisites

In order to ingest metadata from RDF files, you will need:

  • Python 3.8 or higher
  • Access to RDF files (local files, folders, or URLs)
  • RDF files in supported formats: Turtle (.ttl), RDF/XML (.rdf, .xml), JSON-LD (.jsonld), N3 (.n3), or N-Triples (.nt)
  • A DataHub instance to ingest into

RDF Format Support

The source supports multiple RDF serialization formats:

  • Turtle (.ttl) - Recommended format, human-readable
  • RDF/XML (.rdf, .xml) - XML-based RDF format
  • JSON-LD (.jsonld) - JSON-based RDF format
  • N3 (.n3) - Notation3 format
  • N-Triples (.nt) - Line-based RDF format

The format is auto-detected from the file extension, or you can specify it explicitly using the format parameter.

Source Types

The source parameter accepts multiple input types:

  • Single file: source: path/to/glossary.ttl
  • Folder: source: path/to/rdf_files/ (processes all RDF files, recursively if recursive: true)
  • URL: source: https://example.com/ontology.ttl
  • Comma-separated files: source: file1.ttl, file2.ttl, file3.ttl
  • Glob pattern: source: path/to/**/*.ttl

RDF Dialects

The source supports different RDF dialects for specialized processing:

  • default - Standard RDF processing (BCBS239-style)
  • fibo - FIBO (Financial Industry Business Ontology) dialect
  • generic - Generic RDF processing

The dialect is auto-detected based on the RDF content, or you can force a specific dialect using the dialect parameter.

SPARQL Filtering

You can use SPARQL CONSTRUCT queries to filter the RDF graph before ingestion. This is useful for filtering by namespace, applying complex filtering logic, or reducing the size of large RDF graphs.

source:
type: rdf
config:
source: large_ontology.ttl
sparql_filter: |
CONSTRUCT { ?s ?p ?o }
WHERE {
?s ?p ?o .
FILTER(STRSTARTS(STR(?s), "https://example.org/module1/"))
}

Only CONSTRUCT queries are supported. The filter is applied before entity extraction.

Selective Entity Export

You can control which entity types are ingested using export_only or skip_export:

source:
type: rdf
config:
source: glossary.ttl
export_only:
- glossary # Only ingest glossary terms

Available entity types: glossary (or glossary_terms), relationship (or relationships).

Install the Plugin

pip install 'acryl-datahub[rdf]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: rdf
config:
# Required: RDF source (file, folder, URL, or comma-separated files)
source: path/to/glossary.ttl

# Optional: DataHub environment
environment: PROD

# Optional: RDF format (auto-detected if not specified)
# format: turtle

# Optional: RDF dialect (auto-detected if not specified)
# dialect: default

# Optional: Export only specific entity types
# export_only:
# - glossary

# Optional: Skip specific entity types
# skip_export:
# - relationship

# Optional: Enable stateful ingestion (recommended for production)
# stateful_ingestion:
# enabled: true

sink:
type: "datahub-rest"
config:
server: 'http://localhost:8080'
# token: "${DATAHUB_TOKEN}"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
source 
string
Primary: Web URL to .ttl or .zip file (e.g. https://example.org/glossary.ttl, https://example.org/data.zip). Also supports web folder URLs. Local file/folder paths work only when running via CLI. Examples: 'https://example.org/glossary.ttl', 'https://example.org/data.zip', 'https://example.org/folder/', '/path/to/file.ttl' (CLI-only)
dialect
One of string, null
Force a specific RDF dialect (default: auto-detect). Options: default, fibo, generic
Default: None
environment
string
DataHub environment (PROD, DEV, TEST, etc.)
Default: PROD
format
One of string, null
RDF format (auto-detected if not specified). Examples: turtle, xml, n3, nt
Default: None
include_provisional
boolean
Include terms with provisional/work-in-progress status (default: False). When False, only terms that have been fully approved/released are included. Many ontologies use workflow status properties (e.g., maturity level) to mark terms that are in the pipeline but not yet fully approved. Setting this to False helps reduce noise from unapproved or draft terms.
Default: False
parent_glossary_node
One of string, null
Optional parent Term Group (glossary node) to place the loaded hierarchy under. Use a name (e.g. 'ExternalOntologies') or full URN (urn:li:glossaryNode:ExternalOntologies). If a name is provided, the parent node is created if it does not exist. When omitted, terms are placed at the top level.
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
recursive
boolean
Enable recursive folder processing (default: true)
Default: True
sparql_filter
One of string, null
Optional SPARQL CONSTRUCT query to filter the RDF graph before ingestion. Useful for filtering by namespace, module, or custom patterns. The query should use CONSTRUCT to build a filtered graph. Example: Filter to specific FIBO modules: 'CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o . FILTER(STRSTARTS(STR(?s), "https://spec.edmcouncil.org/fibo/ontology/FBC/")) }'
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
export_only
One of array, null
Export only specified entity types. Options are dynamically determined from registered entity types.
Default: None
export_only.string
string
extensions
array
File extensions to process when source is a folder
Default: ['.ttl', '.rdf', '.owl', '.n3', '.nt']
extensions.string
string
skip_export
One of array, null
Skip exporting specified entity types. Options are dynamically determined from registered entity types.
Default: None
skip_export.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful ingestion configuration. See https://datahubproject.io/docs/stateful-ingestion for more details.
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

IRI-to-URN Mapping

RDF IRIs are converted to DataHub URNs following this pattern:

http://example.com/finance/credit-risk
→ urn:li:glossaryTerm:example.com/finance/credit-risk

IRI path segments create the glossary node hierarchy:

  • example.com → Glossary Node
  • finance → Glossary Node (child of example.com)
  • credit-risk → Glossary Term (under finance node)

Supported RDF Vocabularies

The source recognizes entities from these standard vocabularies:

  • SKOS: skos:Concept → Glossary Term, skos:prefLabel → name, skos:definition → definition, skos:broader/skos:narrower → relationships
  • OWL: owl:Class → Glossary Term, owl:NamedIndividual → Glossary Term
  • RDFS: rdfs:label → name (fallback), rdfs:comment → definition (fallback)

Limitations

  • MVP Scope: The current implementation focuses on glossary terms and relationships. Dataset, lineage, and structured property extraction are not included.
  • Relationship Types: Only skos:broader and skos:narrower relationships are extracted. skos:related and skos:exactMatch are not supported.
  • Term Requirements: Terms must have a label (skos:prefLabel or rdfs:label) of at least 3 characters to be extracted.
  • Large Files: Very large RDF files are loaded entirely into memory. Consider splitting large ontologies into multiple files.

Troubleshooting

No glossary terms extracted

  • Check term types - Ensure entities are typed as skos:Concept, owl:Class, or owl:NamedIndividual
  • Verify labels - Terms must have skos:prefLabel or rdfs:label with at least 3 characters
  • Check file format - Verify the RDF file is valid and in a supported format
  • Review logs - Enable debug logging: datahub ingest -c recipe.yml --debug

Terms not appearing in correct hierarchy

  • Check IRI structure - Glossary nodes are created from IRI path segments
  • Verify IRI format - IRIs should be absolute (e.g., https://example.com/path/term)

Relationships not extracted

  • Check relationship types - Only skos:broader and skos:narrower are supported
  • Verify term existence - Both source and target terms must exist in the RDF
  • Check export options - Ensure relationship is not in skip_export

Stateful ingestion not working

  • Enable it - Set stateful_ingestion.enabled: true
  • Check server - Verify your DataHub server supports stateful ingestion (version >= 0.8.20)

Code Coordinates

  • Class Name: datahub.ingestion.source.rdf.ingestion.rdf_source.RDFSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for RDF, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.