GitHub
Overview
GitHub hosts source files, documentation, and knowledge bases in git repositories. Learn more in the official GitHub documentation.
The DataHub github-documents integration ingests markdown and text files from a GitHub repository as Document entities. Folder structure in the repository is preserved as parent-child document relationships in DataHub.
Concept Mapping
| Source Concept | DataHub Concept | Notes |
|---|---|---|
| Repository folder | Document | Optional folder documents for navigation. |
| Markdown / text file | Document | Native (editable) or external (read-only) depending on config. |
| Repository + branch | Custom metadata | Stored on documents for traceability. |
Module github-documents
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Test Connection | ✅ | Enabled by default. |
Overview
The github-documents source ingests files from a GitHub repository branch using the GitHub REST API. It is designed for scheduled, repeatable imports of documentation into DataHub—especially context documents nested under a parent folder.
Prerequisites
GitHub access token
Create a GitHub Personal Access Token (classic or fine-grained) with read access to the target repository:
repo(private repositories) orcontents: read(fine-grained)
Store the token in a DataHub secret and reference it from your ingestion recipe.
Repository access
Ensure the token can read the repository, branch, and paths you configure.
Supported imports
- Import
.md,.txt, and other configured extensions - Preserve folder hierarchy as parent-child documents
- Optionally attach imported trees under a parent document URN
- Import as native (editable) or external (read-only) documents
- Schedule recurring syncs via ingestion sources
Install the Plugin
pip install 'acryl-datahub[github-documents]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: github-documents
config:
github_token: "${GITHUB_TOKEN}"
repository: acme/handbook
branch: main
path_prefix: docs
file_extensions:
- .md
- .txt
parent_document_urn: null
create_repo_root_document: true
max_files: 500
document_import_mode: NATIVE
show_in_global_context: true
stateful_ingestion:
enabled: true
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
github_token ✅ string(password) | GitHub access token (PAT or GitHub App installation token). |
repository ✅ string | Repository to ingest, as 'owner/repo' or 'https://github.com/owner/repo'. |
branch string | Branch to read files from. Default: main |
create_repo_root_document boolean | When True, create a folder document named after the repository and nest imported files beneath it. Default: True |
document_import_mode Enum | One of: "NATIVE", "EXTERNAL" |
max_files integer | Maximum number of matching files to import per run. Additional matches are skipped with a report warning. Default: 500 |
parent_document_urn One of string, null | Optional parent document URN. Top-level imported items are nested beneath this document while preserving the GitHub folder hierarchy. Default: None |
path_prefix string | Only ingest files under this path (e.g. 'docs'). Default: |
show_in_global_context boolean | Whether imported documents appear in global search and navigation. Default: True |
document_mapping DocumentMappingConfig | Document entity mapping configuration. |
document_mapping.id_pattern string | Pattern for generating document IDs Default: {source_type}-{directory}-{basename} |
document_mapping.status Enum | One of: "PUBLISHED", "UNPUBLISHED" Default: PUBLISHED |
document_mapping.id_normalization IdNormalizationConfig | Document ID normalization rules. |
document_mapping.id_normalization.lowercase boolean | Convert to lowercase Default: True |
document_mapping.id_normalization.max_length integer | Maximum ID length Default: 200 |
document_mapping.id_normalization.remove_special_chars boolean | Remove special characters except _ and - Default: True |
document_mapping.id_normalization.replace_spaces_with string | Replace spaces with this character Default: - |
document_mapping.source SourceConfig | Document source configuration. |
document_mapping.source.include_external_id boolean | Include external ID in DocumentSource Default: True |
document_mapping.source.include_external_url boolean | Include external URL in DocumentSource Default: True |
document_mapping.source.type Enum | One of: "NATIVE", "EXTERNAL" Default: EXTERNAL |
document_mapping.title TitleExtractionConfig | Title extraction configuration. |
document_mapping.title.extract_from_content boolean | Try to extract title from document content Default: True |
document_mapping.title.fallback_to_filename boolean | Use filename as title if not found in content Default: True |
document_mapping.title.max_length integer | Maximum title length Default: 500 |
file_extensions array | File extensions to include (include the leading dot). |
file_extensions.string string | |
hierarchy HierarchyConfig | Hierarchy configuration. |
hierarchy.enabled boolean | Enable parent-child relationships Default: True |
hierarchy.parent_strategy Enum | One of: "folder", "none", "custom", "notion", "confluence" Default: folder |
hierarchy.custom_mapping One of CustomMappingConfig, null | Custom mapping configuration Default: None |
hierarchy.custom_mapping.rules array | Custom parent mapping rules |
hierarchy.custom_mapping.rules.CustomParentRule CustomParentRule | Custom parent mapping rule. |
hierarchy.custom_mapping.rules.CustomParentRule.parent_id ❓ string | Parent document ID for matching files |
hierarchy.custom_mapping.rules.CustomParentRule.pattern ❓ string | Glob pattern to match file paths |
hierarchy.folder_mapping FolderMappingConfig | Folder hierarchy mapping configuration. |
hierarchy.folder_mapping.create_parent_docs boolean | Create Document entities for folders Default: True |
hierarchy.folder_mapping.max_depth integer | Maximum hierarchy depth Default: 10 |
hierarchy.folder_mapping.parent_id_pattern string | Pattern for parent document IDs Default: {source_type}-{directory} |
hierarchy.folder_mapping.root_parent One of string, null | Optional root document URN Default: None |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Stateful Ingestion Config Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"CustomMappingConfig": {
"additionalProperties": false,
"description": "Custom parent mapping configuration.",
"properties": {
"rules": {
"description": "Custom parent mapping rules",
"items": {
"$ref": "#/$defs/CustomParentRule"
},
"title": "Rules",
"type": "array"
}
},
"title": "CustomMappingConfig",
"type": "object"
},
"CustomParentRule": {
"additionalProperties": false,
"description": "Custom parent mapping rule.",
"properties": {
"pattern": {
"description": "Glob pattern to match file paths",
"title": "Pattern",
"type": "string"
},
"parent_id": {
"description": "Parent document ID for matching files",
"title": "Parent Id",
"type": "string"
}
},
"required": [
"pattern",
"parent_id"
],
"title": "CustomParentRule",
"type": "object"
},
"DocumentImportMode": {
"description": "Whether ingested documents are native (editable) or external (read-only references).",
"enum": [
"NATIVE",
"EXTERNAL"
],
"title": "DocumentImportMode",
"type": "string"
},
"DocumentMappingConfig": {
"additionalProperties": false,
"description": "Document entity mapping configuration.",
"properties": {
"id_pattern": {
"default": "{source_type}-{directory}-{basename}",
"description": "Pattern for generating document IDs",
"title": "Id Pattern",
"type": "string"
},
"id_normalization": {
"$ref": "#/$defs/IdNormalizationConfig",
"description": "ID normalization rules"
},
"title": {
"$ref": "#/$defs/TitleExtractionConfig",
"description": "Title extraction configuration"
},
"source": {
"$ref": "#/$defs/SourceConfig",
"description": "Source configuration"
},
"status": {
"default": "PUBLISHED",
"description": "Default publication status",
"enum": [
"PUBLISHED",
"UNPUBLISHED"
],
"title": "Status",
"type": "string"
}
},
"title": "DocumentMappingConfig",
"type": "object"
},
"FolderMappingConfig": {
"additionalProperties": false,
"description": "Folder hierarchy mapping configuration.",
"properties": {
"create_parent_docs": {
"default": true,
"description": "Create Document entities for folders",
"title": "Create Parent Docs",
"type": "boolean"
},
"parent_id_pattern": {
"default": "{source_type}-{directory}",
"description": "Pattern for parent document IDs",
"title": "Parent Id Pattern",
"type": "string"
},
"max_depth": {
"default": 10,
"description": "Maximum hierarchy depth",
"maximum": 50,
"minimum": 1,
"title": "Max Depth",
"type": "integer"
},
"root_parent": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Optional root document URN",
"title": "Root Parent"
}
},
"title": "FolderMappingConfig",
"type": "object"
},
"HierarchyConfig": {
"additionalProperties": false,
"description": "Hierarchy configuration.",
"properties": {
"enabled": {
"default": true,
"description": "Enable parent-child relationships",
"title": "Enabled",
"type": "boolean"
},
"parent_strategy": {
"default": "folder",
"description": "Parent document creation strategy. 'notion' extracts parent from Notion API metadata. 'confluence' extracts parent from Confluence page ancestors.",
"enum": [
"folder",
"none",
"custom",
"notion",
"confluence"
],
"title": "Parent Strategy",
"type": "string"
},
"folder_mapping": {
"$ref": "#/$defs/FolderMappingConfig",
"description": "Folder mapping configuration"
},
"custom_mapping": {
"anyOf": [
{
"$ref": "#/$defs/CustomMappingConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Custom mapping configuration"
}
},
"title": "HierarchyConfig",
"type": "object"
},
"IdNormalizationConfig": {
"additionalProperties": false,
"description": "Document ID normalization rules.",
"properties": {
"lowercase": {
"default": true,
"description": "Convert to lowercase",
"title": "Lowercase",
"type": "boolean"
},
"replace_spaces_with": {
"default": "-",
"description": "Replace spaces with this character",
"title": "Replace Spaces With",
"type": "string"
},
"remove_special_chars": {
"default": true,
"description": "Remove special characters except _ and -",
"title": "Remove Special Chars",
"type": "boolean"
},
"max_length": {
"default": 200,
"description": "Maximum ID length",
"title": "Max Length",
"type": "integer"
}
},
"title": "IdNormalizationConfig",
"type": "object"
},
"SourceConfig": {
"additionalProperties": false,
"description": "Document source configuration.",
"properties": {
"type": {
"default": "EXTERNAL",
"description": "Document source type: NATIVE for editable DataHub documents, EXTERNAL for read-only references.",
"enum": [
"NATIVE",
"EXTERNAL"
],
"title": "Type",
"type": "string"
},
"include_external_url": {
"default": true,
"description": "Include external URL in DocumentSource",
"title": "Include External Url",
"type": "boolean"
},
"include_external_id": {
"default": true,
"description": "Include external ID in DocumentSource",
"title": "Include External Id",
"type": "boolean"
}
},
"title": "SourceConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
},
"TitleExtractionConfig": {
"additionalProperties": false,
"description": "Title extraction configuration.",
"properties": {
"extract_from_content": {
"default": true,
"description": "Try to extract title from document content",
"title": "Extract From Content",
"type": "boolean"
},
"fallback_to_filename": {
"default": true,
"description": "Use filename as title if not found in content",
"title": "Fallback To Filename",
"type": "boolean"
},
"max_length": {
"default": 500,
"description": "Maximum title length",
"title": "Max Length",
"type": "integer"
}
},
"title": "TitleExtractionConfig",
"type": "object"
}
},
"additionalProperties": false,
"description": "Configuration for ingesting markdown and text documents from a GitHub repository.",
"properties": {
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful Ingestion Config"
},
"github_token": {
"description": "GitHub access token (PAT or GitHub App installation token).",
"format": "password",
"title": "Github Token",
"type": "string",
"writeOnly": true
},
"repository": {
"description": "Repository to ingest, as 'owner/repo' or 'https://github.com/owner/repo'.",
"title": "Repository",
"type": "string"
},
"branch": {
"default": "main",
"description": "Branch to read files from.",
"title": "Branch",
"type": "string"
},
"path_prefix": {
"default": "",
"description": "Only ingest files under this path (e.g. 'docs').",
"title": "Path Prefix",
"type": "string"
},
"file_extensions": {
"description": "File extensions to include (include the leading dot).",
"items": {
"type": "string"
},
"title": "File Extensions",
"type": "array"
},
"max_files": {
"default": 500,
"description": "Maximum number of matching files to import per run. Additional matches are skipped with a report warning.",
"minimum": 1,
"title": "Max Files",
"type": "integer"
},
"create_repo_root_document": {
"default": true,
"description": "When True, create a folder document named after the repository and nest imported files beneath it.",
"title": "Create Repo Root Document",
"type": "boolean"
},
"parent_document_urn": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Optional parent document URN. Top-level imported items are nested beneath this document while preserving the GitHub folder hierarchy.",
"title": "Parent Document Urn"
},
"document_import_mode": {
"$ref": "#/$defs/DocumentImportMode",
"default": "NATIVE",
"description": "NATIVE imports editable documents in DataHub. EXTERNAL imports read-only references that link back to GitHub."
},
"show_in_global_context": {
"default": true,
"description": "Whether imported documents appear in global search and navigation.",
"title": "Show In Global Context",
"type": "boolean"
},
"document_mapping": {
"$ref": "#/$defs/DocumentMappingConfig"
},
"hierarchy": {
"$ref": "#/$defs/HierarchyConfig",
"description": "Parent-child relationship configuration."
}
},
"required": [
"github_token",
"repository"
],
"title": "GitHubDocumentsSourceConfig",
"type": "object"
}
Capabilities
Context documents under a parent folder
source:
type: github-documents
config:
github_token: "${GITHUB_TOKEN}"
repository: acme/handbook
branch: main
path_prefix: docs
parent_document_urn: "urn:li:document:context-handbook"
document_import_mode: NATIVE
show_in_global_context: true
stateful_ingestion:
enabled: true
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Read-only external references
source:
type: github-documents
config:
github_token: "${GITHUB_TOKEN}"
repository: https://github.com/acme/handbook
document_import_mode: EXTERNAL
stateful_ingestion:
enabled: true
sink:
type: datahub-rest
config:
server: "http://localhost:8080"
Plain-body serialization
GitHub files are imported as plain markdown or text. Document metadata (title, tags, ownership, etc.) lives in DataHub aspects and customProperties — not in YAML frontmatter or other file headers.
Each imported file document stores these customProperties keys:
| Key | Purpose |
|---|---|
import_source_id | Stable external key for upserts |
content_hash | SHA-256 of raw file body (change detection) |
extraction_algo_version | Bump when hash algorithm changes |
github_blob_sha | Git blob SHA at last import |
github_commit_sha | Branch HEAD at last import |
Cloud sync-back may additionally write last_exported_content_hash to prevent re-import loops after export.
Stateful ingestion
Enable stateful_ingestion.enabled: true (recommended for scheduled syncs) to soft-delete documents that disappear from the GitHub tree between runs. Unchanged file bodies are skipped when a graph connection is available and the stored content_hash matches.
Limitations
- Maximum file size is 1 MB per file (larger files are skipped).
- Requires network access from the ingestion executor to
api.github.com. - Binary formats are not parsed; use text/markdown files.
- GitHub may truncate recursive tree listings for very large repositories; narrow
path_prefixor split across sources. - Imports are capped by
max_files(default 500) per run.
Troubleshooting
- 401 / 403 from GitHub: Verify the token, repository name, and permissions.
- Branch not found: Confirm the branch exists and is spelled correctly.
- No files imported: Check
path_prefixandfile_extensionsfilters. - Documents not removed after GitHub delete: Ensure
stateful_ingestion.enabled: trueand the pipeline has a graph connection for checkpoint storage. - Partial import on large repos: Check ingestion report warnings for
github-tree-truncatedorgithub-files-truncated.
Code Coordinates
- Class Name:
datahub.ingestion.source.github_documents.github_documents_source.GitHubDocumentsSource - Browse on GitHub
If you've got any questions on configuring ingestion for GitHub, feel free to ping us on our Slack.
This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.
Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.