Skip to main content

GitHub

Overview

GitHub hosts source files, documentation, and knowledge bases in git repositories. Learn more in the official GitHub documentation.

The DataHub github-documents integration ingests markdown and text files from a GitHub repository as Document entities. Folder structure in the repository is preserved as parent-child document relationships in DataHub.

Concept Mapping

Source ConceptDataHub ConceptNotes
Repository folderDocumentOptional folder documents for navigation.
Markdown / text fileDocumentNative (editable) or external (read-only) depending on config.
Repository + branchCustom metadataStored on documents for traceability.

Module github-documents

Incubating

Important Capabilities

CapabilityStatusNotes
Detect Deleted EntitiesEnabled by default via stateful ingestion.
Test ConnectionEnabled by default.

Overview

The github-documents source ingests files from a GitHub repository branch using the GitHub REST API. It is designed for scheduled, repeatable imports of documentation into DataHub—especially context documents nested under a parent folder.

Prerequisites

GitHub access token

Create a GitHub Personal Access Token (classic or fine-grained) with read access to the target repository:

  • repo (private repositories) or contents: read (fine-grained)

Store the token in a DataHub secret and reference it from your ingestion recipe.

Repository access

Ensure the token can read the repository, branch, and paths you configure.

Supported imports

  • Import .md, .txt, and other configured extensions
  • Preserve folder hierarchy as parent-child documents
  • Optionally attach imported trees under a parent document URN
  • Import as native (editable) or external (read-only) documents
  • Schedule recurring syncs via ingestion sources

Install the Plugin

pip install 'acryl-datahub[github-documents]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: github-documents
config:
github_token: "${GITHUB_TOKEN}"
repository: acme/handbook
branch: main
path_prefix: docs
file_extensions:
- .md
- .txt
parent_document_urn: null
create_repo_root_document: true
max_files: 500
document_import_mode: NATIVE
show_in_global_context: true
stateful_ingestion:
enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
github_token 
string(password)
GitHub access token (PAT or GitHub App installation token).
repository 
string
Repository to ingest, as 'owner/repo' or 'https://github.com/owner/repo'.
branch
string
Branch to read files from.
Default: main
create_repo_root_document
boolean
When True, create a folder document named after the repository and nest imported files beneath it.
Default: True
document_import_mode
Enum
One of: "NATIVE", "EXTERNAL"
max_files
integer
Maximum number of matching files to import per run. Additional matches are skipped with a report warning.
Default: 500
parent_document_urn
One of string, null
Optional parent document URN. Top-level imported items are nested beneath this document while preserving the GitHub folder hierarchy.
Default: None
path_prefix
string
Only ingest files under this path (e.g. 'docs').
Default:
show_in_global_context
boolean
Whether imported documents appear in global search and navigation.
Default: True
document_mapping
DocumentMappingConfig
Document entity mapping configuration.
document_mapping.id_pattern
string
Pattern for generating document IDs
Default: {source_type}-{directory}-{basename}
document_mapping.status
Enum
One of: "PUBLISHED", "UNPUBLISHED"
Default: PUBLISHED
document_mapping.id_normalization
IdNormalizationConfig
Document ID normalization rules.
document_mapping.id_normalization.lowercase
boolean
Convert to lowercase
Default: True
document_mapping.id_normalization.max_length
integer
Maximum ID length
Default: 200
document_mapping.id_normalization.remove_special_chars
boolean
Remove special characters except _ and -
Default: True
document_mapping.id_normalization.replace_spaces_with
string
Replace spaces with this character
Default: -
document_mapping.source
SourceConfig
Document source configuration.
document_mapping.source.include_external_id
boolean
Include external ID in DocumentSource
Default: True
document_mapping.source.include_external_url
boolean
Include external URL in DocumentSource
Default: True
document_mapping.source.type
Enum
One of: "NATIVE", "EXTERNAL"
Default: EXTERNAL
document_mapping.title
TitleExtractionConfig
Title extraction configuration.
document_mapping.title.extract_from_content
boolean
Try to extract title from document content
Default: True
document_mapping.title.fallback_to_filename
boolean
Use filename as title if not found in content
Default: True
document_mapping.title.max_length
integer
Maximum title length
Default: 500
file_extensions
array
File extensions to include (include the leading dot).
file_extensions.string
string
hierarchy
HierarchyConfig
Hierarchy configuration.
hierarchy.enabled
boolean
Enable parent-child relationships
Default: True
hierarchy.parent_strategy
Enum
One of: "folder", "none", "custom", "notion", "confluence"
Default: folder
hierarchy.custom_mapping
One of CustomMappingConfig, null
Custom mapping configuration
Default: None
hierarchy.custom_mapping.rules
array
Custom parent mapping rules
hierarchy.custom_mapping.rules.CustomParentRule
CustomParentRule
Custom parent mapping rule.
hierarchy.custom_mapping.rules.CustomParentRule.parent_id 
string
Parent document ID for matching files
hierarchy.custom_mapping.rules.CustomParentRule.pattern 
string
Glob pattern to match file paths
hierarchy.folder_mapping
FolderMappingConfig
Folder hierarchy mapping configuration.
hierarchy.folder_mapping.create_parent_docs
boolean
Create Document entities for folders
Default: True
hierarchy.folder_mapping.max_depth
integer
Maximum hierarchy depth
Default: 10
hierarchy.folder_mapping.parent_id_pattern
string
Pattern for parent document IDs
Default: {source_type}-{directory}
hierarchy.folder_mapping.root_parent
One of string, null
Optional root document URN
Default: None
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Stateful Ingestion Config
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Context documents under a parent folder

source:
type: github-documents
config:
github_token: "${GITHUB_TOKEN}"
repository: acme/handbook
branch: main
path_prefix: docs
parent_document_urn: "urn:li:document:context-handbook"
document_import_mode: NATIVE
show_in_global_context: true
stateful_ingestion:
enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Read-only external references

source:
type: github-documents
config:
github_token: "${GITHUB_TOKEN}"
repository: https://github.com/acme/handbook
document_import_mode: EXTERNAL
stateful_ingestion:
enabled: true

sink:
type: datahub-rest
config:
server: "http://localhost:8080"

Plain-body serialization

GitHub files are imported as plain markdown or text. Document metadata (title, tags, ownership, etc.) lives in DataHub aspects and customProperties — not in YAML frontmatter or other file headers.

Each imported file document stores these customProperties keys:

KeyPurpose
import_source_idStable external key for upserts
content_hashSHA-256 of raw file body (change detection)
extraction_algo_versionBump when hash algorithm changes
github_blob_shaGit blob SHA at last import
github_commit_shaBranch HEAD at last import

Cloud sync-back may additionally write last_exported_content_hash to prevent re-import loops after export.

Stateful ingestion

Enable stateful_ingestion.enabled: true (recommended for scheduled syncs) to soft-delete documents that disappear from the GitHub tree between runs. Unchanged file bodies are skipped when a graph connection is available and the stored content_hash matches.

Limitations

  • Maximum file size is 1 MB per file (larger files are skipped).
  • Requires network access from the ingestion executor to api.github.com.
  • Binary formats are not parsed; use text/markdown files.
  • GitHub may truncate recursive tree listings for very large repositories; narrow path_prefix or split across sources.
  • Imports are capped by max_files (default 500) per run.

Troubleshooting

  • 401 / 403 from GitHub: Verify the token, repository name, and permissions.
  • Branch not found: Confirm the branch exists and is spelled correctly.
  • No files imported: Check path_prefix and file_extensions filters.
  • Documents not removed after GitHub delete: Ensure stateful_ingestion.enabled: true and the pipeline has a graph connection for checkpoint storage.
  • Partial import on large repos: Check ingestion report warnings for github-tree-truncated or github-files-truncated.

Code Coordinates

  • Class Name: datahub.ingestion.source.github_documents.github_documents_source.GitHubDocumentsSource
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for GitHub, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.