Skip to main content
Version: Next

DataHub Backup & Restore

DataHub stores metadata in two key storage systems that require separate backup approaches:

  1. Versioned Aspects: Stored in a relational database (MySQL/PostgreSQL) in the metadata_aspect_v2 table
  2. Time Series Aspects, Search Indexes, & Graph Relationships: Stored in Elasticsearch/OpenSearch indexes

This guide outlines how to properly back up both components to ensure complete recoverability of your DataHub instance.

Production Environment Backups

Backing Up Document Store (Versioned Metadata)

The recommended backup strategy is to periodically dump the metadata_aspect_v2 table from the datahub database. This table contains all versioned aspects and can be restored in case of database failure. Most managed database services (e.g., AWS RDS) provide automated backup capabilities.

AWS Managed RDS

Option 1: Automated RDS Snapshots

  1. Go to AWS Console > RDS > Databases
  2. Select your DataHub RDS instance
  3. Click Actions > Take Snapshot
  4. Name the snapshot (e.g., datahub-backup-YYYY-MM-DD)
  5. Configure automated snapshots in RDS with appropriate retention periods (recommended: 14-30 days)

Option 2: SQL Dump (MySQL)

For a targeted backup of only the essential metadata:

mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql

To compress the backup:

mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz

Self-Hosted MySQL

mysqldump -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql

Compressed version:

mysqldump -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz

Backing Up Time Series Aspects (Elasticsearch/OpenSearch)

Time Series Aspects power important features like usage statistics, dataset profiles, and assertion runs. These are stored in Elasticsearch/OpenSearch and require a separate backup strategy.

AWS OpenSearch Service

  1. Create an IAM Role for Snapshots

    Create an IAM role with permissions to write to an S3 bucket:

{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["s3:ListBucket"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::your-backup-bucket"]
},
{
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::your-backup-bucket/*"]
}
]
}

Ensure the trust relationship allows OpenSearch to assume this role:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "es.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
  1. Register a Snapshot Repository
   PUT _snapshot/datahub_s3_backup
{
"type": "s3",
"settings": {
"bucket": "your-backup-bucket",
"region": "us-east-1",
"role_arn": "arn:aws:iam::<account-id>:role/<snapshot-role>"
}
}

⚠️ Important: The S3 bucket must be in the same AWS region as your OpenSearch domain.

  1. Create a Regular Snapshot Schedule

    Set up an automated schedule using the OpenSearch Snapshot Management:

   PUT _plugins/_sm/policies/datahub_backup_policy
{
"schedule": {
"cron": {
"expression": "0 0 * * *",
"timezone": "UTC"
}
},
"name": "<snapshot-{now/d}>",
"repository": "datahub_s3_backup",
"config": {
"partial": false
},
"retention": {
"expire_after": "15d",
"min_count": 5,
"max_count": 30
}
}

This configures daily snapshots with a 15-day retention period.

  1. Take a Manual Snapshot (if needed)

    PUT _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD?wait_for_completion=true

  2. Verify Snapshot Status

    GET _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD

Self-Hosted Elasticsearch

  1. Create a Local Repository

    First, add path.repo setting to elasticsearch.yml on all nodes:

    path.repo: ["/mnt/es-backups"]

    Ensure /mnt/es-backups is a shared or mounted path on all Elasticsearch nodes.

  2. Register the Repository

   PUT _snapshot/datahub_fs_backup
{
"type": "fs",
"settings": {
"location": "/mnt/es-backups",
"compress": true
}
}
  1. Create a Snapshot

    PUT \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD?wait_for_completion=true

  2. Check Snapshot Status

    GET \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD

Restoring DataHub from Backups

Restoring the MySQL Database

  1. Restore from an RDS Snapshot (if using AWS RDS)

    In the AWS Console, go to RDS > Snapshots, select your snapshot, and choose "Restore Snapshot".

  2. Restore from SQL Dump

    mysql -h <host> -u <user> -p datahub < metadata_aspect_v2_backup.sql

Restoring Elasticsearch/OpenSearch Indices

After restoring the database, you need to restore the search and graph indices using your snapshots.

Note that you can also rebuild the index from scratch after restoring the MySQL / Postgres Document Store, as outlined here.

Restoring from Snapshots

To restore search indexes from a snapshot:

POST _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD/_restore
{
"indices": "datastream*,metadataindex*",
"include_global_state": false
}

Testing Your Backup Strategy

Regularly test your backup and restore procedures to ensure they work when needed:

  1. Create a test environment
  2. Restore your production backups to this environment
  3. Verify that all functionality works correctly
  4. Document any issues encountered and update your backup/restore procedures

A good practice is to test restore procedures quarterly or after significant infrastructure changes.