Upgrading from DataHub Core to DataHub Cloud
Looking to upgrade to DataHub Cloud, but don't have an account yet? Start here.
Once you have a DataHub Cloud instance, you can seamlessly transfer all metadata from your self-hosted DataHub Core instance to DataHub Cloud using the DataHub CLI. In this guide, we'll show you how.
Prerequisites
Before starting the upgrade process:
- DataHub Cloud Account: Ensure you have an active DataHub Cloud instance with an API token
- Database Access: You'll need read access to your DataHub Core MySQL or PostgreSQL database
- DataHub CLI: Install the DataHub CLI with
pip install acryl-datahub
- Network Connectivity: Ensure your upgrade environment can access both your source database and DataHub Cloud
- Database Index: Verify that the
createdon
column is indexed in your source database (should by for newer versions by default)
Moving From Core To Cloud
DataHub supports lifting and shifting your instance from DataHub Core to DataHub Cloud, if you'd like to retain the information already present in your DataHub Core instance. To transfer your instance cleanly, you can follow the steps below.
Transferring Core Data
You can easily copy core metadata from DataHub Core to DataHub Cloud using a simple CLI command.
By default, we'll transfer:
- Data assets (datasets, dashboards, charts, etc.)
- Users and groups
- Lineage
- Descriptions
- Ownership
- Domains and data products
- Tags and glossary terms
The following is NOT automatically transferred:
- Ingestion Sources
- Ingestion Source Runs
- Ingestion Secrets
- Platform Settings
Due to the different encryption scheme employed on DataHub Cloud. This method also excludes time-series metadata such as dataset profiles, column statistics, and assertion run history.
Step 1: Create Your Upgrade Recipe
Create a file named upgrade_recipe.yml
with the following configuration:
pipeline_name: datahub_cloud_upgrade
source:
type: datahub
config:
# Disable version history to transfer only current state
include_all_versions: false
# Configure your source database connection
database_connection:
# For MySQL
scheme: "mysql+pymysql"
# For PostgreSQL, use: "postgresql+psycopg2"
host_port: "your-database-host:3306" # MySQL default port
username: "your-datahub-username"
password: "your-datahub-password"
database: "datahub" # Default database name
# Disable stateful ingestion for one-time transfer.
# If you intend to incrementally sync over time, should enable this!
stateful_ingestion:
enabled: false
# Preserve system metadata during transfer
flags:
set_system_metadata: true
# Configure DataHub Cloud as destination
sink:
type: datahub-rest
config:
server: "https://your-instance.acryl.io"
token: "your-datahub-cloud-api-token"
Step 2: Run the Upgrade
Execute the upgrade using the DataHub CLI:
datahub ingest -c upgrade_recipe.yml
The upgrade will display progress as it transfers your metadata. Depending on the size of your catalog, this process can take anywhere from minutes to hours.
Step 3: Verify the Upgrade
After completion:
- Log into your DataHub Cloud instance
- Navigate to the Browse page to verify your assets
- Check a few key datasets to ensure documentation, owners, and tags transferred correctly
- Verify lineage relationships are intact
Transferring Time-Series Metadata
For a complete transfer including recent time-series metadata, you'll need to provide additional configurations to connect to your Kafka cluster. This will transfer:
- Dataset and column profiling history
- Dataset update history (inserts, updates, deletes)
- Dataset query statistics (query counts)
- Assertion run results
Important: The amount of historical data available depends on your Kafka retention policy (typically 30-90 days).
Extended Recipe Configuration
pipeline_name: datahub_upgrade_with_timeseries
source:
type: datahub
config:
include_all_versions: false
# Database connection (same as quickstart)
database_connection:
scheme: "mysql+pymysql"
host_port: "your-database-host:3306"
username: "your-datahub-username"
password: "your-datahub-password"
database: "datahub"
# Kafka configuration for time-series data
kafka_connection:
bootstrap: "your-kafka-broker:9092"
schema_registry_url: "http://your-schema-registry:8081"
consumer_config:
# Optional: Add security configuration if needed
# security.protocol: "SASL_SSL"
# sasl.mechanism: "PLAIN"
# sasl.username: "your-username"
# sasl.password: "your-password"
# Topic containing time-series data (change if doesn't match default name)
kafka_topic_name: "MetadataChangeLog_Timeseries_v1"
# Disable stateful ingestion for one-time transfer.
# If you intend to incrementally sync over time, should enable this!
stateful_ingestion:
enabled: false
flags:
set_system_metadata: true
sink:
type: datahub-rest
config:
server: "https://your-instance.acryl.io"
token: "your-datahub-cloud-api-token"
Advanced Configuration: Transferring Specific Aspects
You can override the default aspects which are excluded from transfer during upgrade using the exclude_aspects
configuration. This is useful if you want to be more restrictive, opting to exclude certain information from being transferred to
your Cloud instance.
Be careful! Some aspects, particularly those containing encrypted secrets, will NOT transfer to DataHub Cloud due to differences in encryption schemes.
source:
type: datahub
config:
# ... other config ...
# Exclude specific aspects from transfer
exclude_aspects:
- dataHubIngestionSourceInfo
- datahubIngestionCheckpoint
- dataHubExecutionRequestInput
- dataHubIngestionSourceKey
- dataHubExecutionRequestResult
- globalSettingsInfo
- datahubIngestionRunSummary
- dataHubExecutionRequestSignal
- globalSettingsKey
- testResults
- dataHubExecutionRequestKey
- dataHubSecretValue
- dataHubSecretKey
# Add any other aspects you want to exclude
To learn about all aspects in DataHub, check out the DataHub metadata model documentation.
Best Practices
Performance Optimization
Batch Size: For large transfers, adjust the batch configuration:
source:
type: datahub
config:
database_query_batch_size: 10000 # Adjust based on your system
commit_state_interval: 1000 # Records before progress is saved to DataHubDestination Settings: For optimal performance on DataHub Cloud:
- Ensuring batch async ingestion is enabled by setting
mode: ASYNC_BATCH
in the sink section of your recipe (enabled by default) - Increase thread count if needed in sink settings by adjusting the
max_threads
parameter in the sink section of your recipe
- Ensuring batch async ingestion is enabled by setting
Check out the sink docs to learn about other configuration parameters you may want to use during the upgrade process.
Stateful Ingestion: For very large instances, use stateful ingestion:
stateful_ingestion:
enabled: true
ignore_old_state: false # Set to true to restart from beginning!This enables you to upgrade incrementally over time, only syncing changes once the initial upgrade has completed.
Troubleshooting
Common issues and solutions:
- Authentication Errors: Verify your API token has write permissions
- Network Timeouts: Check firewall rules and consider adjusting
query_timeout
- Memory Issues: Reduce
database_query_batch_size
for large transfers - Slow Performance: Ensure the
createdon
index exists on your source database - Parse Errors: Set
commit_with_parse_errors: true
to continue despite errors
Error Handling
By default, the upgrade job will stop committing checkpoints if errors occur, allowing you to re-run and catch missed data.
However, in some cases it's not possible to transfer data to DataHub Cloud, particularly if you've forked and extended the DataHub metadata model.
To continue making progress, ignoring errors:
source:
type: datahub
config:
commit_with_parse_errors: true # Continue even with parse errors
Post-Upgrade Steps
- Configure Data Sources: Configure your ingestion sources on DataHub Cloud
- Configure SSO: Set up authentication for your team
- Update API Clients: Update any client applications to point to your DataHub Cloud instance
- Review Policies: Recreate any custom policies and roles as needed
Additional Resources
For more detailed configuration options, refer to the DataHub source documentation.
Need help? Contact DataHub Cloud support or visit our community Slack.