Version: Next

Upgrading from DataHub Core to DataHub Cloud

Looking to upgrade to DataHub Cloud, but don't have an account yet? Start here.

Once you have a DataHub Cloud instance, you can seamlessly transfer all metadata from your self-hosted DataHub Core instance to DataHub Cloud using the DataHub CLI. In this guide, we'll show you how.

Prerequisites

Before starting the upgrade process:

DataHub Cloud Account: Ensure you have an active DataHub Cloud instance with an API token
Database Access: You'll need read access to your DataHub Core MySQL or PostgreSQL database
DataHub CLI: Install the DataHub CLI with pip install acryl-datahub
Network Connectivity: Ensure your upgrade environment can access both your source database and DataHub Cloud
Database Index: Verify that the createdon column is indexed in your source database (should by for newer versions by default)

Moving From Core To Cloud

DataHub supports lifting and shifting your instance from DataHub Core to DataHub Cloud, if you'd like to retain the information already present in your DataHub Core instance. To transfer your instance cleanly, you can follow the steps below.

Transferring Core Data

You can easily copy core metadata from DataHub Core to DataHub Cloud using a simple CLI command.

By default, we'll transfer:

Data assets (datasets, dashboards, charts, etc.)
Users and groups
Lineage
Descriptions
Ownership
Domains and data products
Tags and glossary terms

The following is NOT automatically transferred:

Ingestion Sources
Ingestion Source Runs
Ingestion Secrets
Platform Settings

Due to the different encryption scheme employed on DataHub Cloud. This method also excludes time-series metadata such as dataset profiles, column statistics, and assertion run history.

Step 1: Create Your Upgrade Recipe

Create a file named upgrade_recipe.yml with the following configuration:

pipeline_name: datahub_cloud_upgrade
source:
  type: datahub
  config:
    # Disable version history to transfer only current state
    include_all_versions: false

    # Configure your source database connection
    database_connection:
      # For MySQL
      scheme: "mysql+pymysql"
      # For PostgreSQL, use: "postgresql+psycopg2"

      host_port: "your-database-host:3306" # MySQL default port
      username: "your-datahub-username"
      password: "your-datahub-password"
      database: "datahub" # Default database name

    # Disable stateful ingestion for one-time transfer.
    # If you intend to incrementally sync over time, should enable this!
    stateful_ingestion:
      enabled: false

# Preserve system metadata during transfer
flags:
  set_system_metadata: true

# Configure DataHub Cloud as destination
sink:
  type: datahub-rest
  config:
    server: "https://your-instance.acryl.io"
    token: "your-datahub-cloud-api-token"

Step 2: Run the Upgrade

Execute the upgrade using the DataHub CLI:

datahub ingest -c upgrade_recipe.yml

The upgrade will display progress as it transfers your metadata. Depending on the size of your catalog, this process can take anywhere from minutes to hours.

Step 3: Verify the Upgrade

After completion:

Log into your DataHub Cloud instance
Navigate to the Browse page to verify your assets
Check a few key datasets to ensure documentation, owners, and tags transferred correctly
Verify lineage relationships are intact

Transferring Time-Series Metadata

For a complete transfer including recent time-series metadata, you'll need to provide additional configurations to connect to your Kafka cluster. This will transfer:

Dataset and column profiling history
Dataset update history (inserts, updates, deletes)
Dataset query statistics (query counts)
Assertion run results

Important: The amount of historical data available depends on your Kafka retention policy (typically 30-90 days).

Extended Recipe Configuration

pipeline_name: datahub_upgrade_with_timeseries
source:
  type: datahub
  config:
    include_all_versions: false

    # Database connection (same as quickstart)
    database_connection:
      scheme: "mysql+pymysql"
      host_port: "your-database-host:3306"
      username: "your-datahub-username"
      password: "your-datahub-password"
      database: "datahub"

    # Kafka configuration for time-series data
    kafka_connection:
      bootstrap: "your-kafka-broker:9092"
      schema_registry_url: "http://your-schema-registry:8081"
      consumer_config:
        # Optional: Add security configuration if needed
        # security.protocol: "SASL_SSL"
        # sasl.mechanism: "PLAIN"
        # sasl.username: "your-username"
        # sasl.password: "your-password"

    # Topic containing time-series data (change if doesn't match default name)
    kafka_topic_name: "MetadataChangeLog_Timeseries_v1"

    # Disable stateful ingestion for one-time transfer.
    # If you intend to incrementally sync over time, should enable this!
    stateful_ingestion:
      enabled: false

flags:
  set_system_metadata: true

sink:
  type: datahub-rest
  config:
    server: "https://your-instance.acryl.io"
    token: "your-datahub-cloud-api-token"

Advanced Configuration: Transferring Specific Aspects

You can override the default aspects which are excluded from transfer during upgrade using the exclude_aspects configuration. This is useful if you want to be more restrictive, opting to exclude certain information from being transferred to your Cloud instance.

Be careful! Some aspects, particularly those containing encrypted secrets, will NOT transfer to DataHub Cloud due to differences in encryption schemes.

source:
  type: datahub
  config:
    # ... other config ...

    # Exclude specific aspects from transfer
    exclude_aspects:
      - dataHubIngestionSourceInfo
      - datahubIngestionCheckpoint
      - dataHubExecutionRequestInput
      - dataHubIngestionSourceKey
      - dataHubExecutionRequestResult
      - globalSettingsInfo
      - datahubIngestionRunSummary
      - dataHubExecutionRequestSignal
      - globalSettingsKey
      - testResults
      - dataHubExecutionRequestKey
      - dataHubSecretValue
      - dataHubSecretKey
      # Add any other aspects you want to exclude

To learn about all aspects in DataHub, check out the DataHub metadata model documentation.

Best Practices

Performance Optimization

Batch Size: For large transfers, adjust the batch configuration:

source:
  type: datahub
  config:
    database_query_batch_size: 10000 # Adjust based on your system
    commit_state_interval: 1000 # Records before progress is saved to DataHub

Destination Settings: For optimal performance on DataHub Cloud:
- Ensuring batch async ingestion is enabled by setting mode: ASYNC_BATCH in the sink section of your recipe (enabled by default)
- Increase thread count if needed in sink settings by adjusting the max_threads parameter in the sink section of your recipe

Check out the sink docs to learn about other configuration parameters you may want to use during the upgrade process.

Stateful Ingestion: For very large instances, use stateful ingestion:
```
stateful_ingestion:
  enabled: true
  ignore_old_state: false # Set to true to restart from beginning!
```
This enables you to upgrade incrementally over time, only syncing changes once the initial upgrade has completed.

Troubleshooting

Common issues and solutions:

Authentication Errors: Verify your API token has write permissions
Network Timeouts: Check firewall rules and consider adjusting query_timeout
Memory Issues: Reduce database_query_batch_size for large transfers
Slow Performance: Ensure the createdon index exists on your source database
Parse Errors: Set commit_with_parse_errors: true to continue despite errors

Error Handling

By default, the upgrade job will stop committing checkpoints if errors occur, allowing you to re-run and catch missed data.

However, in some cases it's not possible to transfer data to DataHub Cloud, particularly if you've forked and extended the DataHub metadata model.

To continue making progress, ignoring errors:

source:
  type: datahub
  config:
    commit_with_parse_errors: true # Continue even with parse errors

Post-Upgrade Steps

Configure Data Sources: Configure your ingestion sources on DataHub Cloud
Configure SSO: Set up authentication for your team
Update API Clients: Update any client applications to point to your DataHub Cloud instance
Review Policies: Recreate any custom policies and roles as needed

Additional Resources

For more detailed configuration options, refer to the DataHub source documentation.

Need help? Contact DataHub Cloud support or visit our community Slack.

Is this page helpful?

Upgrading from DataHub Core to DataHub Cloud

Prerequisites​

Moving From Core To Cloud​

Transferring Core Data​

Step 1: Create Your Upgrade Recipe​

Step 2: Run the Upgrade​

Step 3: Verify the Upgrade​

Transferring Time-Series Metadata​

Extended Recipe Configuration​

Advanced Configuration: Transferring Specific Aspects​

Best Practices​

Performance Optimization​

Troubleshooting​

Error Handling​

Post-Upgrade Steps​

Additional Resources​