DataHub Java SDK V2 Design Document

Executive Summary

This document describes the design of DataHub Java SDK V2, a modern, user-friendly Java client library that provides feature parity with the Python SDK V2. The new SDK addresses feedback from enterprise Java customers who require a first-class SDK experience comparable to Python developers.

This document is organized into two main sections:

Part 1 - User-Facing API Design: The public API, patterns, and behaviors visible to SDK users
Part 2 - Developer-Facing Implementation: Internal architecture and implementation details for contributors

Why Hand-Crafted? For a deep dive into why we chose to hand-craft this SDK instead of using OpenAPI code generation, see Java SDK V2 Philosophy.

Background

Problem Statement

Currently, DataHub's Java SDK (datahub-client) provides only low-level emission capabilities:

Manual MCP (Metadata Change Proposal) construction required
No high-level entity builders for Dataset, Chart, Dashboard, etc.
No client for CRUD operations (read, update, delete)
No patch capabilities for granular updates
Significantly inferior developer experience compared to Python SDK V2

This gap has created issues with enterprise customers, particularly Java shops who feel like "second-class citizens" when compared to Python developers.

Goals

Feature Parity: Match Python SDK V2 capabilities for entity management
Backward Compatibility: Maintain 100% compatibility with existing Java SDK
Namespace Separation: Use datahub.client.v2.* namespace for new APIs
Builder Pattern: Fluent, type-safe API for entity construction
Patch Support: Granular updates without full entity replacement
CRUD Operations: Support create, read, update, upsert operations (delete/exists deferred)
Comprehensive Testing: Unit and integration tests validating all functionality

Non-Goals

Rewriting existing emitter infrastructure (leverage existing)
100% feature parity with Python SDK (focus on core entities first)
GraphQL client implementation (focus on REST/OpenAPI)
Search client (future enhancement)
Lineage client (future enhancement)

Part 1: User-Facing API Design

This section describes the public API that SDK users interact with - the patterns, behaviors, and interfaces that define the developer experience.

Design Principles

1. Fluent Builder Pattern

Intuitive entity construction through method chaining:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .env("PROD")
    .description("My dataset")
    .build();

// Fluent metadata operations with type-safe method chaining
dataset.addTag("pii")
       .addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER)
       .setDomain("urn:li:domain:Analytics")
       .setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5);

client.entities().upsert(dataset);

2. Type Safety and Compile-Time Checking

Leverage Java's strong typing:

Strongly-typed URNs (DatasetUrn, ChartUrn, etc.)
Generic types for entity operations
CRTP (Curiously Recurring Template Pattern) for type-safe mixin interfaces
Builder validation at construction time

3. Mode-Aware Behavior

SDK Mode vs INGESTION Mode for proper separation of concerns:

SDK Mode (default): User edits → editableDatasetProperties
INGESTION Mode: Pipeline writes → datasetProperties
Getters intelligently prefer editable aspects over system aspects

// SDK mode - user edits go to editable aspects
DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .mode(OperationMode.SDK)  // Default
    .build();

// INGESTION mode - pipeline writes go to system aspects
DataHubClientV2 ingestionClient = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .mode(OperationMode.INGESTION)
    .build();

4. Patch-First Philosophy

Design Decision: Prioritize patches over full aspect replacement

The SDK V2 is designed around patch-based operations because they represent the most common and intuitive way to make metadata changes:

Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable();  // Get mutable copy

// These create patches internally - no server calls yet
mutable.addTag("pii")
       .addTag("sensitive")
       .addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER);

// Single call emits all accumulated patches atomically
client.entities().update(mutable);

Why patches?

Simplicity: Users think "add a tag" not "fetch all tags, add one, PUT entire tag aspect back"
Safety: Patches don't overwrite concurrent changes from other users
Efficiency: Only changed fields are transmitted and processed
Common use case: Most metadata operations are incremental additions/removals

When to use low-level SDK: If you need to completely replace an aspect (full PUT/upsert semantics), use the V1 SDK's RestEmitter directly with MetadataChangeProposalWrapper. The V2 SDK focuses on making common operations simple, not exposing every low-level primitive.

5. Composition Through Mixin Interfaces

Shared metadata operations via type-safe mixin interfaces:

HasTags<T> - Add, remove, set tags
HasOwners<T> - Manage ownership
HasGlossaryTerms<T> - Associate glossary terms
DomainOperations<T> - Domain assignment
HasContainer<T> - Parent-child hierarchies

All mixins use CRTP pattern for type-safe method chaining that returns the concrete entity type.

Architecture

Package Structure (Actual Implementation)

datahub-client/
├── src/main/java/
│   ├── datahub/client/                    # Existing v1 (unchanged)
│   │   ├── Emitter.java
│   │   ├── rest/RestEmitter.java
│   │   └── ...
│   │
│   └── datahub/client/v2/                 # New v2 namespace
│       ├── DataHubClientV2.java           # Main client entry point
│       │
│       ├── entity/                        # Entity classes
│       │   ├── Entity.java                # Base entity class (490 lines)
│       │   ├── AspectCache.java           # Unified cache with dirty tracking (184 lines)
│       │   ├── CachedAspect.java          # Aspect wrapper with metadata (68 lines)
│       │   ├── AspectSource.java          # SERVER vs LOCAL enum (23 lines)
│       │   ├── ReadMode.java              # ALLOW_DIRTY vs SERVER_ONLY (28 lines)
│       │   ├── Dataset.java               # Dataset entity (564 lines)
│       │   ├── Chart.java                 # Chart entity (587 lines)
│       │   ├── Dashboard.java             # Dashboard entity (671 lines)
│       │   ├── DataJob.java               # DataJob entity (597 lines)
│       │   ├── DataFlow.java              # DataFlow entity (467 lines)
│       │   ├── Container.java             # Container entity (500 lines)
│       │   ├── MLModel.java               # ML Model entity NEW
│       │   ├── MLModelGroup.java          # ML Model Group entity NEW
│       │   ├── HasTags.java               # Tag operations mixin
│       │   ├── HasOwners.java             # Ownership operations mixin
│       │   ├── HasGlossaryTerms.java      # Terms operations mixin
│       │   ├── HasDomains.java            # Domain operations mixin
│       │   ├── HasContainer.java          # Container hierarchy mixin
│       │   └── HasStructuredProperties.java # Structured properties mixin
│       │
│       ├── operations/                    # CRUD operation clients
│       │   └── EntityClient.java          # Entity CRUD operations (570 lines)
│       │
│       └── config/                        # Configuration
│           └── DataHubClientConfigV2.java # Config with mode support
│
└── src/test/java/                         # Tests mirror structure
    └── datahub/client/v2/
        ├── DataHubClientV2Test.java       # Client tests
        ├── entity/                        # 378 unit tests
        │   ├── AspectCacheTest.java       # 30 tests (cache infrastructure)
        │   ├── CachedAspectTest.java      # 13 tests (cache infrastructure)
        │   ├── DatasetTest.java           # 37 tests
        │   ├── ChartTest.java             # 43 tests
        │   ├── DashboardTest.java         # 52 tests
        │   ├── DataJobTest.java           # 45 tests
        │   ├── DataFlowTest.java          # 40 tests
        │   ├── ContainerTest.java         # 40 tests
        │   ├── MLModelTest.java           # 44 tests
        │   └── MLModelGroupTest.java      # 38 tests
        └── integration/                   # 79 integration tests
            ├── DatasetIntegrationTest.java
            ├── ChartIntegrationTest.java
            ├── DashboardIntegrationTest.java
            ├── DataJobIntegrationTest.java
            ├── DataFlowIntegrationTest.java
            ├── ContainerIntegrationTest.java
            ├── MLModelIntegrationTest.java
            └── MLModelGroupIntegrationTest.java

Key Design Decisions:

No separate patch/ package - patches accumulate internally within entities
Mixin interfaces in entity/ package using CRTP pattern for type safety
Support for 8 entity types including ML entities (MLModel, MLModelGroup)
Mode-aware configuration for SDK vs INGESTION behavior

Core Classes

1. DataHubClientV2 (Main Entry Point)

File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java (266 lines)

package datahub.client.v2;

/**
 * Main entry point for DataHub Java SDK V2.
 * Provides high-level operations for entity management with mode-aware behavior.
 *
 * <p>Example usage:
 * <pre>
 * DataHubClientV2 client = DataHubClientV2.builder()
 *     .server("http://localhost:8080")
 *     .token("my-token")
 *     .mode(OperationMode.SDK)  // SDK or INGESTION mode
 *     .build();
 *
 * Dataset dataset = Dataset.builder()
 *     .platform("snowflake")
 *     .name("my_table")
 *     .env("PROD")
 *     .description("My dataset")
 *     .build();
 *
 * client.entities().upsert(dataset);
 * </pre>
 */
public class DataHubClientV2 implements AutoCloseable {
    private final RestEmitter emitter;
    private final DataHubClientConfigV2 config;
    private final EntityClient entityClient;

    // Builder for client configuration
    public static Builder builder() { ... }

    // Entity operations
    public EntityClient entities() { return entityClient; }

    // Low-level emitter access (for advanced users)
    public RestEmitter emitter() { return emitter; }

    // Configuration access
    public DataHubClientConfigV2 config() { return config; }

    @Override
    public void close() throws IOException { ... }

    public static class Builder {
        public Builder server(String serverUrl) { ... }
        public Builder token(String token) { ... }
        public Builder timeout(int timeoutMs) { ... }
        public Builder mode(OperationMode mode) { ... }  // NEW
        public Builder config(DataHubClientConfigV2 config) { ... }
        public DataHubClientV2 build() { ... }
    }
}

Design Features:

Mode-aware behavior (SDK vs INGESTION) for proper aspect routing
Environment variable support for configuration
Builder pattern with sensible defaults
AutoCloseable interface for resource management

2. Entity (Base Class) - User-Facing API

File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines)

The Entity base class provides a unified interface for all DataHub entities. From a user perspective, all entities support:

Public API Methods:

// URN access
public Urn getUrn()
public abstract String getEntityType()

// Convert to MCPs for emission (primarily internal)
public List<MetadataChangeProposalWrapper> toMCPs()

Entity Construction:

Entities are constructed via fluent builders:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .env("PROD")
    .description("My dataset")
    .build();

Fluent Metadata Operations:

All entities support method chaining for metadata operations (via mixin interfaces):

dataset.addTag("pii")
       .addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
       .setDomain(domainUrn)
       .addTerm(termUrn);

Lazy Loading:

Entities loaded from the server fetch aspects on-demand:

Dataset dataset = client.entities().get(datasetUrn);  // Only URN loaded
String description = dataset.getDescription();         // Aspect fetched now
List<String> tags = dataset.getTags();                // Another aspect fetch

Patch Accumulation:

Metadata operations create patches that accumulate until save:

Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable();  // Get mutable copy
mutable.addTag("pii");           // Creates patch (not sent yet)
mutable.addTag("sensitive");     // Another patch (not sent yet)
client.entities().update(mutable); // Emits all patches atomically

Immutability-by-Default:

Entities fetched from the server are read-only to prevent accidental mutations:

Dataset dataset = client.entities().get(datasetUrn);
dataset.isReadOnly();  // true
dataset.isMutable();   // false

// Attempting mutation throws ReadOnlyEntityException
// dataset.addTag("pii");  // ERROR!

// Get mutable copy for updates
Dataset mutable = dataset.mutable();
mutable.isMutable();  // true
mutable.addTag("pii");  // Works
client.entities().upsert(mutable);

Entity Lifecycle:

Builder-created entities - Mutable from creation

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();
dataset.isMutable();  // true - can mutate immediately

Server-fetched entities - Immutable by default

Dataset dataset = client.entities().get(urn);
dataset.isReadOnly();  // true - must call .mutable()

Mutable copies - Created via .mutable()

Dataset mutable = dataset.mutable();
mutable.isMutable();  // true - can mutate

The .mutable() method:

Creates a shallow copy with independent mutability flags
Shares aspect cache with original (read-your-own-writes semantics)
Idempotent - returns self if already mutable
Original entity remains read-only after creating mutable copy

Why immutability-by-default?

Makes mutations explicit and intentional
Prevents accidental modification when passing entities between functions
Clear separation between read and write workflows
Enables safe entity sharing across threads
Common pattern in modern APIs (Rust, Python, Java immutable collections)

See "Developer-Facing Implementation Design" section below for internal architecture details.

3. Supported Entities

The SDK V2 implements 8 entity types with full metadata support:

Data Entities:

Dataset - Tables, views, files with schema support
Container - Databases, schemas, folders (hierarchical structures)

Pipeline Entities:

DataFlow - Pipelines, workflows (Airflow DAGs, Spark jobs, dbt projects)
DataJob - Individual tasks with inlet/outlet lineage

Visualization Entities:

Chart - Visualizations with input dataset lineage
Dashboard - Dashboards with chart relationships and input datasets

ML Entities:

MLModel - Machine learning models with metrics, hyperparameters, training jobs
MLModelGroup - Model families with version management

Common Entity Operations:

All entities support these fluent operations (via mixin interfaces):

// Tags
entity.addTag("pii")
      .removeTag("deprecated")
      .setTags(Arrays.asList("tag1", "tag2"))
      .clearTags()

// Owners
entity.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
      .removeOwner(ownerUrn)
      .setOwners(ownerList)
      .clearOwners()

// Glossary Terms
entity.addTerm(termUrn)
      .removeTerm(termUrn)
      .setTerms(termList)
      .clearTerms()

// Domains
entity.setDomain(domainUrn)
      .removeDomain(domainUrn)
      .clearDomains()

// Container (for hierarchical entities)
entity.setContainer(containerUrn)
      .clearContainer()

// Structured Properties (custom typed metadata)
entity.setStructuredProperty("io.acryl.dataManagement.replicationSLA", "24h")
      .setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5)
      .setStructuredProperty("io.acryl.dataManagement.certifications",
                             Arrays.asList("SOC2", "HIPAA", "GDPR"))
      .setStructuredProperty("io.acryl.privacy.retentionDays", 90, 180, 365)
      .removeStructuredProperty("io.acryl.dataManagement.deprecated")

Entity-Specific Documentation:

See comprehensive guides in metadata-integration/java/docs/sdk-v2/:

dataset-entity.md - Dataset with schema support
chart-entity.md - Chart with lineage
dashboard-entity.md - Dashboard with chart relationships
container-entity.md - Container hierarchies
dataflow-entity.md - DataFlow pipelines
datajob-entity.md - DataJob with inlet/outlet lineage
mlmodel-entity.md - MLModel with metrics
mlmodelgroup-entity.md - MLModelGroup with versions

4. EntityClient (CRUD Operations)

File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java (570 lines)

package datahub.client.v2.operations;

/**
 * Client for entity CRUD operations.
 * Provides create, read, update, and upsert operations.
 */
public class EntityClient {
    private final RestEmitter emitter;
    private final DataHubClientConfigV2 config;

    /**
     * Create a new entity (convenience method - same as upsert).
     */
    public <T extends Entity> void create(T entity) throws IOException, ExecutionException, InterruptedException {
        upsert(entity);
    }

    /**
     * Upsert an entity (create or update).
     * Emits all aspects and accumulated patches.
     */
    public <T extends Entity> void upsert(T entity) throws IOException, ExecutionException, InterruptedException {
        List<MetadataChangeProposalWrapper> mcps = entity.toMCPs();
        // Emit all MCPs asynchronously and wait for completion
        // ...
    }

    /**
     * Update an existing entity.
     * Emits only accumulated patches (not full aspects).
     */
    public <T extends Entity> void update(T entity) throws IOException, ExecutionException, InterruptedException {
        // Emit only pending patches
        // ...
    }

    /**
     * Get an entity by URN.
     * Returns entity with lazy-loaded aspects.
     */
    public <T extends Entity> T get(Urn urn, Class<T> entityClass) throws IOException {
        // Fetch entity aspects from server
        // Construct entity with lazy loading support
        // ...
    }

    // Note: delete(Urn) and exists(Urn) operations deferred to future releases
}

Supported Operations:

create() - Create new entities (wrapper for upsert)
upsert() - Create or update entities (emits all aspects + patches)
update() - Update existing entities (emits only patches)
get() - Retrieve entities with lazy loading
delete() and exists() - Deferred to future releases

Patch Behavior:

Patches are accumulated inside entities during metadata operations and emitted automatically during upsert()/update():

Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable();  // Get mutable copy
mutable.addTag("pii");           // Creates internal patch
mutable.addTag("sensitive");     // Creates another internal patch
client.entities().update(mutable); // Emits both patches atomically

There is no separate patch() method - patches are managed internally by entities.

5. Mixin Interfaces (CRTP Pattern)

Files: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Has*.java

Mixin interfaces provide reusable metadata operations across entities using the Curiously Recurring Template Pattern (CRTP) for type-safe method chaining:

/**
 * Interface for entities that support tags.
 * Uses CRTP for type-safe method chaining.
 */
public interface HasTags<T extends Entity & HasTags<T>> {

    /**
     * Add a tag to this entity.
     * Creates a patch that will be emitted on save.
     */
    default T addTag(@Nonnull String tagUrn) {
        // Implementation creates patch internally
        return (T) this;
    }

    default T removeTag(@Nonnull String tagUrn) { ... }
    default T setTags(@Nonnull List<String> tagUrns) { ... }
    default T clearTags() { ... }

    // Getter methods
    default List<String> getTags() { ... }
}

Available Mixin Interfaces:

HasTags<T> - Tag operations (addTag, removeTag, setTags, clearTags)
HasOwners<T> - Ownership operations (addOwner, removeOwner, setOwners, clearOwners)
HasGlossaryTerms<T> - Glossary term operations (addTerm, removeTerm, setTerms, clearTerms)
DomainOperations<T> - Domain operations (setDomain, removeDomain, clearDomains)
HasContainer<T> - Container hierarchy (setContainer, clearContainer)
HasStructuredProperties<T> - Structured properties operations (setStructuredProperty, removeStructuredProperty)

Why CRTP?

The CRTP pattern enables type-safe method chaining that returns the concrete entity type:

// Without CRTP: returns Entity
Entity entity = dataset.addTag("pii");  // Loses Dataset type!

// With CRTP: returns Dataset
Dataset dataset = dataset.addTag("pii")
                         .addOwner(ownerUrn, type)  // Still Dataset type!
                         .setDomain(domainUrn);     // Still Dataset type!

Entity Implementations:

Entities implement mixin interfaces by declaring them in the class signature:

public class Dataset extends Entity
    implements HasTags<Dataset>,
               HasOwners<Dataset>,
               HasGlossaryTerms<Dataset>,
               DomainOperations<Dataset>,
               HasContainer<Dataset>,
               HasStructuredProperties<Dataset> {
    // Mixin methods provided by default implementations
}

Part 2: Developer-Facing Implementation Design

This section describes the internal architecture and implementation details for developers contributing to the SDK.

Internal Architecture

Entity Base Class - Internal Implementation

File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines)

The Entity base class implements three core subsystems:

1. AspectCache System with Read-Your-Own-Writes

Unified Cache Architecture: The SDK uses a unified AspectCache that provides read-your-own-writes semantics with proper dirty tracking. This architecture fixes bugs where fetched aspects would override patches.

Core Implementation Files:

AspectCache.java (184 lines) - Main cache with dirty tracking
CachedAspect.java (68 lines) - Aspect wrapper with metadata
AspectSource.java (23 lines) - Enum for SERVER vs LOCAL aspects
ReadMode.java (28 lines) - Enum for ALLOW_DIRTY vs SERVER_ONLY reads

Key Architectural Features:

AspectSource Tracking: Distinguishes between SERVER-fetched aspects (subject to TTL) and LOCAL-created aspects (no expiration)
Dirty Tracking: Explicit marking of aspects that need write-back to server via markDirty() method
Read-Your-Own-Writes: Default ReadMode.ALLOW_DIRTY returns local modifications immediately, SERVER_ONLY mode skips dirty aspects
TTL Management: 60-second TTL enforced only for SERVER-sourced aspects, LOCAL aspects never expire
Thread Safety: Uses ConcurrentHashMap for safe concurrent access

Internal State (Entity.java):

protected final AspectCache cache;  // Unified cache with dirty tracking
protected final Map<String, List<MetadataChangeProposal>> pendingPatches;
private DataHubClientV2 boundClient = null;

Cache Operations:

getAspectLazy() - Lazy loads from server, stores as clean SERVER-sourced aspect
getOrCreateAspect() - Gets from cache or creates new LOCAL-sourced aspect (marked dirty)
markAspectDirty() - Marks aspect dirty after in-place modification (used by domain operations)
toMCPs() - Returns only dirty aspects for emission (excludes clean fetched aspects)

Why This Architecture?

The unified cache solves a critical bug: when entities are fetched from the server and then patch operations are applied (e.g., removeTerm()), the cached aspect would be included in toMCPs() and override the patches. With dirty tracking, toMCPs() only returns modified aspects, allowing patches to work correctly.

2. Patch Accumulation and MCP Generation

Metadata operations create patches that accumulate until emission. The system supports two types of operations:

Patch-Based Operations (incremental updates):

Tags, owners, glossary terms use PatchBuilder classes
Patches accumulate in pendingPatches map (aspect name → list of patches)
Multiple operations on same aspect create multiple patches

Cache-Based Operations (full aspect replacement):

Domains, custom properties modify aspects in cache
Aspects marked dirty via markAspectDirty() after modification
Dirty aspects included in toMCPs() output

MCP Generation:

The toMCPs() method returns only dirty aspects and accumulated patches:

public List<MetadataChangeProposalWrapper> toMCPs() {
    // 1. Add dirty aspects from cache (excludes clean fetched aspects)
    for (Map.Entry<String, RecordTemplate> entry : cache.getDirtyAspects().entrySet()) {
        mcps.add(createMCP(entry.getKey(), entry.getValue()));
    }

    // 2. Add accumulated patches
    for (PatchBuilder builder : patchBuilders.values()) {
        mcps.add(builder.build());
    }

    // 3. Add pending MCPs
    mcps.addAll(pendingMCPs);

    return mcps;
}

Critical Design Point: toMCPs() uses cache.getDirtyAspects() instead of all cached aspects. This ensures that fetched aspects don't override patches - only locally modified aspects are emitted.

3. Mode-Aware Aspect Routing

SDK mode vs INGESTION mode for proper aspect selection:

/**
 * Get aspect name based on operation mode.
 * SDK mode: prefer editable aspects
 * INGESTION mode: use system aspects
 */
protected String getAspectName(Class<? extends RecordTemplate> aspectClass, OperationMode mode) {
    if (mode == OperationMode.SDK) {
        // Check if editable variant exists
        String editableAspectName = getEditableAspectName(aspectClass);
        if (editableAspectName != null) {
            return editableAspectName;
        }
    }
    return aspectClass.getSimpleName();
}

/**
 * Get getter preference order: editable aspects first, then system aspects.
 */
protected <T extends RecordTemplate> T getAspectWithPreference(
    Class<T> editableClass,
    Class<T> systemClass
) {
    // Try editable aspect first
    T editable = getAspectLazy(editableClass);
    if (editable != null) {
        return editable;
    }

    // Fall back to system aspect
    return getAspectLazy(systemClass);
}

## Implementation Phases

### Phase 1: Core Framework

Base functionality for all entities:

- Base `Entity` class with aspect management, lazy loading, and patch accumulation
- `DataHubClientV2` main client class with mode-aware behavior
- `EntityClient` with create, read, update, upsert operations
- Configuration classes with environment variable support
- Mixin interfaces using CRTP pattern for type safety

### Phase 2: Dataset Entity

Reference implementation demonstrating all patterns:

- `Dataset` entity with fluent builder
- Dataset-specific aspects (properties, schema, lineage)
- Mixin interface implementations
- Comprehensive unit tests

### Phase 3: Additional Entities

Seven additional entity types:

- `Chart` - Visualizations with lineage
- `Dashboard` - Dashboards with chart relationships
- `Container` - Hierarchical data structures
- `DataJob` - Pipeline tasks with inlet/outlet lineage
- `DataFlow` - Pipeline workflows
- `MLModel` - Machine learning models
- `MLModelGroup` - ML model families

### Phase 4: Patch Capabilities

Patch-based updates for efficient metadata changes:

- Internal patch accumulation within entities (not separate patch builders)
- Automatic patch emission on `update()` and `upsert()`
- Leverages existing `PatchBuilder` classes from entity-registry module
- Patches tested via entity unit tests

### Phase 5: Testing & Documentation

Comprehensive validation and user guides:

- Integration tests with live DataHub server
- API documentation (Javadoc) and 13 comprehensive Markdown guides
- 19 working example files demonstrating real-world usage
- Migration guide from V1
- Design principles document
- Patch operations deep-dive
- Entity-specific guides for all 8 entities

## Testing Strategy

### Unit Tests

Each entity and component has comprehensive unit tests:

- Builder validation (required fields, optional fields, validation logic)
- Aspect management (getters, setters, mode-aware routing)
- MCP generation (full aspects + patches)
- Patch operations (accumulation, emission)
- Fluent API chaining (type safety via CRTP)
- Mixin operations (tags, owners, terms, domains)

**Test Coverage by Entity:**
- Dataset: 37 tests
- Chart: 43 tests
- Dashboard: 52 tests
- DataJob: 45 tests
- DataFlow: 40 tests
- Container: 40 tests
- MLModel: 44 tests
- MLModelGroup: 38 tests

### Integration Tests

Full end-to-end tests against a real DataHub instance:

```java
@Test
public void testDatasetCreateAndRead() throws Exception {
    // Create client
    DataHubClientV2 client = DataHubClientV2.builder()
        .server(TEST_SERVER)
        .token(TEST_TOKEN)
        .build();

    // Create dataset
    Dataset dataset = Dataset.builder()
        .platform("snowflake")
        .name("db.schema.test_table_" + System.currentTimeMillis())
        .env("PROD")
        .description("Test dataset created by Java SDK V2")
        .build();

    dataset.addTag("test-tag")
           .addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER);

    // Upsert
    client.entities().upsert(dataset);

    // Read back
    Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
    assertNotNull(retrieved);
    assertEquals("Test dataset created by Java SDK V2", retrieved.getDescription());
}

@Test
public void testDatasetPatchOperations() throws Exception {
    DataHubClientV2 client = DataHubClientV2.builder()
        .server(TEST_SERVER)
        .token(TEST_TOKEN)
        .build();

    // Create dataset first
    Dataset dataset = Dataset.builder()
        .platform("snowflake")
        .name("db.schema.test_table_patch_" + System.currentTimeMillis())
        .env("PROD")
        .build();
    client.entities().upsert(dataset);

    // Retrieve and apply patches
    Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
    Dataset mutable = retrieved.mutable();  // Get mutable copy
    mutable.addTag("pii")                    // Creates patch
           .addTag("sensitive")              // Another patch
           .addTerm("urn:li:glossaryTerm:CustomerData");  // Another patch

    // All patches emitted atomically
    client.entities().update(mutable);

    // Verify patches were applied
    Dataset verified = client.entities().get(dataset.getUrn(), Dataset.class);
    assertTrue(verified.getTags().contains("urn:li:tag:pii"));
}

Integration Test Coverage:

Entity creation and retrieval
Tag, owner, term, domain operations
Lineage relationships (charts → datasets, jobs → datasets)
Custom properties
Full metadata workflows
Batch operations
Patch accumulation and emission

Running Integration Tests:

export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your_token

./gradlew :metadata-integration:java:datahub-client:test --tests "*Integration*"

Test Coverage Results

Unit test coverage: >80% for new code (378 unit tests + 79 integration tests = 457 total)
All public APIs covered
Edge cases tested (null values, invalid inputs, mode switching)
Async operations tested with proper synchronization
Cache infrastructure thoroughly tested (43 tests for AspectCache + CachedAspect)
Full end-to-end integration tests (79 tests)

API Documentation

All public classes and methods have comprehensive Javadoc plus extensive Markdown documentation:

Javadoc Coverage:

Class-level documentation explaining purpose and usage
Method-level documentation with parameters, returns, exceptions
Code examples for common use cases
Links to related classes and methods

Markdown Documentation (13 files):

Located in metadata-integration/java/docs/sdk-v2/:

getting-started.md - Quick start guide for new users
design-principles.md - Architecture and design decisions
dataset-entity.md - Dataset operations and schema support
chart-entity.md - Chart operations and lineage
dashboard-entity.md - Dashboard operations and relationships
container-entity.md - Container hierarchies
dataflow-entity.md - DataFlow pipeline operations
datajob-entity.md - DataJob inlet/outlet lineage
mlmodel-entity.md - MLModel metrics and hyperparameters
mlmodelgroup-entity.md - MLModelGroup version management
patch-operations.md - Deep dive into patch-based updates
migration-from-v1.md - Migration guide from V1 SDK
java-sdk-v2-design.md - This comprehensive design document

Working Examples (19 files):

Located in metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/:

Dataset examples: DatasetCreateExample, DatasetFullExample, DatasetPatchExample
Chart examples: ChartCreateExample, ChartFullExample, ChartLineageExample
Dashboard examples: DashboardCreateExample, DashboardFullExample, DashboardLineageExample
DataFlow examples: DataFlowCreateExample, DataFlowFullExample
DataJob examples: DataJobCreateExample, DataJobFullExample, DataJobLineageExample
Container examples: ContainerCreateExample, ContainerFullExample, ContainerHierarchyExample
MLModel examples: MLModelCreateExample, MLModelFullExample
MLModelGroup examples: MLModelGroupCreateExample, MLModelGroupFullExample

Migration Guide

For users of the existing Java SDK:

Before (V1):

RestEmitter emitter = RestEmitter.create(b -> b.server("http://localhost:8080"));

DatasetUrn urn = new DatasetUrn(
    new DataPlatformUrn("postgres"),
    "my_table",
    FabricType.PROD
);

DatasetProperties props = new DatasetProperties();
props.setDescription("My dataset");

MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
    .entityType("dataset")
    .entityUrn(urn)
    .upsert()
    .aspect(props)
    .build();

emitter.emit(mcpw).get();

After (V2):

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .build();

Dataset dataset = Dataset.builder()
    .platform("postgres")
    .name("my_table")
    .description("My dataset")
    .build();

client.entities().upsert(dataset);

Decision Log

1. Use Pegasus Models vs OpenAPI Models

Decision: Use Pegasus models (com.linkedin.*) for aspect classes.

Rationale:

Pegasus models are the canonical representation in DataHub
Already used by v1 SDK, maintains consistency
Generated from PDL schemas, always in sync with backend
OpenAPI models are less mature and have fewer utilities

Result: Proven correct - seamless integration with existing infrastructure.

2. Namespace Separation

Decision: Use datahub.client.v2.* namespace.

Rationale:

Clear separation from v1 API
Allows side-by-side usage
Follows semantic versioning principles
Easy to deprecate v1 in future

Result: 100% backward compatibility achieved - v1 code unchanged.

3. Builder Pattern

Decision: Use nested static Builder classes.

Rationale:

Idiomatic Java pattern
Type-safe construction
Optional parameters handled cleanly
Better than telescoping constructors

Result: Excellent developer experience with fluent API.

4. Synchronous vs Async

Decision: Provide synchronous API that wraps async operations.

Rationale:

Simpler for most users
Matches Python SDK V2 API
Can expose async API later for advanced users
RestEmitter already provides async primitives

Result: Simplified API widely adopted in examples and tests.

5. Error Handling

Decision: Throw checked exceptions for I/O operations.

Rationale:

Forces callers to handle errors
Consistent with Java conventions
Clear distinction between programmer errors and runtime failures

Result: Clear error handling patterns in all code.

Exception Hierarchy:

The SDK introduces custom exceptions for common error conditions:

ReadOnlyEntityException - Thrown when attempting to mutate a read-only entity:

try {
  Dataset dataset = client.entities().get(urn);
  dataset.addTag("pii");  // Throws ReadOnlyEntityException
} catch (ReadOnlyEntityException e) {
  // Exception message explains the issue and provides fix
  System.err.println(e.getMessage());

  // Fix: Get mutable copy first
  Dataset mutable = dataset.mutable();
  mutable.addTag("pii");
  client.entities().upsert(mutable);
}

PendingMutationsException - Thrown when reading from entity with pending mutations:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("New description");
// dataset.getDescription();  // Throws PendingMutationsException!

// Fix: Save first, then read
client.entities().upsert(dataset);  // Clears dirty flag
String desc = dataset.getDescription();  // Now works

Why these restrictions?

ReadOnlyEntityException: Makes mutations explicit, prevents accidental changes when passing entities between functions
PendingMutationsException: Prevents reading stale cached data, enforces explicit save-then-fetch workflow

Both restrictions enforce clear separation between read and write workflows. These may be relaxed in future versions as the API matures and usage patterns emerge.

6. Patch-First over Full Aspect Replacement

Decision: Prioritize patch-based operations as the primary API, defer full aspect replacement to V1 SDK.

Rationale:

User mental model: "Add a tag" is more natural than "fetch all tags, modify list, PUT entire aspect"
Safety: Patches don't clobber concurrent changes from other users/systems
Simplicity: Most metadata operations are incremental (add owner, remove tag, etc.)
Efficiency: Only changed fields transmitted and processed by server
Escape hatch exists: Users needing full PUT semantics can use V1 SDK's RestEmitter directly

Why not both? V2 SDK focuses on making common operations simple, not exposing every low-level primitive. This keeps the API focused and prevents confusion about when to use patches vs full replacement.

Result: Clean, intuitive API for 95% of use cases. Power users can drop to V1 SDK for remaining 5%.

7. Internal Patch Accumulation vs External Patch Builders

Decision: Accumulate patches inside entities rather than separate patch builder classes.

Rationale:

More intuitive API - metadata operations just work
Patches automatically emitted on save
Reduces API surface area
Simplifies user code

Original Design: Separate DatasetPatch, ChartPatch builder classes

Actual Implementation: Patches accumulate in Entity.pendingPatches and emit via toMCPs()

Result: Superior developer experience - no need to learn separate patch API.

8. CRTP Pattern for Mixin Interfaces

Decision: Use Curiously Recurring Template Pattern for type-safe mixin interfaces.

Rationale:

Type-safe method chaining returns concrete entity type
Compile-time type checking
No casting required in user code
Idiomatic Java generics pattern

Original Design: Simple interfaces returning Entity

Actual Implementation:

public interface HasTags<T extends Entity & HasTags<T>> {
    default T addTag(String tagUrn) { return (T) this; }
}

Result: Excellent type safety and developer experience.

9. Mode-Aware Behavior (SDK vs INGESTION)

Decision: Support SDK mode and INGESTION mode for aspect routing.

Rationale:

Proper separation of user edits vs pipeline writes
SDK mode → editable aspects (user overrides)
INGESTION mode → system aspects (pipeline data)
Getters prefer editable over system

Original Design: Not specified

Actual Implementation: OperationMode enum with aspect routing logic

Result: Clear separation of concerns, aligns with DataHub's aspect model.

10. Lazy Loading for GET Operations

Decision: Implement lazy loading for aspects when entities are retrieved.

Rationale:

Performance - only fetch aspects when accessed
Client binding enables on-demand fetching
Cache management with timestamps

Original Design: Not specified (GET deferred)

Actual Implementation: Full lazy loading with getAspectLazy() and client binding

Result: Efficient entity retrieval with on-demand aspect fetching.

Design Questions and Resolutions

GET operation implementation: Should we implement REST client for reading entities, or defer to future?
- Resolution: Implemented with lazy loading support
Search client: Should we include search functionality in V2?
- Resolution: Deferred to future (out of scope for V2)
Lineage client: Should we include lineage management?
- Resolution: Basic lineage on Dataset, Chart, Dashboard, DataJob entities
Schema field builders: Should we provide fluent builders for schema fields?
- Resolution: Yes, schema field support in Dataset entity

References

Quick Links for Reviewers

Start Here:

metadata-integration/java/docs/sdk-v2/getting-started.md - Quick start guide
metadata-integration/java/docs/sdk-v2/design-principles.md - Architecture overview

Core Implementation: 3. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines) - Base entity class 4. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java (570 lines) - CRUD operations 5. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java (266 lines) - Main client

Sample Entities: 6. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (564 lines) - Reference implementation 7. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/HasTags.java (145 lines) - CRTP mixin example

Examples: 8. metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java - Complete workflow 9. metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/ChartLineageExample.java - Lineage relationships

Tests: 10. metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/entity/DatasetTest.java (37 unit tests) 11. metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/integration/DatasetIntegrationTest.java - End-to-end validation

Document Status: Design document reflecting implemented architecture (includes AspectCache refactoring) Author: DataHub OSS Team Last Updated: 2025-01-06

Is this page helpful?

DataHub Java SDK V2 Design Document

Executive Summary​

Background​

Problem Statement​

Goals​

Non-Goals​

Part 1: User-Facing API Design

Design Principles​

1. Fluent Builder Pattern​

2. Type Safety and Compile-Time Checking​

3. Mode-Aware Behavior​

4. Patch-First Philosophy​

5. Composition Through Mixin Interfaces​

Architecture​

Package Structure (Actual Implementation)​

Core Classes​

1. DataHubClientV2 (Main Entry Point)​

2. Entity (Base Class) - User-Facing API​

3. Supported Entities​

4. EntityClient (CRUD Operations)​

5. Mixin Interfaces (CRTP Pattern)​

Part 2: Developer-Facing Implementation Design

Internal Architecture​

Entity Base Class - Internal Implementation​

1. AspectCache System with Read-Your-Own-Writes​

2. Patch Accumulation and MCP Generation​

3. Mode-Aware Aspect Routing​

Test Coverage Results​

API Documentation​

Migration Guide​

Before (V1):​

After (V2):​

Decision Log​

1. Use Pegasus Models vs OpenAPI Models​

2. Namespace Separation​

3. Builder Pattern​

4. Synchronous vs Async​

5. Error Handling​

6. Patch-First over Full Aspect Replacement​

7. Internal Patch Accumulation vs External Patch Builders​

8. CRTP Pattern for Mixin Interfaces​

9. Mode-Aware Behavior (SDK vs INGESTION)​

10. Lazy Loading for GET Operations​

Design Questions and Resolutions​

References​

Quick Links for Reviewers​

Executive Summary

Background

Problem Statement

Goals

Non-Goals

Design Principles

1. Fluent Builder Pattern

2. Type Safety and Compile-Time Checking

3. Mode-Aware Behavior

4. Patch-First Philosophy

5. Composition Through Mixin Interfaces

Architecture

Package Structure (Actual Implementation)

Core Classes

1. DataHubClientV2 (Main Entry Point)

2. Entity (Base Class) - User-Facing API

3. Supported Entities

4. EntityClient (CRUD Operations)

5. Mixin Interfaces (CRTP Pattern)

Internal Architecture

Entity Base Class - Internal Implementation

1. AspectCache System with Read-Your-Own-Writes

2. Patch Accumulation and MCP Generation

3. Mode-Aware Aspect Routing

Test Coverage Results

API Documentation

Migration Guide

Before (V1):

After (V2):

Decision Log

1. Use Pegasus Models vs OpenAPI Models

2. Namespace Separation

3. Builder Pattern

4. Synchronous vs Async

5. Error Handling

6. Patch-First over Full Aspect Replacement

7. Internal Patch Accumulation vs External Patch Builders

8. CRTP Pattern for Mixin Interfaces

9. Mode-Aware Behavior (SDK vs INGESTION)

10. Lazy Loading for GET Operations

Design Questions and Resolutions

References

Quick Links for Reviewers