DataHub Java SDK V2 Design Document
Executive Summary
This document describes the design of DataHub Java SDK V2, a modern, user-friendly Java client library that provides feature parity with the Python SDK V2. The new SDK addresses feedback from enterprise Java customers who require a first-class SDK experience comparable to Python developers.
This document is organized into two main sections:
- Part 1 - User-Facing API Design: The public API, patterns, and behaviors visible to SDK users
- Part 2 - Developer-Facing Implementation: Internal architecture and implementation details for contributors
Why Hand-Crafted? For a deep dive into why we chose to hand-craft this SDK instead of using OpenAPI code generation, see Java SDK V2 Philosophy.
Background
Problem Statement
Currently, DataHub's Java SDK (datahub-client) provides only low-level emission capabilities:
- Manual MCP (Metadata Change Proposal) construction required
- No high-level entity builders for Dataset, Chart, Dashboard, etc.
- No client for CRUD operations (read, update, delete)
- No patch capabilities for granular updates
- Significantly inferior developer experience compared to Python SDK V2
This gap has created issues with enterprise customers, particularly Java shops who feel like "second-class citizens" when compared to Python developers.
Goals
- Feature Parity: Match Python SDK V2 capabilities for entity management
- Backward Compatibility: Maintain 100% compatibility with existing Java SDK
- Namespace Separation: Use
datahub.client.v2.*namespace for new APIs - Builder Pattern: Fluent, type-safe API for entity construction
- Patch Support: Granular updates without full entity replacement
- CRUD Operations: Support create, read, update, upsert operations (delete/exists deferred)
- Comprehensive Testing: Unit and integration tests validating all functionality
Non-Goals
- Rewriting existing emitter infrastructure (leverage existing)
- 100% feature parity with Python SDK (focus on core entities first)
- GraphQL client implementation (focus on REST/OpenAPI)
- Search client (future enhancement)
- Lineage client (future enhancement)
Part 1: User-Facing API Design
This section describes the public API that SDK users interact with - the patterns, behaviors, and interfaces that define the developer experience.
Design Principles
1. Fluent Builder Pattern
Intuitive entity construction through method chaining:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.description("My dataset")
.build();
// Fluent metadata operations with type-safe method chaining
dataset.addTag("pii")
.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER)
.setDomain("urn:li:domain:Analytics")
.setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5);
client.entities().upsert(dataset);
2. Type Safety and Compile-Time Checking
Leverage Java's strong typing:
- Strongly-typed URNs (
DatasetUrn,ChartUrn, etc.) - Generic types for entity operations
- CRTP (Curiously Recurring Template Pattern) for type-safe mixin interfaces
- Builder validation at construction time
3. Mode-Aware Behavior
SDK Mode vs INGESTION Mode for proper separation of concerns:
- SDK Mode (default): User edits →
editableDatasetProperties - INGESTION Mode: Pipeline writes →
datasetProperties - Getters intelligently prefer editable aspects over system aspects
// SDK mode - user edits go to editable aspects
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.mode(OperationMode.SDK) // Default
.build();
// INGESTION mode - pipeline writes go to system aspects
DataHubClientV2 ingestionClient = DataHubClientV2.builder()
.server("http://localhost:8080")
.mode(OperationMode.INGESTION)
.build();
4. Patch-First Philosophy
Design Decision: Prioritize patches over full aspect replacement
The SDK V2 is designed around patch-based operations because they represent the most common and intuitive way to make metadata changes:
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
// These create patches internally - no server calls yet
mutable.addTag("pii")
.addTag("sensitive")
.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER);
// Single call emits all accumulated patches atomically
client.entities().update(mutable);
Why patches?
- Simplicity: Users think "add a tag" not "fetch all tags, add one, PUT entire tag aspect back"
- Safety: Patches don't overwrite concurrent changes from other users
- Efficiency: Only changed fields are transmitted and processed
- Common use case: Most metadata operations are incremental additions/removals
When to use low-level SDK:
If you need to completely replace an aspect (full PUT/upsert semantics), use the V1 SDK's RestEmitter directly with MetadataChangeProposalWrapper. The V2 SDK focuses on making common operations simple, not exposing every low-level primitive.
5. Composition Through Mixin Interfaces
Shared metadata operations via type-safe mixin interfaces:
HasTags<T>- Add, remove, set tagsHasOwners<T>- Manage ownershipHasGlossaryTerms<T>- Associate glossary termsDomainOperations<T>- Domain assignmentHasContainer<T>- Parent-child hierarchies
All mixins use CRTP pattern for type-safe method chaining that returns the concrete entity type.
Architecture
Package Structure (Actual Implementation)
datahub-client/
├── src/main/java/
│ ├── datahub/client/ # Existing v1 (unchanged)
│ │ ├── Emitter.java
│ │ ├── rest/RestEmitter.java
│ │ └── ...
│ │
│ └── datahub/client/v2/ # New v2 namespace
│ ├── DataHubClientV2.java # Main client entry point
│ │
│ ├── entity/ # Entity classes
│ │ ├── Entity.java # Base entity class (490 lines)
│ │ ├── AspectCache.java # Unified cache with dirty tracking (184 lines)
│ │ ├── CachedAspect.java # Aspect wrapper with metadata (68 lines)
│ │ ├── AspectSource.java # SERVER vs LOCAL enum (23 lines)
│ │ ├── ReadMode.java # ALLOW_DIRTY vs SERVER_ONLY (28 lines)
│ │ ├── Dataset.java # Dataset entity (564 lines)
│ │ ├── Chart.java # Chart entity (587 lines)
│ │ ├── Dashboard.java # Dashboard entity (671 lines)
│ │ ├── DataJob.java # DataJob entity (597 lines)
│ │ ├── DataFlow.java # DataFlow entity (467 lines)
│ │ ├── Container.java # Container entity (500 lines)
│ │ ├── MLModel.java # ML Model entity NEW
│ │ ├── MLModelGroup.java # ML Model Group entity NEW
│ │ ├── HasTags.java # Tag operations mixin
│ │ ├── HasOwners.java # Ownership operations mixin
│ │ ├── HasGlossaryTerms.java # Terms operations mixin
│ │ ├── HasDomains.java # Domain operations mixin
│ │ ├── HasContainer.java # Container hierarchy mixin
│ │ └── HasStructuredProperties.java # Structured properties mixin
│ │
│ ├── operations/ # CRUD operation clients
│ │ └── EntityClient.java # Entity CRUD operations (570 lines)
│ │
│ └── config/ # Configuration
│ └── DataHubClientConfigV2.java # Config with mode support
│
└── src/test/java/ # Tests mirror structure
└── datahub/client/v2/
├── DataHubClientV2Test.java # Client tests
├── entity/ # 378 unit tests
│ ├── AspectCacheTest.java # 30 tests (cache infrastructure)
│ ├── CachedAspectTest.java # 13 tests (cache infrastructure)
│ ├── DatasetTest.java # 37 tests
│ ├── ChartTest.java # 43 tests
│ ├── DashboardTest.java # 52 tests
│ ├── DataJobTest.java # 45 tests
│ ├── DataFlowTest.java # 40 tests
│ ├── ContainerTest.java # 40 tests
│ ├── MLModelTest.java # 44 tests
│ └── MLModelGroupTest.java # 38 tests
└── integration/ # 79 integration tests
├── DatasetIntegrationTest.java
├── ChartIntegrationTest.java
├── DashboardIntegrationTest.java
├── DataJobIntegrationTest.java
├── DataFlowIntegrationTest.java
├── ContainerIntegrationTest.java
├── MLModelIntegrationTest.java
└── MLModelGroupIntegrationTest.java
Key Design Decisions:
- No separate
patch/package - patches accumulate internally within entities - Mixin interfaces in
entity/package using CRTP pattern for type safety - Support for 8 entity types including ML entities (MLModel, MLModelGroup)
- Mode-aware configuration for SDK vs INGESTION behavior
Core Classes
1. DataHubClientV2 (Main Entry Point)
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java (266 lines)
package datahub.client.v2;
/**
* Main entry point for DataHub Java SDK V2.
* Provides high-level operations for entity management with mode-aware behavior.
*
* <p>Example usage:
* <pre>
* DataHubClientV2 client = DataHubClientV2.builder()
* .server("http://localhost:8080")
* .token("my-token")
* .mode(OperationMode.SDK) // SDK or INGESTION mode
* .build();
*
* Dataset dataset = Dataset.builder()
* .platform("snowflake")
* .name("my_table")
* .env("PROD")
* .description("My dataset")
* .build();
*
* client.entities().upsert(dataset);
* </pre>
*/
public class DataHubClientV2 implements AutoCloseable {
private final RestEmitter emitter;
private final DataHubClientConfigV2 config;
private final EntityClient entityClient;
// Builder for client configuration
public static Builder builder() { ... }
// Entity operations
public EntityClient entities() { return entityClient; }
// Low-level emitter access (for advanced users)
public RestEmitter emitter() { return emitter; }
// Configuration access
public DataHubClientConfigV2 config() { return config; }
@Override
public void close() throws IOException { ... }
public static class Builder {
public Builder server(String serverUrl) { ... }
public Builder token(String token) { ... }
public Builder timeout(int timeoutMs) { ... }
public Builder mode(OperationMode mode) { ... } // NEW
public Builder config(DataHubClientConfigV2 config) { ... }
public DataHubClientV2 build() { ... }
}
}
Design Features:
- Mode-aware behavior (SDK vs INGESTION) for proper aspect routing
- Environment variable support for configuration
- Builder pattern with sensible defaults
- AutoCloseable interface for resource management
2. Entity (Base Class) - User-Facing API
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines)
The Entity base class provides a unified interface for all DataHub entities. From a user perspective, all entities support:
Public API Methods:
// URN access
public Urn getUrn()
public abstract String getEntityType()
// Convert to MCPs for emission (primarily internal)
public List<MetadataChangeProposalWrapper> toMCPs()
Entity Construction:
Entities are constructed via fluent builders:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.description("My dataset")
.build();
Fluent Metadata Operations:
All entities support method chaining for metadata operations (via mixin interfaces):
dataset.addTag("pii")
.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
.setDomain(domainUrn)
.addTerm(termUrn);
Lazy Loading:
Entities loaded from the server fetch aspects on-demand:
Dataset dataset = client.entities().get(datasetUrn); // Only URN loaded
String description = dataset.getDescription(); // Aspect fetched now
List<String> tags = dataset.getTags(); // Another aspect fetch
Patch Accumulation:
Metadata operations create patches that accumulate until save:
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
mutable.addTag("pii"); // Creates patch (not sent yet)
mutable.addTag("sensitive"); // Another patch (not sent yet)
client.entities().update(mutable); // Emits all patches atomically
Immutability-by-Default:
Entities fetched from the server are read-only to prevent accidental mutations:
Dataset dataset = client.entities().get(datasetUrn);
dataset.isReadOnly(); // true
dataset.isMutable(); // false
// Attempting mutation throws ReadOnlyEntityException
// dataset.addTag("pii"); // ERROR!
// Get mutable copy for updates
Dataset mutable = dataset.mutable();
mutable.isMutable(); // true
mutable.addTag("pii"); // Works
client.entities().upsert(mutable);
Entity Lifecycle:
Builder-created entities - Mutable from creation
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.isMutable(); // true - can mutate immediatelyServer-fetched entities - Immutable by default
Dataset dataset = client.entities().get(urn);
dataset.isReadOnly(); // true - must call .mutable()Mutable copies - Created via
.mutable()Dataset mutable = dataset.mutable();
mutable.isMutable(); // true - can mutate
The .mutable() method:
- Creates a shallow copy with independent mutability flags
- Shares aspect cache with original (read-your-own-writes semantics)
- Idempotent - returns self if already mutable
- Original entity remains read-only after creating mutable copy
Why immutability-by-default?
- Makes mutations explicit and intentional
- Prevents accidental modification when passing entities between functions
- Clear separation between read and write workflows
- Enables safe entity sharing across threads
- Common pattern in modern APIs (Rust, Python, Java immutable collections)
See "Developer-Facing Implementation Design" section below for internal architecture details.
3. Supported Entities
The SDK V2 implements 8 entity types with full metadata support:
Data Entities:
- Dataset - Tables, views, files with schema support
- Container - Databases, schemas, folders (hierarchical structures)
Pipeline Entities:
- DataFlow - Pipelines, workflows (Airflow DAGs, Spark jobs, dbt projects)
- DataJob - Individual tasks with inlet/outlet lineage
Visualization Entities:
- Chart - Visualizations with input dataset lineage
- Dashboard - Dashboards with chart relationships and input datasets
ML Entities:
- MLModel - Machine learning models with metrics, hyperparameters, training jobs
- MLModelGroup - Model families with version management
Common Entity Operations:
All entities support these fluent operations (via mixin interfaces):
// Tags
entity.addTag("pii")
.removeTag("deprecated")
.setTags(Arrays.asList("tag1", "tag2"))
.clearTags()
// Owners
entity.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
.removeOwner(ownerUrn)
.setOwners(ownerList)
.clearOwners()
// Glossary Terms
entity.addTerm(termUrn)
.removeTerm(termUrn)
.setTerms(termList)
.clearTerms()
// Domains
entity.setDomain(domainUrn)
.removeDomain(domainUrn)
.clearDomains()
// Container (for hierarchical entities)
entity.setContainer(containerUrn)
.clearContainer()
// Structured Properties (custom typed metadata)
entity.setStructuredProperty("io.acryl.dataManagement.replicationSLA", "24h")
.setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5)
.setStructuredProperty("io.acryl.dataManagement.certifications",
Arrays.asList("SOC2", "HIPAA", "GDPR"))
.setStructuredProperty("io.acryl.privacy.retentionDays", 90, 180, 365)
.removeStructuredProperty("io.acryl.dataManagement.deprecated")
Entity-Specific Documentation:
See comprehensive guides in metadata-integration/java/docs/sdk-v2/:
dataset-entity.md- Dataset with schema supportchart-entity.md- Chart with lineagedashboard-entity.md- Dashboard with chart relationshipscontainer-entity.md- Container hierarchiesdataflow-entity.md- DataFlow pipelinesdatajob-entity.md- DataJob with inlet/outlet lineagemlmodel-entity.md- MLModel with metricsmlmodelgroup-entity.md- MLModelGroup with versions
4. EntityClient (CRUD Operations)
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java (570 lines)
package datahub.client.v2.operations;
/**
* Client for entity CRUD operations.
* Provides create, read, update, and upsert operations.
*/
public class EntityClient {
private final RestEmitter emitter;
private final DataHubClientConfigV2 config;
/**
* Create a new entity (convenience method - same as upsert).
*/
public <T extends Entity> void create(T entity) throws IOException, ExecutionException, InterruptedException {
upsert(entity);
}
/**
* Upsert an entity (create or update).
* Emits all aspects and accumulated patches.
*/
public <T extends Entity> void upsert(T entity) throws IOException, ExecutionException, InterruptedException {
List<MetadataChangeProposalWrapper> mcps = entity.toMCPs();
// Emit all MCPs asynchronously and wait for completion
// ...
}
/**
* Update an existing entity.
* Emits only accumulated patches (not full aspects).
*/
public <T extends Entity> void update(T entity) throws IOException, ExecutionException, InterruptedException {
// Emit only pending patches
// ...
}
/**
* Get an entity by URN.
* Returns entity with lazy-loaded aspects.
*/
public <T extends Entity> T get(Urn urn, Class<T> entityClass) throws IOException {
// Fetch entity aspects from server
// Construct entity with lazy loading support
// ...
}
// Note: delete(Urn) and exists(Urn) operations deferred to future releases
}
Supported Operations:
create()- Create new entities (wrapper for upsert)upsert()- Create or update entities (emits all aspects + patches)update()- Update existing entities (emits only patches)get()- Retrieve entities with lazy loadingdelete()andexists()- Deferred to future releases
Patch Behavior:
Patches are accumulated inside entities during metadata operations and emitted automatically during upsert()/update():
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
mutable.addTag("pii"); // Creates internal patch
mutable.addTag("sensitive"); // Creates another internal patch
client.entities().update(mutable); // Emits both patches atomically
There is no separate patch() method - patches are managed internally by entities.
5. Mixin Interfaces (CRTP Pattern)
Files: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Has*.java
Mixin interfaces provide reusable metadata operations across entities using the Curiously Recurring Template Pattern (CRTP) for type-safe method chaining:
/**
* Interface for entities that support tags.
* Uses CRTP for type-safe method chaining.
*/
public interface HasTags<T extends Entity & HasTags<T>> {
/**
* Add a tag to this entity.
* Creates a patch that will be emitted on save.
*/
default T addTag(@Nonnull String tagUrn) {
// Implementation creates patch internally
return (T) this;
}
default T removeTag(@Nonnull String tagUrn) { ... }
default T setTags(@Nonnull List<String> tagUrns) { ... }
default T clearTags() { ... }
// Getter methods
default List<String> getTags() { ... }
}
Available Mixin Interfaces:
HasTags<T>- Tag operations (addTag,removeTag,setTags,clearTags)HasOwners<T>- Ownership operations (addOwner,removeOwner,setOwners,clearOwners)HasGlossaryTerms<T>- Glossary term operations (addTerm,removeTerm,setTerms,clearTerms)DomainOperations<T>- Domain operations (setDomain,removeDomain,clearDomains)HasContainer<T>- Container hierarchy (setContainer,clearContainer)HasStructuredProperties<T>- Structured properties operations (setStructuredProperty,removeStructuredProperty)
Why CRTP?
The CRTP pattern enables type-safe method chaining that returns the concrete entity type:
// Without CRTP: returns Entity
Entity entity = dataset.addTag("pii"); // Loses Dataset type!
// With CRTP: returns Dataset
Dataset dataset = dataset.addTag("pii")
.addOwner(ownerUrn, type) // Still Dataset type!
.setDomain(domainUrn); // Still Dataset type!
Entity Implementations:
Entities implement mixin interfaces by declaring them in the class signature:
public class Dataset extends Entity
implements HasTags<Dataset>,
HasOwners<Dataset>,
HasGlossaryTerms<Dataset>,
DomainOperations<Dataset>,
HasContainer<Dataset>,
HasStructuredProperties<Dataset> {
// Mixin methods provided by default implementations
}
Part 2: Developer-Facing Implementation Design
This section describes the internal architecture and implementation details for developers contributing to the SDK.
Internal Architecture
Entity Base Class - Internal Implementation
File: metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines)
The Entity base class implements three core subsystems:
1. AspectCache System with Read-Your-Own-Writes
Unified Cache Architecture: The SDK uses a unified AspectCache that provides read-your-own-writes semantics with proper dirty tracking. This architecture fixes bugs where fetched aspects would override patches.
Core Implementation Files:
AspectCache.java(184 lines) - Main cache with dirty trackingCachedAspect.java(68 lines) - Aspect wrapper with metadataAspectSource.java(23 lines) - Enum for SERVER vs LOCAL aspectsReadMode.java(28 lines) - Enum for ALLOW_DIRTY vs SERVER_ONLY reads
Key Architectural Features:
AspectSource Tracking: Distinguishes between SERVER-fetched aspects (subject to TTL) and LOCAL-created aspects (no expiration)
Dirty Tracking: Explicit marking of aspects that need write-back to server via
markDirty()methodRead-Your-Own-Writes: Default
ReadMode.ALLOW_DIRTYreturns local modifications immediately,SERVER_ONLYmode skips dirty aspectsTTL Management: 60-second TTL enforced only for SERVER-sourced aspects, LOCAL aspects never expire
Thread Safety: Uses
ConcurrentHashMapfor safe concurrent access
Internal State (Entity.java):
protected final AspectCache cache; // Unified cache with dirty tracking
protected final Map<String, List<MetadataChangeProposal>> pendingPatches;
private DataHubClientV2 boundClient = null;
Cache Operations:
getAspectLazy()- Lazy loads from server, stores as clean SERVER-sourced aspectgetOrCreateAspect()- Gets from cache or creates new LOCAL-sourced aspect (marked dirty)markAspectDirty()- Marks aspect dirty after in-place modification (used by domain operations)toMCPs()- Returns only dirty aspects for emission (excludes clean fetched aspects)
Why This Architecture?
The unified cache solves a critical bug: when entities are fetched from the server and then patch operations are applied (e.g., removeTerm()), the cached aspect would be included in toMCPs() and override the patches. With dirty tracking, toMCPs() only returns modified aspects, allowing patches to work correctly.
2. Patch Accumulation and MCP Generation
Metadata operations create patches that accumulate until emission. The system supports two types of operations:
Patch-Based Operations (incremental updates):
- Tags, owners, glossary terms use
PatchBuilderclasses - Patches accumulate in
pendingPatchesmap (aspect name → list of patches) - Multiple operations on same aspect create multiple patches
Cache-Based Operations (full aspect replacement):
- Domains, custom properties modify aspects in cache
- Aspects marked dirty via
markAspectDirty()after modification - Dirty aspects included in
toMCPs()output
MCP Generation:
The toMCPs() method returns only dirty aspects and accumulated patches:
public List<MetadataChangeProposalWrapper> toMCPs() {
// 1. Add dirty aspects from cache (excludes clean fetched aspects)
for (Map.Entry<String, RecordTemplate> entry : cache.getDirtyAspects().entrySet()) {
mcps.add(createMCP(entry.getKey(), entry.getValue()));
}
// 2. Add accumulated patches
for (PatchBuilder builder : patchBuilders.values()) {
mcps.add(builder.build());
}
// 3. Add pending MCPs
mcps.addAll(pendingMCPs);
return mcps;
}
Critical Design Point: toMCPs() uses cache.getDirtyAspects() instead of all cached aspects. This ensures that fetched aspects don't override patches - only locally modified aspects are emitted.
3. Mode-Aware Aspect Routing
SDK mode vs INGESTION mode for proper aspect selection:
/**
* Get aspect name based on operation mode.
* SDK mode: prefer editable aspects
* INGESTION mode: use system aspects
*/
protected String getAspectName(Class<? extends RecordTemplate> aspectClass, OperationMode mode) {
if (mode == OperationMode.SDK) {
// Check if editable variant exists
String editableAspectName = getEditableAspectName(aspectClass);
if (editableAspectName != null) {
return editableAspectName;
}
}
return aspectClass.getSimpleName();
}
/**
* Get getter preference order: editable aspects first, then system aspects.
*/
protected <T extends RecordTemplate> T getAspectWithPreference(
Class<T> editableClass,
Class<T> systemClass
) {
// Try editable aspect first
T editable = getAspectLazy(editableClass);
if (editable != null) {
return editable;
}
// Fall back to system aspect
return getAspectLazy(systemClass);
}
## Implementation Phases
### Phase 1: Core Framework
Base functionality for all entities:
- Base `Entity` class with aspect management, lazy loading, and patch accumulation
- `DataHubClientV2` main client class with mode-aware behavior
- `EntityClient` with create, read, update, upsert operations
- Configuration classes with environment variable support
- Mixin interfaces using CRTP pattern for type safety
### Phase 2: Dataset Entity
Reference implementation demonstrating all patterns:
- `Dataset` entity with fluent builder
- Dataset-specific aspects (properties, schema, lineage)
- Mixin interface implementations
- Comprehensive unit tests
### Phase 3: Additional Entities
Seven additional entity types:
- `Chart` - Visualizations with lineage
- `Dashboard` - Dashboards with chart relationships
- `Container` - Hierarchical data structures
- `DataJob` - Pipeline tasks with inlet/outlet lineage
- `DataFlow` - Pipeline workflows
- `MLModel` - Machine learning models
- `MLModelGroup` - ML model families
### Phase 4: Patch Capabilities
Patch-based updates for efficient metadata changes:
- Internal patch accumulation within entities (not separate patch builders)
- Automatic patch emission on `update()` and `upsert()`
- Leverages existing `PatchBuilder` classes from entity-registry module
- Patches tested via entity unit tests
### Phase 5: Testing & Documentation
Comprehensive validation and user guides:
- Integration tests with live DataHub server
- API documentation (Javadoc) and 13 comprehensive Markdown guides
- 19 working example files demonstrating real-world usage
- Migration guide from V1
- Design principles document
- Patch operations deep-dive
- Entity-specific guides for all 8 entities
## Testing Strategy
### Unit Tests
Each entity and component has comprehensive unit tests:
- Builder validation (required fields, optional fields, validation logic)
- Aspect management (getters, setters, mode-aware routing)
- MCP generation (full aspects + patches)
- Patch operations (accumulation, emission)
- Fluent API chaining (type safety via CRTP)
- Mixin operations (tags, owners, terms, domains)
**Test Coverage by Entity:**
- Dataset: 37 tests
- Chart: 43 tests
- Dashboard: 52 tests
- DataJob: 45 tests
- DataFlow: 40 tests
- Container: 40 tests
- MLModel: 44 tests
- MLModelGroup: 38 tests
### Integration Tests
Full end-to-end tests against a real DataHub instance:
```java
@Test
public void testDatasetCreateAndRead() throws Exception {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server(TEST_SERVER)
.token(TEST_TOKEN)
.build();
// Create dataset
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("db.schema.test_table_" + System.currentTimeMillis())
.env("PROD")
.description("Test dataset created by Java SDK V2")
.build();
dataset.addTag("test-tag")
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER);
// Upsert
client.entities().upsert(dataset);
// Read back
Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
assertNotNull(retrieved);
assertEquals("Test dataset created by Java SDK V2", retrieved.getDescription());
}
@Test
public void testDatasetPatchOperations() throws Exception {
DataHubClientV2 client = DataHubClientV2.builder()
.server(TEST_SERVER)
.token(TEST_TOKEN)
.build();
// Create dataset first
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("db.schema.test_table_patch_" + System.currentTimeMillis())
.env("PROD")
.build();
client.entities().upsert(dataset);
// Retrieve and apply patches
Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
Dataset mutable = retrieved.mutable(); // Get mutable copy
mutable.addTag("pii") // Creates patch
.addTag("sensitive") // Another patch
.addTerm("urn:li:glossaryTerm:CustomerData"); // Another patch
// All patches emitted atomically
client.entities().update(mutable);
// Verify patches were applied
Dataset verified = client.entities().get(dataset.getUrn(), Dataset.class);
assertTrue(verified.getTags().contains("urn:li:tag:pii"));
}
Integration Test Coverage:
- Entity creation and retrieval
- Tag, owner, term, domain operations
- Lineage relationships (charts → datasets, jobs → datasets)
- Custom properties
- Full metadata workflows
- Batch operations
- Patch accumulation and emission
Running Integration Tests:
export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your_token
./gradlew :metadata-integration:java:datahub-client:test --tests "*Integration*"
Test Coverage Results
- Unit test coverage: >80% for new code (378 unit tests + 79 integration tests = 457 total)
- All public APIs covered
- Edge cases tested (null values, invalid inputs, mode switching)
- Async operations tested with proper synchronization
- Cache infrastructure thoroughly tested (43 tests for AspectCache + CachedAspect)
- Full end-to-end integration tests (79 tests)
API Documentation
All public classes and methods have comprehensive Javadoc plus extensive Markdown documentation:
Javadoc Coverage:
- Class-level documentation explaining purpose and usage
- Method-level documentation with parameters, returns, exceptions
- Code examples for common use cases
- Links to related classes and methods
Markdown Documentation (13 files):
Located in metadata-integration/java/docs/sdk-v2/:
- getting-started.md - Quick start guide for new users
- design-principles.md - Architecture and design decisions
- dataset-entity.md - Dataset operations and schema support
- chart-entity.md - Chart operations and lineage
- dashboard-entity.md - Dashboard operations and relationships
- container-entity.md - Container hierarchies
- dataflow-entity.md - DataFlow pipeline operations
- datajob-entity.md - DataJob inlet/outlet lineage
- mlmodel-entity.md - MLModel metrics and hyperparameters
- mlmodelgroup-entity.md - MLModelGroup version management
- patch-operations.md - Deep dive into patch-based updates
- migration-from-v1.md - Migration guide from V1 SDK
- java-sdk-v2-design.md - This comprehensive design document
Working Examples (19 files):
Located in metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/:
- Dataset examples: DatasetCreateExample, DatasetFullExample, DatasetPatchExample
- Chart examples: ChartCreateExample, ChartFullExample, ChartLineageExample
- Dashboard examples: DashboardCreateExample, DashboardFullExample, DashboardLineageExample
- DataFlow examples: DataFlowCreateExample, DataFlowFullExample
- DataJob examples: DataJobCreateExample, DataJobFullExample, DataJobLineageExample
- Container examples: ContainerCreateExample, ContainerFullExample, ContainerHierarchyExample
- MLModel examples: MLModelCreateExample, MLModelFullExample
- MLModelGroup examples: MLModelGroupCreateExample, MLModelGroupFullExample
Migration Guide
For users of the existing Java SDK:
Before (V1):
RestEmitter emitter = RestEmitter.create(b -> b.server("http://localhost:8080"));
DatasetUrn urn = new DatasetUrn(
new DataPlatformUrn("postgres"),
"my_table",
FabricType.PROD
);
DatasetProperties props = new DatasetProperties();
props.setDescription("My dataset");
MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
.entityType("dataset")
.entityUrn(urn)
.upsert()
.aspect(props)
.build();
emitter.emit(mcpw).get();
After (V2):
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("my_table")
.description("My dataset")
.build();
client.entities().upsert(dataset);
Decision Log
1. Use Pegasus Models vs OpenAPI Models
Decision: Use Pegasus models (com.linkedin.*) for aspect classes.
Rationale:
- Pegasus models are the canonical representation in DataHub
- Already used by v1 SDK, maintains consistency
- Generated from PDL schemas, always in sync with backend
- OpenAPI models are less mature and have fewer utilities
Result: Proven correct - seamless integration with existing infrastructure.
2. Namespace Separation
Decision: Use datahub.client.v2.* namespace.
Rationale:
- Clear separation from v1 API
- Allows side-by-side usage
- Follows semantic versioning principles
- Easy to deprecate v1 in future
Result: 100% backward compatibility achieved - v1 code unchanged.
3. Builder Pattern
Decision: Use nested static Builder classes.
Rationale:
- Idiomatic Java pattern
- Type-safe construction
- Optional parameters handled cleanly
- Better than telescoping constructors
Result: Excellent developer experience with fluent API.
4. Synchronous vs Async
Decision: Provide synchronous API that wraps async operations.
Rationale:
- Simpler for most users
- Matches Python SDK V2 API
- Can expose async API later for advanced users
- RestEmitter already provides async primitives
Result: Simplified API widely adopted in examples and tests.
5. Error Handling
Decision: Throw checked exceptions for I/O operations.
Rationale:
- Forces callers to handle errors
- Consistent with Java conventions
- Clear distinction between programmer errors and runtime failures
Result: Clear error handling patterns in all code.
Exception Hierarchy:
The SDK introduces custom exceptions for common error conditions:
ReadOnlyEntityException - Thrown when attempting to mutate a read-only entity:
try {
Dataset dataset = client.entities().get(urn);
dataset.addTag("pii"); // Throws ReadOnlyEntityException
} catch (ReadOnlyEntityException e) {
// Exception message explains the issue and provides fix
System.err.println(e.getMessage());
// Fix: Get mutable copy first
Dataset mutable = dataset.mutable();
mutable.addTag("pii");
client.entities().upsert(mutable);
}
PendingMutationsException - Thrown when reading from entity with pending mutations:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("New description");
// dataset.getDescription(); // Throws PendingMutationsException!
// Fix: Save first, then read
client.entities().upsert(dataset); // Clears dirty flag
String desc = dataset.getDescription(); // Now works
Why these restrictions?
- ReadOnlyEntityException: Makes mutations explicit, prevents accidental changes when passing entities between functions
- PendingMutationsException: Prevents reading stale cached data, enforces explicit save-then-fetch workflow
Both restrictions enforce clear separation between read and write workflows. These may be relaxed in future versions as the API matures and usage patterns emerge.
6. Patch-First over Full Aspect Replacement
Decision: Prioritize patch-based operations as the primary API, defer full aspect replacement to V1 SDK.
Rationale:
- User mental model: "Add a tag" is more natural than "fetch all tags, modify list, PUT entire aspect"
- Safety: Patches don't clobber concurrent changes from other users/systems
- Simplicity: Most metadata operations are incremental (add owner, remove tag, etc.)
- Efficiency: Only changed fields transmitted and processed by server
- Escape hatch exists: Users needing full PUT semantics can use V1 SDK's
RestEmitterdirectly
Why not both? V2 SDK focuses on making common operations simple, not exposing every low-level primitive. This keeps the API focused and prevents confusion about when to use patches vs full replacement.
Result: Clean, intuitive API for 95% of use cases. Power users can drop to V1 SDK for remaining 5%.
7. Internal Patch Accumulation vs External Patch Builders
Decision: Accumulate patches inside entities rather than separate patch builder classes.
Rationale:
- More intuitive API - metadata operations just work
- Patches automatically emitted on save
- Reduces API surface area
- Simplifies user code
Original Design: Separate DatasetPatch, ChartPatch builder classes
Actual Implementation: Patches accumulate in Entity.pendingPatches and emit via toMCPs()
Result: Superior developer experience - no need to learn separate patch API.
8. CRTP Pattern for Mixin Interfaces
Decision: Use Curiously Recurring Template Pattern for type-safe mixin interfaces.
Rationale:
- Type-safe method chaining returns concrete entity type
- Compile-time type checking
- No casting required in user code
- Idiomatic Java generics pattern
Original Design: Simple interfaces returning Entity
Actual Implementation:
public interface HasTags<T extends Entity & HasTags<T>> {
default T addTag(String tagUrn) { return (T) this; }
}
Result: Excellent type safety and developer experience.
9. Mode-Aware Behavior (SDK vs INGESTION)
Decision: Support SDK mode and INGESTION mode for aspect routing.
Rationale:
- Proper separation of user edits vs pipeline writes
- SDK mode → editable aspects (user overrides)
- INGESTION mode → system aspects (pipeline data)
- Getters prefer editable over system
Original Design: Not specified
Actual Implementation: OperationMode enum with aspect routing logic
Result: Clear separation of concerns, aligns with DataHub's aspect model.
10. Lazy Loading for GET Operations
Decision: Implement lazy loading for aspects when entities are retrieved.
Rationale:
- Performance - only fetch aspects when accessed
- Client binding enables on-demand fetching
- Cache management with timestamps
Original Design: Not specified (GET deferred)
Actual Implementation: Full lazy loading with getAspectLazy() and client binding
Result: Efficient entity retrieval with on-demand aspect fetching.
Design Questions and Resolutions
GET operation implementation: Should we implement REST client for reading entities, or defer to future?
- Resolution: Implemented with lazy loading support
Search client: Should we include search functionality in V2?
- Resolution: Deferred to future (out of scope for V2)
Lineage client: Should we include lineage management?
- Resolution: Basic lineage on Dataset, Chart, Dashboard, DataJob entities
Schema field builders: Should we provide fluent builders for schema fields?
- Resolution: Yes, schema field support in Dataset entity
References
Quick Links for Reviewers
Start Here:
metadata-integration/java/docs/sdk-v2/getting-started.md- Quick start guidemetadata-integration/java/docs/sdk-v2/design-principles.md- Architecture overview
Core Implementation: 3. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java (490 lines) - Base entity class 4. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java (570 lines) - CRUD operations 5. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java (266 lines) - Main client
Sample Entities: 6. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java (564 lines) - Reference implementation 7. metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/HasTags.java (145 lines) - CRTP mixin example
Examples: 8. metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java - Complete workflow 9. metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/ChartLineageExample.java - Lineage relationships
Tests: 10. metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/entity/DatasetTest.java (37 unit tests) 11. metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/integration/DatasetIntegrationTest.java - End-to-end validation
Document Status: Design document reflecting implemented architecture (includes AspectCache refactoring) Author: DataHub OSS Team Last Updated: 2025-01-06