Why We Hand-Crafted the Java SDK V2 (Instead of Generating It)

The Question

When building DataHub's Java SDK V2, we faced a choice that every API platform eventually confronts: should we generate our SDK from OpenAPI specs, or hand-craft it?

OpenAPI code generation is seductive. Tools like OpenAPI Generator promise instant SDKs in dozens of languages. Run a command, get a client—complete with type-safe models, proper serialization, and comprehensive endpoint coverage. Why would anyone choose to write thousands of lines of code by hand?

We chose to hand-craft. This document explains why.

When Code Generation Works Beautifully

Let's be clear: code generation isn't wrong. It's incredibly effective when your abstraction boundary aligns with your wire protocol.

CRUD APIs: If your API exposes resources like GET /users/{id}, POST /users, DELETE /users/{id}, a generated client is perfect:

User user = client.getUser(123);
client.createUser(newUser);
client.deleteUser(456);

The user's mental model—"I want to fetch/create/delete a user"—maps directly to HTTP operations. There's no translation needed.

Protocol Buffers: Google's protobuf generators are exemplary because the .proto file is the contract:

service UserService {
  rpc GetUser(UserId) returns (User);
  rpc ListUsers(ListRequest) returns (UserList);
}

The service definition becomes the client API with perfect fidelity. What you define is what users get.

The Pattern: Code generation excels when the API's conceptual model matches user mental models, and the wire protocol fully captures domain semantics.

The Semantic Gap: Why DataHub Is Different

DataHub doesn't fit this mold. Our metadata platform has a semantic gap between what users want to do and what the HTTP API exposes.

The Aspect-Based Model

DataHub stores metadata as discrete "aspects"—properties, tags, ownership, schemas. But users don't think in aspects. They think:

"I want to add a 'PII' tag to this dataset"
"I need to assign ownership to John"
"This table should be in the Finance domain"

An OpenAPI-generated client would expose:

// What the API provides
client.updateGlobalTags(entityUrn, globalTagsPayload);
client.updateOwnership(entityUrn, ownershipPayload);

But to use this, you need to know:

What is GlobalTags? How do I construct it?
Should I use PUT (full replacement) or PATCH (incremental update)?
How do I avoid race conditions when multiple systems update tags?
Where do tags even live—in system aspects or editable aspects?

This is expert-level knowledge pushed onto every user.

The Patch Complexity

DataHub supports both full aspect replacement (PUT) and JSON Patch (incremental updates). The generated client would expose both:

// Full replacement
void putGlobalTags(Urn entityUrn, GlobalTags tags);

// JSON Patch
void patchGlobalTags(Urn entityUrn, JsonPatch patch);

Now users must decide when to use each. Patches are safer (no race conditions), but how do you construct a JsonPatch? Do you use a PatchBuilder? Hand-write JSON?

Every user solves this problem independently, reinventing best practices.

The Mode Problem

DataHub has dual aspects: system aspects (written by ingestion pipelines) and editable aspects (written by humans via UI/SDK). Users editing metadata should write to editable aspects, but pipelines should write to system aspects.

A generated client doesn't understand this distinction. It just exposes endpoints. Users must learn DataHub's aspect model to route correctly.

Five Principles of Hand-Crafted SDKs

Our hand-crafted SDK addresses these gaps through five design principles.

1. Semantic Layers Translate Domain Concepts

The SDK provides operations that match how users think:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("fact_revenue")
    .build();

// Think "add a tag", not "construct and PUT a GlobalTags aspect"
dataset.addTag("pii");

// Think "assign ownership", not "build an Ownership aspect"
dataset.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER);

client.entities().upsert(dataset);

The SDK translates addTag() into the correct:

Aspect type (GlobalTags)
Operation type (JSON Patch for safety)
Aspect variant (editable, in SDK mode)
JSON path (into the aspect structure)

This is semantic translation—mapping domain intent to wire protocol. Generators can't do this because the semantics live in institutional knowledge, not OpenAPI specs.

2. Opinionated APIs: The 95/5 Rule

We optimized for the 95% case and provided escape hatches for the 5%.

The 95% case: Incremental metadata changes—add a tag, update ownership, set a domain.

dataset.addTag("sensitive")
       .addOwner(ownerUrn, type)
       .setDomain(domainUrn);

client.entities().update(dataset);

Users never think about PUT vs PATCH, aspect construction, or batch strategies. It just works.

The 5% case: Complete aspect replacement, custom MCPs, or operations V2 doesn't support.

// Drop to V1 SDK for full control
RestEmitter emitter = client.emitter();
MetadataChangeProposalWrapper mcpw = /* custom logic */;
emitter.emit(mcpw).get();

This philosophy—make simple things trivial, complex things possible—requires intentional API design. Generators produce flat API surfaces where every operation has equal weight.

3. Encoding Expert Knowledge

Every platform accumulates tribal knowledge:

"Always use patches for concurrent-safe updates"
"Editable aspects override system aspects in SDK mode"
"Batch operations to avoid Kafka load spikes"
"Schema field names don't always match aspect names"

A generated client leaves this knowledge in Slack threads and documentation. Users discover best practices through painful trial and error.

The hand-crafted SDK encodes this knowledge:

// Users call addTag(), SDK internally:
// - Creates a JSON Patch (not full replacement)
// - Targets the editable aspect in SDK mode
// - Accumulates patches for atomic emission
// - Uses the correct field paths

The SDK becomes executable documentation of best practices. This scales better than tribal knowledge.

Why Not an ORM Approach?

Tools like Hibernate, SQLAlchemy, and Pydantic+ORM excel at managing complex object graphs in transactional applications. Why didn't we use this pattern?

Metadata operations follow different patterns than OLTP workloads:

Bulk mutations - "Tag 50 datasets as PII" requires only URNs and the operation, not loading full object graphs
Point lookups - "Get this dataset's schema before querying" is a direct fetch, no relationship navigation needed
Read-modify-write - "Infer quality scores from schema statistics" involves fetching an aspect, transforming it, and patching it back

ORMs optimize for relationship traversal (dataset.container.database.catalog), session lifecycle management, and automatic dirty tracking. But:

Relationship traversal is handled by DataHub's search and graph query APIs, not in-memory navigation
Explicit patches are central to our design—we want addTag() visible in code, not hidden behind session flush
Session complexity adds cognitive overhead without benefit for metadata's bulk/point/patch patterns

The result: a simpler, more explicit API that matches how developers actually work with metadata.

4. Centralized Maintenance vs Distributed Pain

Generated clients push maintenance costs onto users. When we improve DataHub:

Add a new endpoint: Users regenerate their client. Breaking change? Every team upgrades simultaneously.
Change error handling: Regenerate. Update all call sites.
Optimize batch operations: Can't—that logic lives in user code, reinvented by every team.

Hand-crafted SDKs centralize expertise:

Add convenience methods: Users pull the SDK update. No code changes required.
Improve retry logic: Fixed once in the SDK. All users benefit immediately.
Optimize batching: Built into the SDK. Users get better performance automatically.

The total maintenance cost is lower because we fix problems once instead of every team solving them independently.

5. Progressive Disclosure

Generated clients are flat—every endpoint is equally visible. Hand-crafted SDKs enable progressive disclosure: simple tasks are simple, complexity is opt-in.

Day 1 user: Create and tag a dataset

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.addTag("pii");
client.entities().upsert(dataset);

No need to understand aspects, patches, or modes.

Week 1 user: Manage governance

dataset.addOwner(ownerUrn, type)
       .setDomain(domainUrn)
       .addTerm(termUrn);

Still pure domain operations.

Month 1 user: Understand update vs upsert

// update() emits only patches (for existing entities)
Dataset existing = client.entities().get(urn);
Dataset mutable = existing.mutable();  // Get writable copy
mutable.addTag("sensitive");
client.entities().update(mutable);

// upsert() emits full aspects + patches
Dataset newEntity = Dataset.builder()...;
client.entities().upsert(newEntity);

Complexity revealed when needed, not upfront.

6. Immutability by Default

Entities fetched from the server are read-only by default, enforcing explicit mutation intent.

The Problem:

Traditional SDKs allow silent mutation of fetched objects:

Dataset dataset = client.get(urn);
// Pass to function - might it mutate dataset? Who knows!
processDataset(dataset);
// Is dataset still the same? Must read all code to know

The Solution:

Immutable-by-default makes mutation intent explicit:

Dataset dataset = client.get(urn);
// dataset is read-only - safe to pass anywhere
processDataset(dataset);

// Want to mutate? Make it explicit
Dataset mutable = dataset.mutable();
mutable.addTag("updated");
client.entities().upsert(mutable);

Benefits:

Safety: Can't accidentally mutate shared references
Clarity: .mutable() call signals write intent
Debugging: Easier to track where mutations happen
Concurrency: Safe to share read-only entities across threads

Design Inspiration:

This pattern is common in modern APIs because immutability scales better than defensive copying:

Rust's ownership model - mut vs immutable borrows
Python's frozen dataclasses - @dataclass(frozen=True)
Java's immutable collections - Collections.unmodifiableList()
Functional programming principles - immutable data structures

When you see .mutable() in our SDK, you're seeing battle-tested patterns from languages designed for safety and concurrency.

What This Costs (And Why It's Worth It)

Hand-crafting isn't free:

3,000+ lines of code across entity classes, caching, and operations
457 tests validating workflows, not just HTTP mechanics
13 documentation guides teaching patterns, not just parameters
Ongoing maintenance as DataHub evolves

But this investment compounds. Every hour we spend on the SDK saves hundreds of hours across our user community. The SDK makes metadata management effortless instead of just possible.

Compare total cost of ownership:

Approach	Initial Dev	User Onboarding	Ongoing Support	Innovation Speed
Generated Client	Hours	High (steep)	High (repeated)	Slow (coupled)
Hand-Crafted SDK	Weeks	Low (gradual)	Low (central)	Fast (buffered)

After 6-12 months, the hand-crafted SDK becomes cheaper because centralized expertise scales better than distributed tribal knowledge.

The Philosophy: What SDKs Should Be

This isn't about generated vs hand-crafted code. It's about what we believe SDKs should be.

SDKs are not just API wrappers. They are:

Semantic layers that translate domain concepts to wire protocols
Knowledge repositories that encode best practices
Usability interfaces that optimize for human cognition
Evolution buffers that allow internals to improve while APIs remain stable

Code generation is perfect when the API is the abstraction. But for domain-rich platforms where users think in terms of datasets, lineage, and governance—not HTTP verbs and JSON payloads—hand-crafted SDKs aren't just better. They're necessary.

When Should You Generate? When Should You Craft?

Generate when:

Your API's conceptual model matches user mental models
The wire protocol fully captures domain semantics
Operations are mostly stateless CRUD
You prioritize API coverage over workflow optimization

Hand-craft when:

Domain concepts require translation to wire protocol
Users need guidance on best practices
Stateful workflows matter (accumulate changes, emit atomically)
You prioritize usability over feature completeness

DataHub falls firmly in the second category. Our users don't want to learn aspect models, patch formats, or mode routing. They want to add a tag to a dataset and have it just work.

That's what the hand-crafted SDK delivers.

Conclusion: Empathy at Scale

In an era of automation, there's pressure to generate everything. But some problems demand craftsmanship.

The hand-crafted SDK is an act of empathy at scale. It says: "We understand your problems. We've encoded the solutions. You shouldn't have to become a DataHub expert to use DataHub."

A generated client says: "Here's our API. Figure it out."

A hand-crafted SDK says: "Here's how to solve your problems."

That difference is why we invested in hand-crafting. And it's why our users can focus on their data, not our API internals.

Document Status: Design Philosophy Author: DataHub OSS Team Last Updated: 2025-01-06

Is this page helpful?

Why We Hand-Crafted the Java SDK V2 (Instead of Generating It)

The Question​

When Code Generation Works Beautifully​

The Semantic Gap: Why DataHub Is Different​

The Aspect-Based Model​

The Patch Complexity​

The Mode Problem​

Five Principles of Hand-Crafted SDKs​

1. Semantic Layers Translate Domain Concepts​

2. Opinionated APIs: The 95/5 Rule​

3. Encoding Expert Knowledge​

Why Not an ORM Approach?​

4. Centralized Maintenance vs Distributed Pain​

5. Progressive Disclosure​

6. Immutability by Default​

What This Costs (And Why It's Worth It)​

The Philosophy: What SDKs Should Be​

When Should You Generate? When Should You Craft?​

Conclusion: Empathy at Scale​