Why We Hand-Crafted the Java SDK V2 (Instead of Generating It)
The Question
When building DataHub's Java SDK V2, we faced a choice that every API platform eventually confronts: should we generate our SDK from OpenAPI specs, or hand-craft it?
OpenAPI code generation is seductive. Tools like OpenAPI Generator promise instant SDKs in dozens of languages. Run a command, get a client—complete with type-safe models, proper serialization, and comprehensive endpoint coverage. Why would anyone choose to write thousands of lines of code by hand?
We chose to hand-craft. This document explains why.
When Code Generation Works Beautifully
Let's be clear: code generation isn't wrong. It's incredibly effective when your abstraction boundary aligns with your wire protocol.
CRUD APIs: If your API exposes resources like GET /users/{id}, POST /users, DELETE /users/{id}, a generated client is perfect:
User user = client.getUser(123);
client.createUser(newUser);
client.deleteUser(456);
The user's mental model—"I want to fetch/create/delete a user"—maps directly to HTTP operations. There's no translation needed.
Protocol Buffers: Google's protobuf generators are exemplary because the .proto file is the contract:
service UserService {
rpc GetUser(UserId) returns (User);
rpc ListUsers(ListRequest) returns (UserList);
}
The service definition becomes the client API with perfect fidelity. What you define is what users get.
The Pattern: Code generation excels when the API's conceptual model matches user mental models, and the wire protocol fully captures domain semantics.
The Semantic Gap: Why DataHub Is Different
DataHub doesn't fit this mold. Our metadata platform has a semantic gap between what users want to do and what the HTTP API exposes.
The Aspect-Based Model
DataHub stores metadata as discrete "aspects"—properties, tags, ownership, schemas. But users don't think in aspects. They think:
- "I want to add a 'PII' tag to this dataset"
- "I need to assign ownership to John"
- "This table should be in the Finance domain"
An OpenAPI-generated client would expose:
// What the API provides
client.updateGlobalTags(entityUrn, globalTagsPayload);
client.updateOwnership(entityUrn, ownershipPayload);
But to use this, you need to know:
- What is
GlobalTags? How do I construct it? - Should I use PUT (full replacement) or PATCH (incremental update)?
- How do I avoid race conditions when multiple systems update tags?
- Where do tags even live—in system aspects or editable aspects?
This is expert-level knowledge pushed onto every user.
The Patch Complexity
DataHub supports both full aspect replacement (PUT) and JSON Patch (incremental updates). The generated client would expose both:
// Full replacement
void putGlobalTags(Urn entityUrn, GlobalTags tags);
// JSON Patch
void patchGlobalTags(Urn entityUrn, JsonPatch patch);
Now users must decide when to use each. Patches are safer (no race conditions), but how do you construct a JsonPatch? Do you use a PatchBuilder? Hand-write JSON?
Every user solves this problem independently, reinventing best practices.
The Mode Problem
DataHub has dual aspects: system aspects (written by ingestion pipelines) and editable aspects (written by humans via UI/SDK). Users editing metadata should write to editable aspects, but pipelines should write to system aspects.
A generated client doesn't understand this distinction. It just exposes endpoints. Users must learn DataHub's aspect model to route correctly.
Five Principles of Hand-Crafted SDKs
Our hand-crafted SDK addresses these gaps through five design principles.
1. Semantic Layers Translate Domain Concepts
The SDK provides operations that match how users think:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("fact_revenue")
.build();
// Think "add a tag", not "construct and PUT a GlobalTags aspect"
dataset.addTag("pii");
// Think "assign ownership", not "build an Ownership aspect"
dataset.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER);
client.entities().upsert(dataset);
The SDK translates addTag() into the correct:
- Aspect type (GlobalTags)
- Operation type (JSON Patch for safety)
- Aspect variant (editable, in SDK mode)
- JSON path (into the aspect structure)
This is semantic translation—mapping domain intent to wire protocol. Generators can't do this because the semantics live in institutional knowledge, not OpenAPI specs.
2. Opinionated APIs: The 95/5 Rule
We optimized for the 95% case and provided escape hatches for the 5%.
The 95% case: Incremental metadata changes—add a tag, update ownership, set a domain.
dataset.addTag("sensitive")
.addOwner(ownerUrn, type)
.setDomain(domainUrn);
client.entities().update(dataset);
Users never think about PUT vs PATCH, aspect construction, or batch strategies. It just works.
The 5% case: Complete aspect replacement, custom MCPs, or operations V2 doesn't support.
// Drop to V1 SDK for full control
RestEmitter emitter = client.emitter();
MetadataChangeProposalWrapper mcpw = /* custom logic */;
emitter.emit(mcpw).get();
This philosophy—make simple things trivial, complex things possible—requires intentional API design. Generators produce flat API surfaces where every operation has equal weight.
3. Encoding Expert Knowledge
Every platform accumulates tribal knowledge:
- "Always use patches for concurrent-safe updates"
- "Editable aspects override system aspects in SDK mode"
- "Batch operations to avoid Kafka load spikes"
- "Schema field names don't always match aspect names"
A generated client leaves this knowledge in Slack threads and documentation. Users discover best practices through painful trial and error.
The hand-crafted SDK encodes this knowledge:
// Users call addTag(), SDK internally:
// - Creates a JSON Patch (not full replacement)
// - Targets the editable aspect in SDK mode
// - Accumulates patches for atomic emission
// - Uses the correct field paths
The SDK becomes executable documentation of best practices. This scales better than tribal knowledge.
Why Not an ORM Approach?
Tools like Hibernate, SQLAlchemy, and Pydantic+ORM excel at managing complex object graphs in transactional applications. Why didn't we use this pattern?
Metadata operations follow different patterns than OLTP workloads:
- Bulk mutations - "Tag 50 datasets as PII" requires only URNs and the operation, not loading full object graphs
- Point lookups - "Get this dataset's schema before querying" is a direct fetch, no relationship navigation needed
- Read-modify-write - "Infer quality scores from schema statistics" involves fetching an aspect, transforming it, and patching it back
ORMs optimize for relationship traversal (dataset.container.database.catalog), session lifecycle management, and automatic dirty tracking. But:
- Relationship traversal is handled by DataHub's search and graph query APIs, not in-memory navigation
- Explicit patches are central to our design—we want
addTag()visible in code, not hidden behind session flush - Session complexity adds cognitive overhead without benefit for metadata's bulk/point/patch patterns
The result: a simpler, more explicit API that matches how developers actually work with metadata.
4. Centralized Maintenance vs Distributed Pain
Generated clients push maintenance costs onto users. When we improve DataHub:
- Add a new endpoint: Users regenerate their client. Breaking change? Every team upgrades simultaneously.
- Change error handling: Regenerate. Update all call sites.
- Optimize batch operations: Can't—that logic lives in user code, reinvented by every team.
Hand-crafted SDKs centralize expertise:
- Add convenience methods: Users pull the SDK update. No code changes required.
- Improve retry logic: Fixed once in the SDK. All users benefit immediately.
- Optimize batching: Built into the SDK. Users get better performance automatically.
The total maintenance cost is lower because we fix problems once instead of every team solving them independently.
5. Progressive Disclosure
Generated clients are flat—every endpoint is equally visible. Hand-crafted SDKs enable progressive disclosure: simple tasks are simple, complexity is opt-in.
Day 1 user: Create and tag a dataset
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.addTag("pii");
client.entities().upsert(dataset);
No need to understand aspects, patches, or modes.
Week 1 user: Manage governance
dataset.addOwner(ownerUrn, type)
.setDomain(domainUrn)
.addTerm(termUrn);
Still pure domain operations.
Month 1 user: Understand update vs upsert
// update() emits only patches (for existing entities)
Dataset existing = client.entities().get(urn);
Dataset mutable = existing.mutable(); // Get writable copy
mutable.addTag("sensitive");
client.entities().update(mutable);
// upsert() emits full aspects + patches
Dataset newEntity = Dataset.builder()...;
client.entities().upsert(newEntity);
Complexity revealed when needed, not upfront.
6. Immutability by Default
Entities fetched from the server are read-only by default, enforcing explicit mutation intent.
The Problem:
Traditional SDKs allow silent mutation of fetched objects:
Dataset dataset = client.get(urn);
// Pass to function - might it mutate dataset? Who knows!
processDataset(dataset);
// Is dataset still the same? Must read all code to know
The Solution:
Immutable-by-default makes mutation intent explicit:
Dataset dataset = client.get(urn);
// dataset is read-only - safe to pass anywhere
processDataset(dataset);
// Want to mutate? Make it explicit
Dataset mutable = dataset.mutable();
mutable.addTag("updated");
client.entities().upsert(mutable);
Benefits:
- Safety: Can't accidentally mutate shared references
- Clarity:
.mutable()call signals write intent - Debugging: Easier to track where mutations happen
- Concurrency: Safe to share read-only entities across threads
Design Inspiration:
This pattern is common in modern APIs because immutability scales better than defensive copying:
- Rust's ownership model - mut vs immutable borrows
- Python's frozen dataclasses -
@dataclass(frozen=True) - Java's immutable collections -
Collections.unmodifiableList() - Functional programming principles - immutable data structures
When you see .mutable() in our SDK, you're seeing battle-tested patterns from languages designed for safety and concurrency.
What This Costs (And Why It's Worth It)
Hand-crafting isn't free:
- 3,000+ lines of code across entity classes, caching, and operations
- 457 tests validating workflows, not just HTTP mechanics
- 13 documentation guides teaching patterns, not just parameters
- Ongoing maintenance as DataHub evolves
But this investment compounds. Every hour we spend on the SDK saves hundreds of hours across our user community. The SDK makes metadata management effortless instead of just possible.
Compare total cost of ownership:
| Approach | Initial Dev | User Onboarding | Ongoing Support | Innovation Speed |
|---|---|---|---|---|
| Generated Client | Hours | High (steep) | High (repeated) | Slow (coupled) |
| Hand-Crafted SDK | Weeks | Low (gradual) | Low (central) | Fast (buffered) |
After 6-12 months, the hand-crafted SDK becomes cheaper because centralized expertise scales better than distributed tribal knowledge.
The Philosophy: What SDKs Should Be
This isn't about generated vs hand-crafted code. It's about what we believe SDKs should be.
SDKs are not just API wrappers. They are:
- Semantic layers that translate domain concepts to wire protocols
- Knowledge repositories that encode best practices
- Usability interfaces that optimize for human cognition
- Evolution buffers that allow internals to improve while APIs remain stable
Code generation is perfect when the API is the abstraction. But for domain-rich platforms where users think in terms of datasets, lineage, and governance—not HTTP verbs and JSON payloads—hand-crafted SDKs aren't just better. They're necessary.
When Should You Generate? When Should You Craft?
Generate when:
- Your API's conceptual model matches user mental models
- The wire protocol fully captures domain semantics
- Operations are mostly stateless CRUD
- You prioritize API coverage over workflow optimization
Hand-craft when:
- Domain concepts require translation to wire protocol
- Users need guidance on best practices
- Stateful workflows matter (accumulate changes, emit atomically)
- You prioritize usability over feature completeness
DataHub falls firmly in the second category. Our users don't want to learn aspect models, patch formats, or mode routing. They want to add a tag to a dataset and have it just work.
That's what the hand-crafted SDK delivers.
Conclusion: Empathy at Scale
In an era of automation, there's pressure to generate everything. But some problems demand craftsmanship.
The hand-crafted SDK is an act of empathy at scale. It says: "We understand your problems. We've encoded the solutions. You shouldn't have to become a DataHub expert to use DataHub."
A generated client says: "Here's our API. Figure it out."
A hand-crafted SDK says: "Here's how to solve your problems."
That difference is why we invested in hand-crafting. And it's why our users can focus on their data, not our API internals.
Document Status: Design Philosophy Author: DataHub OSS Team Last Updated: 2025-01-06