Dataset Entity
The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2.
Creating a Dataset
Minimal Dataset
Only platform and name are required:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_database.my_schema.my_table")
.build();
With Environment
Specify environment (PROD, DEV, STAGING, etc.):
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)
With Metadata
Add description and display name at construction:
Dataset dataset = Dataset.builder()
.platform("bigquery")
.name("project.dataset.table")
.env("PROD")
.description("User transactions table")
.displayName("User Transactions")
.build();
With Custom Properties
Include custom properties in builder:
Map<String, String> props = new HashMap<>();
props.put("team", "data-engineering");
props.put("retention", "90_days");
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public.users")
.customProperties(props)
.build();
With Platform Instance
For multi-instance platforms:
Dataset dataset = Dataset.builder()
.platform("kafka")
.name("user-events")
.platformInstance("kafka-prod-cluster")
.build();
URN Construction
Dataset URNs follow the pattern:
urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env})
Automatic URN creation:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.events")
.env("PROD")
.build();
DatasetUrn urn = dataset.getDatasetUrn();
// urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD)
Description Operations
Mode-Aware Description
The setDescription() method routes to different aspects based on mode:
// SDK mode (default) - writes to editableDatasetProperties
dataset.setDescription("User-provided description");
// INGESTION mode - writes to datasetProperties
dataset.setDescription("Ingested from Snowflake");
Explicit Aspect Targeting
Control which aspect to write:
// System description (datasetProperties)
dataset.setSystemDescription("Generated by ETL pipeline");
// Editable description (editableDatasetProperties)
dataset.setEditableDescription("User override description");
Reading Description
Get description (prefers editable over system):
String description = dataset.getDescription();
// Returns editableDatasetProperties.description if set
// Otherwise returns datasetProperties.description
Display Name Operations
Similar to description, display names are mode-aware:
// Mode-aware (SDK → editable, INGESTION → system)
dataset.setDisplayName("User Events");
// Explicit aspect targeting
dataset.setSystemDisplayName("user_events_table");
dataset.setEditableDisplayName("User Events Table");
// Read display name (prefers editable)
String name = dataset.getDisplayName();
Tags
Adding Tags
// Simple tag name (auto-prefixed)
dataset.addTag("pii");
// Creates: urn:li:tag:pii
// Full tag URN
dataset.addTag("urn:li:tag:analytics");
Removing Tags
dataset.removeTag("pii");
dataset.removeTag("urn:li:tag:analytics");
Tag Chaining
dataset.addTag("pii")
.addTag("sensitive")
.addTag("gdpr");
Owners
Adding Owners
import com.linkedin.common.OwnershipType;
// Technical owner
dataset.addOwner(
"urn:li:corpuser:john_doe",
OwnershipType.TECHNICAL_OWNER
);
// Data steward
dataset.addOwner(
"urn:li:corpuser:jane_smith",
OwnershipType.DATA_STEWARD
);
// Business owner
dataset.addOwner(
"urn:li:corpuser:alice",
OwnershipType.BUSINESS_OWNER
);
Removing Owners
dataset.removeOwner("urn:li:corpuser:john_doe");
Owner Types
Available ownership types:
TECHNICAL_OWNER- Maintains the technical implementationBUSINESS_OWNER- Business stakeholderDATA_STEWARD- Manages data quality and complianceDATAOWNER- Generic data ownerDEVELOPER- Software developerPRODUCER- Data producerCONSUMER- Data consumerSTAKEHOLDER- Other stakeholder
Glossary Terms
Adding Terms
dataset.addTerm("urn:li:glossaryTerm:CustomerData");
dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential");
Removing Terms
dataset.removeTerm("urn:li:glossaryTerm:CustomerData");
Term Chaining
dataset.addTerm("urn:li:glossaryTerm:Customer Data")
.addTerm("urn:li:glossaryTerm:PII")
.addTerm("urn:li:glossaryTerm:GDPR");
Domain
Setting Domain
dataset.setDomain("urn:li:domain:Marketing");
Removing Domain
// Remove a specific domain
dataset.removeDomain("urn:li:domain:Marketing");
// Or clear all domains
dataset.clearDomains();
Custom Properties
Adding Individual Properties
dataset.addCustomProperty("team", "data-engineering");
dataset.addCustomProperty("retention_days", "90");
dataset.addCustomProperty("cost_center", "12345");
Setting All Properties
Replace all custom properties:
Map<String, String> properties = new HashMap<>();
properties.put("team", "data-engineering");
properties.put("retention", "90_days");
properties.put("classification", "internal");
dataset.setCustomProperties(properties);
Removing Properties
dataset.removeCustomProperty("retention_days");
Schema
Setting Schema Metadata
import com.linkedin.schema.*;
SchemaMetadata schema = new SchemaMetadata();
// Configure schema...
dataset.setSchema(schema);
Setting Schema Fields
import com.linkedin.schema.*;
List<SchemaField> fields = new ArrayList<>();
// String field
SchemaField userIdField = new SchemaField();
userIdField.setFieldPath("user_id");
userIdField.setNativeDataType("VARCHAR(255)");
userIdField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())));
fields.add(userIdField);
// Numeric field
SchemaField amountField = new SchemaField();
amountField.setFieldPath("amount");
amountField.setNativeDataType("DECIMAL(10,2)");
amountField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType())));
fields.add(amountField);
dataset.setSchemaFields(fields);
Complete Example
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import com.linkedin.common.OwnershipType;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
public class DatasetExample {
public static void main(String[] args) {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
try {
// Build dataset with all metadata
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.description("User interaction events from web and mobile")
.displayName("User Events")
.build();
// Add tags
dataset.addTag("pii")
.addTag("analytics")
.addTag("gdpr");
// Add owners
dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER);
// Add glossary terms
dataset.addTerm("urn:li:glossaryTerm:CustomerData")
.addTerm("urn:li:glossaryTerm:EventData");
// Set domain
dataset.setDomain("urn:li:domain:Analytics");
// Add custom properties
dataset.addCustomProperty("team", "data-engineering")
.addCustomProperty("retention_days", "365")
.addCustomProperty("refresh_schedule", "daily");
// Upsert to DataHub
client.entities().upsert(dataset);
System.out.println("Successfully created dataset: " + dataset.getUrn());
} catch (IOException | ExecutionException | InterruptedException e) {
e.printStackTrace();
} finally {
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Updating Existing Datasets
Load and Modify
// Load existing dataset
DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD");
Dataset dataset = client.entities().get(urn);
// Add new metadata (creates patches)
dataset.addTag("new-tag")
.addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER);
// Apply patches
client.entities().update(dataset);
Incremental Updates
// Just add what you need
dataset.addTag("sensitive");
client.entities().update(dataset);
// Later, add more
dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
client.entities().update(dataset);
Builder Options Reference
| Method | Required | Description |
|---|---|---|
platform(String) | ✅ Yes | Data platform (e.g., "snowflake", "bigquery") |
name(String) | ✅ Yes | Fully qualified dataset name |
env(String) | No | Environment (PROD, DEV, etc.) Default: PROD |
platformInstance(String) | No | Platform instance identifier |
description(String) | No | Dataset description |
displayName(String) | No | Display name shown in UI |
customProperties(Map) | No | Map of custom key-value properties |
Mode-Aware vs Explicit Methods
| Operation | Mode-Aware Method | SDK Mode Aspect | INGESTION Mode Aspect |
|---|---|---|---|
| Description | setDescription() | editableDatasetProperties | datasetProperties |
| Display Name | setDisplayName() | editableDatasetProperties | datasetProperties |
Explicit methods (always available):
setSystemDescription()/setEditableDescription()setSystemDisplayName()/setEditableDisplayName()
Common Patterns
Creating Multiple Datasets
for (String tableName : tableNames) {
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public." + tableName)
.env("PROD")
.build();
dataset.addTag("auto-generated")
.addCustomProperty("created_by", "sync_job");
client.entities().upsert(dataset);
}
Batch Metadata Addition
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
List<String> tags = Arrays.asList("pii", "sensitive", "gdpr");
tags.forEach(dataset::addTag);
client.entities().upsert(dataset); // Emits all tags in one call
Conditional Metadata
if (isPII(dataset)) {
dataset.addTag("pii")
.addTerm("urn:li:glossaryTerm:PersonalData");
}
if (requiresGovernance(dataset)) {
dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD);
}
Next Steps
- Chart Entity - Working with chart entities
- Patch Operations - Deep dive into patches
- Migration Guide - Upgrading from V1
Examples
Basic Dataset Creation
# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java
package io.datahubproject.examples.v2;
import com.linkedin.common.OwnershipType;
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
/**
* Example demonstrating how to create a Dataset using Java SDK V2.
*
* <p>This example shows:
*
* <ul>
* <li>Creating a DataHubClientV2
* <li>Building a Dataset with fluent builder
* <li>Adding tags, owners, and custom properties
* <li>Upserting to DataHub
* </ul>
*/
public class DatasetCreateExample {
public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Create client (use environment variables or pass explicit values)
DataHubClientV2 client =
DataHubClientV2.builder()
.server(System.getenv().getOrDefault("DATAHUB_SERVER", "http://localhost:8080"))
.token(System.getenv("DATAHUB_TOKEN")) // Optional
.build();
try {
// Test connection
if (!client.testConnection()) {
System.err.println("Failed to connect to DataHub server");
return;
}
System.out.println("✓ Connected to DataHub");
// Build dataset with metadata
Dataset dataset =
Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.description("User interaction events from web and mobile applications")
.displayName("User Events")
.build();
System.out.println("✓ Built dataset with URN: " + dataset.getUrn());
// Add tags
dataset.addTag("pii").addTag("analytics").addTag("gdpr");
System.out.println("✓ Added 3 tags");
// Add owners
dataset
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:data_team", OwnershipType.DATA_STEWARD);
System.out.println("✓ Added 2 owners");
// Add custom properties
dataset
.addCustomProperty("team", "data-engineering")
.addCustomProperty("retention_days", "365")
.addCustomProperty("refresh_schedule", "daily")
.addCustomProperty("source_system", "web_analytics");
System.out.println("✓ Added 4 custom properties");
// Upsert to DataHub
client.entities().upsert(dataset);
System.out.println("✓ Successfully created dataset in DataHub!");
System.out.println("\n URN: " + dataset.getUrn());
System.out.println(
" View in DataHub: " + client.getConfig().getServer() + "/dataset/" + dataset.getUrn());
} finally {
client.close();
}
}
}
Dataset Patch Operations
# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java
package io.datahubproject.examples.v2;
import com.linkedin.common.OwnershipType;
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
/**
* Example demonstrating patch-based updates using Java SDK V2.
*
* <p>This example shows:
*
* <ul>
* <li>Adding tags, owners, and properties to existing dataset
* <li>Patch accumulation pattern
* <li>Efficient incremental updates
* </ul>
*/
public class DatasetPatchExample {
public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Create client
DataHubClientV2 client =
DataHubClientV2.builder()
.server(System.getenv().getOrDefault("DATAHUB_SERVER", "http://localhost:8080"))
.token(System.getenv("DATAHUB_TOKEN"))
.build();
try {
// Create a simple dataset
Dataset dataset =
Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.build();
System.out.println("Dataset URN: " + dataset.getUrn());
// Add metadata using patches (accumulate without emitting)
dataset
.addTag("pii")
.addTag("analytics")
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER)
.addCustomProperty("team", "data-engineering");
System.out.println("Accumulated " + dataset.getPendingPatches().size() + " patches");
// Emit all patches in single upsert
client.entities().upsert(dataset);
System.out.println("✓ Successfully applied all patches!");
// Later, add more patches
dataset
.addTag("gdpr")
.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
System.out.println("Accumulated " + dataset.getPendingPatches().size() + " more patches");
// Apply new patches
client.entities().upsert(dataset);
System.out.println("✓ Successfully applied additional patches!");
} finally {
client.close();
}
}
}
Comprehensive Dataset Example
# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java
package io.datahubproject.examples.v2;
import com.linkedin.common.OwnershipType;
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
/**
* Comprehensive example demonstrating all Dataset metadata operations using Java SDK V2.
*
* <p>This example shows:
*
* <ul>
* <li>Creating a dataset with complete metadata
* <li>Adding tags, owners, glossary terms
* <li>Setting domain and custom properties
* <li>Combining all operations in single entity
* </ul>
*/
public class DatasetFullExample {
public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Create client
DataHubClientV2 client =
DataHubClientV2.builder()
.server(System.getenv().getOrDefault("DATAHUB_SERVER", "http://localhost:8080"))
.token(System.getenv("DATAHUB_TOKEN"))
.build();
try {
// Test connection
if (!client.testConnection()) {
System.err.println("Failed to connect to DataHub server");
return;
}
System.out.println("✓ Connected to DataHub");
// Build comprehensive dataset with all metadata types
Dataset dataset =
Dataset.builder()
.platform("snowflake")
.name("analytics.public.customer_transactions")
.env("PROD")
.description(
"Complete customer transaction history including purchases, refunds, and adjustments. "
+ "This dataset is the source of truth for financial reporting and customer analytics.")
.displayName("Customer Transactions")
.build();
System.out.println("✓ Built dataset with URN: " + dataset.getUrn());
// Add multiple tags for categorization
dataset
.addTag("pii")
.addTag("financial")
.addTag("analytics")
.addTag("gdpr")
.addTag("production");
System.out.println("✓ Added 5 tags");
// Add multiple owners with different roles
dataset
.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:finance_team", OwnershipType.DATA_STEWARD)
.addOwner("urn:li:corpuser:compliance_team", OwnershipType.DATA_STEWARD);
System.out.println("✓ Added 3 owners");
// Add glossary terms for business context
dataset
.addTerm("urn:li:glossaryTerm:CustomerData")
.addTerm("urn:li:glossaryTerm:FinancialTransaction")
.addTerm("urn:li:glossaryTerm:GDPR.PersonalData");
System.out.println("✓ Added 3 glossary terms");
// Set domain for organizational structure
dataset.setDomain("urn:li:domain:Finance");
System.out.println("✓ Set domain");
// Add comprehensive custom properties
dataset
.addCustomProperty("team", "data-platform")
.addCustomProperty("retention_days", "2555") // 7 years for financial data
.addCustomProperty("refresh_schedule", "hourly")
.addCustomProperty("source_system", "payment_gateway")
.addCustomProperty("sla_tier", "tier1")
.addCustomProperty("encryption", "AES-256")
.addCustomProperty("backup_frequency", "continuous")
.addCustomProperty("compliance_level", "PCI-DSS")
.addCustomProperty("data_classification", "highly-confidential")
.addCustomProperty("business_criticality", "mission-critical");
System.out.println("✓ Added 10 custom properties");
// Count accumulated patches
System.out.println("\nAccumulated " + dataset.getPendingPatches().size() + " patches");
// Upsert to DataHub - all metadata in single operation
client.entities().upsert(dataset);
System.out.println("\n✓ Successfully created comprehensive dataset in DataHub!");
System.out.println("\nSummary:");
System.out.println(" URN: " + dataset.getUrn());
System.out.println(" Platform: snowflake");
System.out.println(" Tags: 5");
System.out.println(" Owners: 3");
System.out.println(" Glossary Terms: 3");
System.out.println(" Domain: Finance");
System.out.println(" Custom Properties: 10");
System.out.println(
"\n View in DataHub: "
+ client.getConfig().getServer()
+ "/dataset/"
+ dataset.getUrn());
} finally {
client.close();
}
}
}
Is this page helpful?