Skip to main content

Dataset Entity

The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2.

Creating a Dataset

Minimal Dataset

Only platform and name are required:

Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_database.my_schema.my_table")
.build();

With Environment

Specify environment (PROD, DEV, STAGING, etc.):

Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)

With Metadata

Add description and display name at construction:

Dataset dataset = Dataset.builder()
.platform("bigquery")
.name("project.dataset.table")
.env("PROD")
.description("User transactions table")
.displayName("User Transactions")
.build();

With Custom Properties

Include custom properties in builder:

Map<String, String> props = new HashMap<>();
props.put("team", "data-engineering");
props.put("retention", "90_days");

Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public.users")
.customProperties(props)
.build();

With Platform Instance

For multi-instance platforms:

Dataset dataset = Dataset.builder()
.platform("kafka")
.name("user-events")
.platformInstance("kafka-prod-cluster")
.build();

URN Construction

Dataset URNs follow the pattern:

urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env})

Automatic URN creation:

Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.events")
.env("PROD")
.build();

DatasetUrn urn = dataset.getDatasetUrn();
// urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD)

Description Operations

Mode-Aware Description

The setDescription() method routes to different aspects based on mode:

// SDK mode (default) - writes to editableDatasetProperties
dataset.setDescription("User-provided description");

// INGESTION mode - writes to datasetProperties
dataset.setDescription("Ingested from Snowflake");

Explicit Aspect Targeting

Control which aspect to write:

// System description (datasetProperties)
dataset.setSystemDescription("Generated by ETL pipeline");

// Editable description (editableDatasetProperties)
dataset.setEditableDescription("User override description");

Reading Description

Get description (prefers editable over system):

String description = dataset.getDescription();
// Returns editableDatasetProperties.description if set
// Otherwise returns datasetProperties.description

Display Name Operations

Similar to description, display names are mode-aware:

// Mode-aware (SDK → editable, INGESTION → system)
dataset.setDisplayName("User Events");

// Explicit aspect targeting
dataset.setSystemDisplayName("user_events_table");
dataset.setEditableDisplayName("User Events Table");

// Read display name (prefers editable)
String name = dataset.getDisplayName();

Tags

Adding Tags

// Simple tag name (auto-prefixed)
dataset.addTag("pii");
// Creates: urn:li:tag:pii

// Full tag URN
dataset.addTag("urn:li:tag:analytics");

Removing Tags

dataset.removeTag("pii");
dataset.removeTag("urn:li:tag:analytics");

Tag Chaining

dataset.addTag("pii")
.addTag("sensitive")
.addTag("gdpr");

Owners

Adding Owners

import com.linkedin.common.OwnershipType;

// Technical owner
dataset.addOwner(
"urn:li:corpuser:john_doe",
OwnershipType.TECHNICAL_OWNER
);

// Data steward
dataset.addOwner(
"urn:li:corpuser:jane_smith",
OwnershipType.DATA_STEWARD
);

// Business owner
dataset.addOwner(
"urn:li:corpuser:alice",
OwnershipType.BUSINESS_OWNER
);

Removing Owners

dataset.removeOwner("urn:li:corpuser:john_doe");

Owner Types

Available ownership types:

  • TECHNICAL_OWNER - Maintains the technical implementation
  • BUSINESS_OWNER - Business stakeholder
  • DATA_STEWARD - Manages data quality and compliance
  • DATAOWNER - Generic data owner
  • DEVELOPER - Software developer
  • PRODUCER - Data producer
  • CONSUMER - Data consumer
  • STAKEHOLDER - Other stakeholder

Glossary Terms

Adding Terms

dataset.addTerm("urn:li:glossaryTerm:CustomerData");
dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential");

Removing Terms

dataset.removeTerm("urn:li:glossaryTerm:CustomerData");

Term Chaining

dataset.addTerm("urn:li:glossaryTerm:Customer Data")
.addTerm("urn:li:glossaryTerm:PII")
.addTerm("urn:li:glossaryTerm:GDPR");

Domain

Setting Domain

dataset.setDomain("urn:li:domain:Marketing");

Removing Domain

// Remove a specific domain
dataset.removeDomain("urn:li:domain:Marketing");

// Or clear all domains
dataset.clearDomains();

Custom Properties

Adding Individual Properties

dataset.addCustomProperty("team", "data-engineering");
dataset.addCustomProperty("retention_days", "90");
dataset.addCustomProperty("cost_center", "12345");

Setting All Properties

Replace all custom properties:

Map<String, String> properties = new HashMap<>();
properties.put("team", "data-engineering");
properties.put("retention", "90_days");
properties.put("classification", "internal");

dataset.setCustomProperties(properties);

Removing Properties

dataset.removeCustomProperty("retention_days");

Schema

Setting Schema Metadata

import com.linkedin.schema.*;

SchemaMetadata schema = new SchemaMetadata();
// Configure schema...
dataset.setSchema(schema);

Setting Schema Fields

import com.linkedin.schema.*;

List<SchemaField> fields = new ArrayList<>();

// String field
SchemaField userIdField = new SchemaField();
userIdField.setFieldPath("user_id");
userIdField.setNativeDataType("VARCHAR(255)");
userIdField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())));
fields.add(userIdField);

// Numeric field
SchemaField amountField = new SchemaField();
amountField.setFieldPath("amount");
amountField.setNativeDataType("DECIMAL(10,2)");
amountField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType())));
fields.add(amountField);

dataset.setSchemaFields(fields);

Complete Example

import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import com.linkedin.common.OwnershipType;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class DatasetExample {
public static void main(String[] args) {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();

try {
// Build dataset with all metadata
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.description("User interaction events from web and mobile")
.displayName("User Events")
.build();

// Add tags
dataset.addTag("pii")
.addTag("analytics")
.addTag("gdpr");

// Add owners
dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER);

// Add glossary terms
dataset.addTerm("urn:li:glossaryTerm:CustomerData")
.addTerm("urn:li:glossaryTerm:EventData");

// Set domain
dataset.setDomain("urn:li:domain:Analytics");

// Add custom properties
dataset.addCustomProperty("team", "data-engineering")
.addCustomProperty("retention_days", "365")
.addCustomProperty("refresh_schedule", "daily");

// Upsert to DataHub
client.entities().upsert(dataset);

System.out.println("Successfully created dataset: " + dataset.getUrn());

} catch (IOException | ExecutionException | InterruptedException e) {
e.printStackTrace();
} finally {
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

Updating Existing Datasets

Load and Modify

// Load existing dataset
DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD");
Dataset dataset = client.entities().get(urn);

// Add new metadata (creates patches)
dataset.addTag("new-tag")
.addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER);

// Apply patches
client.entities().update(dataset);

Incremental Updates

// Just add what you need
dataset.addTag("sensitive");
client.entities().update(dataset);

// Later, add more
dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
client.entities().update(dataset);

Builder Options Reference

MethodRequiredDescription
platform(String)✅ YesData platform (e.g., "snowflake", "bigquery")
name(String)✅ YesFully qualified dataset name
env(String)NoEnvironment (PROD, DEV, etc.) Default: PROD
platformInstance(String)NoPlatform instance identifier
description(String)NoDataset description
displayName(String)NoDisplay name shown in UI
customProperties(Map)NoMap of custom key-value properties

Mode-Aware vs Explicit Methods

OperationMode-Aware MethodSDK Mode AspectINGESTION Mode Aspect
DescriptionsetDescription()editableDatasetPropertiesdatasetProperties
Display NamesetDisplayName()editableDatasetPropertiesdatasetProperties

Explicit methods (always available):

  • setSystemDescription() / setEditableDescription()
  • setSystemDisplayName() / setEditableDisplayName()

Common Patterns

Creating Multiple Datasets

for (String tableName : tableNames) {
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public." + tableName)
.env("PROD")
.build();

dataset.addTag("auto-generated")
.addCustomProperty("created_by", "sync_job");

client.entities().upsert(dataset);
}

Batch Metadata Addition

Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();

List<String> tags = Arrays.asList("pii", "sensitive", "gdpr");
tags.forEach(dataset::addTag);

client.entities().upsert(dataset); // Emits all tags in one call

Conditional Metadata

if (isPII(dataset)) {
dataset.addTag("pii")
.addTerm("urn:li:glossaryTerm:PersonalData");
}

if (requiresGovernance(dataset)) {
dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD);
}

Next Steps

Examples

Basic Dataset Creation

# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java
package io.datahubproject.examples.v2;

import com.linkedin.common.OwnershipType;
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

/**
* Example demonstrating how to create a Dataset using Java SDK V2.
*
* <p>This example shows:
*
* <ul>
* <li>Creating a DataHubClientV2
* <li>Building a Dataset with fluent builder
* <li>Adding tags, owners, and custom properties
* <li>Upserting to DataHub
* </ul>
*/
public class DatasetCreateExample {

public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Create client (use environment variables or pass explicit values)
DataHubClientV2 client =
DataHubClientV2.builder()
.server(System.getenv().getOrDefault("DATAHUB_SERVER", "http://localhost:8080"))
.token(System.getenv("DATAHUB_TOKEN")) // Optional
.build();

try {
// Test connection
if (!client.testConnection()) {
System.err.println("Failed to connect to DataHub server");
return;
}
System.out.println("✓ Connected to DataHub");

// Build dataset with metadata
Dataset dataset =
Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.description("User interaction events from web and mobile applications")
.displayName("User Events")
.build();

System.out.println("✓ Built dataset with URN: " + dataset.getUrn());

// Add tags
dataset.addTag("pii").addTag("analytics").addTag("gdpr");

System.out.println("✓ Added 3 tags");

// Add owners
dataset
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:data_team", OwnershipType.DATA_STEWARD);

System.out.println("✓ Added 2 owners");

// Add custom properties
dataset
.addCustomProperty("team", "data-engineering")
.addCustomProperty("retention_days", "365")
.addCustomProperty("refresh_schedule", "daily")
.addCustomProperty("source_system", "web_analytics");

System.out.println("✓ Added 4 custom properties");

// Upsert to DataHub
client.entities().upsert(dataset);

System.out.println("✓ Successfully created dataset in DataHub!");
System.out.println("\n URN: " + dataset.getUrn());
System.out.println(
" View in DataHub: " + client.getConfig().getServer() + "/dataset/" + dataset.getUrn());

} finally {
client.close();
}
}
}

Dataset Patch Operations

# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java
package io.datahubproject.examples.v2;

import com.linkedin.common.OwnershipType;
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

/**
* Example demonstrating patch-based updates using Java SDK V2.
*
* <p>This example shows:
*
* <ul>
* <li>Adding tags, owners, and properties to existing dataset
* <li>Patch accumulation pattern
* <li>Efficient incremental updates
* </ul>
*/
public class DatasetPatchExample {

public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Create client
DataHubClientV2 client =
DataHubClientV2.builder()
.server(System.getenv().getOrDefault("DATAHUB_SERVER", "http://localhost:8080"))
.token(System.getenv("DATAHUB_TOKEN"))
.build();

try {
// Create a simple dataset
Dataset dataset =
Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.build();

System.out.println("Dataset URN: " + dataset.getUrn());

// Add metadata using patches (accumulate without emitting)
dataset
.addTag("pii")
.addTag("analytics")
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER)
.addCustomProperty("team", "data-engineering");

System.out.println("Accumulated " + dataset.getPendingPatches().size() + " patches");

// Emit all patches in single upsert
client.entities().upsert(dataset);

System.out.println("✓ Successfully applied all patches!");

// Later, add more patches
dataset
.addTag("gdpr")
.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));

System.out.println("Accumulated " + dataset.getPendingPatches().size() + " more patches");

// Apply new patches
client.entities().upsert(dataset);

System.out.println("✓ Successfully applied additional patches!");

} finally {
client.close();
}
}
}

Comprehensive Dataset Example

# Inlined from /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java
package io.datahubproject.examples.v2;

import com.linkedin.common.OwnershipType;
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

/**
* Comprehensive example demonstrating all Dataset metadata operations using Java SDK V2.
*
* <p>This example shows:
*
* <ul>
* <li>Creating a dataset with complete metadata
* <li>Adding tags, owners, glossary terms
* <li>Setting domain and custom properties
* <li>Combining all operations in single entity
* </ul>
*/
public class DatasetFullExample {

public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Create client
DataHubClientV2 client =
DataHubClientV2.builder()
.server(System.getenv().getOrDefault("DATAHUB_SERVER", "http://localhost:8080"))
.token(System.getenv("DATAHUB_TOKEN"))
.build();

try {
// Test connection
if (!client.testConnection()) {
System.err.println("Failed to connect to DataHub server");
return;
}
System.out.println("✓ Connected to DataHub");

// Build comprehensive dataset with all metadata types
Dataset dataset =
Dataset.builder()
.platform("snowflake")
.name("analytics.public.customer_transactions")
.env("PROD")
.description(
"Complete customer transaction history including purchases, refunds, and adjustments. "
+ "This dataset is the source of truth for financial reporting and customer analytics.")
.displayName("Customer Transactions")
.build();

System.out.println("✓ Built dataset with URN: " + dataset.getUrn());

// Add multiple tags for categorization
dataset
.addTag("pii")
.addTag("financial")
.addTag("analytics")
.addTag("gdpr")
.addTag("production");

System.out.println("✓ Added 5 tags");

// Add multiple owners with different roles
dataset
.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:finance_team", OwnershipType.DATA_STEWARD)
.addOwner("urn:li:corpuser:compliance_team", OwnershipType.DATA_STEWARD);

System.out.println("✓ Added 3 owners");

// Add glossary terms for business context
dataset
.addTerm("urn:li:glossaryTerm:CustomerData")
.addTerm("urn:li:glossaryTerm:FinancialTransaction")
.addTerm("urn:li:glossaryTerm:GDPR.PersonalData");

System.out.println("✓ Added 3 glossary terms");

// Set domain for organizational structure
dataset.setDomain("urn:li:domain:Finance");

System.out.println("✓ Set domain");

// Add comprehensive custom properties
dataset
.addCustomProperty("team", "data-platform")
.addCustomProperty("retention_days", "2555") // 7 years for financial data
.addCustomProperty("refresh_schedule", "hourly")
.addCustomProperty("source_system", "payment_gateway")
.addCustomProperty("sla_tier", "tier1")
.addCustomProperty("encryption", "AES-256")
.addCustomProperty("backup_frequency", "continuous")
.addCustomProperty("compliance_level", "PCI-DSS")
.addCustomProperty("data_classification", "highly-confidential")
.addCustomProperty("business_criticality", "mission-critical");

System.out.println("✓ Added 10 custom properties");

// Count accumulated patches
System.out.println("\nAccumulated " + dataset.getPendingPatches().size() + " patches");

// Upsert to DataHub - all metadata in single operation
client.entities().upsert(dataset);

System.out.println("\n✓ Successfully created comprehensive dataset in DataHub!");
System.out.println("\nSummary:");
System.out.println(" URN: " + dataset.getUrn());
System.out.println(" Platform: snowflake");
System.out.println(" Tags: 5");
System.out.println(" Owners: 3");
System.out.println(" Glossary Terms: 3");
System.out.println(" Domain: Finance");
System.out.println(" Custom Properties: 10");
System.out.println(
"\n View in DataHub: "
+ client.getConfig().getServer()
+ "/dataset/"
+ dataset.getUrn());

} finally {
client.close();
}
}
}