Notebook
A Notebook is a metadata entity that represents interactive computational documents combining code execution, text documentation, data visualizations, and query results. Notebooks are collaborative environments for data analysis, exploration, and documentation, commonly used in data science, analytics, and business intelligence workflows.
The Notebook entity captures both the structural components (cells containing text, queries, or charts) and the metadata about notebooks from platforms like Jupyter, Databricks, QueryBook, Hex, Mode, Deepnote, and other notebook-based tools.
⚠️ Notice: The Notebook entity is currently in BETA. While the core functionality is stable, the entity model and UI features may evolve based on community feedback. Notebook support is actively being developed and improved.
Identity
A Notebook is uniquely identified by two components:
- notebookTool: The name of the notebook platform or tool (e.g., "querybook", "jupyter", "databricks", "hex")
- notebookId: A globally unique identifier for the notebook within that tool
The URN structure for a Notebook is:
urn:li:notebook:(<notebookTool>,<notebookId>)
Examples
urn:li:notebook:(querybook,773)
urn:li:notebook:(jupyter,analysis_2024_q1)
urn:li:notebook:(databricks,/Users/analyst/customer_segmentation)
urn:li:notebook:(hex,a8b3c5d7-1234-5678-90ab-cdef12345678)
Generating Stable Notebook IDs
The notebookId should be globally unique for a notebook tool, even when there are multiple deployments. Best practices include:
- URL-based IDs: Use the notebook URL or path (e.g.,
querybook.com/notebook/773) - Platform IDs: Use the platform's native notebook identifier (e.g., Databricks workspace path)
- UUID: Generate a stable UUID based on notebook metadata for platforms without native IDs
- File paths: For Jupyter notebooks, use the file path relative to a known root directory
The key requirement is that the same notebook should always produce the same URN across different ingestion runs.
Important Capabilities
Notebook Information
The notebookInfo aspect contains the core metadata about a notebook:
- title: The notebook's display name (searchable and used in autocomplete)
- description: Detailed description of what the notebook does or analyzes
- customProperties: Key-value pairs for platform-specific metadata
- externalUrl: Link to the notebook in its native platform
- changeAuditStamps: Tracking of who created/modified the notebook and when
The following code snippet shows you how to create a Notebook with basic information.
Python SDK: Create a Notebook
# metadata-ingestion/examples/library/notebook_create.py
import logging
import time
from typing import Dict, Optional
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AuditStampClass,
ChangeAuditStampsClass,
NotebookInfoClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
def create_notebook_metadata(
notebook_urn: str,
title: str,
description: str,
external_url: str,
custom_properties: Optional[Dict[str, str]] = None,
actor: str = "urn:li:corpuser:data_scientist",
timestamp_millis: Optional[int] = None,
) -> MetadataChangeProposalWrapper:
"""
Create metadata for a notebook entity.
Args:
notebook_urn: URN of the notebook
title: Title of the notebook
description: Description of the notebook
external_url: URL to access the notebook
custom_properties: Optional dictionary of custom properties
actor: URN of the actor creating the notebook
timestamp_millis: Optional timestamp in milliseconds (defaults to current time)
Returns:
MetadataChangeProposalWrapper containing the notebook metadata
"""
timestamp_millis = timestamp_millis or int(time.time() * 1000)
audit_stamp = AuditStampClass(time=timestamp_millis, actor=actor)
notebook_info = NotebookInfoClass(
title=title,
description=description,
externalUrl=external_url,
customProperties=custom_properties or {},
changeAuditStamps=ChangeAuditStampsClass(
created=audit_stamp,
lastModified=audit_stamp,
),
)
return MetadataChangeProposalWrapper(
entityUrn=notebook_urn,
aspect=notebook_info,
)
def main(emitter: Optional[DatahubRestEmitter] = None) -> None:
"""
Main function to create a notebook example.
Args:
emitter: Optional emitter to use (for testing). If not provided, creates a new one.
Environment Variables:
DATAHUB_GMS_URL: DataHub GMS server URL (default: http://localhost:8080)
DATAHUB_GMS_TOKEN: DataHub access token (if authentication is required)
"""
if emitter is None:
import os
gms_server = os.getenv("DATAHUB_GMS_URL", "http://localhost:8080")
token = os.getenv("DATAHUB_GMS_TOKEN")
# If no token in env, try to get from datahub config
if not token:
try:
from datahub.ingestion.graph.client import get_default_graph
graph = get_default_graph()
token = graph.config.token
except Exception:
# Fall back to no token
pass
emitter = DatahubRestEmitter(gms_server=gms_server, token=token)
notebook_urn = "urn:li:notebook:(querybook,customer_analysis_2024)"
event = create_notebook_metadata(
notebook_urn=notebook_urn,
title="Customer Segmentation Analysis 2024",
description="Comprehensive analysis of customer segments including RFM analysis, cohort analysis, and predictive scoring for marketing campaigns",
external_url="https://querybook.company.com/notebook/customer_analysis_2024",
custom_properties={
"workspace": "analytics",
"team": "growth",
"last_run": "2024-01-15T10:30:00Z",
},
)
emitter.emit(event)
log.info(f"Created notebook {notebook_urn}")
if __name__ == "__main__":
main()
Notebook Content
The notebookContent aspect captures the actual structure and content of a notebook through a list of cells. Each cell represents a distinct block of content within the notebook.
Cell Types
Notebooks support three types of cells:
TEXT_CELL: Markdown or rich text content for documentation, explanations, and narrative
- Contains formatted text, headings, lists, images, and documentation
- Used to explain analysis steps, provide context, and create reports
QUERY_CELL: SQL or other query language statements for data retrieval and transformation
- Contains executable query code
- References datasets and produces result sets
- Can be linked to specific query entities for lineage tracking
CHART_CELL: Data visualizations and charts built from query results
- Contains configuration for charts and visualizations
- Can reference chart entities for metadata consistency
- Represents visual output from data analysis
Each cell in the notebookContent aspect includes:
- type: The cell type (TEXT_CELL, QUERY_CELL, or CHART_CELL)
- textCell: Content for text cells (null for other types)
- queryCell: Content for query cells (null for other types)
- chartCell: Content for chart cells (null for other types)
The cell list represents the sequential structure of the notebook as it appears to users.
Python SDK: Add content to a Notebook
# metadata-ingestion/examples/library/notebook_add_content.py
import logging
import time
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
AuditStampClass,
ChangeAuditStampsClass,
ChartCellClass,
NotebookCellClass,
NotebookCellTypeClass,
NotebookContentClass,
QueryCellClass,
TextCellClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
notebook_urn = "urn:li:notebook:(querybook,customer_analysis_2024)"
audit_stamp = AuditStampClass(
time=int(time.time() * 1000), actor="urn:li:corpuser:data_scientist"
)
change_audit = ChangeAuditStampsClass(created=audit_stamp, lastModified=audit_stamp)
cells = [
NotebookCellClass(
type=NotebookCellTypeClass.TEXT_CELL,
textCell=TextCellClass(
cellId="cell-1",
cellTitle="Introduction",
text="# Customer Segmentation Analysis\n\nThis notebook analyzes customer behavior patterns to identify high-value segments.",
changeAuditStamps=change_audit,
),
),
NotebookCellClass(
type=NotebookCellTypeClass.QUERY_CELL,
queryCell=QueryCellClass(
cellId="cell-2",
cellTitle="Customer Activity Query",
rawQuery="SELECT customer_id, SUM(revenue) as total_revenue, COUNT(*) as order_count FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id ORDER BY total_revenue DESC LIMIT 1000",
lastExecuted=audit_stamp,
changeAuditStamps=change_audit,
),
),
NotebookCellClass(
type=NotebookCellTypeClass.CHART_CELL,
chartCell=ChartCellClass(
cellId="cell-3",
cellTitle="Revenue Distribution by Segment",
changeAuditStamps=change_audit,
),
),
]
notebook_content = NotebookContentClass(cells=cells)
event = MetadataChangeProposalWrapper(
entityUrn=notebook_urn,
aspect=notebook_content,
)
emitter.emit(event)
log.info(f"Added content to notebook {notebook_urn}")
Editable Properties
The editableNotebookProperties aspect allows users to add or modify certain notebook properties through the DataHub UI without affecting the source system:
- description: User-editable description that supplements or overrides the ingested description
This separation allows DataHub users to enrich notebook metadata while preserving the original information from the source platform.
Ownership
Notebooks support ownership through the ownership aspect, allowing you to track who is responsible for maintaining and governing each notebook. Ownership types include:
- TECHNICAL_OWNER: Engineers or data scientists who created or maintain the notebook
- BUSINESS_OWNER: Business stakeholders who own the analysis or insights
- DATA_STEWARD: Data governance personnel responsible for notebook quality and compliance
Python SDK: Add ownership to a Notebook
# metadata-ingestion/examples/library/notebook_add_owner.py
import logging
from datahub.emitter.mce_builder import make_user_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
OwnerClass,
OwnershipClass,
OwnershipTypeClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
notebook_urn = "urn:li:notebook:(querybook,customer_analysis_2024)"
owner_to_add = make_user_urn("data_scientist")
ownership_type = OwnershipTypeClass.TECHNICAL_OWNER
owners_to_add = [
OwnerClass(owner=owner_to_add, type=ownership_type),
]
ownership = OwnershipClass(owners=owners_to_add)
event = MetadataChangeProposalWrapper(
entityUrn=notebook_urn,
aspect=ownership,
)
emitter.emit(event)
log.info(f"Added owner {owner_to_add} to notebook {notebook_urn}")
Tags and Glossary Terms
Notebooks can be tagged and associated with glossary terms for organization and discovery:
- Tags (via
globalTagsaspect): Informal categorization labels like "exploratory", "production", "deprecated", "customer-analysis" - Glossary Terms (via
glossaryTermsaspect): Formal business vocabulary linking notebooks to business concepts
Python SDK: Add tags to a Notebook
# metadata-ingestion/examples/library/notebook_add_tags.py
import logging
from datahub.emitter.mce_builder import make_tag_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
GlobalTagsClass,
TagAssociationClass,
)
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
notebook_urn = "urn:li:notebook:(querybook,customer_analysis_2024)"
tag_to_add = make_tag_urn("production")
tag_association = TagAssociationClass(tag=tag_to_add)
global_tags = GlobalTagsClass(tags=[tag_association])
event = MetadataChangeProposalWrapper(
entityUrn=notebook_urn,
aspect=global_tags,
)
emitter.emit(event)
log.info(f"Added tag {tag_to_add} to notebook {notebook_urn}")
Domains
Notebooks can be assigned to one or more domains through the domains aspect, organizing them by business unit, team, or functional area. This helps with discovery and governance at scale.
Browse Paths
The browsePaths and browsePathsV2 aspects enable hierarchical navigation of notebooks within DataHub, allowing users to browse notebooks by platform, workspace, folder, or other organizational structures.
Applications
The applications aspect allows linking notebooks to specific applications or use cases, helping track which business applications or workflows depend on particular notebooks.
Sub Types
The subTypes aspect enables classification of notebooks into categories like:
- "Data Analysis"
- "ML Training"
- "Reporting"
- "Data Exploration"
- "ETL Development"
This helps users find notebooks relevant to their specific needs.
Institutional Memory
Through the institutionalMemory aspect, notebooks can have links to external documentation, wikis, runbooks, or other resources that provide additional context about their purpose and usage.
Test Results
The testResults aspect can capture the results of data quality tests or validation checks performed within the notebook, integrating notebook-based testing into DataHub's data quality framework.
Integration Points
Relationship with Datasets
Notebooks have relationships with datasets through query cells:
- Query Subjects: When a notebook's query cell references datasets, those relationships are captured
- Lineage: Notebooks can be sources of lineage information when their queries create or transform data
- Usage Tracking: Notebooks contribute to dataset usage statistics through their query execution patterns
Relationship with Charts
When a notebook contains chart cells, those cells can reference chart entities, creating a relationship between the notebook and the visualizations it produces. This is particularly relevant for BI notebook tools like Mode or Hex where notebooks generate reusable charts.
Relationship with Queries
Query cells in notebooks can be linked to query entities, enabling:
- Query Reuse: Track where specific queries are used across different notebooks
- Lineage Propagation: Leverage SQL parsing from query entities for notebook lineage
- Usage Analytics: Understand query patterns in the context of notebook workflows
Platform Instance
The dataPlatformInstance aspect associates a notebook with a specific instance of a notebook platform (e.g., a particular Databricks workspace or Hex account), which is essential when multiple instances of the same platform exist.
Ingestion Sources
Several DataHub connectors extract notebook metadata:
- QueryBook: Ingests notebooks with their cells and metadata
- Jupyter: Can process notebook files from repositories or file systems
- Databricks: Extracts notebooks from Databricks workspaces
- Hex: Ingests notebooks and their project context
- Mode: Extracts notebooks (called "reports" in Mode) with their queries and visualizations
- Deepnote: Can ingest collaborative notebooks from Deepnote projects
These connectors typically:
- Discover notebooks from the platform's API or file system
- Extract notebook metadata (title, description, author, timestamps)
- Parse notebook structure into cells of appropriate types
- Create relationships to referenced datasets, queries, and charts
- Track ownership and collaboration information
GraphQL API
Notebooks are accessible through DataHub's GraphQL API, supporting queries for:
- Notebook metadata and properties
- Notebook content and cell structure
- Relationships to datasets, charts, and queries
- Ownership and governance information
Notable Exceptions
Beta Status
As a BETA feature, notebooks have some limitations:
- UI Support: The DataHub web interface may not fully visualize all notebook capabilities
- Lineage Extraction: Automatic lineage from notebook queries may vary by platform
- Search and Discovery: Notebook-specific search features are still evolving
- Cell Execution State: Execution results and output cells are not currently captured
Users should expect ongoing improvements and potential schema changes as the feature matures.
Cell Content Storage
Notebook cells store structural information and metadata but may not capture:
- Full Execution Output: Large result sets from query execution
- Binary Attachments: Images or files embedded in notebooks (except via URLs)
- Interactive Widgets: Dynamic UI elements in notebooks like ipywidgets
The focus is on capturing the notebook's code, structure, and metadata rather than execution artifacts.
Platform-Specific Features
Different notebook platforms have unique features that may not map perfectly to DataHub's model:
- Databricks: Collaborative features, version control, and job scheduling
- Hex: App-building features and parameter inputs
- Jupyter: Kernel-specific features and extensions
- Mode: Report scheduling and sharing configurations
Ingestion connectors capture common features while platform-specific capabilities may be stored in customProperties.
Cell Ordering
The notebookContent cells array preserves the order of cells as they appear in the source notebook. However, notebooks with complex branching logic or non-linear execution flows may not be fully represented by a simple ordered list.
Versioning
The current notebook model doesn't natively track notebook versions or revision history. The changeAuditStamps captures last modified information, but full version control requires integration with the source platform's versioning system (e.g., Git for Jupyter, platform version history for Databricks).
Large Notebooks
Very large notebooks with hundreds of cells may face performance considerations:
- Ingestion time increases with notebook size
- UI rendering may be optimized for notebook metadata rather than full content display
- Consider splitting extremely large notebooks into smaller, focused notebooks for better manageability
Use Cases
Notebooks in DataHub enable several important use cases:
- Discovery: Find notebooks related to specific datasets, business domains, or analysis topics
- Documentation: Understand how data is analyzed and transformed through self-documenting notebook code
- Lineage: Track data flows through notebook-based ETL and transformation pipelines
- Collaboration: Identify notebook owners and subject matter experts for specific analyses
- Governance: Apply tags, terms, and classifications to notebook-based analytics
- Impact Analysis: Understand downstream dependencies when datasets used by notebooks change
- Knowledge Management: Preserve institutional knowledge embedded in analysis notebooks
By bringing notebooks into DataHub's metadata graph, organizations can treat analysis code with the same rigor as production data assets.
Technical Reference
For technical details about fields, searchability, and relationships, view the Columns tab in DataHub.