Optional PySpark Support for S3 Source
DataHub's S3 source now supports optional PySpark installation through the s3-slim variant. This allows users to choose a lightweight installation when data lake profiling is not needed.
Overview
The S3 source includes PySpark by default for backward compatibility and profiling support. For users who only need metadata extraction without profiling, the s3-slim variant provides a ~500MB smaller installation.
Current implementation status:
- ✅ S3: SparkProfiler pattern fully implemented (optional PySpark)
- ABS: Not yet implemented (still requires PySpark for profiling)
- Unity Catalog: Not affected by this change (uses separate profiling mechanisms)
- GCS: Does not support profiling
Note: This change implements the SparkProfiler pattern for S3 only. The same pattern can be applied to other sources (ABS, etc.) in future PRs.
PySpark Version
Current Version: PySpark 3.5.x (3.5.6)
PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability.
Installation Options
Standard Installation (includes PySpark)
pip install 'acryl-datahub[s3]' # S3 with PySpark/profiling support
Lightweight Installation (without PySpark)
For installations where you don't need profiling capabilities and want to save ~500MB:
pip install 'acryl-datahub[s3-slim]' # S3 without profiling (~500MB smaller)
Recommendation: Use s3-slim when profiling is not needed.
The data-lake-profiling dependencies (included in standard s3 by default):
pyspark~=3.5.6pydeequ>=1.1.0- Profiling dependencies (cachetools)
Note: In a future major release (e.g., DataHub 2.0), the
s3-slimvariant may become the default, and PySpark will be truly optional. This current approach provides backward compatibility while giving users time to adapt.
What's Included
S3 source:
Standard s3 extra:
- ✅ Metadata extraction (schemas, tables, file listing)
- ✅ Data format detection (Parquet, Avro, CSV, JSON, etc.)
- ✅ Schema inference from files
- ✅ Table and column-level metadata
- ✅ Tags and properties extraction
- ✅ Data profiling (min/max, nulls, distinct counts)
- ✅ Data quality checks (PyDeequ-based)
- Includes: PySpark 3.5.6 + PyDeequ
s3-slim variant:
- ✅ All metadata features (same as above)
- ❌ Data profiling disabled
- No PySpark dependencies (~500MB smaller)
Feature Comparison
| Feature | s3-slim | Standard s3 |
|---|---|---|
| Metadata extraction | ✅ Full support | ✅ Full support |
| Schema inference | ✅ Full support | ✅ Full support |
| Tags & properties | ✅ Full support | ✅ Full support |
| Data profiling | ❌ Not available | ✅ Full profiling |
| Installation size | ~200MB | ~700MB |
| Install time | Fast | Slower (PySpark build) |
| PySpark dependencies | ❌ None | ✅ PySpark 3.5.6 + PyDeequ |
Configuration
With Standard Installation (PySpark included)
When you install acryl-datahub[s3], profiling works out of the box:
source:
type: s3
config:
path_specs:
- include: s3://my-bucket/data/**/*.parquet
profiling:
enabled: true # Works seamlessly with standard installation
profile_table_level_only: false
With Slim Installation (no PySpark)
When you install s3-slim, disable profiling in your config:
source:
type: s3
config:
path_specs:
- include: s3://my-bucket/data/**/*.parquet
profiling:
enabled: false # Required for s3-slim installation
If you enable profiling with s3-slim installation, you'll see a clear error message at runtime:
RuntimeError: PySpark is not installed, but is required for S3 profiling.
Please install with: pip install 'acryl-datahub[s3]'
Developer Guide
Implementation Pattern
The S3 source demonstrates the recommended pattern for isolating PySpark-dependent code. This pattern can be applied to ABS and other sources in future PRs.
Architecture (currently implemented for S3 only):
- Main source class (
source.py) - Contains no PySpark imports at module level - Profiler class (
profiling.py) - Encapsulates all PySpark/PyDeequ logic inSparkProfilerclass - Conditional instantiation -
SparkProfilercreated only when profiling is enabled - TYPE_CHECKING imports - Type annotations use TYPE_CHECKING block for optional dependencies
Key Benefits:
- ✅ Type safety preserved (mypy passes without issues)
- ✅ Proper code layer separation
- ✅ Works with both standard and
-sliminstallations - ✅ Clear error messages when dependencies missing
- ✅ Pattern can be reused for ABS and other sources
Example structure:
# source.py
if TYPE_CHECKING:
from datahub.ingestion.source.s3.profiling import SparkProfiler
class S3Source:
profiler: Optional["SparkProfiler"]
def __init__(self, config, ctx):
if config.is_profiling_enabled():
from datahub.ingestion.source.s3.profiling import SparkProfiler
self.profiler = SparkProfiler(...)
else:
self.profiler = None
# profiling.py
class SparkProfiler:
"""Encapsulates all PySpark/PyDeequ profiling logic."""
def init_spark(self) -> Any:
# Spark session initialization
def read_file_spark(self, file: str, ext: str):
# File reading with Spark
def get_table_profile(self, table_data, dataset_urn):
# Table profiling coordination
For more details, see the Adding a Metadata Ingestion Source guide.
Troubleshooting
Error: "PySpark is not installed, but is required for profiling"
Problem: You installed a -slim variant but have profiling enabled in your config.
Solutions:
Recommended: Use standard installation with PySpark:
pip uninstall acryl-datahub
pip install 'acryl-datahub[s3]' # For S3 profilingAlternative: Disable profiling in your recipe:
profiling:
enabled: false
Verifying Installation
Check if PySpark is installed:
# Check installed packages
pip list | grep pyspark
# Test import in Python
python -c "import pyspark; print(pyspark.__version__)"
Expected output:
- Standard installation (
s3): Showspyspark 3.5.x - Slim installation (
s3-slim): Import fails or package not found
Migration Guide
Upgrading from Previous Versions
No action required! This change is fully backward compatible:
# Existing installations continue to work exactly as before
pip install 'acryl-datahub[s3]' # Still includes PySpark by default (profiling supported)
Recommended: Optimize installations
- S3 with profiling: Keep using
acryl-datahub[s3](includes PySpark) - S3 without profiling: Switch to
acryl-datahub[s3-slim]to save ~500MB
# Recommended installations
pip install 'acryl-datahub[s3]' # S3 with profiling support
pip install 'acryl-datahub[s3-slim]' # S3 metadata only (no profiling)
No Breaking Changes
This implementation maintains full backward compatibility:
- Standard
s3extra includes PySpark (unchanged behavior) - All existing recipes and configs continue to work
- New
s3-slimvariant available for users who want smaller installations - Future DataHub 2.0 may flip defaults, but provides migration path
Benefits for DataHub Actions
DataHub Actions depends on acryl-datahub and can benefit from s3-slim when profiling is not needed:
Reduced Installation Size
DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. Use s3-slim to reduce footprint:
# If Actions needs S3 metadata access but not profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3-slim]'
# Result: ~500MB smaller than standard s3 extra
# If Actions needs full S3 with profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3]'
# Result: Includes PySpark for profiling capabilities
Faster Deployment
Actions services using s3-slim deploy faster in containerized environments:
- Faster pip install: No PySpark compilation required
- Smaller Docker images: Reduced base image size
- Quicker cold starts: Less code to load and initialize
Fewer Dependency Conflicts
Actions workflows often integrate with other tools (Slack, Teams, email services). Using s3-slim reduces:
- Python version constraint conflicts
- Java/Spark runtime conflicts in restricted environments
- Transitive dependency version mismatches
When Actions Needs Profiling
If your Actions workflow needs to trigger data lake profiling jobs, use the standard extra:
# Actions with data lake profiling capability
pip install 'acryl-datahub-actions'
pip install 'acryl-datahub[s3]' # Includes PySpark by default
Common Actions use cases that DON'T need PySpark:
- Slack notifications on schema changes
- Propagating tags and terms to downstream systems
- Triggering dbt runs on metadata updates
- Sending emails on data quality failures
- Creating Jira tickets for governance issues
- Updating external catalogs (e.g., Alation, Collibra)
Rare Actions use cases that MIGHT need PySpark:
- Custom actions that programmatically trigger S3 profiling
- Actions that directly process data lake files (not typical)
Benefits Summary
✅ Backward compatible: Standard s3 extra unchanged, existing users unaffected
✅ Smaller installations: Save ~500MB with s3-slim
✅ Faster setup: No PySpark compilation with s3-slim
✅ Flexible deployment: Choose based on profiling needs
✅ Type safety maintained: Refactored with proper code layer separation (mypy passes)
✅ Clear error messages: Runtime errors guide users to correct installation
✅ Actions-friendly: DataHub Actions benefits from reduced footprint with s3-slim
Key Takeaways:
- Use
s3if you need S3 profiling,s3-slimif you don't - Pattern can be applied to other sources (ABS, etc.) in future PRs
- Existing installations continue working without changes