
Pathml
Store and batch-process whole-slide pathology imagery in HDF5 with tiled SlideData pipelines for ML-ready datasets.
Overview
Pathml is an agent skill for the Build phase that implements PathML HDF5 storage and tiled whole-slide batch processing for pathology machine learning pipelines.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill pathmlWhat is this skill?
- HDF5 as primary format with compression, chunked storage, and fast random tile access
- SlideData and SlideDataset APIs for single slides and globbed multi-slide corpora
- Tile generation with configurable level, tile_size, and stride before pipeline execution
- Distributed processing via dataset.run(pipeline, distributed=True, n_workers=8)
- Documented hierarchical HDF5 layout for images, masks, features, and metadata
- SlideDataset example uses n_workers=8 for distributed processing
Adoption & trust: 512 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Whole-slide pathology images are too large for naive file-per-tile workflows and you lack a chunked, hierarchical store that ML code can random-access efficiently.
Who is it for?
Builders creating pathology ML backends, dataset builders, or agent automation around SVS slides and PathML preprocessing pipelines.
Skip if: Clinical diagnosis workflows, tiny image classifiers without WSI tiling needs, or teams that only need DICOM viewers without HDF5 training stores.
When should I use this skill?
Managing large-scale pathology datasets, saving processed slides to HDF5, or optimizing tile batch workflows for ML.
What do I get? / Deliverables
Processed slides and multi-slide datasets land in structured HDF5 files with tiles and pipeline outputs ready for training and reproducible agent-driven batch jobs.
- processed_slide.h5 or processed_dataset.h5 HDF5 artifacts
- Tiled slide representations at configured level and stride
- Batch-processed dataset ready for downstream ML consumers
Recommended Skills
Journey fit
Canonical shelf is Build because the skill centers on implementing storage and batch processing backends for pathology ML, not distribution or validation landing pages. Backend matches HDF5 persistence, hierarchical slide datasets, and distributed dataset.run workflows that feed model training code.
How it compares
Use instead of scattering PNG tiles on disk when you need hierarchical HDF5 with masks, features, and metadata in one ML-friendly artifact.
Common Questions / FAQ
Who is pathml for?
Solo developers and small teams building computational pathology pipelines who use PathML to tile, preprocess, and persist whole-slide data for model training.
When should I use pathml?
During Build when you are implementing SlideData loading, tile generation, distributed SlideDataset runs, or HDF5 export for large pathology corpora.
Is pathml safe to install?
Check the Security Audits panel on this page and validate the skill package before agents run Python that reads local slide paths and writes large HDF5 files.
SKILL.md
READMESKILL.md - Pathml
# Data Management & Storage ## Overview PathML provides efficient data management solutions for handling large-scale pathology datasets through HDF5 storage, tile management strategies, and optimized batch processing workflows. The framework enables seamless storage and retrieval of images, masks, features, and metadata in formats optimized for machine learning pipelines and downstream analysis. ## HDF5 Integration HDF5 (Hierarchical Data Format) is the primary storage format for processed PathML data, providing: - Efficient compression and chunked storage - Fast random access to subsets of data - Support for arbitrarily large datasets - Hierarchical organization of heterogeneous data types - Cross-platform compatibility ### Saving to HDF5 **Single slide:** ```python from pathml.core import SlideData # Load and process slide wsi = SlideData.from_slide("slide.svs") wsi.generate_tiles(level=1, tile_size=256, stride=256) # Run preprocessing pipeline pipeline.run(wsi) # Save to HDF5 wsi.to_hdf5("processed_slide.h5") ``` **Multiple slides (SlideDataset):** ```python from pathml.core import SlideDataset import glob # Create dataset slide_paths = glob.glob("data/*.svs") dataset = SlideDataset(slide_paths, tile_size=256, stride=256, level=1) # Process dataset.run(pipeline, distributed=True, n_workers=8) # Save entire dataset dataset.to_hdf5("processed_dataset.h5") ``` ### HDF5 File Structure PathML HDF5 files are organized hierarchically: ``` processed_dataset.h5 ├── slide_0/ │ ├── metadata/ │ │ ├── name │ │ ├── level │ │ ├── dimensions │ │ └── ... │ ├── tiles/ │ │ ├── tile_0/ │ │ │ ├── image (H, W, C) array │ │ │ ├── coords (x, y) │ │ │ └── masks/ │ │ │ ├── tissue │ │ │ ├── nucleus │ │ │ └── ... │ │ ├── tile_1/ │ │ └── ... │ └── features/ │ ├── tile_features (n_tiles, n_features) │ └── feature_names ├── slide_1/ └── ... ``` ### Loading from HDF5 **Load entire slide:** ```python from pathml.core import SlideData # Load from HDF5 wsi = SlideData.from_hdf5("processed_slide.h5") # Access tiles for tile in wsi.tiles: image = tile.image masks = tile.masks # Process tile... ``` **Load specific tiles:** ```python # Load only tiles at specific indices tile_indices = [0, 10, 20, 30] tiles = wsi.load_tiles_from_hdf5("processed_slide.h5", indices=tile_indices) for tile in tiles: # Process subset... pass ``` **Memory-mapped access:** ```python import h5py # Open HDF5 file without loading into memory with h5py.File("processed_dataset.h5", 'r') as f: # Access specific data tile_0_image = f['slide_0/tiles/tile_0/image'][:] tissue_mask = f['slide_0/tiles/tile_0/masks/tissue'][:] # Iterate through tiles efficiently for tile_key in f['slide_0/tiles'].keys(): tile_image = f[f'slide_0/tiles/{tile_key}/image'][:] # Process without loading all tiles... ``` ## Tile Management ### Tile Generation Strategies **Fixed-size tiles with no overlap:** ```python wsi.generate_tiles( level=1, tile_size=256, stride=256, # stride = tile_size → no overlap pad=False # Don't pad edge tiles ) ``` - **Use case:** Standard tile-based processing, classification - **Pros:** Simple, no redundancy, fast processing - **Cons:** Edge effects at tile boundaries **Overlapping tiles:** ```python wsi.generate_tiles( level=1, tile_size=256, stride=128, # 50% overlap pad=False ) ``` - **Use case:** Segmentation, detection (reduces boundary artifacts) - **Pros:** Better boundary handling, smoother stitching - **Cons:** More tiles, redundant computation **Adaptive tiling based on tissue content:** ```python from pathml.utils import adaptive_tile_generation # Generate tiles only in tissue regions wsi.generate_tiles(level=1, tile_size=256, stride=256) # Filter to keep only tiles with sufficient tissue tissue_tiles = [] for tile in wsi.tiles: if tile.masks.get('tissue') is