
Umap Learn
Implement or tune UMAP embeddings correctly in Python pipelines with accurate parameter semantics for neighbors, metrics, and low-memory modes.
Overview
UMAP Learn is an agent skill most often used in Build (also Validate and Grow) that documents the umap-learn 0.5.12 API so agents configure neighborhood, metric, and embedding parameters correctly in Python.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill umap-learnWhat is this skill?
- API reference aligned to umap-learn 0.5.12 with scikit-learn>=1.6 constraints
- Core `UMAP` constructor parameters: n_neighbors, n_components, metric, init, min_dist, spread
- Tuning guidance for local vs global structure (e.g. n_neighbors 2–5 vs 10–20)
- Coverage of densmap, precomputed_knn, transform modes, and performance flags like low_memory and n_jobs
- Links to official readthedocs API for deeper upstream detail
- Documents umap-learn version 0.5.12 with Python >=3.9 and scikit-learn>=1.6
- Default n_neighbors is 15 with typical tuning range 2 to 100 noted in reference
Adoption & trust: 541 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need UMAP in a pipeline but constructor options and version-specific defaults are easy to misconfigure, wasting runs and misleading plots.
Who is it for?
Builders adding dimensionality reduction to analytics notebooks, batch jobs, or ML feature exploration with Python 3.9+.
Skip if: Non-Python stacks, teams wanting a full clustering methodology course instead of API-level grounding, or greenfield projects with no numerical data yet.
When should I use this skill?
Implementing, debugging, or tuning UMAP embeddings in Python code against umap-learn 0.5.x APIs.
What do I get? / Deliverables
Agents apply documented 0.5.12 parameters and tuning bands when writing fit/transform code linked to scikit-learn-compatible workflows.
- Correctly parameterized UMAP fit/transform code snippets
- Parameter tuning notes for local vs global structure
- Version-aligned constructor usage for 0.5.12
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Shelved under Build because the skill is an implementation reference agents use while writing analysis, feature, or visualization code—not casual brainstorming. Backend subphase covers Python ML modules, embedding jobs, and notebook-to-service scripts where umap-learn is imported and configured.
Where it fits
Prototype 2D embeddings on survey features before committing to a dashboard.
Add a batch embedding step ahead of clustering in an ETL script.
Re-tune n_neighbors when monthly user-behavior vectors grow denser.
How it compares
Use as a version-pinned API companion—not a replacement for experiment design, labeling strategy, or production monitoring of embedding drift.
Common Questions / FAQ
Who is umap-learn for?
Solo builders and small teams writing Python analytics or ML code who need accurate UMAP configuration during implementation.
When should I use umap-learn?
During Validate when prototyping embeddings on sample data; during Build when coding pipelines; during Grow when refreshing analytics visualizations on live cohorts.
Is umap-learn safe to install?
See the Security Audits panel on this Prism page; the skill is reference-only but generated code may pull scientific dependencies—pin versions in your environment.
SKILL.md
READMESKILL.md - Umap Learn
# UMAP API Reference Reference for **umap-learn 0.5.12** (Python >=3.9; `scikit-learn>=1.6`). See [official API guide](https://umap-learn.readthedocs.io/en/latest/api.html) and the 0.5.12 GitHub tag for the full upstream reference. ## UMAP Class `umap.UMAP(n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, n_jobs=-1, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, tqdm_kwds=None, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None, precomputed_knn=(None, None, None))` Find low-dimensional embedding that approximates the underlying manifold of the data. ### Core Parameters #### n_neighbors (int, default: 15) Size of the local neighborhood used for manifold approximation. Larger values result in more global views of the manifold, while smaller values preserve more local structure. Generally in the range 2 to 100. **Tuning guidance:** - Use 2-5 for very local structure - Use 10-20 for balanced local/global structure (typical) - Use 50-200 for emphasizing global structure #### n_components (int, default: 2) Dimension of the embedding space. Unlike t-SNE, UMAP scales well with increasing embedding dimensions. **Common values:** - 2-3: Visualization - 5-10: Clustering preprocessing - 10-100: Feature engineering for downstream ML #### metric (str or callable, default: 'euclidean') Distance metric to use. Accepts: - Any metric from scipy.spatial.distance - Any metric from sklearn.metrics - Custom callable distance functions (must be compiled with Numba) **Common metrics:** - `'euclidean'`: Standard Euclidean distance (default) - `'manhattan'`: L1 distance - `'cosine'`: Cosine distance (good for text/document vectors) - `'correlation'`: Correlation distance - `'hamming'`: Hamming distance (for binary data) - `'jaccard'`: Jaccard distance (for binary/set data) - `'dice'`: Dice distance - `'canberra'`: Canberra distance - `'braycurtis'`: Bray-Curtis distance - `'chebyshev'`: Chebyshev distance - `'minkowski'`: Minkowski distance (specify p with metric_kwds) - `'precomputed'`: Use precomputed distance matrix #### output_metric (str or callable, default: 'euclidean') Distance metric for the embedding space. Most workflows should keep the Euclidean default; advanced workflows can use a supported output metric with `output_metric_kwds`. #### min_dist (float, default: 0.1) Effective minimum distance between embedded points. Controls how tightly points are packed together. Smaller values result in clumpier embeddings. **Tuning guidance:** - Use 0.0 for clustering applications - Use 0.1-0.3 for visualization (balanced) - Use 0.5-0.99 for loose structure preservation #### spread (float, default: 1.0) Effective scale of embedded points. Combined with `min_dist` to control clumped vs. spread-out embeddings. Determines how spread out the clusters are in the embedding space. ### Training Parameters #### n_epochs (int, default: None) Number of training epochs. If None, automatically determined based on dataset size (typically 200-500 epochs). **Manual tuning:** - Smaller datasets may need 500+ epochs - Larger datasets may converge with 200 epochs - More epochs = better optimization but slower training #### learning_rate (float, default: 1.0) Initial learning rate for the SGD optimizer. Higher values lead to faster convergence but may overshoot optimal solutions. #### init (str or np.ndarray, default: 'spectral') Initialization method for the embedding: - `'spectral'`: Use s