Optimize For Gpu

Name: Optimize For Gpu
Author: k-dense-ai

k-dense-ai/scientific-agent-skills

906 installs
32k repo stars
Updated July 29, 2026
k-dense-ai/scientific-agent-skills

optimize-for-gpu is a Claude agent skill that rewrites CPU-bound scientific Python into GPU-accelerated code using NVIDIA CUDA libraries for developers who need faster NumPy, pandas, ML, and simulation workloads.

About

optimize-for-gpu is a scientific computing skill that guides agents to transform CPU-bound Python—loops, large arrays, ML pipelines, graph analytics, and image processing—into GPU code via 12 NVIDIA libraries including CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. The skill covers physics simulation, particle systems, geospatial analysis, medical imaging, vector search, and sparse eigensolvers when users mention GPU, CUDA, or NVIDIA acceleration. Developers reach for optimize-for-gpu when pandas, scikit-learn, NetworkX, GeoPandas, or Faiss code is too slow on CPU and needs production-grade GPU equivalents without rewriting algorithms from scratch.

Detects any mention of GPU, CUDA, NVIDIA or CPU-bound numerical workloads and proactively accelerates them
Converts NumPy/pandas/scikit-learn/NetworkX/SciPy code to CuPy, Numba CUDA, cuDF, cuML, cuGraph, RAFT and Warp equivalen
Covers physics simulation, differentiable rendering, particle systems, vector search, geospatial analysis, medical imagi
Handles GPUDirect Storage I/O, interactive GPU dashboards and large-scale graph analytics
Delivers typical 10x–1000x speedups on suitable parallel workloads

Optimize For Gpu by the numbers

906 all-time installs (skills.sh)
+40 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #1,164 of 16,570 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 29, 2026 (Skillselion catalog sync)

npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill optimize-for-gpu

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/k-dense-ai/scientific-agent-skills/optimize-for-gpu.svg)](https://skillselion.com/skills/k-dense-ai/scientific-agent-skills/optimize-for-gpu)

Installs	906
repo stars	★ 32k
Security audit	3 / 3 scanners passed
Last updated	July 29, 2026
Repository	k-dense-ai/scientific-agent-skills ↗

How do you GPU-accelerate NumPy and pandas Python code?

Automatically transform CPU-bound numerical and scientific Python into GPU-accelerated code using NVIDIA libraries for massive performance gains.

Who is it for?

Python developers with NVIDIA GPUs who need to speed up numerical computing, ML pipelines, graph analytics, or physics simulations without leaving the scientific Python ecosystem.

Skip if: Teams without NVIDIA CUDA hardware, pure CPU-only deployments, or projects that only need lightweight scripting without array-heavy compute.

When should I use this skill?

User mentions GPU, CUDA, NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, NetworkX, or Faiss workloads.

What you get

CUDA-accelerated Python modules using CuPy, Numba, cuDF, cuML, Warp, and related NVIDIA library replacements for CPU code paths.

GPU-accelerated Python modules
library migration mappings

By the numbers

Covers 12 NVIDIA GPU libraries including CuPy, cuDF, cuML, Warp, and RAFT
Targets NumPy, pandas, scikit-learn, NetworkX, GeoPandas, and Faiss workload migrations

Files

SKILL.mdMarkdownGitHub ↗

GPU Optimization for Python with NVIDIA

You are an expert GPU optimization engineer. Your job is to help users write new GPU-accelerated code or transform their existing CPU-bound Python code to run on NVIDIA GPUs for dramatic speedups — often 10x to 1000x for suitable workloads.

When This Skill Applies

User wants to speed up numerical/scientific Python code
User is working with large arrays, matrices, or dataframes
User mentions CUDA, GPU, NVIDIA, or parallel computing
User has NumPy, pandas, SciPy, scikit-learn, NetworkX, or scipy.sparse.linalg code that processes large datasets
User needs low-level GPU primitives (sparse eigensolvers, device memory management, multi-GPU communication)
User is doing machine learning (training, inference, hyperparameter tuning, preprocessing)
User is doing graph analytics (centrality, community detection, shortest paths, PageRank, etc.)
User is doing vector search, nearest neighbor search, similarity search, or building a RAG pipeline
User has Faiss, Annoy, ScaNN, or sklearn NearestNeighbors code that could be GPU-accelerated
User wants GPU-accelerated interactive dashboards, cross-filtering, or exploratory data analysis on large datasets
User is doing geospatial analysis (point-in-polygon, spatial joins, trajectory analysis, distance calculations) with GeoPandas or shapely
User is doing image processing, computer vision, or medical imaging (filtering, segmentation, morphology, feature detection) with scikit-image or OpenCV
User is working with whole-slide images (WSI), digital pathology, microscopy, or remote sensing imagery
User is loading large binary data files into GPU memory (numpy.fromfile → cupy, or Python open() → GPU array)
User needs to read files from S3, HTTP, or WebHDFS directly into GPU memory
User mentions GPUDirect Storage (GDS) or wants to bypass CPU-memory staging for file IO
User is doing physics simulation (particles, cloth, fluids, rigid bodies) or differentiable simulation
User needs mesh operations (ray casting, closest-point queries, signed distance fields) or geometry processing on GPU
User is doing robotics (kinematics, dynamics, control) with transforms and quaternions
User has Python simulation loops that could be JIT-compiled to GPU kernels
User mentions NVIDIA Warp or wants differentiable GPU simulation integrated with PyTorch/JAX
User is doing simulations, signal processing, financial modeling, bioinformatics, physics, or any compute-intensive work
User wants to optimize existing code and GPU acceleration is the right answer

Decision Framework: Which Library to Use

Choose the right tool based on what the user's code actually does. Read the appropriate reference file(s) before writing any GPU code.

CuPy — for array/matrix operations (NumPy replacement)

Read: references/cupy.md

Use CuPy when the user's code is primarily:

NumPy array operations (element-wise math, linear algebra, FFT, sorting, reductions)
SciPy operations (sparse matrices, signal processing, image filtering, special functions)
Any code that chains NumPy calls — CuPy is a drop-in replacement

CuPy wraps NVIDIA's optimized libraries (cuBLAS, cuFFT, cuSOLVER, cuSPARSE, cuRAND) so standard operations are already tuned. Most NumPy code works by changing import numpy as np to import cupy as cp.

Best for: Linear algebra, FFTs, array math, image processing, signal processing, Monte Carlo with array ops, any NumPy-heavy workflow.

Numba CUDA — for custom GPU kernels

Read: references/numba.md

Use Numba when the user needs:

Custom algorithms that don't map to standard array operations
Fine-grained control over GPU threads, blocks, and shared memory
Element-wise operations with complex logic (use @vectorize(target='cuda'))
Reduction operations with custom logic
Stencil computations or neighbor-dependent calculations
Anything requiring the CUDA programming model directly

Numba compiles Python directly into CUDA kernels. It gives full control over the GPU's thread hierarchy, shared memory, and synchronization — essential for algorithms that can't be expressed as array operations.

Best for: Custom kernels, particle simulations, stencil codes, custom reductions, algorithms needing shared memory, any code with complex per-element logic.

Warp — for simulation, spatial computing, and differentiable programming

Read: references/warp.md

Use Warp when the user's code is primarily:

Physics simulation (particles, cloth, fluids, rigid bodies, DEM, SPH)
Geometry processing (mesh operations, ray casting, signed distance fields, marching cubes)
Robotics (kinematics, dynamics, control with transforms and quaternions)
Differentiable simulation for ML training (integrates with PyTorch/JAX autograd)
Any Python simulation loop that needs to be JIT-compiled to GPU
Spatial computing with meshes, volumes (NanoVDB), hash grids, or BVH queries

Warp JIT-compiles @wp.kernel Python functions to CUDA, with built-in types for spatial computing (vec3, mat33, quat, transform) and primitives for geometry queries (Mesh, Volume, HashGrid, BVH). All kernels are automatically differentiable.

Best for: Physics simulation, mesh ray casting, particle systems, differentiable rendering, robotics kinematics, SDF operations, any workload combining spatial data structures with GPU compute.

Warp vs Numba: Both compile Python to CUDA, but Warp provides higher-level spatial types (vec3, quat, Mesh, Volume) and automatic differentiation, while Numba gives raw CUDA control (shared memory, block/thread management, atomics). Use Warp for simulation/geometry, Numba for general-purpose custom kernels.

cuDF — for dataframe operations (pandas replacement)

Read: references/cudf.md

Use cuDF when the user's code is primarily:

pandas DataFrame operations (filtering, groupby, joins, aggregations)
CSV/Parquet/JSON reading and processing
ETL pipelines or data wrangling on large datasets
Any pandas-heavy workflow on datasets that fit in GPU memory

cuDF's cudf.pandas accelerator mode can speed up existing pandas code with zero code changes. For maximum performance, use the native cuDF API.

Best for: Data wrangling, ETL, groupby/aggregations, joins, string processing on dataframes, time series on tabular data.

cuML — for machine learning (scikit-learn replacement)

Read: references/cuml.md

Use cuML when the user's code is primarily:

scikit-learn estimators (classification, regression, clustering, dimensionality reduction)
ML preprocessing (scaling, encoding, imputation, feature extraction)
Hyperparameter tuning or cross-validation
Tree model inference (XGBoost, LightGBM, sklearn Random Forest via FIL)
UMAP, t-SNE, HDBSCAN, or KNN on large datasets

cuML's cuml.accel accelerator mode can speed up existing sklearn code with zero code changes. For maximum performance, use the native cuML API. Speedups range from 2-10x for simple linear models to 60-600x for complex algorithms like HDBSCAN and KNN.

Best for: Classification, regression, clustering, dimensionality reduction, preprocessing pipelines, model inference, any scikit-learn-heavy workflow.

cuGraph — for graph analytics (NetworkX replacement)

Read: references/cugraph.md

Use cuGraph when the user's code is primarily:

NetworkX graph algorithms (centrality, community detection, shortest paths, PageRank)
Graph construction and analysis on large networks
Social network analysis, knowledge graphs, or recommendation systems
Any graph algorithm on networks with 10K+ edges

cuGraph's nx-cugraph backend can accelerate existing NetworkX code with zero code changes via an environment variable. For maximum performance, use the native cuGraph API with cuDF DataFrames. Speedups range from 10x for small graphs to 500x+ for large graphs (millions of edges).

Best for: PageRank, betweenness centrality, community detection (Louvain, Leiden), BFS/SSSP, connected components, link prediction, graph neural network sampling, any NetworkX-heavy workflow.

KvikIO — for high-performance GPU file IO

Read: references/kvikio.md

Use KvikIO when the user's code is primarily:

Loading large binary data files directly into GPU memory
Writing GPU arrays to disk without copying to host first
Reading data from remote storage (S3, HTTP, WebHDFS) into GPU memory
Working with Zarr arrays on GPU (GDSStore backend)
Any pipeline where file IO is the bottleneck between storage and GPU

KvikIO provides Python bindings to NVIDIA cuFile, enabling GPUDirect Storage (GDS) — data flows directly between NVMe storage and GPU memory, bypassing CPU memory entirely. When GDS isn't available, it falls back to POSIX IO transparently. It handles both host and device data seamlessly.

Best for: Loading binary data to GPU, saving GPU arrays to disk, reading from S3/HTTP directly to GPU, Zarr arrays on GPU, replacing numpy.fromfile() → cupy patterns, any IO-heavy GPU pipeline where data staging through CPU memory is a bottleneck.

Note: For tabular formats (CSV, Parquet, JSON), use cuDF's built-in readers instead — they're optimized for those formats. KvikIO is for raw binary data and remote file access.

cuxfilter — for GPU-accelerated interactive dashboards

Read: references/cuxfilter.md

Use cuxfilter when the user needs:

Interactive cross-filtering dashboards on large datasets (millions of rows)
Exploratory data analysis with linked charts that filter each other
GPU-accelerated visualization with scatter plots, bar charts, heatmaps, choropleths, or graph visualizations
Dashboard prototyping from Jupyter notebooks with minimal code
Visualizing results from cuDF, cuML, or cuGraph pipelines

cuxfilter leverages cuDF for all data operations on the GPU — filtering, groupby, and aggregation happen entirely on the GPU, with only rendering results sent to the browser. It integrates Bokeh, Datashader (for millions of points), Deck.gl (for maps), and Panel widgets.

Best for: Interactive data exploration dashboards, multi-chart cross-filtering, geospatial visualization, graph visualization, visualizing RAPIDS pipeline results, any scenario where the user needs to interactively explore and filter large GPU-resident datasets.

cuCIM — for image processing (scikit-image replacement)

Read: references/cucim.md

Use cuCIM when the user's code is primarily:

scikit-image operations (filtering, morphology, segmentation, feature detection, color conversion)
Image preprocessing pipelines for deep learning (resize, normalize, augment)
Digital pathology (whole-slide image reading, H&E stain normalization, cell counting)
Microscopy, remote sensing, or medical imaging workflows
Any scikit-image-heavy pipeline processing images at 512x512 or larger

cuCIM's cucim.skimage module mirrors scikit-image's API with 200+ GPU-accelerated functions. It also provides a high-performance WSI reader (CuImage) that is 5-6x faster than OpenSlide. All functions work on CuPy arrays — zero-copy, all on GPU.

Best for: Filtering (Gaussian, Sobel, Frangi), morphology, thresholding, connected component labeling, region properties, color space conversion, image registration, denoising, whole-slide image processing, DL preprocessing pipelines.

cuVS — for vector search (Faiss/Annoy replacement)

Read: references/cuvs.md

Use cuVS when the user's code is primarily:

Approximate nearest neighbor (ANN) search on high-dimensional vectors
Similarity search for RAG, recommender systems, or semantic retrieval
k-NN graph construction for clustering or visualization
Any Faiss, Annoy, ScaNN, or sklearn NearestNeighbors workload on large embedding datasets

cuVS provides GPU-accelerated ANN index types (CAGRA, IVF-Flat, IVF-PQ, brute force) plus HNSW for CPU serving from GPU-built indexes. It powers the GPU backends of Faiss, Milvus, and Lucene. Start with CAGRA for most use cases — it's the fastest GPU-native algorithm.

Best for: Embedding search, RAG retrieval, recommender systems, image/text/audio similarity search, k-NN graph construction, any nearest-neighbor workload on 10K+ vectors.

cuSpatial — for geospatial analytics (GeoPandas replacement)

Read: references/cuspatial.md

Use cuSpatial when the user's code is primarily:

GeoPandas spatial operations (point-in-polygon, spatial joins, distance calculations)
Trajectory analysis (grouping GPS traces, computing speeds/distances)
Spatial indexing (quadtree) for large-scale spatial joins
Haversine distance calculations on lat/lon coordinates
Any GeoPandas/shapely-heavy workflow on large geospatial datasets

cuSpatial provides GPU-accelerated GeoSeries and GeoDataFrame types compatible with GeoPandas, plus spatial join, distance, and trajectory functions. Convert from GeoPandas with cuspatial.from_geopandas().

Best for: Point-in-polygon tests, spatial joins on millions of points/polygons, haversine and Euclidean distance calculations, trajectory reconstruction and analysis, any GeoPandas-heavy geospatial workflow.

RAFT (pylibraft) — for low-level GPU primitives and multi-GPU

Read: references/raft.md

Use RAFT when the user needs:

GPU-accelerated sparse eigenvalue problems (scipy.sparse.linalg.eigsh replacement)
Low-level GPU device memory management (device_ndarray)
Random graph generation (R-MAT model for benchmarking)
Multi-node multi-GPU communication infrastructure (via raft-dask)
Building blocks that underlie higher-level RAPIDS libraries

RAFT provides the foundational primitives that cuML and cuGraph are built on. Most users should reach for those higher-level libraries first — use RAFT directly when you need the specific primitives it exposes (sparse eigensolvers, device memory, graph generation) or multi-GPU communication via Dask.

Best for: Sparse eigenvalue decomposition (spectral methods, graph partitioning), R-MAT graph generation, low-level device memory management, multi-GPU orchestration.

Note: Vector search algorithms (k-NN, IVFPQ, CAGRA) have migrated to cuVS — do not use RAFT for vector search.

Combining Libraries

Many real workloads benefit from using multiple libraries together. They interoperate via the CUDA Array Interface — zero-copy data sharing between CuPy, Numba, Warp, cuDF, cuML, cuGraph, cuVS, cuCIM, cuSpatial, KvikIO, PyTorch, JAX, and other GPU libraries.

Common combinations:

cuDF + cuML: Load and preprocess data with cuDF, train/predict with cuML — the full RAPIDS pipeline
cuDF + cuGraph: Build graphs from cuDF edge lists, run graph analytics with cuGraph
cuGraph + cuML: Extract graph features with cuGraph, feed into cuML for ML
cuML + cuVS: Train an embedding model with cuML, index and search embeddings with cuVS
cuDF + CuPy: Load and filter data with cuDF, then do numerical analysis with CuPy
CuPy + cuVS: Generate embeddings with CuPy operations, build a cuVS search index — zero-copy
Warp + PyTorch: Differentiable simulation in Warp, backpropagate gradients into PyTorch training loop
Warp + CuPy: Use CuPy for array math, Warp for spatial queries (mesh, volume) — zero-copy via CUDA Array Interface
Warp + JAX: Warp kernels as JAX primitives inside jitted functions
CuPy + Numba: Use CuPy for standard ops, drop into Numba for custom kernels
cuDF + Numba: Process dataframes with cuDF, apply custom GPU functions via Numba UDFs
cuML + CuPy: Train with cuML, do custom post-processing with CuPy
cuDF + cuxfilter: Load data with cuDF, build interactive cross-filtering dashboards with cuxfilter
cuML + cuxfilter: Run ML (e.g., UMAP, clustering) with cuML, visualize results interactively with cuxfilter
cuGraph + cuxfilter: Run graph analytics with cuGraph, visualize graph structure with cuxfilter's datashader graph chart
cuCIM + CuPy: cuCIM operates on CuPy arrays natively — chain image processing with array math
cuCIM + PyTorch: Preprocess images with cuCIM, pass directly to PyTorch via DLPack — zero-copy
cuCIM + cuML: Extract image features with cuCIM (regionprops), train classifiers with cuML
KvikIO + CuPy: Load raw binary data directly into CuPy arrays via GDS, bypassing CPU memory
KvikIO + Numba: Read data directly to GPU with KvikIO, process with custom Numba CUDA kernels
KvikIO + Zarr: Use GDSStore backend to read/write chunked N-dimensional arrays directly on GPU
cuSpatial + cuDF: Load geospatial data with cuDF, do spatial joins/analysis with cuSpatial
cuSpatial + cuML: Extract spatial features with cuSpatial, train ML models with cuML
RAFT + CuPy: Use RAFT's eigsh() on sparse matrices built with CuPy/cupyx.scipy.sparse
RAFT + raft-dask: Scale GPU workloads across multiple GPUs/nodes via Dask

Installation

IMPORTANT: Always use uv add for package installation — never pip install or conda install. This applies to install instructions in code comments, docstrings, error messages, and any other output you generate. If the user's project uses a different package manager, follow their lead, but default to uv add.

# CuPy (choose the right CUDA version)
uv add cupy-cuda12x          # For CUDA 12.x (most common)

# Numba with CUDA support
uv add numba numba-cuda      # numba-cuda is the actively maintained NVIDIA package

# Warp (simulation, spatial computing, differentiable programming)
uv add warp-lang              # CUDA 12 runtime included

# cuDF (RAPIDS)
uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12  # For CUDA 12.x
# For cudf.pandas accelerator mode, that's all you need
# Load it with: python -m cudf.pandas your_script.py

# cuML (RAPIDS machine learning)
uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12   # For CUDA 12.x
# For cuml.accel accelerator mode (zero-change sklearn acceleration):
# Load it with: python -m cuml.accel your_script.py

# cuGraph (RAPIDS graph analytics)
uv add --extra-index-url=https://pypi.nvidia.com cugraph-cu12    # Core cuGraph
uv add --extra-index-url=https://pypi.nvidia.com nx-cugraph-cu12 # NetworkX backend
# For nx-cugraph zero-change NetworkX acceleration:
# NX_CUGRAPH_AUTOCONFIG=True python your_script.py

# KvikIO (high-performance GPU file IO)
uv add kvikio-cu12               # For CUDA 12.x
# Optional: uv add zarr          # For Zarr GPU backend support

# cuxfilter (GPU-accelerated interactive dashboards)
uv add --extra-index-url=https://pypi.nvidia.com cuxfilter-cu12   # For CUDA 12.x
# Depends on cuDF — installs it automatically

# cuCIM (RAPIDS image processing — scikit-image on GPU)
uv add --extra-index-url=https://pypi.nvidia.com cucim-cu12    # For CUDA 12.x

# cuVS (RAPIDS vector search)
uv add --extra-index-url=https://pypi.nvidia.com cuvs-cu12   # For CUDA 12.x

# cuSpatial (RAPIDS geospatial)
uv add --extra-index-url=https://pypi.nvidia.com cuspatial-cu12   # For CUDA 12.x

# RAFT (low-level GPU primitives)
uv add --extra-index-url=https://pypi.nvidia.com pylibraft-cu12   # Core primitives
uv add --extra-index-url=https://pypi.nvidia.com raft-dask-cu12   # Multi-GPU support (optional)

To check CUDA availability after installation:

# CuPy
import cupy as cp
print(cp.cuda.runtime.getDeviceCount())  # Should be >= 1

# Numba
from numba import cuda
print(cuda.is_available())               # Should be True
print(cuda.detect())                     # Shows GPU details

# cuDF
import cudf
print(cudf.Series([1, 2, 3]))           # Should print a GPU series

# cuML
import cuml
print(cuml.__version__)                  # Should print version

# cuGraph
import cugraph
print(cugraph.__version__)               # Should print version

# Warp
import warp as wp
wp.init()                                # Should print device info

# KvikIO
import kvikio
import kvikio.cufile_driver
print(kvikio.cufile_driver.get("is_gds_available"))  # True if GDS is set up

# cuxfilter
import cuxfilter
print(cuxfilter.__version__)             # Should print version

# cuVS
from cuvs.neighbors import cagra
import cupy as cp
dataset = cp.random.rand(1000, 128, dtype=cp.float32)
index = cagra.build(cagra.IndexParams(), dataset)
print("cuVS working")                    # Should print confirmation

# cuSpatial
import cuspatial
from shapely.geometry import Point
gs = cuspatial.GeoSeries([Point(0, 0)])
print("cuSpatial working")              # Should print confirmation

# RAFT (pylibraft)
from pylibraft.common import DeviceResources
handle = DeviceResources()
handle.sync()
print("pylibraft is working")

Optimization Workflow

When helping a user optimize code, follow this process:

1. Profile First

Before optimizing, understand where time is actually spent:

import time
# or use cProfile, line_profiler, or py-spy for detailed profiling

Don't guess — measure. The bottleneck might not be where the user thinks.

2. Assess GPU Suitability

Not all code benefits from GPU acceleration. GPU excels when:

Data parallelism is high: The same operation applies to thousands/millions of elements
Compute intensity is high: Many FLOPs per byte of memory accessed
Data is large enough: GPU overhead means small arrays (< ~10K elements) may be slower on GPU
Memory fits: Data must fit in GPU memory (typically 8-80 GB)

GPU is a poor fit when:

Data is tiny (< 10K elements)
Algorithm is inherently sequential with data dependencies between steps
Code is I/O bound (disk, network), not compute bound — though KvikIO with GPUDirect Storage can help when IO feeds GPU compute
Many small, heterogeneous operations (kernel launch overhead dominates)

3. Start Simple, Then Optimize

1. Try the drop-in replacement first. CuPy for NumPy, cudf.pandas for pandas, cuml.accel for sklearn, nx-cugraph for NetworkX. This alone often gives 5-50x speedup. 2. Minimize host-device transfers. Keep data on GPU. Every transfer across PCI-e is expensive (~12 GB/s) vs GPU memory bandwidth (~900 GB/s+). 3. Batch operations. Fewer large GPU operations beat many small ones. 4. Only write custom kernels if needed. CuPy and cuDF use NVIDIA's hand-tuned libraries. Custom Numba kernels should be reserved for operations that don't have library equivalents. 5. Profile the GPU version. Use nvprof, nsys, or CuPy's built-in benchmarking.

4. Memory Management Principles

These apply across all libraries:

Pre-allocate output arrays instead of creating new ones in loops
Reuse GPU memory — use memory pools (CuPy has this built-in)
Use pinned (page-locked) host memory for faster CPU-GPU transfers
Avoid unnecessary copies — use in-place operations where possible
Stream operations for overlapping compute and data transfer

5. Common Pitfalls to Watch For

Implicit CPU fallback: Some operations silently fall back to CPU. Watch for warnings.
Synchronization overhead: GPU operations are asynchronous. Calling .get() or cp.asnumpy() forces a sync.
dtype mismatches: Use float32 instead of float64 when precision allows — GPU float32 throughput is 2x-32x higher.
Small kernel launches: Each kernel launch has ~5-20us overhead. Fuse operations when possible.

Code Transformation Patterns

When converting existing CPU code, apply these patterns:

NumPy to CuPy

# Before (CPU)
import numpy as np
a = np.random.rand(10_000_000)
b = np.fft.fft(a)
c = np.sort(b.real)

# After (GPU) — often just change the import
import cupy as cp
a = cp.random.rand(10_000_000)
b = cp.fft.fft(a)
c = cp.sort(b.real)

pandas to cuDF

# Before (CPU)
import pandas as pd
df = pd.read_parquet("large_data.parquet")
result = df.groupby("category")["value"].mean()

# After (GPU) — change the import
import cudf
df = cudf.read_parquet("large_data.parquet")
result = df.groupby("category")["value"].mean()

# Or zero-code-change: python -m cudf.pandas your_script.py

Custom loop to Numba CUDA kernel

# Before (CPU) — slow Python loop
def process(data, out):
    for i in range(len(data)):
        out[i] = math.sin(data[i]) * math.exp(-data[i])

# After (GPU) — Numba kernel
from numba import cuda
import math

@cuda.jit
def process(data, out):
    i = cuda.grid(1)
    if i < data.size:
        out[i] = math.sin(data[i]) * math.exp(-data[i])

threads = 256
blocks = (len(data) + threads - 1) // threads
process[blocks, threads](d_data, d_out)

NetworkX to cuGraph

# Before (CPU)
import networkx as nx
G = nx.read_edgelist("edges.csv", delimiter=",", nodetype=int)
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)

# After (GPU) — direct cuGraph API
import cugraph
import cudf
edges = cudf.read_csv("edges.csv", names=["src", "dst"], dtype=["int32", "int32"])
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst")
pr = cugraph.pagerank(G)
bc = cugraph.betweenness_centrality(G)

# Or zero-code-change: NX_CUGRAPH_AUTOCONFIG=True python your_script.py

scikit-learn to cuML

# Before (CPU)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# After (GPU) — change the imports
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import StandardScaler
from cuml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Or zero-code-change: python -m cuml.accel your_script.py

Simulation loop to Warp kernel

# Before (CPU) — slow Python loop over particles
import numpy as np

def integrate(positions, velocities, forces, dt):
    for i in range(len(positions)):
        velocities[i] += forces[i] * dt
        positions[i] += velocities[i] * dt

# After (GPU) — Warp kernel, JIT-compiled to CUDA
import warp as wp

@wp.kernel
def integrate(positions: wp.array(dtype=wp.vec3),
              velocities: wp.array(dtype=wp.vec3),
              forces: wp.array(dtype=wp.vec3),
              dt: float):
    tid = wp.tid()
    velocities[tid] = velocities[tid] + forces[tid] * dt
    positions[tid] = positions[tid] + velocities[tid] * dt

wp.launch(integrate, dim=num_particles,
          inputs=[positions, velocities, forces, 0.01], device="cuda")

File IO to GPU with KvikIO

# Before — CPU staging (disk → CPU → GPU)
import numpy as np
import cupy as cp

data = np.fromfile("data.bin", dtype=np.float32)
gpu_data = cp.asarray(data)  # Extra copy through CPU memory

# After — direct to GPU (disk → GPU via GDS)
import cupy as cp
import kvikio

gpu_data = cp.empty(1_000_000, dtype=cp.float32)
with kvikio.CuFile("data.bin", "r") as f:
    f.read(gpu_data)  # Bypasses CPU memory with GPUDirect Storage

# Reading from S3 directly to GPU
with kvikio.RemoteFile.open_s3_url("s3://bucket/data.bin") as f:
    buf = cp.empty(f.nbytes() // 4, dtype=cp.float32)
    f.read(buf)

GPU-accelerated dashboard with cuxfilter

# Before — static matplotlib/seaborn plots, no interactivity
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_parquet("large_dataset.parquet")
fig, axes = plt.subplots(1, 2)
df.plot.scatter(x="feature1", y="feature2", ax=axes[0])
df["category"].value_counts().plot.bar(ax=axes[1])
plt.show()

# After (GPU) — interactive cross-filtering dashboard
import cudf
import cuxfilter

df = cudf.read_parquet("large_dataset.parquet")
cux_df = cuxfilter.DataFrame.from_dataframe(df)

scatter = cuxfilter.charts.scatter(x="feature1", y="feature2", pixel_shade_type="linear")
bar = cuxfilter.charts.bar("category")
slider = cuxfilter.charts.range_slider("value_col")

d = cux_df.dashboard(
    [scatter, bar],
    sidebar=[slider],
    layout=cuxfilter.layouts.feature_and_base,
    theme=cuxfilter.themes.rapids_dark,
    title="Interactive Explorer",
)
d.app()  # or d.show() for standalone web app

scikit-image to cuCIM

# Before (CPU)
from skimage.filters import gaussian, sobel, threshold_otsu
from skimage.morphology import binary_opening, disk
from skimage.measure import label, regionprops_table
import numpy as np

blurred = gaussian(image, sigma=3)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image, properties=['area', 'centroid'])

# After (GPU) — change imports, wrap input with cp.asarray
from cucim.skimage.filters import gaussian, sobel, threshold_otsu
from cucim.skimage.morphology import binary_opening, disk
from cucim.skimage.measure import label, regionprops_table
import cupy as cp

image_gpu = cp.asarray(image)  # Transfer once
blurred = gaussian(image_gpu, sigma=3)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image_gpu, properties=['area', 'centroid'])

GeoPandas to cuSpatial

# Before (CPU)
import geopandas as gpd
from shapely.geometry import Point

points = gpd.GeoDataFrame(geometry=[Point(x, y) for x, y in coords], crs="EPSG:4326")
polygons = gpd.read_file("regions.geojson")
joined = gpd.sjoin(points, polygons, predicate="within")

# After (GPU) — convert and use cuSpatial
import cuspatial
import cudf

points_cu = cuspatial.from_geopandas(points)
polygons_cu = cuspatial.from_geopandas(polygons)
joined = cuspatial.point_in_polygon(
    points_cu.geometry.x, points_cu.geometry.y,
    polygons_cu.geometry
)

Faiss/Annoy to cuVS

# Before (CPU) — Faiss
import faiss
import numpy as np

embeddings = np.random.rand(1_000_000, 128).astype(np.float32)
index = faiss.IndexFlatL2(128)
index.add(embeddings)
distances, neighbors = index.search(queries, k=10)

# After (GPU) — cuVS CAGRA (orders of magnitude faster)
import cupy as cp
from cuvs.neighbors import cagra

embeddings = cp.random.rand(1_000_000, 128, dtype=cp.float32)
index = cagra.build(cagra.IndexParams(), embeddings)
distances, neighbors = cagra.search(cagra.SearchParams(), index, queries, k=10)

scipy.sparse.linalg to RAFT

# Before (CPU)
import numpy as np
from scipy.sparse import random as sparse_random
from scipy.sparse.linalg import eigsh

A = sparse_random(10000, 10000, density=0.01, format="csr", dtype=np.float32)
A = A + A.T  # Make symmetric
eigenvalues, eigenvectors = eigsh(A, k=10, which="LM")

# After (GPU) — RAFT sparse eigensolver
import cupy as cp
import cupyx.scipy.sparse as sp_gpu
from pylibraft.sparse.linalg import eigsh as gpu_eigsh

A_gpu = sp_gpu.csr_matrix(A)  # Transfer to GPU
eigenvalues, eigenvectors = gpu_eigsh(A_gpu, k=10, which="LM")

Important Notes

Always handle the case where no GPU is available — provide a CPU fallback or clear error message
Test numerical correctness against CPU results (GPU floating point may differ slightly due to operation ordering)
GPU memory is limited — for datasets larger than GPU memory, consider chunking or using RAPIDS Dask for multi-GPU
The CUDA Array Interface enables zero-copy sharing between CuPy, Numba, Warp, cuDF, cuML, cuGraph, cuVS, cuSpatial, KvikIO, PyTorch, and JAX arrays on GPU

Reference Files

Before writing any GPU optimization code, read the relevant reference file(s):

File	When to Read
`references/cupy.md`	User has NumPy/SciPy code, or needs array operations on GPU
`references/numba.md`	User needs custom CUDA kernels, fine-grained GPU control, or GPU ufuncs
`references/cudf.md`	User has pandas code, or needs dataframe operations on GPU
`references/cuml.md`	User has scikit-learn code, or needs ML training/inference/preprocessing on GPU
`references/cugraph.md`	User has NetworkX code, or needs graph analytics on GPU
`references/warp.md`	User needs GPU simulation, spatial computing, mesh/volume queries, differentiable programming, or robotics
`references/kvikio.md`	User needs high-performance file IO to/from GPU, GPUDirect Storage, reading S3/HTTP to GPU, or Zarr on GPU
`references/cuxfilter.md`	User wants GPU-accelerated interactive dashboards, cross-filtering, or EDA visualization
`references/cucim.md`	User has scikit-image code, or needs image processing, digital pathology, or WSI reading on GPU
`references/cuvs.md`	User needs vector search, nearest neighbors, similarity search, or RAG retrieval on GPU
`references/cuspatial.md`	User has GeoPandas/shapely code, or needs spatial joins, distance calculations, or trajectory analysis on GPU
`references/raft.md`	User needs sparse eigensolvers, device memory management, or multi-GPU primitives

Read the specific reference before writing code — they contain detailed API patterns, optimization techniques, and pitfalls specific to each library.

cuCIM Reference

cuCIM (CUDA Clara IMage) is NVIDIA's GPU-accelerated computer vision and image processing library within the RAPIDS ecosystem. Its cucim.skimage module is a near-drop-in GPU replacement for scikit-image, with 200+ GPU-accelerated functions. It also provides a high-performance whole-slide image (WSI) reader via cucim.clara.CuImage that is 5-6x faster than OpenSlide.

Full documentation: https://docs.rapids.ai/api/cucim/stable/

GitHub: https://github.com/rapidsai/cucim

1. Installation and Setup 2. Core Concept: CuPy Arrays 3. cucim.skimage — GPU scikit-image 4. Color Operations 5. Exposure and Histogram 6. Feature Detection 7. Filters 8. Measure and Region Properties 9. Morphology 10. Segmentation 11. Registration 12. Restoration 13. Transform 14. Metrics 15. Utility Functions 16. cucim.core.operations — NVIDIA-Specific 17. Whole-Slide Image Reading (cucim.clara) 18. Performance Characteristics 19. Interoperability 20. Known Limitations vs scikit-image 21. Common Migration Patterns

---

Installation and Setup

Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.

uv add --extra-index-url=https://pypi.nvidia.com cucim-cu12    # For CUDA 12.x

Platform: Linux only (x86-64 and aarch64) — no Windows or macOS GPU support. Requires: NVIDIA GPU with CUDA 12.x, Python 3.9+, CuPy, NumPy, SciPy, scikit-image.

Verify:

import cucim
print(cucim.__version__)

import cupy as cp
from cucim.skimage.filters import gaussian
img = cp.random.rand(512, 512).astype(cp.float32)
result = gaussian(img, sigma=3)
print(f"Filtered image shape: {result.shape}")  # Should work on GPU

---

Core Concept: CuPy Arrays

cuCIM operates natively on CuPy arrays. All cucim.skimage functions accept CuPy arrays as input and return CuPy arrays as output — zero-copy, all on GPU.

import cupy as cp
import numpy as np
from cucim.skimage.filters import gaussian

# Transfer image to GPU once
image_gpu = cp.asarray(numpy_image)

# All processing stays on GPU — zero-copy between cuCIM calls
blurred = gaussian(image_gpu, sigma=3)
# ... more processing on GPU ...

# Transfer back to CPU only when needed (for display, save, etc.)
result_cpu = cp.asnumpy(blurred)

Best practice: Move data to GPU once at the start, chain all cuCIM operations on GPU, then transfer back to CPU only at the end.

---

cucim.skimage

The cucim.skimage module mirrors scikit-image's module structure. In most cases, replace from skimage with from cucim.skimage and pass CuPy arrays instead of NumPy arrays.

# Before (CPU — scikit-image)
from skimage.filters import gaussian
import numpy as np
result = gaussian(numpy_image, sigma=3)

# After (GPU — cuCIM)
from cucim.skimage.filters import gaussian
import cupy as cp
result = gaussian(cp.asarray(numpy_image), sigma=3)

---

Color Operations

cucim.skimage.color — 42 GPU-accelerated color space conversion functions.

from cucim.skimage.color import rgb2gray, rgb2hsv, rgb2lab, label2rgb
from cucim.skimage.color import separate_stains, combine_stains

# Color space conversions
gray = rgb2gray(rgb_image_gpu)
hsv = rgb2hsv(rgb_image_gpu)
lab = rgb2lab(rgb_image_gpu)

# Stain separation (for H&E histology)
stains = separate_stains(rgb_image_gpu, stain_matrix)

Available conversions: rgb2gray, rgb2hsv, hsv2rgb, rgb2lab, lab2rgb, rgb2xyz, xyz2rgb, rgb2luv, luv2rgb, rgb2ycbcr, ycbcr2rgb, rgb2yuv, yuv2rgb, rgb2yiq, yiq2rgb, rgb2hed, hed2rgb, rgb2rgbcie, rgbcie2rgb, gray2rgb, gray2rgba, rgba2rgb, convert_colorspace, label2rgb

Color difference: deltaE_cie76, deltaE_ciede94, deltaE_ciede2000, deltaE_cmc

---

Exposure and Histogram

cucim.skimage.exposure — histogram equalization, contrast adjustment.

from cucim.skimage.exposure import (
    equalize_hist, equalize_adapthist,
    rescale_intensity, adjust_gamma, adjust_log, adjust_sigmoid,
    histogram, match_histograms, is_low_contrast
)

# CLAHE (Contrast Limited Adaptive Histogram Equalization)
enhanced = equalize_adapthist(image_gpu, clip_limit=0.03)

# Gamma correction
brightened = adjust_gamma(image_gpu, gamma=0.5)

# Rescale intensity to [0, 1]
normalized = rescale_intensity(image_gpu)

# Histogram matching between two images
matched = match_histograms(source_gpu, reference_gpu)

---

Feature Detection

cucim.skimage.feature — edge, corner, and blob detection.

from cucim.skimage.feature import (
    canny, corner_harris, corner_peaks,
    blob_dog, blob_doh, blob_log,
    structure_tensor, hessian_matrix, hessian_matrix_det,
    match_template, peak_local_max, daisy, multiscale_basic_features
)

# Canny edge detection
edges = canny(gray_image_gpu, sigma=2.0)

# Harris corner detection
corners = corner_harris(gray_image_gpu)
corner_coords = corner_peaks(corners, min_distance=5)

# Blob detection (Difference of Gaussian)
blobs = blob_dog(gray_image_gpu, max_sigma=30, threshold=0.1)

# Template matching
result = match_template(image_gpu, template_gpu)

---

Filters

cucim.skimage.filters — 47 GPU-accelerated filter functions. This is one of the most commonly used modules.

from cucim.skimage.filters import (
    gaussian, median, sobel, laplace, unsharp_mask,
    frangi, hessian, meijering, sato,
    threshold_otsu, threshold_multiotsu, threshold_sauvola,
    gabor, difference_of_gaussians, butterworth
)

# Gaussian blur
blurred = gaussian(image_gpu, sigma=3)

# Sobel edge detection
edges = sobel(gray_image_gpu)

# Unsharp mask (sharpening)
sharpened = unsharp_mask(image_gpu, radius=5, amount=2.0)

# Vessel/ridge detection (for medical imaging)
vessels = frangi(gray_image_gpu, sigmas=range(1, 10))

# Otsu thresholding
threshold = threshold_otsu(gray_image_gpu)
binary = gray_image_gpu > threshold

# Multi-level Otsu
thresholds = threshold_multiotsu(gray_image_gpu, classes=3)

Edge detection: sobel, scharr, prewitt, roberts, farid, laplace (plus _h/_v variants)

Smoothing: gaussian, median, unsharp_mask

Ridge/vessel detection: frangi, hessian, meijering, sato

Thresholding (10 methods): threshold_otsu, threshold_isodata, threshold_li, threshold_mean, threshold_minimum, threshold_multiotsu, threshold_niblack, threshold_sauvola, threshold_triangle, threshold_yen

Frequency domain: butterworth, wiener

---

Measure and Region Properties

cucim.skimage.measure — labeling, region properties, and shape metrics.

from cucim.skimage.measure import label, regionprops, regionprops_table
from cucim.skimage.measure import moments, moments_central, moments_hu
from cucim.skimage.measure import block_reduce, shannon_entropy

# Connected component labeling
labels = label(binary_image_gpu)

# Region properties (area, centroid, bounding box, etc.)
props = regionprops(labels)
table = regionprops_table(labels, intensity_image=gray_gpu,
                          properties=['area', 'centroid', 'mean_intensity'])

# Block reduce (downsampling)
downsampled = block_reduce(image_gpu, block_size=(2, 2), func=cp.mean)

Colocalization metrics (for microscopy): manders_coloc_coeff, manders_overlap_coeff, pearson_corr_coeff, intersection_coeff

---

Morphology

cucim.skimage.morphology — 30 GPU-accelerated morphological operations.

from cucim.skimage.morphology import (
    binary_erosion, binary_dilation, binary_opening, binary_closing,
    erosion, dilation, opening, closing,
    white_tophat, black_tophat,
    disk, diamond, ball, star,
    remove_small_objects, remove_small_holes,
    reconstruction, medial_axis, thin
)

# Create structuring element
selem = disk(5)

# Binary morphological operations
cleaned = binary_opening(binary_image_gpu, footprint=selem)
cleaned = binary_closing(cleaned, footprint=selem)

# Remove small objects/holes
cleaned = remove_small_objects(labels_gpu, min_size=100)
filled = remove_small_holes(binary_gpu, area_threshold=50)

# Grayscale morphology
tophat = white_tophat(gray_image_gpu, footprint=disk(10))

Structuring elements: disk, diamond, ball, octagon, octahedron, star, ellipse, footprint_rectangle

Isotropic operations: isotropic_erosion, isotropic_dilation, isotropic_opening, isotropic_closing

---

Segmentation

cucim.skimage.segmentation — level-set methods, boundary detection, label operations.

from cucim.skimage.segmentation import (
    chan_vese, morphological_chan_vese, morphological_geodesic_active_contour,
    find_boundaries, mark_boundaries, clear_border,
    expand_labels, relabel_sequential, random_walker
)

# Chan-Vese segmentation
segmented = chan_vese(gray_image_gpu, mu=0.25, max_num_iter=200)

# Active contours (geodesic)
gimage = inverse_gaussian_gradient(gray_image_gpu)
init_ls = checkerboard_level_set(gray_image_gpu.shape)
seg = morphological_geodesic_active_contour(gimage, num_iter=200, init_level_set=init_ls)

# Find and mark boundaries
boundaries = find_boundaries(labels_gpu, mode='thick')

---

Registration

cucim.skimage.registration — image alignment.

from cucim.skimage.registration import (
    phase_cross_correlation,
    optical_flow_tvl1,
    optical_flow_ilk
)

# Subpixel image registration
shift, error, diffphase = phase_cross_correlation(reference_gpu, moving_gpu)

# Optical flow
flow = optical_flow_tvl1(frame1_gpu, frame2_gpu)

---

Restoration

cucim.skimage.restoration — denoising and deconvolution.

from cucim.skimage.restoration import (
    denoise_tv_chambolle,
    richardson_lucy,
    wiener, unsupervised_wiener
)

# Total variation denoising
denoised = denoise_tv_chambolle(noisy_image_gpu, weight=0.1)

# Richardson-Lucy deconvolution
restored = richardson_lucy(blurred_image_gpu, psf_gpu, num_iter=30)

---

Transform

cucim.skimage.transform — geometric transforms, resizing, pyramids.

from cucim.skimage.transform import (
    resize, rescale, rotate, warp, swirl, warp_polar,
    pyramid_gaussian, pyramid_laplacian,
    downscale_local_mean, integral_image,
    AffineTransform, EuclideanTransform, SimilarityTransform
)

# Resize
resized = resize(image_gpu, (256, 256))

# Rescale
half = rescale(image_gpu, 0.5)

# Rotate
rotated = rotate(image_gpu, angle=45, resize=True)

# Gaussian pyramid
pyramid = list(pyramid_gaussian(image_gpu, max_layer=4, downscale=2))

# Affine transform
tform = AffineTransform(rotation=0.3, translation=(50, 50))
warped = warp(image_gpu, tform.inverse)

---

Metrics

cucim.skimage.metrics — image quality assessment.

from cucim.skimage.metrics import (
    mean_squared_error,
    peak_signal_noise_ratio,
    structural_similarity,
    normalized_root_mse
)

mse = mean_squared_error(original_gpu, processed_gpu)
psnr = peak_signal_noise_ratio(original_gpu, processed_gpu)
ssim = structural_similarity(original_gpu, processed_gpu)

---

Utility Functions

cucim.skimage.util — type conversion, array manipulation.

from cucim.skimage.util import (
    img_as_float, img_as_float32, img_as_ubyte,
    invert, crop, random_noise, montage
)

# Convert to float32 [0, 1]
float_img = img_as_float32(uint8_image_gpu)

# Add noise for testing
noisy = random_noise(image_gpu, mode='gaussian', var=0.01)

---

cucim.core.operations

NVIDIA-specific operations not found in scikit-image. Especially useful for digital pathology.

Pathology-Specific

from cucim.core.operations.color import (
    color_jitter,
    image_to_absorbance,
    stain_extraction_pca,
    normalize_colors_pca
)

# H&E stain normalization (digital pathology)
normalized = normalize_colors_pca(he_image_gpu)

# Color augmentation
augmented = color_jitter(image_gpu, brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1)

Intensity Operations

from cucim.core.operations.intensity import normalize_data, scale_intensity_range, zoom

normalized = normalize_data(image_gpu)
scaled = scale_intensity_range(image_gpu, a_min=0, a_max=255, b_min=0.0, b_max=1.0)

Spatial Augmentation

from cucim.core.operations.spatial import image_flip, image_rotate_90, rand_image_flip

flipped = image_flip(image_gpu, spatial_axis=1)
rotated = image_rotate_90(image_gpu, k=1)  # 90 degrees
randomly_flipped = rand_image_flip(image_gpu, prob=0.5)

Distance Transform

from cucim.core.operations.morphology import distance_transform_edt

# Exact Euclidean distance transform (faster than scipy.ndimage on GPU)
distances = distance_transform_edt(binary_image_gpu)

---

Whole-Slide Image Reading

cucim.clara.CuImage — high-performance WSI reader, compatible with OpenSlide API, 5-6x faster.

from cucim import CuImage

# Open a whole-slide image
img = CuImage("slide.svs")

# Inspect metadata
print(f"Dimensions: {img.shape}")
print(f"Resolution levels: {img.resolutions}")
print(f"Spacing: {img.spacing}")

# Read a region (returns a CuImage object)
region = img.read_region(location=(1000, 2000), size=(256, 256), level=0)

# Convert to CuPy array for processing
import cupy as cp
tile_gpu = cp.asarray(region)

# Process with cucim.skimage
from cucim.skimage.color import rgb2gray
gray_tile = rgb2gray(tile_gpu)

Supported formats: Aperio SVS, Philips TIFF, generic tiled multi-resolution RGB TIFF (JPEG, JPEG2000, LZW, Deflate compression).

Tile Caching

from cucim.clara.cache import ImageCache

# Configure tile cache for repeated access patterns
cache = ImageCache(memory_capacity=2 * 1024**3)  # 2 GB cache

GPUDirect Storage

For large files (2GB+), GPUDirect Storage bypasses CPU memory for 25%+ additional speedup:

from cucim.clara.filesystem import CuFileDriver

# Read directly into GPU memory, bypassing CPU
driver = CuFileDriver(path, flags)
driver.pread(gpu_buffer, size, offset)

---

Performance Characteristics

Headline numbers:

Up to 1245x faster than scikit-image for certain operations on large images
5-6x faster than OpenSlide for WSI multi-threaded patch reading
25%+ additional speedup with GPUDirect Storage on 2GB+ files

Scaling behavior:

4K resolution and above: GPU parallelism fully utilized, maximum speedups
~1000x1000: Moderate but measurable speedups for most operations
Below ~512x512: Diminishing returns; GPU overhead starts to matter
Below ~64x64: CPU may be faster due to CUDA kernel launch overhead

First-call overhead: JIT compilation on first kernel execution (cached after). Benchmark on subsequent calls.

Best strategy: Transfer image to GPU once, chain all processing operations, transfer back once at the end.

---

Interoperability

CuPy: Native array format. All cucim.skimage functions accept and return CuPy arrays.
NumPy: Convert with cp.asarray() / cp.asnumpy().
PyTorch/TensorFlow: Zero-copy via DLPack protocol: torch.as_tensor(cupy_array) or torch.from_dlpack(cupy_array).
MONAI: Medical imaging framework with direct cuCIM integration for pathology transforms.
Albumentations: Can use cuCIM as GPU backend for augmentations.
NVIDIA DALI: Data loading pipeline integration.
Numba CUDA: CuPy arrays interoperable with Numba GPU kernels.
cuDF: Use for tabular operations on regionprops_table output.

CPU/GPU Agnostic Code

# Switch between CPU and GPU by changing the array module
import cupy as cp  # or: import numpy as cp
from cucim.skimage.filters import gaussian  # or: from skimage.filters import gaussian

result = gaussian(cp.asarray(image), sigma=5)

---

Known Limitations vs scikit-image

1. Incomplete API coverage: ~50-66% of scikit-image functions are implemented. Notable gaps include some graph-based segmentation (watershed, SLIC superpixels), some feature descriptors (ORB, BRIEF, HOG), and some restoration methods.

2. Linux only. No Windows or macOS GPU support.

3. NVIDIA GPU required. No AMD/Intel GPU support.

4. Data must be explicitly moved to GPU. cuCIM does not auto-transfer; you must call cp.asarray().

5. Small image penalty. Images below ~512x512 may not benefit. Below ~64x64, CPU is likely faster.

6. GPU memory constraints. Very large images must be tiled. GPU memory is typically smaller than system RAM.

7. WSI format support is limited. Supports TIFF/SVS/Philips TIFF only. DICOM, NIFTI, Zarr not yet in stable release.

8. JIT compilation overhead on first call per session (cached thereafter).

---

Common Migration Patterns

Pattern 1: Direct scikit-image Replacement

# Before (CPU)
from skimage.filters import gaussian, sobel, threshold_otsu
from skimage.morphology import binary_opening, disk
from skimage.measure import label, regionprops_table
import numpy as np

image = np.array(...)  # Load image
blurred = gaussian(image, sigma=3)
edges = sobel(blurred)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image, properties=['area', 'centroid'])

# After (GPU) — change imports, wrap input with cp.asarray
from cucim.skimage.filters import gaussian, sobel, threshold_otsu
from cucim.skimage.morphology import binary_opening, disk
from cucim.skimage.measure import label, regionprops_table
import cupy as cp

image_gpu = cp.asarray(image)  # Transfer once
blurred = gaussian(image_gpu, sigma=3)
edges = sobel(blurred)
binary = blurred > threshold_otsu(blurred)
cleaned = binary_opening(binary, footprint=disk(3))
labels = label(cleaned)
props = regionprops_table(labels, image_gpu, properties=['area', 'centroid'])

Pattern 2: Digital Pathology Pipeline

from cucim import CuImage
from cucim.skimage.color import rgb2gray, separate_stains
from cucim.skimage.filters import threshold_otsu
from cucim.skimage.morphology import binary_opening, remove_small_objects, disk
from cucim.skimage.measure import label, regionprops_table
from cucim.core.operations.color import normalize_colors_pca
import cupy as cp

# Read whole-slide image tile
slide = CuImage("tissue.svs")
tile = cp.asarray(slide.read_region(location=(1000, 2000), size=(512, 512), level=0))

# Normalize staining
normalized = normalize_colors_pca(tile)

# Segment nuclei
gray = rgb2gray(normalized)
binary = gray < threshold_otsu(gray)
cleaned = binary_opening(binary, footprint=disk(2))
cleaned = remove_small_objects(label(cleaned), min_size=50)
labels = label(cleaned)

# Extract properties
props = regionprops_table(labels, gray, properties=['area', 'centroid', 'mean_intensity'])

Pattern 3: Deep Learning Preprocessing Pipeline

import cupy as cp
from cucim.skimage.transform import resize
from cucim.skimage.exposure import equalize_adapthist
from cucim.skimage.util import img_as_float32
from cucim.core.operations.spatial import rand_image_flip
from cucim.core.operations.color import color_jitter
import torch

# Load batch of images to GPU
images_gpu = cp.asarray(numpy_batch)  # (N, H, W, C)

# Process each image on GPU
processed = []
for img in images_gpu:
    img = img_as_float32(img)
    img = resize(img, (224, 224))
    img = equalize_adapthist(img)
    img = rand_image_flip(img, prob=0.5)
    img = color_jitter(img, brightness=0.2, contrast=0.2)
    processed.append(img)

batch_gpu = cp.stack(processed)

# Zero-copy to PyTorch for model inference
batch_torch = torch.as_tensor(batch_gpu).permute(0, 3, 1, 2)  # NHWC → NCHW

cuDF Reference

cuDF is a GPU DataFrame library that provides a pandas-like API for loading, joining, aggregating, filtering, and manipulating tabular data entirely on the GPU. It's part of the NVIDIA RAPIDS ecosystem and is built on the Apache Arrow columnar memory format.

Full documentation: https://docs.rapids.ai/api/cudf/stable/

1. Installation and Setup 2. Two Usage Modes 3. cudf.pandas Accelerator Mode 4. Core API: DataFrame and Series 5. IO Operations 6. GroupBy Operations 7. String Operations 8. User Defined Functions (UDFs) 9. Missing Data Handling 10. Data Types 11. Memory Management 12. Interoperability 13. Multi-GPU with Dask-cuDF 14. Performance Optimization 15. Key Differences from pandas 16. Common Migration Patterns

---

Installation and Setup

Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.

uv add --extra-index-url=https://pypi.nvidia.com cudf-cu12    # For CUDA 12.x

Verify:

import cudf
print(cudf.Series([1, 2, 3]))  # Should print a GPU series

---

Two Usage Modes

cuDF offers two ways to accelerate pandas code:

1. cudf.pandas (Zero-Code-Change)

Drop-in replacement that automatically accelerates pandas. Falls back to CPU for unsupported operations. Best for: quick acceleration of existing code, mixed codebases, prototyping.

2. Direct cuDF API

Replace import pandas with import cudf. Maximum performance, no proxy overhead, but requires adapting code to cuDF's API (which has some behavioral differences from pandas). Best for: production pipelines, maximum performance, new GPU-first code.

---

cudf.pandas Accelerator Mode

The fastest path from pandas to GPU — no code changes required.

Activation

# Jupyter/IPython (MUST be before any pandas import)
%load_ext cudf.pandas
import pandas as pd  # Now GPU-accelerated

# Command line
# python -m cudf.pandas your_script.py
# python -m cudf.pandas --profile your_script.py  # With profiling

# Programmatic
import cudf.pandas
cudf.pandas.install()
import pandas as pd  # Now GPU-accelerated

Critical: If pandas was already imported in the session, you must restart the kernel/process.

How It Works

import pandas returns a proxy module that wraps cuDF and pandas.
Every operation is first attempted on GPU (cuDF). If it fails, it automatically falls back to CPU (pandas).
Data transfers between GPU and CPU happen only when necessary.
Uses managed memory by default — can process datasets larger than GPU memory.
Currently passes 93% of pandas' 187,000+ unit tests.

Profiling GPU vs CPU Execution

%%cudf.pandas.profile        # Shows GPU vs CPU operation breakdown per cell
%%cudf.pandas.line_profile   # Per-line GPU/CPU timing

Accessing Underlying Objects

proxy_df.as_gpu_object()  # Get the cuDF DataFrame directly
proxy_df.as_cpu_object()  # Get the pandas DataFrame directly

Note: automatic fallback stops working after you extract the underlying object.

Compatible Third-Party Libraries

cuGraph, cuML, Hvplot, Holoview, Ibis, NumPy, Matplotlib, Plotly, PyTorch, Seaborn, Scikit-Learn, SciPy, TensorFlow, XGBoost.

Not compatible: Joblib. For distributed work, use Dask-cuDF instead.

Limitations

Join operations don't guarantee pandas' row ordering (for performance).
Cannot use import cudf alongside cudf.pandas in the same session.
Pickled objects are not interchangeable between regular pandas and cudf.pandas.
Proxy arrays subclass numpy.ndarray, which can cause eager device-to-host transfers.
To force CPU-only: set CUDF_PANDAS_FALLBACK_MODE=1.

---

Core API

Creating DataFrames and Series

import cudf

# From dict
df = cudf.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0], "c": ["x", "y", "z"]})

# From pandas
import pandas as pd
gdf = cudf.DataFrame.from_pandas(pd.DataFrame({"a": [1, 2, 3]}))
# or
gdf = cudf.DataFrame(pandas_df)

# Series
s = cudf.Series([1, 2, 3, None, 5])

# Back to pandas
pdf = gdf.to_pandas()

Common Operations (Same as pandas)

df.head(10)
df.tail(5)
df.describe()
df.info()
df.dtypes
df.columns
df.shape

# Selection
df["a"]                     # Column → Series
df[["a", "b"]]             # Multiple columns → DataFrame
df.loc[2:5, ["a", "b"]]   # Label-based indexing
df.iloc[0:3]               # Integer-based indexing

# Filtering
df[df["a"] > 2]
df.query("a > 2 and b < 6")  # Supports @var for local variables

# Sorting
df.sort_values("a", ascending=False)
df.sort_index()

# Missing data
df.fillna(0)
df.dropna()
df.isna()

# Aggregations
df["a"].sum()
df["a"].mean()
df["a"].std()
df["a"].value_counts()

# Transforms
df["a"].clip(lower=1, upper=5)
df["a"].apply(lambda x: x * 2)  # JIT-compiled

# Combining
cudf.concat([df1, df2])
df1.merge(df2, on="key")
df1.merge(df2, on="key", how="left")  # left, right, inner, outer

# Arrow interop (zero-copy)
arrow_table = df.to_arrow()
df = cudf.DataFrame.from_arrow(arrow_table)

---

IO Operations

GPU-accelerated file reading and writing — often dramatically faster than pandas for large files.

Parquet (Recommended for Performance)

# Read
df = cudf.read_parquet("data.parquet")
df = cudf.read_parquet("data.parquet", columns=["a", "b"])  # Read only specific columns

# Write
df.to_parquet("output.parquet")

# Metadata inspection (without loading data)
cudf.io.parquet.read_parquet_metadata("data.parquet")

# Incremental writing
writer = cudf.io.parquet.ParquetDatasetWriter("output_dir/", partition_cols=["year"])
writer.write_table(df)
writer.close()

CSV

df = cudf.read_csv("data.csv")
df = cudf.read_csv("data.csv", usecols=["a", "b"], dtype={"a": "int32"})
df.to_csv("output.csv", index=False)

JSON

df = cudf.read_json("data.json")
df = cudf.read_json("data.json", lines=True)  # JSON Lines format
df.to_json("output.json")

ORC

df = cudf.read_orc("data.orc")
df.to_orc("output.orc")

Other Formats

Format	Read	Write	GPU-Accelerated
Avro	`cudf.read_avro()`	N/A	Yes (read only)
Text	`cudf.read_text()`	N/A	Yes (read only)
HDF5	`cudf.read_hdf()`	`df.to_hdf()`	No (uses pandas)
Feather	`cudf.read_feather()`	`df.to_feather()`	No (uses pandas)

Prefer Parquet over CSV — columnar format reads faster on GPU, supports predicate pushdown, and compresses well.

---

GroupBy Operations

Basic GroupBy

df.groupby("category").sum()
df.groupby(["category", "subcategory"]).mean()
df.groupby("category").agg({"value": "sum", "count": "max"})
df.groupby("category").agg({"value": ["sum", "min", "max"], "count": "mean"})

Supported Aggregations

Universal: count, size, nunique, nth, collect, unique Numeric: sum, mean, var, std, median, idxmin, idxmax, min, max, quantile Specialized: corr, cov

GroupBy Transform

df.groupby("category").transform("max")  # Broadcasts result to match group size

GroupBy Apply

df.groupby("category").apply(lambda x: x.max() - x.min())

Warning: Apply runs the function sequentially per group — can be slow with many small groups. Use vectorized aggregations whenever possible.

JIT-Compiled GroupBy (User-Defined Aggregation)

def custom_agg(df):
    return df["value"].max() - df["value"].min() / 2

result = df.groupby("category").apply(custom_agg, engine="jit")

JIT restrictions: no nulls, only int32/64 and float32/64, cannot return new columns.

Important: Sort Behavior

cuDF uses sort=False by default (unlike pandas which sorts by default). To match pandas:

df.groupby("category", sort=True).sum()
# Or globally:
cudf.set_option("mode.pandas_compatible", True)

---

String Operations

cuDF provides GPU-accelerated string operations via the .str accessor — identical API to pandas.

s = cudf.Series(["Hello World", "foo bar", "RAPIDS GPU", None])

# Case
s.str.lower()
s.str.upper()
s.str.title()
s.str.capitalize()

# Pattern matching
s.str.contains("World")
s.str.startswith("Hello")
s.str.endswith("GPU")
s.str.match(r"^[A-Z]")

# Extraction and replacement
s.str.extract(r"(\w+)\s(\w+)")
s.str.replace("World", "GPU")
s.str.slice(0, 5)

# Splitting and joining
s.str.split(" ")
s.str.cat(sep=", ")

# Info
s.str.len()
s.str.isalpha()
s.str.isdigit()

# cuDF-exclusive operations (not in pandas)
s.str.normalize_spaces()   # Collapse whitespace
s.str.tokenize()           # Tokenize strings
s.str.ngrams(2)            # Generate n-grams
s.str.edit_distance(other) # Levenshtein distance
s.str.url_encode()
s.str.url_decode()

---

User Defined Functions

Series.apply() — JIT-Compiled

s = cudf.Series([1, 2, 3, 4, 5])

def square_plus_one(x):
    return x ** 2 + 1

s.apply(square_plus_one)  # Compiled to GPU kernel via Numba

With arguments:

def add_constant(x, c):
    return x + c

s.apply(add_constant, args=(42,))

DataFrame.apply() — Row-wise (axis=1)

def row_func(row):
    return row["a"] + row["b"] * 2

df.apply(row_func, axis=1)  # Access columns by name via dict-like syntax

Null Handling in UDFs

Nulls propagate automatically:

s = cudf.Series([1, cudf.NA, 3])
def f(x):
    return x + 1
s.apply(f)  # Returns [2, <NA>, 4]

Explicit null checks:

def f(x):
    if x is cudf.NA:
        return 0
    return x + 1

String UDFs

String operations inside UDFs support: ==, !=, >=, <=, startswith(), endswith(), find(), rfind(), count(), in, strip/lstrip/rstrip(), upper/lower(), replace(), + (concatenation), len(), boolean checks.

For string UDFs creating intermediate strings, allocate heap:

from cudf.core.udf.utils import set_malloc_heap_size
set_malloc_heap_size(int(2e9))  # 2 GB

Rolling Window UDFs

import math

s = cudf.Series([16, 25, 36, 49, 64, 81], dtype="float64")

def max_sqrt(window):
    result = 0
    for val in window:
        result = max(result, math.sqrt(val))
    return result

s.rolling(window=3, min_periods=3).apply(max_sqrt)

Limitation: Rolling UDFs do NOT support null values.

Custom Numba CUDA Kernels on cuDF Columns

For maximum control, write CUDA kernels that operate directly on cuDF columns:

from numba import cuda

@cuda.jit
def gpu_multiply(in_col, out_col, multiplier):
    i = cuda.grid(1)
    if i < in_col.size:
        out_col[i] = in_col[i] * multiplier

df["result"] = 0.0
gpu_multiply.forall(len(df))(df["a"], df["result"], 10.0)

UDF Limitations

Only numeric non-decimal types have full support; strings have partial support.
**kwargs not supported.
Bitwise operations not implemented in UDFs.
GroupBy JIT: no nulls, only int32/64 and float32/64, cannot return new columns.
Rolling UDFs: no null support.

---

Missing Data Handling

Missing values are <NA> (not NaN) — cuDF uses a separate null mask, not NaN sentinels.
All dtypes are nullable (including integers — no float coercion for missing ints).
np.nan inserted into integer columns becomes <NA> without casting to float.

s = cudf.Series([1, None, 3, None, 5])

s.isna()                # Boolean mask
s.notna()
s.fillna(0)             # Fill with scalar
s.fillna({"a": 0, "b": 1})  # Fill with dict (per-column)
s.dropna()

# Aggregations skip NA by default
s.sum()                 # skipna=True (default)
s.sum(skipna=False)     # Propagates NA

# GroupBy excludes NA groups by default
df.groupby("a", dropna=False).sum()  # Include NA groups

---

Data Types

Category	Types
Integer	`int8`, `int16`, `int32`, `int64`, `uint32`, `uint64`
Float	`float32`, `float64`
Datetime	`datetime64[s/ms/us/ns]`
Timedelta	`timedelta[s/ms/us/ns]`
Categorical	`CategoricalDtype`
String	`object` / `string`
Decimal	`Decimal32Dtype`, `Decimal64Dtype`, `Decimal128Dtype`
List	`ListDtype` (nested lists)
Struct	`StructDtype` (dict-like)

All types are nullable. List columns have a .list accessor (get(), len(), contains(), sort_values(), unique(), concat()). Struct columns have a .struct accessor (field(), explode()).

No `object` dtype for arbitrary Python objects — object dtype only stores strings.

---

Memory Management

RMM (RAPIDS Memory Manager)

cuDF uses RMM for GPU memory allocation. Configure it for your workload:

import rmm

# Pool allocator (recommended for production — avoids per-allocation cudaMalloc overhead)
pool = rmm.mr.PoolMemoryResource(
    rmm.mr.CudaMemoryResource(),
    initial_pool_size="1GiB",
    maximum_pool_size="4GiB"
)
rmm.mr.set_current_device_resource(pool)

# Managed memory (allows datasets larger than GPU memory)
rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

# Managed + pool (best of both)
pool = rmm.mr.PoolMemoryResource(
    rmm.mr.ManagedMemoryResource(),
    initial_pool_size="1GiB"
)
rmm.mr.set_current_device_resource(pool)

Aligning CuPy and Numba with RMM

When using cuDF with CuPy or Numba, align all libraries on the same allocator to avoid memory fragmentation:

# CuPy
from rmm.allocators.cupy import rmm_cupy_allocator
import cupy
cupy.cuda.set_allocator(rmm_cupy_allocator)

# Numba
from rmm.allocators.numba import RMMNumbaManager
from numba import cuda
cuda.set_memory_manager(RMMNumbaManager)

Copy-on-Write

cudf.set_option("copy_on_write", True)
# or: export CUDF_COPY_ON_WRITE=1

Slices, .head(), shallow copies, and view-generating methods share memory until one is modified. Reduces memory usage significantly for workflows with many derived DataFrames.

Memory Profiling

rmm.statistics.enable_statistics()
stats = rmm.statistics.get_statistics()
# Returns: current_bytes, current_count, peak_bytes, peak_count, total_bytes, total_count

---

Interoperability

CuPy (Zero-Copy)

import cupy as cp

# cuDF → CuPy
arr = df.to_cupy()             # DataFrame → 2D CuPy array
arr = cp.asarray(df["col"])    # Series → 1D CuPy array
arr = df["col"].values         # Series → 1D CuPy array

# CuPy → cuDF
df = cudf.DataFrame(cupy_2d_array)
s = cudf.Series(cupy_1d_array)

# Via DLPack
df = cudf.from_dlpack(cupy_array.__dlpack__())

Arrow (Zero-Copy)

arrow_table = df.to_arrow()
df = cudf.DataFrame.from_arrow(arrow_table)

RAPIDS Ecosystem

cuML: Accepts cuDF DataFrames directly for ML pipelines.
cuGraph: Accepts cuDF DataFrames for graph analytics.
Dask-cuDF: Distributed GPU DataFrames (see below).

CUDA Array Interface

cuDF Series exposes __cuda_array_interface__ for zero-copy sharing with any compatible library (CuPy, Numba, PyTorch, etc.).

---

Multi-GPU with Dask-cuDF

For datasets larger than a single GPU's memory, or for multi-GPU parallelism:

import dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

# One worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)

# From files
ddf = dask_cudf.read_csv("path/*.csv")
ddf = dask_cudf.read_parquet("path/")

# From cuDF DataFrame
ddf = dask_cudf.from_cudf(df, npartitions=16)

# Operations (lazy — call .compute() to execute)
result = ddf.groupby("a").sum().compute()

# Persist in GPU memory for repeated access
ddf = ddf.persist()

Key differences from cuDF: .iloc not supported, must call .compute() to materialize, transpose not implemented.

---

Performance Optimization

1. Start with cudf.pandas for easiest adoption — zero code changes, automatic GPU/CPU fallback.

2. Switch to direct cuDF API for max performance — avoids proxy overhead and fallback copying costs.

3. Prefer Parquet over CSV — columnar format, faster GPU reads, predicate pushdown, better compression.

4. Use pool allocators via RMM — avoids per-allocation cudaMalloc overhead.

5. Enable copy-on-write — cudf.set_option("copy_on_write", True) reduces memory from slices and views.

6. Reshape data to be long (more rows, fewer columns) — GPUs parallelize over rows.

7. Never iterate — use vectorized operations exclusively. for row in df.iterrows() defeats the purpose of GPU acceleration.

8. Minimum dataset size: GPUs shine with 10,000-100,000+ rows. Smaller datasets may be faster on CPU.

9. Use vectorized string ops (.str. accessor) instead of row-wise string UDFs.

10. Use CuPy for row-wise math that cuDF doesn't support natively.

11. Use Numba CUDA kernels for complex element-wise operations.

12. Align all RAPIDS libraries on the same RMM allocator to avoid memory fragmentation.

13. For distributed workloads, use Dask-cuDF with persist() to keep data on GPU memory.

---

Key Differences from pandas

1. Result ordering is non-deterministic by default (groupby, joins, etc.). Use sort=True or cudf.set_option("mode.pandas_compatible", True).

2. All types are nullable. Missing values are <NA>, not NaN. Integer columns with missing values stay integer (no float coercion).

3. No iteration. for val in series is not supported. Convert to pandas first if you must iterate.

4. Unique column names required. No duplicate column names.

5. No arbitrary Python objects. The object dtype only stores strings.

6. `.apply()` uses Numba JIT. Only a subset of Python is supported inside UDFs — no arbitrary Python objects, no external library calls.

7. Floating-point results may differ slightly due to GPU parallel operation ordering. Use tolerance-based comparisons.

8. GroupBy defaults to `sort=False` (pandas defaults to sort=True).

9. No ExtensionDtype support from pandas.

---

Common Migration Patterns

Pattern 1: Zero-Effort (cudf.pandas)

%load_ext cudf.pandas
import pandas as pd
# Everything else stays exactly the same

Pattern 2: Direct Import Swap

# Before
import pandas as pd
df = pd.read_csv("data.csv")
result = df.groupby("col").mean()

# After
import cudf
df = cudf.read_csv("data.csv")
result = df.groupby("col").mean()

Pattern 3: Replace Iteration with Vectorized Ops

# Before (pandas — slow even on CPU)
for idx, row in df.iterrows():
    df.at[idx, "c"] = row["a"] + row["b"]

# After (cuDF)
df["c"] = df["a"] + df["b"]

Pattern 4: Replace apply() with Vectorized

# Before
df["result"] = df.apply(lambda row: row["a"] ** 2 + row["b"], axis=1)

# After (vectorized — much faster)
df["result"] = df["a"] ** 2 + df["b"]

Pattern 5: GPU Processing, CPU at Boundaries

# Load and process on GPU
gdf = cudf.read_parquet("data.parquet")
result = gdf.groupby("key").agg({"val": "sum"})

# Convert to pandas only when needed (plotting, export, etc.)
pdf = result.to_pandas()
pdf.plot()

Pattern 6: CuPy for Unsupported Math

import cupy as cp

# Convert to CuPy for operations cuDF doesn't support
arr = df[["x", "y", "z"]].to_cupy()
norms = cp.linalg.norm(arr, axis=1)
df["norm"] = cudf.Series(norms)

---

Configuration

cudf.set_option("copy_on_write", True)            # Enable copy-on-write
cudf.set_option("mode.pandas_compatible", True)    # Match pandas behavior
cudf.describe_option()                             # List all options

Environment Variable	Purpose
`CUDF_COPY_ON_WRITE=1`	Enable copy-on-write
`CUDF_PANDAS_RMM_MODE`	Control memory allocator for cudf.pandas
`CUDF_PANDAS_FALLBACK_MODE=1`	Force CPU-only execution in cudf.pandas

cuGraph Reference

cuGraph is NVIDIA's GPU-accelerated graph analytics library within the RAPIDS ecosystem. It provides NetworkX-compatible APIs for graph algorithms, delivering 10-500x+ speedup over CPU-based NetworkX on medium to large graphs. It supports both a direct Python API and a zero-code-change NetworkX backend (nx-cugraph) that accelerates existing NetworkX code with no modifications.

Full documentation: https://docs.rapids.ai/api/cugraph/stable/

Version (stable): 26.02.00

Repository: https://github.com/rapidsai/cugraph

1. Installation and Setup 2. Two Usage Modes 3. nx-cugraph: Zero-Code-Change NetworkX Backend 4. Direct cuGraph API 5. Graph Creation and Data Loading 6. Supported Graph Types 7. Algorithm Catalog 8. Multi-GPU Support with Dask 9. GNN Support (cugraph-pyg and WholeGraph) 10. Performance Characteristics and Benchmarks 11. Memory Management 12. Interoperability 13. Known Limitations vs NetworkX 14. Common Migration Patterns

---

Installation and Setup

Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.

uv add --extra-index-url=https://pypi.nvidia.com cugraph-cu12    # Core cuGraph for CUDA 12.x
uv add --extra-index-url=https://pypi.nvidia.com nx-cugraph-cu12 # NetworkX backend

Platform: Linux and WSL2 only (no native macOS or Windows). Requires: NVIDIA GPU with CUDA 12.x support, NetworkX >= 3.2 (>= 3.4 recommended for optimal nx-cugraph).

Verify:

import cugraph
print(cugraph.__version__)

# Quick test with built-in dataset
from cugraph.datasets import karate
G = karate.get_graph()
result = cugraph.degree_centrality(G)
print(result.head())

---

Two Usage Modes

Mode 1: nx-cugraph Backend (Zero Code Change)

Accelerate existing NetworkX code by setting one environment variable. No code changes required.

NX_CUGRAPH_AUTOCONFIG=True python my_networkx_script.py

Mode 2: Direct cuGraph API

Use cuGraph's native API for maximum control, working directly with cuDF DataFrames and cuGraph graph objects.

import cugraph
import cudf

edges = cudf.DataFrame({
    "src": [0, 1, 2, 0],
    "dst": [1, 2, 3, 3],
    "weight": [1.0, 2.0, 1.5, 3.0]
})
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst", edge_attr="weight")
result = cugraph.pagerank(G)

When to use which:

nx-cugraph: Existing NetworkX codebases, rapid prototyping, when you want zero migration effort
Direct API: Maximum performance, multi-GPU workflows, integration with cuDF/cuML pipelines, GNN training

---

nx-cugraph: Zero-Code-Change NetworkX Backend

nx-cugraph is a NetworkX backend that transparently redirects supported algorithm calls to GPU-accelerated cuGraph implementations.

How It Works

NetworkX >= 3.2 has a backend dispatch system. When nx-cugraph is installed and enabled, NetworkX automatically redirects supported function calls to GPU implementations. Unsupported calls fall back to default NetworkX.

Three Ways to Enable

1. Environment Variable (recommended for zero code change):

export NX_CUGRAPH_AUTOCONFIG=True
python my_script.py
# OR inline:
NX_CUGRAPH_AUTOCONFIG=True python my_script.py

2. Keyword Argument (explicit per-call):

import networkx as nx
result = nx.betweenness_centrality(G, k=10, backend="cugraph")

3. Type-Based Dispatch (explicit graph conversion):

import networkx as nx
import nx_cugraph as nxcg

G_nx = nx.karate_club_graph()
G_gpu = nxcg.from_networkx(G_nx)  # Convert once, reuse for multiple algorithms
result = nx.pagerank(G_gpu)       # Automatically dispatched to GPU

Supported Algorithms in nx-cugraph

Centrality:

betweenness_centrality, edge_betweenness_centrality
degree_centrality, in_degree_centrality, out_degree_centrality
eigenvector_centrality, katz_centrality

Community:

louvain_communities, leiden_communities

Components:

connected_components, is_connected, number_connected_components
node_connected_component
weakly_connected_components, is_weakly_connected, number_weakly_connected_components

Clustering:

average_clustering, clustering, transitivity, triangles

Core:

core_number, k_truss

Link Analysis:

pagerank, hits

Link Prediction:

jaccard_coefficient

Shortest Paths (23+ functions):

shortest_path, shortest_path_length
has_path, all_pairs_shortest_path, all_pairs_shortest_path_length
dijkstra_path, dijkstra_path_length, all_pairs_dijkstra, all_pairs_dijkstra_path_length
bellman_ford_path, bellman_ford_path_length, all_pairs_bellman_ford_path_length
single_source_shortest_path, single_source_shortest_path_length
single_source_dijkstra, single_source_dijkstra_path, single_source_dijkstra_path_length
single_source_bellman_ford, single_source_bellman_ford_path, single_source_bellman_ford_path_length
single_target_shortest_path_length

Traversal:

bfs_edges, bfs_layers, bfs_predecessors, bfs_successors, bfs_tree
generic_bfs_edges, descendants_at_distance

DAG:

ancestors, descendants

Bipartite:

betweenness_centrality (bipartite), biadjacency_matrix
complete_bipartite_graph, from_biadjacency_matrix

Tree:

is_arborescence, is_branching, is_forest, is_tree

Operators:

complement, reverse

Reciprocity:

overall_reciprocity, reciprocity

Isolate:

is_isolate, isolates, number_of_isolates

Lowest Common Ancestors:

lowest_common_ancestor

Layout:

forceatlas2_layout

Graph Generators: Various generators are also supported for creating graphs directly on GPU.

---

Direct cuGraph API

Quick Example

import cugraph
import cudf

# Load edges from cuDF DataFrame
edges = cudf.DataFrame({
    "source": [0, 1, 2, 3, 0, 2],
    "destination": [1, 2, 3, 4, 4, 1],
    "weight": [1.0, 2.0, 1.0, 3.0, 0.5, 1.5]
})

G = cugraph.Graph(directed=True)
G.from_cudf_edgelist(edges, source="source", destination="destination", edge_attr="weight")

# Run algorithms
pr = cugraph.pagerank(G)
bc = cugraph.betweenness_centrality(G)
components = cugraph.weakly_connected_components(G)

---

Graph Creation and Data Loading

From cuDF DataFrame (Primary Method)

import cudf, cugraph

df = cudf.DataFrame({"src": [0, 1, 2], "dst": [1, 2, 3], "wt": [1.0, 2.0, 3.0]})

# Unweighted
G = cugraph.Graph()
G.from_cudf_edgelist(df, source="src", destination="dst")

# Weighted
G = cugraph.Graph()
G.from_cudf_edgelist(df, source="src", destination="dst", edge_attr="wt")

# Directed
G = cugraph.Graph(directed=True)
G.from_cudf_edgelist(df, source="src", destination="dst")

From Pandas DataFrame

import pandas as pd, cugraph

df = pd.DataFrame({"src": [0, 1, 2], "dst": [1, 2, 3]})
G = cugraph.Graph()
G.from_pandas_edgelist(df, source="src", destination="dst")

From cuDF Adjacency List

G = cugraph.Graph()
G.from_cudf_adjlist(offsets, indices, values)  # CSR format

From NumPy Array

import numpy as np
adj_matrix = np.array([[0, 1, 0], [1, 0, 1], [0, 1, 0]])
G = cugraph.Graph()
G.from_numpy_array(adj_matrix)

From Pandas Adjacency Matrix

G = cugraph.Graph()
G.from_pandas_adjacency(adj_df)

From Dask-cuDF (Multi-GPU)

G = cugraph.Graph()
G.from_dask_cudf_edgelist(dask_cudf_df, source="src", destination="dst")

From Built-in Datasets

from cugraph.datasets import karate, dolphins, polbooks, netscience
G = karate.get_graph()

Symmetrization (Undirected Graphs)

# Ensure all edges are bidirectional
sym_df = cugraph.symmetrize_df(df, "src", "dst")

# Or symmetrize a graph directly
sym_df = cugraph.symmetrize(source_col, dest_col, weight_col)

Vertex Renumbering

cuGraph internally renumbers vertices to contiguous integers starting from 0. Use unrenumber() to map back to original IDs:

result = cugraph.pagerank(G)
result = G.unrenumber(result, "vertex")  # Map internal IDs back to original

---

Supported Graph Types

Graph Type	cuGraph Class	Notes
Undirected	`cugraph.Graph()`	Default; edges are bidirectional
Directed	`cugraph.Graph(directed=True)`	Directed edges; some algorithms require directed/undirected
Weighted	Set `edge_attr` in `from_cudf_edgelist`	Edge weights used by SSSP, PageRank, Louvain, etc.
MultiGraph	`cugraph.MultiGraph()`	Multiple edges between same vertex pairs
Bipartite	Supported via standard Graph with bipartite structure	No dedicated class; algorithms in `cugraph.bipartite`

Important: cuGraph uses a CSR (Compressed Sparse Row) internal representation. Graphs are immutable after creation -- you cannot dynamically add/remove individual edges after calling from_cudf_edgelist(). To modify a graph, reconstruct it from a new DataFrame.

---

Algorithm Catalog

Centrality

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Betweenness Centrality	`cugraph.betweenness_centrality(G)`	`cugraph.dask.centrality.betweenness_centrality()`	`nx.betweenness_centrality()`
Edge Betweenness	`cugraph.edge_betweenness_centrality(G)`	`cugraph.dask.centrality.edge_betweenness_centrality()`	`nx.edge_betweenness_centrality()`
Degree Centrality	`cugraph.degree_centrality(G)`	--	`nx.degree_centrality()`
Eigenvector Centrality	`cugraph.eigenvector_centrality(G)`	`cugraph.dask.centrality.eigenvector_centrality()`	`nx.eigenvector_centrality()`
Katz Centrality	`cugraph.katz_centrality(G)`	`cugraph.dask.centrality.katz_centrality()`	`nx.katz_centrality()`

Community Detection

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Louvain	`cugraph.louvain(G, max_level=, max_iter=, resolution=)`	`cugraph.dask.community.louvain.louvain()`	`nx.community.louvain_communities()`
Leiden	`cugraph.leiden(G, max_iter=, resolution=)`	`cugraph.dask.community.leiden.leiden()`	`nx.community.leiden_communities()`
ECG	`cugraph.ecg(G, min_weight=)`	`cugraph.dask.community.ecg.ecg()`	--
Spectral Balanced Cut	`cugraph.spectralBalancedCutClustering(G, num_clusters)`	--	--
Spectral Modularity	`cugraph.spectralModularityMaximizationClustering(G, num_clusters)`	--	--
Triangle Counting	`cugraph.triangle_count(G)`	`cugraph.dask.community.triangle_count()`	`nx.triangles()`
K-Truss	`cugraph.k_truss(G, k)` or `cugraph.ktruss_subgraph(G, k)`	`cugraph.dask.community.ktruss_subgraph()`	`nx.k_truss()`
EgoNet	`cugraph.ego_graph(G, n, radius=)`	`cugraph.dask.community.egonet()`	`nx.ego_graph()`
Induced Subgraph	`cugraph.induced_subgraph(G, vertices)`	`cugraph.dask.community.induced_subgraph()`	`G.subgraph(vertices)`

Clustering Analysis:

cugraph.analyzeClustering_edge_cut(G, n_clusters, clustering)
cugraph.analyzeClustering_modularity(G, n_clusters, clustering)
cugraph.analyzeClustering_ratio_cut(G, n_clusters, clustering)

Traversal

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
BFS	`cugraph.bfs(G, start=, depth_limit=)`	`cugraph.dask.traversal.bfs.bfs()`	`nx.bfs_edges()`
BFS Edges	`cugraph.bfs_edges(G, source)`	--	`nx.bfs_edges()`
SSSP	`cugraph.sssp(G, source=)`	`cugraph.dask.traversal.sssp.sssp()`	`nx.single_source_dijkstra()`
Shortest Path	`cugraph.shortest_path(G, source=)`	--	`nx.shortest_path()`
Shortest Path Length	`cugraph.shortest_path_length(G, source, target=)`	--	`nx.shortest_path_length()`
Filter Unreachable	`cugraph.filter_unreachable(df)`	--	--

Link Analysis

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
PageRank	`cugraph.pagerank(G, alpha=)`	`cugraph.dask.link_analysis.pagerank()`	`nx.pagerank()`
HITS	`cugraph.hits(G, max_iter=, tol=)`	`cugraph.dask.link_analysis.hits()`	`nx.hits()`

Link Prediction / Similarity

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Jaccard	`cugraph.jaccard(G, vertex_pair=)`	--	`nx.jaccard_coefficient()`
Cosine Similarity	`cugraph.cosine(G, vertex_pair=)`	--	--
Overlap	`cugraph.overlap(G, vertex_pair=)`	`cugraph.dask.link_prediction.overlap()`	--
Sorensen	`cugraph.sorensen(G, vertex_pair=)`	`cugraph.dask.link_prediction.sorensen()`	--

NetworkX-compatible wrappers: cugraph.jaccard_coefficient(G, ebunch), cugraph.overlap_coefficient(G, ebunch), cugraph.sorensen_coefficient(G, ebunch)

Components

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Connected Components	`cugraph.connected_components(G)`	--	`nx.connected_components()`
Weakly Connected	`cugraph.weakly_connected_components(G)`	`cugraph.dask.components.weakly_connected_components()`	`nx.weakly_connected_components()`
Strongly Connected	`cugraph.strongly_connected_components(G)`	--	`nx.strongly_connected_components()`

Cores

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Core Number	`cugraph.core_number(G, degree_type=)`	`cugraph.dask.cores.core_number()`	`nx.core_number()`
K-Core	`cugraph.k_core(G, k=, core_number=)`	`cugraph.dask.cores.k_core()`	`nx.k_core()`

Sampling

Algorithm	Single-GPU	Multi-GPU	Notes
Biased Random Walks	`cugraph.biased_random_walks(G, start_vertices)`	`cugraph.dask.sampling.biased_random_walks()`	Weighted/biased traversal
Uniform Random Walks	--	`cugraph.dask.sampling.uniform_random_walks()`	Padded result with max path length
Random Walks	--	`cugraph.dask.sampling.random_walks()`	General random walk
Node2Vec	--	`cugraph.dask.sampling.node2vec_random_walks()`	Node2Vec sampling framework
Homogeneous Neighbor Sample	`cugraph.homogeneous_neighbor_sample(G, start_vertices, fanout)`	--	Configurable fan-out per hop
Heterogeneous Neighbor Sample	`cugraph.heterogeneous_neighbor_sample(G, ...)`	--	Multi-type node/edge graphs

Layout

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Force Atlas 2	`cugraph.force_atlas2(G)`	--	`nx.forceatlas2_layout()` (via nx-cugraph)

Tree

Algorithm	Single-GPU	Multi-GPU	NetworkX Equivalent
Minimum Spanning Tree	`cugraph.minimum_spanning_tree(G)`	--	`nx.minimum_spanning_tree()`
Maximum Spanning Tree	`cugraph.maximum_spanning_tree(G)`	--	`nx.maximum_spanning_tree()`

Linear Assignment

Algorithm	Single-GPU	Multi-GPU
Hungarian	`cugraph.hungarian(G, workers, cost)`	--

Utilities

Function	Purpose
`cugraph.symmetrize(src, dst, val)`	Make edges bidirectional (for undirected graphs)
`cugraph.symmetrize_df(df, src, dst)`	Symmetrize a DataFrame
`cugraph.symmetrize_ddf(ddf, src, dst)`	Symmetrize a Dask DataFrame
`cugraph.NumberMap`	Map external vertex IDs to contiguous internal IDs
`G.unrenumber(df, col)`	Map internal vertex IDs back to original

---

Multi-GPU Support with Dask

cuGraph supports multi-GPU computation through Dask for graphs that exceed single-GPU memory or need faster processing.

Setup

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cugraph
import cugraph.dask as dask_cugraph
import dask_cudf

# Initialize multi-GPU cluster
cluster = LocalCUDACluster()
client = Client(cluster)

# Load distributed edge list
ddf = dask_cudf.read_csv("large_graph.csv", names=["src", "dst", "weight"])

# Create distributed graph
G = cugraph.Graph(directed=True)
G.from_dask_cudf_edgelist(ddf, source="src", destination="dst", edge_attr="weight")

# Run multi-GPU algorithms
pr = dask_cugraph.pagerank(G)
components = dask_cugraph.weakly_connected_components(G)

Algorithms with Multi-GPU Support

The following algorithms have Dask-based multi-GPU implementations:

Centrality: Betweenness, Edge Betweenness, Eigenvector, Katz
Community: Louvain, Leiden, ECG, K-Truss, Triangle Counting, EgoNet, Induced Subgraph
Components: Weakly Connected Components
Cores: Core Number, K-Core
Link Analysis: PageRank, HITS
Link Prediction: Overlap, Sorensen
Sampling: Random Walks, Biased Random Walks, Uniform Random Walks, Node2Vec, Neighborhood Sampling
Traversal: BFS, SSSP
Utilities: Renumbering, Symmetrize, Path Extraction, Two-Hop Neighbors, RMAT Generator

---

GNN Support

cugraph-pyg (PyTorch Geometric Integration)

As of release 25.06, cugraph-pyg is the recommended GNN framework integration (cuGraph-DGL has been removed).

cugraph-pyg provides native GPU-accelerated implementations of PyG's core interfaces:

GraphStore: GPU-accelerated graph storage using cuGraph's CSR representation
FeatureStore: GPU-resident feature storage for node/edge features
Sampler/Loader: GPU-accelerated neighborhood sampling with configurable fan-out

uv add --extra-index-url=https://pypi.nvidia.com cugraph-pyg-cu12

Key capabilities:

Heterogeneous graph sampling (multiple node/edge types)
Multi-GPU distributed sampling
Direct integration with PyG's NeighborLoader and training loops
GPU-accelerated centrality, community detection, and other analytics within PyG workflows

Repository: https://github.com/rapidsai/cugraph-gnn

WholeGraph (Distributed GPU Memory for GNNs)

WholeGraph provides distributed GPU memory management for large-scale GNN training through its WholeMemory abstraction.

uv add --extra-index-url=https://pypi.nvidia.com pylibwholegraph-cu12

Core concepts:

WholeMemory: A unified view of GPU memory distributed across multiple GPUs. Each GPU sees the entire memory space through a single abstraction, even though data is physically distributed.
WholeMemory Communicator: Defines the set of GPUs that collaborate, with one process per GPU.
WholeMemory Tensor: Like PyTorch tensors but distributed; supports 1D and 2D data with first dimension partitioned across GPUs.
WholeMemory Embedding: 2D tensor variant with built-in cache policies and sparse optimizers (SGD, Adam, RMSProp, AdaGrad).

Memory modes:

Mode	Description	Use Case
Continuous	Single continuous address space via hardware peer-to-peer	NVLink systems (DGX)
Chunked	Per-GPU chunks with direct multi-pointer access	Multi-GPU with some NVLink
Distributed	Explicit communication required for remote access	Multi-node clusters

Storage locations: Host memory (pinned) or device/GPU memory.

Graph storage: CSR format with ROW_INDEX and COL_INDEX as WholeMemory Tensors for efficient distributed graph management.

Cache policies: Device-cached host memory, local-cached global memory -- critical for handling graphs larger than GPU memory.

Target hardware: NVLink systems like DGX A100/H100 servers for optimal performance.

cuGraph-DGL (DEPRECATED)

cuGraph-DGL has been removed as of release 25.06. Users should migrate to cugraph-pyg. The cuGraph team is not planning further work in the DGL ecosystem.

---

Performance Characteristics and Benchmarks

nx-cugraph Benchmarks (NetworkX backend)

Hardware: Intel Xeon w9-3495X (56 cores), NVIDIA RTX 3090 (24GB), 251 GB RAM, CUDA 12.8

Datasets tested:

Dataset	Nodes	Edges	Type
netscience	1,461	5,484	Small
amazon0302	262,111	1,234,877	Medium
cit-Patents	3,774,768	16,518,948	Large
soc-LiveJournal1	4,847,571	68,993,773	Very large

Speedups (GPU vs CPU NetworkX):

Algorithm	Medium Graph	Large Graph	Very Large Graph
`betweenness_centrality` (k=100)	~20x	~520x	~300x
`katz_centrality`	~100x	~5,000x	~24,768x
`average_clustering`	~50x	~1,000x	~2,828x
`transitivity`	~50x	~1,000x	~2,832x
`louvain_communities`	~30x	~273x	~200x
`pagerank`	~2x	~50x	~188x
`eigenvector_centrality`	~7x	~100x	~376x
`k_truss`	~8x	~200x	~540x

Key finding: Speedup increases dramatically with graph size. Small graphs (< 5K edges) may see overhead from GPU initialization that negates speedup. For graphs with > 100K edges, expect 10-500x+ improvement on most algorithms.

Concrete example: Betweenness centrality on cit-Patents (3.7M nodes, 16.5M edges):

CPU NetworkX: 7 min 41 sec
nx-cugraph GPU: 5.32 sec (~86x speedup)

General Performance Guidelines

Small graphs (< 10K edges): GPU overhead may dominate; NetworkX CPU may be faster
Medium graphs (100K-1M edges): 10-100x speedup typical
Large graphs (1M-100M edges): 100-1000x+ speedup typical
Very large graphs (> 100M edges): Use multi-GPU; single GPU memory may be insufficient
First call overhead: Initial GPU kernel compilation and graph transfer adds ~1-3 seconds; subsequent calls on same graph are much faster

---

Memory Management

GPU Memory Considerations

cuGraph stores graphs in CSR format on GPU memory
Memory usage is approximately: (num_edges * 2 * 4 bytes) + (num_vertices * 4 bytes) for unweighted, plus (num_edges * 8 bytes) for weighted (float64 weights)
A graph with 100M edges requires roughly ~1.6 GB unweighted or ~2.4 GB weighted
Algorithm working memory varies; some algorithms (like betweenness centrality) need additional O(V) or O(E) temporary space

Strategies for Large Graphs

1. Use multi-GPU via Dask for graphs exceeding single GPU memory 2. Use WholeGraph for GNN workloads that need distributed feature/graph storage 3. Use `rmm` (RAPIDS Memory Manager) for fine-grained GPU memory control:

   import rmm
   rmm.reinitialize(pool_allocator=True, initial_pool_size=2**30)  # 1 GB pool

4. Monitor memory with nvidia-smi or rmm.get_memory_info() 5. Delete intermediate results explicitly: del result; import gc; gc.collect()

---

Interoperability

With cuDF

cuGraph natively consumes and produces cuDF DataFrames. Algorithm results are returned as cuDF DataFrames with vertex/edge columns.

import cudf, cugraph
# Create graph from cuDF
edges = cudf.read_csv("edges.csv")
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst")

# Results come back as cuDF DataFrames
pr = cugraph.pagerank(G)  # cuDF DataFrame with 'vertex' and 'pagerank' columns

With cuML

Pipe graph analytics results into cuML for downstream ML:

import cuml
# Use graph embeddings (e.g., from Node2Vec) as features for cuML
# Or use community labels as features for classification
louvain_result = cugraph.louvain(G)
# Feed partition labels into cuML models

With CuPy / SciPy

# cuGraph can work with CuPy and SciPy sparse matrices as input data
import cupy, scipy

With NetworkX

import networkx as nx
import cugraph

# NetworkX -> cuGraph
G_nx = nx.karate_club_graph()
G_cu = cugraph.from_networkx(G_nx)  # Not yet available in all versions

# Or use nx-cugraph backend for transparent acceleration

With PyTorch Geometric

# Via cugraph-pyg (see GNN Support section)
from cugraph_pyg.data import CuGraphStore
from cugraph_pyg.loader import CuGraphNeighborLoader

With Pandas

import pandas as pd
df = pd.DataFrame({"src": [0, 1, 2], "dst": [1, 2, 3]})
G = cugraph.Graph()
G.from_pandas_edgelist(df, source="src", destination="dst")

---

Known Limitations vs NetworkX

1. Immutable graphs: Cannot add/remove individual edges after graph creation. Must reconstruct from DataFrame. 2. No node/edge attributes on Graph object: cuGraph stores structure only. Node/edge properties must be maintained separately (e.g., in cuDF DataFrames). The nx-cugraph backend handles attribute mapping transparently. 3. Vertex types: Vertices must be integers (or will be renumbered to integers internally). String vertex IDs are renumbered automatically. 4. Not all NetworkX algorithms supported: Check the nx-cugraph supported algorithms list. Unsupported calls fall back to CPU NetworkX. 5. Numerical precision: GPU floating-point results may differ slightly from CPU results due to parallel reduction ordering. 6. No dynamic graphs: cuGraph is designed for static graph analytics, not streaming/dynamic graph updates. 7. Strongly Connected Components: Single-GPU only (no multi-GPU Dask variant). 8. Spectral Clustering: Single-GPU only. 9. Minimum/Maximum Spanning Tree: Single-GPU only. 10. Force Atlas 2 layout: Single-GPU only. 11. Compatibility doc: The official cuGraph compatibility document with NetworkX is listed as "coming soon" in the 26.02 release.

---

Common Migration Patterns

NetworkX to nx-cugraph (Zero Effort)

# Before (CPU):
import networkx as nx
G = nx.from_pandas_edgelist(df, "src", "dst")
pr = nx.pagerank(G)

# After (GPU, no code changes):
# Just set: NX_CUGRAPH_AUTOCONFIG=True
# Same code runs on GPU automatically

NetworkX to Direct cuGraph API

# Before (NetworkX):
import networkx as nx
G = nx.from_pandas_edgelist(df, "src", "dst")
pr = nx.pagerank(G, alpha=0.85)
bc = nx.betweenness_centrality(G, k=100)
communities = nx.community.louvain_communities(G, resolution=1.0)

# After (cuGraph):
import cudf, cugraph
edges = cudf.from_pandas(df)
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst")
pr = cugraph.pagerank(G, alpha=0.85)
bc = cugraph.betweenness_centrality(G)
parts, modularity = cugraph.louvain(G, resolution=1.0)

Pandas to cuDF + cuGraph Pipeline

# Before:
import pandas as pd
import networkx as nx
df = pd.read_csv("edges.csv")
G = nx.from_pandas_edgelist(df, "source", "target", "weight")
result = nx.pagerank(G)

# After:
import cudf
import cugraph
df = cudf.read_csv("edges.csv")
G = cugraph.Graph()
G.from_cudf_edgelist(df, source="source", destination="target", edge_attr="weight")
result = cugraph.pagerank(G)

Adding Multi-GPU to Existing cuGraph Code

# Before (single-GPU):
import cugraph
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source="src", destination="dst")
result = cugraph.pagerank(G)

# After (multi-GPU):
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cugraph, cugraph.dask as dcg
import dask_cudf

cluster = LocalCUDACluster()
client = Client(cluster)

ddf = dask_cudf.from_cudf(edges, npartitions=len(cluster.workers))
G = cugraph.Graph()
G.from_dask_cudf_edgelist(ddf, source="src", destination="dst")
result = dcg.pagerank(G)
result_local = result.compute()  # Collect to single GPU

cuML Reference

cuML is NVIDIA's GPU-accelerated machine learning library within the RAPIDS ecosystem. It provides scikit-learn-compatible APIs for 50+ algorithms, delivering 10-50x faster performance on average, with some algorithms (HDBSCAN, t-SNE, UMAP, KNN) reaching 60-600x speedup. It follows the familiar fit/predict/transform pattern from sklearn.

Full documentation: https://docs.rapids.ai/api/cuml/stable/

1. Installation and Setup 2. Two Usage Modes 3. cuml.accel Accelerator Mode 4. Direct cuML API 5. Algorithm Catalog 6. Input/Output Type Handling 7. Preprocessing 8. Feature Extraction 9. Model Selection and Tuning 10. Forest Inference Library (FIL) 11. Multi-GPU with Dask 12. Model Serialization 13. Memory Management 14. Performance Optimization 15. Interoperability 16. Key Differences from sklearn 17. Common Migration Patterns

---

Installation and Setup

Always use uv add (never pip install or conda install) in all install instructions, docstrings, comments, and error messages.

uv add --extra-index-url=https://pypi.nvidia.com cuml-cu12    # For CUDA 12.x

Platform: Linux and WSL2 only (no native macOS or Windows). Requires: scikit-learn >= 1.4, NVIDIA GPU with CUDA 12.x support.

Verify:

import cuml
print(cuml.__version__)

from cuml.datasets import make_blobs
X, y = make_blobs(n_samples=1000, n_features=10)
print(f"Generated {X.shape[0]} samples on GPU")

---

Two Usage Modes

1. cuml.accel (Zero-Code-Change)

Transparently intercepts sklearn, umap-learn, and hdbscan calls and routes them to GPU. Falls back to CPU for unsupported operations. Best for: quick acceleration of existing sklearn code, mixed codebases, prototyping.

2. Direct cuML API

Replace from sklearn with from cuml. Maximum performance, explicit control over GPU execution. Best for: production pipelines, maximum performance, new GPU-first code.

---

cuml.accel Accelerator Mode

The fastest path from sklearn to GPU — no code changes required. Similar to cudf.pandas for pandas.

Activation

# Jupyter/IPython (MUST be the first cell, before any sklearn import)
%load_ext cuml.accel

import sklearn  # Now GPU-accelerated
from sklearn.cluster import KMeans  # Runs on GPU transparently

# Command line
python -m cuml.accel script.py
python -m cuml.accel -v script.py     # With info logging
python -m cuml.accel -vv script.py    # With debug logging

# Programmatic (call BEFORE importing sklearn)
import cuml
cuml.accel.install()

from sklearn.cluster import KMeans  # Now GPU-accelerated

# Environment variable
CUML_ACCEL_ENABLED=1 python script.py

How It Works

Intercepts sklearn/umap-learn/hdbscan imports and replaces estimators with GPU versions.
If an operation isn't supported on GPU, it silently falls back to CPU sklearn.
Uses managed memory by default — host RAM augments GPU VRAM.
Models pickled under cuml.accel load as standard sklearn objects in non-GPU environments.
Accelerates 30+ algorithms across sklearn, umap-learn, and hdbscan.
Compatible with scikit-learn versions 1.4-1.7.

Known Fallback Triggers (Runs on CPU Instead)

Sparse input data (most algorithms)
Callable parameters (e.g., callable init for KMeans)
Certain parameter values: n_components="mle" for PCA, positive=True for linear models, warm starts
Unsupported distance metrics for neighbors algorithms
Multi-output targets for Random Forest
String/object dtypes — must pre-encode with LabelEncoder first

Numerical Precision

GPU results are numerically equivalent but may differ at floating-point precision level due to parallel reduction order. Compare model quality via scores (accuracy, R2, etc.), not raw coefficient values.

---

Direct cuML API

Replace sklearn imports with cuml imports. The API is identical — fit/predict/transform.

from cuml.cluster import DBSCAN
from cuml.datasets import make_blobs

# Create data directly on GPU
X, y = make_blobs(n_samples=100_000, centers=5, n_features=10, random_state=42)

# Fit — runs on GPU
model = DBSCAN(eps=1.0, min_samples=5)
model.fit(X)
print(model.labels_)

from cuml import LinearRegression
from cuml.datasets import make_regression
from cuml.model_selection import train_test_split

X, y = make_regression(n_samples=100_000, n_features=50, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = model.score(X_test, y_test)
print(f"R2 score: {score:.4f}")

---

Algorithm Catalog

Clustering

cuML	sklearn Equivalent	Multi-GPU
`cuml.KMeans`	`sklearn.cluster.KMeans`	Yes
`cuml.DBSCAN`	`sklearn.cluster.DBSCAN`	Yes
`cuml.AgglomerativeClustering`	`sklearn.cluster.AgglomerativeClustering`	No
`cuml.cluster.hdbscan.HDBSCAN`	`hdbscan.HDBSCAN`	No
`cuml.cluster.SpectralClustering`	`sklearn.cluster.SpectralClustering`	No

Regression

cuML	sklearn Equivalent	Multi-GPU
`cuml.LinearRegression`	`sklearn.linear_model.LinearRegression`	Yes
`cuml.Ridge`	`sklearn.linear_model.Ridge`	Yes
`cuml.Lasso`	`sklearn.linear_model.Lasso`	Yes
`cuml.ElasticNet`	`sklearn.linear_model.ElasticNet`	Yes
`cuml.SVR`	`sklearn.svm.SVR`	No
`cuml.KernelRidge`	`sklearn.kernel_ridge.KernelRidge`	No
`cuml.ensemble.RandomForestRegressor`	`sklearn.ensemble.RandomForestRegressor`	Yes
`cuml.MBSGDRegressor`	`sklearn.linear_model.SGDRegressor`	No

Classification

cuML	sklearn Equivalent	Multi-GPU
`cuml.LogisticRegression`	`sklearn.linear_model.LogisticRegression`	No
`cuml.ensemble.RandomForestClassifier`	`sklearn.ensemble.RandomForestClassifier`	Yes
`cuml.svm.SVC`	`sklearn.svm.SVC`	No
`cuml.svm.LinearSVC`	`sklearn.svm.LinearSVC`	No
`cuml.naive_bayes.GaussianNB`	`sklearn.naive_bayes.GaussianNB`	No
`cuml.naive_bayes.MultinomialNB`	`sklearn.naive_bayes.MultinomialNB`	Yes
`cuml.naive_bayes.BernoulliNB`	`sklearn.naive_bayes.BernoulliNB`	No
`cuml.naive_bayes.CategoricalNB`	`sklearn.naive_bayes.CategoricalNB`	No
`cuml.naive_bayes.ComplementNB`	`sklearn.naive_bayes.ComplementNB`	No
`cuml.neighbors.KNeighborsClassifier`	`sklearn.neighbors.KNeighborsClassifier`	Yes
`cuml.neighbors.KNeighborsRegressor`	`sklearn.neighbors.KNeighborsRegressor`	Yes
`cuml.MBSGDClassifier`	`sklearn.linear_model.SGDClassifier`	No
`cuml.multiclass.OneVsOneClassifier`	`sklearn.multiclass.OneVsOneClassifier`	No
`cuml.multiclass.OneVsRestClassifier`	`sklearn.multiclass.OneVsRestClassifier`	No

Dimensionality Reduction and Manifold Learning

cuML	sklearn/Library Equivalent	Multi-GPU
`cuml.PCA`	`sklearn.decomposition.PCA`	Yes
`cuml.IncrementalPCA`	`sklearn.decomposition.IncrementalPCA`	No
`cuml.TruncatedSVD`	`sklearn.decomposition.TruncatedSVD`	Yes
`cuml.UMAP`	`umap.UMAP`	Yes (inference)
`cuml.TSNE`	`sklearn.manifold.TSNE`	No
`cuml.random_projection.GaussianRandomProjection`	`sklearn.random_projection.GaussianRandomProjection`	No
`cuml.random_projection.SparseRandomProjection`	`sklearn.random_projection.SparseRandomProjection`	No

Nearest Neighbors

cuML	sklearn Equivalent	Multi-GPU
`cuml.neighbors.NearestNeighbors`	`sklearn.neighbors.NearestNeighbors`	Yes
`cuml.neighbors.KNeighborsClassifier`	`sklearn.neighbors.KNeighborsClassifier`	Yes
`cuml.neighbors.KNeighborsRegressor`	`sklearn.neighbors.KNeighborsRegressor`	Yes
`cuml.neighbors.KernelDensity`	`sklearn.neighbors.KernelDensity`	No

Time Series

cuML	Description
`cuml.ExponentialSmoothing`	Holt-Winters exponential smoothing
`cuml.tsa.ARIMA`	ARIMA/SARIMA models (batched — fits multiple series simultaneously)
`cuml.tsa.auto_arima.AutoARIMA`	Automatic ARIMA order selection

Metrics (GPU-Accelerated)

Regression: r2_score, mean_squared_error, mean_absolute_error, mean_squared_log_error, median_absolute_error

Classification: accuracy_score, log_loss, roc_auc_score, precision_recall_curve, confusion_matrix

Clustering: adjusted_rand_score, silhouette_score, silhouette_samples, homogeneity_score, completeness_score, v_measure_score, mutual_info_score

Other: trustworthiness, pairwise_distances, pairwise_kernels

Model Explainability

cuML	Description
`cuml.explainer.KernelExplainer`	SHAP Kernel Explainer
`cuml.explainer.PermutationExplainer`	SHAP Permutation Explainer
`cuml.explainer.TreeExplainer`	SHAP Tree Explainer

---

Input/Output Type Handling

Supported Input Types

cuML accepts: NumPy arrays, CuPy arrays, cuDF DataFrames/Series, pandas DataFrames/Series, Numba device arrays, PyTorch tensors (via __cuda_array_interface__).

NumPy and pandas inputs are automatically transferred to GPU. For best performance, pass CuPy arrays or cuDF DataFrames to avoid transfers.

Controlling Output Type

import cuml

# Global setting
cuml.set_global_output_type('cupy')  # Options: 'input', 'cupy', 'numpy', 'cudf', 'pandas'

# Context manager
with cuml.using_output_type('cudf'):
    result = model.predict(X)  # Returns cudf Series

# Per-estimator
model = cuml.KMeans(output_type='cupy')

Performance ranking (fastest to slowest output type): 1. cupy — no host transfers, most efficient 2. cudf — slight overhead for some shapes 3. numpy / pandas — device-to-host transfer cost

Best practice: Use cupy or cudf for intermediate results. Only convert to numpy/pandas at the end for visualization or export.

---

Preprocessing

cuML provides GPU-accelerated versions of all common sklearn preprocessors.

Scalers and Transformers

from cuml.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from cuml.preprocessing import Normalizer, PowerTransformer, QuantileTransformer
from cuml.preprocessing import Binarizer, PolynomialFeatures, KBinsDiscretizer

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Encoders

from cuml.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, TargetEncoder

le = LabelEncoder()
y_encoded = le.fit_transform(y)

ohe = OneHotEncoder(sparse_output=False)
X_encoded = ohe.fit_transform(X_categorical)

Imputers

from cuml.preprocessing import SimpleImputer, MissingIndicator

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

Pipeline and Composition

from cuml.compose import ColumnTransformer, make_column_transformer
from cuml.preprocessing import StandardScaler, OneHotEncoder

preprocessor = make_column_transformer(
    (StandardScaler(), ['age', 'income']),
    (OneHotEncoder(), ['category', 'region']),
)
X_processed = preprocessor.fit_transform(df)

Preprocessing Functions

scale(), minmax_scale(), maxabs_scale(), robust_scale(), normalize(), binarize(), add_dummy_feature(), label_binarize()

---

Feature Extraction

from cuml.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer

tfidf = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf.fit_transform(corpus)

---

Model Selection and Tuning

Train/Test Split

from cuml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Cross-Validation

from cuml.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    # ...

Hyperparameter Tuning

For GPU-efficient hyperparameter search, use dask-ml's GridSearchCV/RandomizedSearchCV rather than sklearn's — sklearn's version causes excessive CPU-GPU data transfers per fold.

from dask_ml.model_selection import RandomizedSearchCV
from cuml.ensemble import RandomForestClassifier

param_distributions = {
    'max_depth': [8, 12, 16, 20],
    'n_estimators': [100, 200, 500],
    'max_features': [0.5, 0.75, 1.0],
}

search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions,
    n_iter=25,
    cv=5,
    random_state=42,
)
search.fit(X_train, y_train)
print(f"Best score: {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")

Dataset Generators

from cuml.datasets import make_blobs, make_classification, make_regression

X, y = make_blobs(n_samples=100_000, centers=5, n_features=20, random_state=42)
X, y = make_classification(n_samples=100_000, n_features=50, n_informative=25)
X, y = make_regression(n_samples=100_000, n_features=50, noise=0.1)

---

Forest Inference Library

FIL provides high-performance GPU inference for tree-based models trained in any framework — 80x+ faster than sklearn inference.

from cuml.fil import ForestInference

# Load from XGBoost, LightGBM, or sklearn saved models
fil_model = ForestInference.load("xgboost_model.ubj", is_classifier=True)

# Optional: optimize for specific batch size
fil_model.optimize()

# Predict (80x+ faster than sklearn)
predictions = fil_model.predict(X_test)
probas = fil_model.predict_proba(X_test)

Supports: XGBoost, LightGBM, sklearn Random Forests, any Treelite-compatible model.

This is especially valuable when you have a model already trained on CPU and want to speed up inference without retraining.

---

Multi-GPU with Dask

For datasets too large for a single GPU or when you want to use multiple GPUs.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

# One Dask worker per GPU
cluster = LocalCUDACluster(
    rmm_pool_size="12GB",
    enable_cudf_spill=True,
)
client = Client(cluster)

# Create distributed data
from cuml.dask.datasets import make_blobs
X, y = make_blobs(
    n_samples=1_000_000,
    n_features=20,
    centers=5,
    n_parts=len(client.scheduler_info()['workers']) * 2,  # 2 partitions per worker
)

# Use Dask estimator
from cuml.dask.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
labels = kmeans.predict(X)

# Convert to single-GPU model for serialization
single_model = kmeans.get_combined_model()

client.close()
cluster.close()

Available Multi-GPU Estimators (`cuml.dask`)

Clustering: KMeans, DBSCAN
Linear models: LinearRegression, Ridge, Lasso, ElasticNet
Ensemble: RandomForestClassifier, RandomForestRegressor
Decomposition: PCA, TruncatedSVD
Manifold: UMAP (inference only)
Neighbors: NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor
Naive Bayes: MultinomialNB
Preprocessing: LabelEncoder, LabelBinarizer, OneHotEncoder

---

Model Serialization

import pickle

# Save cuML model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f, protocol=5)

# Load cuML model
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

Models trained under cuml.accel can be pickled and loaded as standard sklearn objects in non-GPU environments.
Dask distributed models must be converted first: single_model = dask_model.get_combined_model().
joblib also works for serialization.

---

Memory Management

RMM (RAPIDS Memory Manager)

import rmm

# Pre-allocate a memory pool for faster allocation
rmm.reinitialize(pool_allocator=True, initial_pool_size=2**32)  # 4 GB pool

Aligning with cuDF and CuPy

When using cuML alongside cuDF and CuPy, align all libraries on the same RMM allocator:

import rmm
from rmm.allocators.cupy import rmm_cupy_allocator
import cupy
cupy.cuda.set_allocator(rmm_cupy_allocator)

cuml.accel Memory

cuml.accel uses managed memory by default (host RAM augments GPU VRAM). Disable with --disable-uvm flag if experiencing slowdowns. Managed memory does NOT work on WSL2 or when RMM is externally configured.

Best Practices

Use float32 instead of float64 when precision allows — halves memory, doubles throughput.
Keep data on GPU throughout the pipeline — avoid NumPy/pandas round-trips.
For datasets larger than GPU memory: use Dask multi-GPU or chunk processing.
Pre-allocate RMM pools to avoid fragmentation.

---

Performance Optimization

Expected Speedups by Algorithm

Category	Typical Speedup	Notes
HDBSCAN, t-SNE, UMAP	60-300x	Complex algorithms benefit most
KNN	Up to 600x	Scales dramatically with data size
KMeans, Random Forest	15-80x	RF: 20-45x single GPU
FIL inference	80x+	Tree model inference from any framework
Linear models, PCA, Ridge	2-10x	Simpler algorithms, lower but consistent gains

Key Optimization Tips

1. Use float32. GPU float32 throughput is 2x-32x higher than float64. Most ML algorithms don't need double precision.

2. Keep data on GPU. Pass CuPy arrays or cuDF DataFrames. Every NumPy/pandas conversion triggers a device-host transfer.

3. Larger datasets = larger speedup. GPU parallelism advantage grows with data size. Minimum ~10K rows to see benefit.

4. Wide data benefits more. 128-512 features see higher speedups than 8-16 features.

5. First call has JIT overhead. Benchmark on subsequent calls, not the first.

6. Use RMM pools. Pre-allocated memory pools are 1000x faster than raw cudaMalloc.

7. Use dask-ml for hyperparameter tuning, not sklearn's GridSearchCV — it avoids excessive CPU-GPU transfers.

8. Use FIL for tree model inference. Even if the model was trained on CPU (XGBoost, LightGBM, sklearn RF), FIL gives 80x+ inference speedup.

---

Interoperability

cuDF: Zero-copy input. cuDF DataFrames accepted directly by all estimators.
CuPy: Zero-copy via __cuda_array_interface__. Most efficient intermediate format.
NumPy/pandas: Accepted as input (auto-transferred to GPU). Output type configurable.
PyTorch: Tensors accepted via array interface.
sklearn: API-compatible. Models interconvertible. cuml.accel for transparent acceleration.
XGBoost/LightGBM: FIL provides GPU inference for externally-trained tree models.
Dask: Native distributed support via cuml.dask module.

End-to-End RAPIDS Pipeline

import cudf
import cuml
from cuml.preprocessing import StandardScaler
from cuml.ensemble import RandomForestClassifier
from cuml.model_selection import train_test_split

# Load data on GPU
df = cudf.read_parquet("data.parquet")
X = df.drop("target", axis=1)
y = df["target"]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Preprocess
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = RandomForestClassifier(n_estimators=100, max_depth=16)
model.fit(X_train, y_train)

# Evaluate
score = model.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")

All of this runs entirely on GPU — from Parquet read to model evaluation — with zero CPU-GPU transfers.

---

Key Differences from sklearn

1. Platform: Linux and WSL2 only. No native macOS or Windows.

2. Sparse data: Most cuML algorithms do not support sparse matrices. Under cuml.accel, sparse inputs fall back to CPU.

3. String data: Must be pre-encoded to numeric. No native string column support in estimators.

4. Multi-output: Not supported for Random Forest.

5. Warm starts: Not supported for most algorithms.

6. Some sklearn parameters ignored: n_jobs (GPU handles parallelism), positive=True, specific solver choices.

7. Numerical precision: Results equivalent in quality but may differ at floating-point level. Compare scores, not raw coefficients.

8. Memory: Limited by GPU VRAM (typically 8-80 GB). Use managed memory or Dask for larger datasets.

9. Missing fitted attributes: Some sklearn attributes not computed under cuml.accel (e.g., HDBSCAN exemplars_, LinearRegression rank_).

---

Common Migration Patterns

Pattern 1: Zero-Effort (cuml.accel)

# Add one line at top of notebook:
%load_ext cuml.accel

from sklearn.cluster import KMeans  # Now GPU-accelerated
from sklearn.decomposition import PCA  # Now GPU-accelerated
# Everything else stays exactly the same

Pattern 2: Direct Import Swap

# Before
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# After
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import StandardScaler
from cuml.model_selection import train_test_split

Pattern 3: Full RAPIDS Pipeline (cuDF + cuML)

import cudf
from cuml.preprocessing import StandardScaler, LabelEncoder
from cuml.ensemble import RandomForestClassifier
from cuml.model_selection import train_test_split

# Load and preprocess entirely on GPU
df = cudf.read_parquet("data.parquet")
le = LabelEncoder()
df["category_encoded"] = le.fit_transform(df["category"])

X = df[["feature1", "feature2", "category_encoded"]].to_cupy()
y = df["target"].to_cupy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = RandomForestClassifier(n_estimators=200, max_depth=16)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")

Pattern 4: GPU Inference for CPU-Trained Models

from cuml.fil import ForestInference

# Load XGBoost/LightGBM/sklearn model for 80x+ faster inference
fil_model = ForestInference.load("my_xgboost_model.ubj", is_classifier=True)
predictions = fil_model.predict(X_test)

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Use optimize-for-gpu for library-specific CUDA migrations in scientific Python rather than generic performance tips that do not map code to NVIDIA stacks.

FAQ

Which NVIDIA libraries does optimize-for-gpu cover?

optimize-for-gpu covers 12 NVIDIA libraries: CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT for replacing CPU-bound NumPy, pandas, scikit-learn, and graph workloads.

When should optimize-for-gpu trigger in an agent session?

optimize-for-gpu triggers when users mention GPU, CUDA, or NVIDIA acceleration, or when CPU-bound Python code shows slow loops, large arrays, ML pipelines, graph analytics, or image processing that maps to GPU libraries.

Is Optimize For Gpu safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingautomationllm

About

Optimize For Gpu by the numbers

Add your badge

How do you GPU-accelerate NumPy and pandas Python code?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

GPU Optimization for Python with NVIDIA

When This Skill Applies

Decision Framework: Which Library to Use

CuPy — for array/matrix operations (NumPy replacement)

Numba CUDA — for custom GPU kernels

Warp — for simulation, spatial computing, and differentiable programming

cuDF — for dataframe operations (pandas replacement)

cuML — for machine learning (scikit-learn replacement)

cuGraph — for graph analytics (NetworkX replacement)

KvikIO — for high-performance GPU file IO

cuxfilter — for GPU-accelerated interactive dashboards

cuCIM — for image processing (scikit-image replacement)

cuVS — for vector search (Faiss/Annoy replacement)

cuSpatial — for geospatial analytics (GeoPandas replacement)

RAFT (pylibraft) — for low-level GPU primitives and multi-GPU

Combining Libraries

Installation

Optimization Workflow

1. Profile First

2. Assess GPU Suitability

3. Start Simple, Then Optimize

4. Memory Management Principles

5. Common Pitfalls to Watch For

Code Transformation Patterns

NumPy to CuPy

pandas to cuDF

Custom loop to Numba CUDA kernel

NetworkX to cuGraph

scikit-learn to cuML

Simulation loop to Warp kernel

File IO to GPU with KvikIO

GPU-accelerated dashboard with cuxfilter

scikit-image to cuCIM

GeoPandas to cuSpatial

Faiss/Annoy to cuVS

scipy.sparse.linalg to RAFT

Important Notes

Reference Files

cuCIM Reference

Table of Contents

Installation and Setup

Core Concept: CuPy Arrays

cucim.skimage

Color Operations

Exposure and Histogram

Feature Detection

Filters

Measure and Region Properties

Morphology

Segmentation

Registration

Restoration

Transform

Metrics

Utility Functions

cucim.core.operations

Pathology-Specific

Intensity Operations

Spatial Augmentation

Distance Transform

Whole-Slide Image Reading

Tile Caching

GPUDirect Storage

Performance Characteristics

Interoperability

CPU/GPU Agnostic Code

Known Limitations vs scikit-image

Common Migration Patterns

Pattern 1: Direct scikit-image Replacement

Pattern 2: Digital Pathology Pipeline

Pattern 3: Deep Learning Preprocessing Pipeline