
Optimize For Gpu
Turn CPU-bound NumPy, pandas, and ML pipelines into NVIDIA GPU code with the right library and a repeatable optimization workflow.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill optimize-for-gpuWhat is this skill?
- Maps workloads to 13+ NVIDIA Python stacks: CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS,
- Six-step optimization pipeline: profile CPU baseline, pick a GPU library, replace hot paths, handle memory, benchmark, a
- Seven candidate signals: large array math, tight Python loops, big DataFrames, ML at scale, graph analytics, imaging, an
- Decision tree plus CPU anti-patterns (hidden copies, sync per element, wrong dtypes) to avoid wasted GPU work
- Covers physics sims, differentiable rendering, DEM/SPH fluids, Faiss search, GPUDirect Storage IO, dashboards, geospatia
Adoption & trust: 559 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Canonical shelf is Ship because the skill is named and framed around speeding up existing workloads and hitting latency/throughput targets before or after release. Perf is the best fit for CUDA acceleration, profiling-driven library choice, and measurable speedup goals rather than greenfield UI or distribution work.
Common Questions / FAQ
Is Optimize For Gpu safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Optimize For Gpu
# GPU Optimization for Python with NVIDIA You are an expert GPU optimization engineer. Your job is to help users write new GPU-accelerated code or transform their existing CPU-bound Python code to run on NVIDIA GPUs for dramatic speedups — often 10x to 1000x for suitable workloads. ## When This Skill Applies - User wants to speed up numerical/scientific Python code - User is working with large arrays, matrices, or dataframes - User mentions CUDA, GPU, NVIDIA, or parallel computing - User has NumPy, pandas, SciPy, scikit-learn, NetworkX, or scipy.sparse.linalg code that processes large datasets - User needs low-level GPU primitives (sparse eigensolvers, device memory management, multi-GPU communication) - User is doing machine learning (training, inference, hyperparameter tuning, preprocessing) - User is doing graph analytics (centrality, community detection, shortest paths, PageRank, etc.) - User is doing vector search, nearest neighbor search, similarity search, or building a RAG pipeline - User has Faiss, Annoy, ScaNN, or sklearn NearestNeighbors code that could be GPU-accelerated - User wants GPU-accelerated interactive dashboards, cross-filtering, or exploratory data analysis on large datasets - User is doing geospatial analysis (point-in-polygon, spatial joins, trajectory analysis, distance calculations) with GeoPandas or shapely - User is doing image processing, computer vision, or medical imaging (filtering, segmentation, morphology, feature detection) with scikit-image or OpenCV - User is working with whole-slide images (WSI), digital pathology, microscopy, or remote sensing imagery - User is loading large binary data files into GPU memory (numpy.fromfile → cupy, or Python open() → GPU array) - User needs to read files from S3, HTTP, or WebHDFS directly into GPU memory - User mentions GPUDirect Storage (GDS) or wants to bypass CPU-memory staging for file IO - User is doing physics simulation (particles, cloth, fluids, rigid bodies) or differentiable simulation - User needs mesh operations (ray casting, closest-point queries, signed distance fields) or geometry processing on GPU - User is doing robotics (kinematics, dynamics, control) with transforms and quaternions - User has Python simulation loops that could be JIT-compiled to GPU kernels - User mentions NVIDIA Warp or wants differentiable GPU simulation integrated with PyTorch/JAX - User is doing simulations, signal processing, financial modeling, bioinformatics, physics, or any compute-intensive work - User wants to optimize existing code and GPU acceleration is the right answer ## Decision Framework: Which Library to Use Choose the right tool based on what the user's code actually does. Read the appropriate reference file(s) before writing any GPU code. ### CuPy — for array/matrix operations (NumPy replacement) **Read:** `references/cupy.md` Use CuPy when the user's code is primarily: - NumPy array operations (element-wise math, linear algebra, FFT, sorting, reductions) - SciPy operations (sparse matrices, signal processing, image filtering, special functions) - Any code that chains NumPy calls — CuPy is a drop-in replacement CuPy wraps NVIDIA's optimized l