
Modal Serverless Gpu
Run multi-GPU and DeepSpeed training jobs on Modal serverless GPUs without managing your own cluster.
Overview
Modal Serverless GPU is an agent skill most often used in Build (also Operate) that documents Modal patterns for multi-GPU and DeepSpeed training on serverless GPU functions.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill modal-serverless-gpuWhat is this skill?
- Single-node multi-GPU with Hugging Face Accelerate on Modal (e.g. H100:4)
- DeepSpeed integration via TrainingArguments and ds_config.json on A100:8
- PyTorch Lightning ddp_spawn / subprocess patterns for entrypoint re-exec
- Modal App, Image, and timeout configuration for long-running train jobs
- Practical GPU count and batch/gradient accumulation tuning snippets
- H100:4 and A100:8 GPU allocation examples
- 7200s and 14400s timeout samples in snippets
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars.
What problem does it solve?
You need multi-GPU training for a model experiment but renting and operating a fixed cluster is too slow or expensive for a solo builder.
Who is it for?
Fine-tuning or training transformers-class models with Accelerate or DeepSpeed when bursts of GPU time beat persistent infra.
Skip if: CPU-only ETL, frontend apps with no training step, or teams that require on-prem exclusive data residency without a Modal-approved path.
When should I use this skill?
Implementing or extending GPU training jobs on Modal including multi-GPU, DeepSpeed, or Lightning subprocess patterns.
What do I get? / Deliverables
You get copy-ready Modal function definitions with GPU types, images, and training loop integration so jobs run on Modal without hand-rolling cluster orchestration.
- Modal App function stubs
- Image pip_install definitions
- Multi-GPU and DeepSpeed training configurations
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Build is canonical because the guide centers on implementing training functions, images, and GPU allocations as product/backend ML infrastructure. Backend fits serverless training endpoints, batch jobs, and model fine-tune pipelines rather than frontend or docs-only work.
Where it fits
Define a Modal function with H100:4 and Accelerate to fine-tune your agent’s ranker model.
Run a short DeepSpeed job on A100:8 to see if a base model meets quality bar before full product integration.
Schedule recurring retraining with Modal timeouts and image pins without maintaining a GPU cluster.
How it compares
Infrastructure-as-code for a fixed cloud VM fleet is the alternative—this skill optimizes for ephemeral Modal functions and per-job GPU sizing.
Common Questions / FAQ
Who is modal-serverless-gpu for?
Indie ML builders and agent developers who train or fine-tune models and want Modal’s serverless GPU model instead of managing nodes.
When should I use modal-serverless-gpu?
During Build when wiring training jobs into your backend, during Validate when prototyping model quality on real GPUs, or during Operate when you rerun scheduled fine-tunes on Modal.
Is modal-serverless-gpu safe to install?
It is documentation-style snippets from a research skills repo; check this page’s Security Audits panel and Modal credential handling in your own project.
SKILL.md
READMESKILL.md - Modal Serverless Gpu
# Modal Advanced Usage Guide ## Multi-GPU Training ### Single-node multi-GPU ```python import modal app = modal.App("multi-gpu-training") image = modal.Image.debian_slim().pip_install("torch", "transformers", "accelerate") @app.function(gpu="H100:4", image=image, timeout=7200) def train_multi_gpu(): from accelerate import Accelerator accelerator = Accelerator() model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) for batch in dataloader: outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() ``` ### DeepSpeed integration ```python image = modal.Image.debian_slim().pip_install( "torch", "transformers", "deepspeed", "accelerate" ) @app.function(gpu="A100:8", image=image, timeout=14400) def deepspeed_train(config: dict): from transformers import Trainer, TrainingArguments args = TrainingArguments( output_dir="/outputs", deepspeed="ds_config.json", fp16=True, per_device_train_batch_size=4, gradient_accumulation_steps=4 ) trainer = Trainer(model=model, args=args, train_dataset=dataset) trainer.train() ``` ### Multi-GPU considerations For frameworks that re-execute the Python entrypoint (like PyTorch Lightning), use: - `ddp_spawn` or `ddp_notebook` strategy - Run training as a subprocess to avoid issues ```python @app.function(gpu="H100:4") def train_with_subprocess(): import subprocess subprocess.run(["python", "-m", "torch.distributed.launch", "train.py"]) ``` ## Advanced Container Configuration ### Multi-stage builds for caching ```python # Stage 1: Base dependencies (cached) base_image = modal.Image.debian_slim().pip_install("torch", "numpy", "scipy") # Stage 2: ML libraries (cached separately) ml_image = base_image.pip_install("transformers", "datasets", "accelerate") # Stage 3: Custom code (rebuilt on changes) final_image = ml_image.copy_local_dir("./src", "/app/src") ``` ### Custom Dockerfiles ```python image = modal.Image.from_dockerfile("./Dockerfile") ``` ### Installing from Git ```python image = modal.Image.debian_slim().pip_install( "git+https://github.com/huggingface/transformers.git@main" ) ``` ### Using uv for faster installs ```python image = modal.Image.debian_slim().uv_pip_install( "torch", "transformers", "accelerate" ) ``` ## Advanced Class Patterns ### Lifecycle hooks ```python @app.cls(gpu="A10G") class InferenceService: @modal.enter() def startup(self): """Called once when container starts""" self.model = load_model() self.tokenizer = load_tokenizer() @modal.exit() def shutdown(self): """Called when container shuts down""" cleanup_resources() @modal.method() def predict(self, text: str): return self.model(self.tokenizer(text)) ``` ### Concurrent request handling ```python @app.cls( gpu="A100", allow_concurrent_inputs=20, # Handle 20 requests per container container_idle_timeout=300 ) class BatchInference: @modal.enter() def load(self): self.model = load_model() @modal.method() def predict(self, inputs: list): return self.model.batch_predict(inputs) ``` ### Input concurrency vs batching - **Input concurrency**: Multiple requests processed simultaneously (async I/O) - **Dynamic batching**: Requests accumulated and processed together (GPU efficiency) ```python # Input concurrency - good for I/O-bound @app.function(allow_concurrent_inputs=10) async def fetch_data(url: str): async with aiohttp.ClientSession() as session: return await session.get(url) # Dynamic batching - good for GPU inference @app.function() @modal.batched(max_batch_size=32, wait_ms=100) async def batch_embed(texts: list[str]) -> list[list[float]]: return model.encode(texts) ``` ## Advanced Volumes ### Volume operations ```python volume = modal.Volume.from_name("my-volume", create_if_missin