
Sglang
Deploy SGLang inference servers on GPU hosts with tensor parallelism, quantization, and Docker for production LLM APIs.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill sglangWhat is this skill?
- Documents `sglang.launch_server` with host, port, and `--mem-fraction-static` tuning
- Tensor parallelism recipes for 70B-class models across 2–4 GPUs
- Quantization paths: FP8, AWQ, and GPTQ with matching `--tp` flags
- CUDA 12.1 Docker image with `sglang[all]` and FlashInfer wheel install
- Expose port 30000 with OpenAI-compatible serving defaults
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Azure Deploymicrosoft/azure-skills
Azure Preparemicrosoft/azure-skills
Azure Storagemicrosoft/azure-skills
Azure Validatemicrosoft/azure-skills
Appinsights Instrumentationmicrosoft/azure-skills
Azure Resource Lookupmicrosoft/azure-skills
Journey fit
Primary fit
Production server launch and containerized ops sit in Operate because the skill targets running inference after the model choice is made. Infra is the canonical shelf for GPU memory tuning, multi-GPU TP, and Docker CMD patterns that keep serving online.
Common Questions / FAQ
Is Sglang safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Sglang
# Production Deployment Guide Complete guide to deploying SGLang in production environments. ## Server Deployment ### Basic server ```bash python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --host 0.0.0.0 \ --port 30000 \ --mem-fraction-static 0.9 ``` ### Multi-GPU (Tensor Parallelism) ```bash # Llama 3-70B on 4 GPUs python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-70B-Instruct \ --tp 4 \ --port 30000 ``` ### Quantization ```bash # FP8 quantization (H100) python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-70B-Instruct \ --quantization fp8 \ --tp 4 # INT4 AWQ quantization python -m sglang.launch_server \ --model-path TheBloke/Llama-2-70B-AWQ \ --quantization awq \ --tp 2 # INT4 GPTQ quantization python -m sglang.launch_server \ --model-path TheBloke/Llama-2-70B-GPTQ \ --quantization gptq \ --tp 2 ``` ## Docker Deployment ### Dockerfile ```dockerfile FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 # Install Python RUN apt-get update && apt-get install -y python3.10 python3-pip git # Install SGLang RUN pip3 install "sglang[all]" flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # Copy model (or download at runtime) WORKDIR /app # Expose port EXPOSE 30000 # Start server CMD ["python3", "-m", "sglang.launch_server", \ "--model-path", "meta-llama/Meta-Llama-3-8B-Instruct", \ "--host", "0.0.0.0", \ "--port", "30000"] ``` ### Build and run ```bash # Build image docker build -t sglang:latest . # Run with GPU docker run --gpus all -p 30000:30000 sglang:latest # Run with specific GPUs docker run --gpus '"device=0,1,2,3"' -p 30000:30000 sglang:latest # Run with custom model docker run --gpus all -p 30000:30000 \ -e MODEL_PATH="meta-llama/Meta-Llama-3-70B-Instruct" \ -e TP_SIZE="4" \ sglang:latest ``` ## Kubernetes Deployment ### Deployment YAML ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: sglang-llama3-70b spec: replicas: 2 selector: matchLabels: app: sglang template: metadata: labels: app: sglang spec: containers: - name: sglang image: sglang:latest command: - python3 - -m - sglang.launch_server - --model-path=meta-llama/Meta-Llama-3-70B-Instruct - --tp=4 - --host=0.0.0.0 - --port=30000 - --mem-fraction-static=0.9 ports: - containerPort: 30000 name: http resources: limits: nvidia.com/gpu: 4 livenessProbe: httpGet: path: /health port: 30000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 30000 initialDelaySeconds: 30 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: sglang-service spec: selector: app: sglang ports: - port: 80 targetPort: 30000 type: LoadBalancer ``` ## Monitoring ### Health checks ```bash # Health endpoint curl http://localhost:30000/health # Model info curl http://localhost:30000/v1/models # Server stats curl http://localhost:30000/stats ``` ### Prometheus metrics ```bash # Start server with metrics python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --enable-metrics # Metrics endpoint curl http://localhost:30000/metrics # Key metrics: # - sglang_request_total # - sglang_request_duration_seconds # - sglang_tokens_generated_total # - sglang_active_requests # - sglang_queue_size # - sglang_radix_cache_hit_rate # - sglang_gpu_memory_used_bytes ``` ### Logging ```bash # Enable debug logging python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --log-level debug # Log to file python -m sglang.launch_server \ --model-path meta-llama/Meta-L