
Tensorrt Llm
Choose and configure TensorRT-LLM tensor, pipeline, and expert parallelism when serving large models across multiple GPUs or nodes.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill tensorrt-llmWhat is this skill?
- Tensor Parallelism (TP) for single-node low-latency sharding across GPUs
- Pipeline Parallelism (PP) for very large models across nodes with micro-batching
- Worked examples (e.g. Llama 3-70B TP=4, Llama 3-405B TP=4 × PP=2 on 8× H100)
- Guidance on communication overhead, throughput, and when NVLink matters
- Expert parallelism patterns for MoE-scale deployments
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 1/3 security scanners passed (skills.sh audits).
Recommended Skills
Azure Deploymicrosoft/azure-skills
Azure Preparemicrosoft/azure-skills
Azure Storagemicrosoft/azure-skills
Azure Validatemicrosoft/azure-skills
Appinsights Instrumentationmicrosoft/azure-skills
Azure Resource Lookupmicrosoft/azure-skills
Journey fit
Primary fit
Multi-GPU inference deployment is where production GPU topology and latency tradeoffs live, even if you first prototype the same stack during Build. Infra is the canonical shelf for scaling LLM serving across NVLink nodes, H100 clusters, and mixed TP/PP layouts.
Common Questions / FAQ
Is Tensorrt Llm safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Tensorrt Llm
# Multi-GPU Deployment Guide Comprehensive guide to scaling TensorRT-LLM across multiple GPUs and nodes. ## Parallelism Strategies ### Tensor Parallelism (TP) **What it does**: Splits model layers across GPUs horizontally. **Use case**: - Model fits in total GPU memory but not single GPU - Need low latency (single forward pass) - GPUs on same node (NVLink required for best performance) **Example** (Llama 3-70B on 4× A100): ```python from tensorrt_llm import LLM llm = LLM( model="meta-llama/Meta-Llama-3-70B", tensor_parallel_size=4, # Split across 4 GPUs dtype="fp16" ) # Model automatically sharded across GPUs # Single forward pass, low latency ``` **Performance**: - Latency: ~Same as single GPU - Throughput: 4× higher (4 GPUs) - Communication: High (activations synced every layer) ### Pipeline Parallelism (PP) **What it does**: Splits model layers across GPUs vertically (layer-wise). **Use case**: - Very large models (175B+) - Can tolerate higher latency - GPUs across multiple nodes **Example** (Llama 3-405B on 8× H100): ```python llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=4, # TP=4 within nodes pipeline_parallel_size=2, # PP=2 across nodes dtype="fp8" ) # Total: 8 GPUs (4×2) # Layers 0-40: Node 1 (4 GPUs with TP) # Layers 41-80: Node 2 (4 GPUs with TP) ``` **Performance**: - Latency: Higher (sequential through pipeline) - Throughput: High with micro-batching - Communication: Lower than TP ### Expert Parallelism (EP) **What it does**: Distributes MoE experts across GPUs. **Use case**: Mixture-of-Experts models (Mixtral, DeepSeek-V2) **Example** (Mixtral-8x22B on 8× A100): ```python llm = LLM( model="mistralai/Mixtral-8x22B", tensor_parallel_size=4, expert_parallel_size=2, # Distribute 8 experts across 2 groups dtype="fp8" ) ``` ## Configuration Examples ### Small model (7-13B) - Single GPU ```python # Llama 3-8B on 1× A100 80GB llm = LLM( model="meta-llama/Meta-Llama-3-8B", dtype="fp16" # or fp8 for H100 ) ``` **Resources**: - GPU: 1× A100 80GB - Memory: ~16GB model + 30GB KV cache - Throughput: 3,000-5,000 tokens/sec ### Medium model (70B) - Multi-GPU same node ```python # Llama 3-70B on 4× A100 80GB (NVLink) llm = LLM( model="meta-llama/Meta-Llama-3-70B", tensor_parallel_size=4, dtype="fp8" # 70GB → 35GB per GPU ) ``` **Resources**: - GPU: 4× A100 80GB with NVLink - Memory: ~35GB per GPU (FP8) - Throughput: 10,000-15,000 tokens/sec - Latency: 15-20ms per token ### Large model (405B) - Multi-node ```python # Llama 3-405B on 2 nodes × 8 H100 = 16 GPUs llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, # TP within each node pipeline_parallel_size=2, # PP across 2 nodes dtype="fp8" ) ``` **Resources**: - GPU: 2 nodes × 8 H100 80GB - Memory: ~25GB per GPU (FP8) - Throughput: 20,000-30,000 tokens/sec - Network: InfiniBand recommended ## Server Deployment ### Single-node multi-GPU ```bash # Llama 3-70B on 4 GPUs (automatic TP) trtllm-serve meta-llama/Meta-Llama-3-70B \ --tp_size 4 \ --max_batch_size 256 \ --dtype fp8 # Listens on http://localhost:8000 ``` ### Multi-node with Ray ```bash # Node 1 (head node) ray start --head --port=6379 # Node 2 (worker) ray start --address='node1:6379' # Deploy across cluster trtllm-serve meta-llama/Meta-Llama-3-405B \ --tp_size 8 \ --pp_size 2 \ --num_workers 2 \ # 2 nodes --dtype fp8 ``` ### Kubernetes deployment ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: tensorrt-llm-llama3-70b spec: replicas: 1 template: spec: containers: - name: trtllm image: nvidia/tensorrt_llm:latest command: - trtllm-serve - meta-llama/Meta-Llama-3-70B - --tp_size=4 - --max_batch_size=256 resources: limits: nvidia.com/gpu: 4 # Request 4 GPUs ``` ## Parallelism Decision Tree ``` Model size < 20G