
Stable Baselines3
Pick the right Stable Baselines3 RL algorithm (PPO, SAC, DQN, HER, etc.) for your environment and action space before training.
Overview
stable-baselines3 is an agent skill for the Build phase that maps task traits to Stable Baselines3 RL algorithms via a structured comparison reference.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill stable-baselines3What is this skill?
- Comparison table across PPO, A2C, SAC, TD3, DDPG, DQN, HER, and RecurrentPPO
- Action-space fit: discrete (DQN), continuous (SAC/TD3), goal-conditioned (HER), POMDP (RecurrentPPO)
- Sample-efficiency and training-speed tradeoffs called out per algorithm
- PPO positioned as general-purpose on-policy default across action space types
- Notes when to prefer TD3 over DDPG and multiprocessing-friendly on-policy options
- 8 algorithms in the core comparison table (PPO, A2C, SAC, TD3, DDPG, DQN, HER, RecurrentPPO)
Adoption & trust: 535 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have an RL environment defined but no clear rule for whether PPO, SAC, DQN, or HER is the right Stable Baselines3 starting point.
Who is it for?
Indie ML builders implementing Python RL agents with Stable Baselines3 who need algorithm selection guidance early in a project.
Skip if: Teams needing only data-pipeline ETL, classical supervised modeling, or hosted inference ops with no RL training loop.
When should I use this skill?
Selecting a Stable Baselines3 trainer given environment action space, observability, and sample-efficiency constraints.
What do I get? / Deliverables
You leave with a short-listed algorithm aligned to action space, sample budget, and observability, plus awareness of speed and stability tradeoffs before you train.
- Recommended algorithm short list with rationale
- Tradeoff notes on sample efficiency versus training speed
- Pointers to specialized choices (HER, RecurrentPPO) when task traits match
Recommended Skills
Journey fit
Reinforcement-learning agent work happens during Build when you are implementing learning systems, simulators, or control policies. Agent-tooling is the right shelf for library selection and training recipes that power autonomous or simulated agents, not generic app frontend work.
How it compares
Algorithm-selection reference for Stable Baselines3, not a substitute for environment design or hyperparameter search tools.
Common Questions / FAQ
Who is stable-baselines3 for?
Solo builders and researchers shipping RL experiments in Python who use or plan to use the Stable Baselines3 library and want faster, informed algorithm picks.
When should I use stable-baselines3?
During Build while defining agent-tooling—after you know action space and reward structure but before you commit to a trainer and training budget.
Is stable-baselines3 safe to install?
It is reference documentation; training code you generate still runs locally with full compute access—review the Security Audits panel on this Prism page for the source package.
SKILL.md
READMESKILL.md - Stable Baselines3
# Stable Baselines3 Algorithm Reference This document provides detailed characteristics of all RL algorithms in Stable Baselines3 to help select the right algorithm for specific tasks. ## Algorithm Comparison Table | Algorithm | Type | Action Space | Sample Efficiency | Training Speed | Use Case | |-----------|------|--------------|-------------------|----------------|----------| | **PPO** | On-Policy | All | Medium | Fast | General-purpose, stable | | **A2C** | On-Policy | All | Low | Very Fast | Quick prototyping, multiprocessing | | **SAC** | Off-Policy | Continuous | High | Medium | Continuous control, sample-efficient | | **TD3** | Off-Policy | Continuous | High | Medium | Continuous control, deterministic | | **DDPG** | Off-Policy | Continuous | High | Medium | Continuous control (use TD3 instead) | | **DQN** | Off-Policy | Discrete | Medium | Medium | Discrete actions, Atari games | | **HER** | Off-Policy | All | Very High | Medium | Goal-conditioned tasks | | **RecurrentPPO** | On-Policy | All | Medium | Slow | Partial observability (POMDP) | ## Detailed Algorithm Characteristics ### PPO (Proximal Policy Optimization) **Overview:** General-purpose on-policy algorithm with good performance across many tasks. **Strengths:** - Stable and reliable training - Works with all action space types (Discrete, Box, MultiDiscrete, MultiBinary) - Good balance between sample efficiency and training speed - Excellent for multiprocessing with vectorized environments - Easy to tune **Weaknesses:** - Less sample-efficient than off-policy methods - Requires many environment interactions **Best For:** - General-purpose RL tasks - When stability is important - When you have cheap environment simulations - Tasks with continuous or discrete actions **Hyperparameter Guidance:** - `n_steps`: 2048-4096 for continuous, 128-256 for Atari - `learning_rate`: 3e-4 is a good default - `n_epochs`: 10 for continuous, 4 for Atari - `batch_size`: 64 - `gamma`: 0.99 (0.995-0.999 for long episodes) ### A2C (Advantage Actor-Critic) **Overview:** Synchronous variant of A3C, simpler than PPO but less stable. **Strengths:** - Very fast training (simpler than PPO) - Works with all action space types - Good for quick prototyping - Memory efficient **Weaknesses:** - Less stable than PPO - Requires careful hyperparameter tuning - Lower sample efficiency **Best For:** - Quick experimentation - When training speed is critical - Simple environments **Hyperparameter Guidance:** - `n_steps`: 5-256 depending on task - `learning_rate`: 7e-4 - `gamma`: 0.99 ### SAC (Soft Actor-Critic) **Overview:** Off-policy algorithm with entropy regularization, state-of-the-art for continuous control. **Strengths:** - Excellent sample efficiency - Very stable training - Automatic entropy tuning - Good exploration through stochastic policy - State-of-the-art for robotics **Weaknesses:** - Only supports continuous action spaces (Box) - Slower wall-clock time than on-policy methods - More complex hyperparameters **Best For:** - Continuous control (robotics, physics simulations) - When sample efficiency is critical - Expensive environment simulations - Tasks requiring good exploration **Hyperparameter Guidance:** - `learning_rate`: 3e-4 - `buffer_size`: 1M for most tasks - `learning_starts`: 10000 - `batch_size`: 256 - `tau`: 0.005 (target network update rate) - `train_freq`: 1 with `gradient_steps=-1` for best performance ### TD3 (Twin Delayed DDPG) **Overview:** Improved DDPG with double Q-learning and delayed policy updates. **Strengths:** - High sample efficiency - Deterministic policy (good for deployment) - More stable than DDPG - Good for continuous control **Weaknesses:** - Only supports continuous action spaces (Box) - Less exploration than SAC - Requires careful tuning **Best For:** - Continuous control tasks - When deterministic policies are preferred - Sample-efficient learning **Hyperparameter Guidance:** - `learning_rate`: 1e-3 - `buffer_size`: 1M -