
Server Management
Decide how to run, watch, and scale your app on a VPS or bare metal without guessing which process manager or observability stack fits your stack.
Overview
Server Management is an agent skill most often used in Operate (also Ship, Launch) that teaches production server decision-making for process supervision, monitoring, and scaling—not a command cheat sheet.
Install
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill server-managementWhat is this skill?
- Scenario-based process tool matrix (Node/PM2, systemd, Docker/Podman, K8s/Swarm)
- Four operational goals: crash restart, zero-downtime reload, clustering, reboot persistence
- Monitoring pillars: availability, performance, errors, and resource metrics
- Three-tier alert severity strategy (critical, warning, info)
- Monitoring tool ladder from PM2/htop through Grafana/Datadog-class observability
- 4 process-management goals
- 4 monitoring metric categories
- 3 alert severity levels
Adoption & trust: 662 installs on skills.sh; 40.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are shipping to your own server but do not know whether PM2, systemd, or containers fit your app, or which metrics and alert levels actually matter.
Who is it for?
Solo builders running Node or containerized apps on a VPS who want repeatable process and monitoring choices before touching prod.
Skip if: Teams that are fully on serverless/PaaS with vendor-managed ops, or anyone who only needs a single paste-and-go deploy script with no scaling or monitoring strategy.
When should I use this skill?
You are planning or running production on a server you control and need process, monitoring, and scaling principles—not ad-hoc command lists.
What do I get? / Deliverables
You leave with a structured ops mental model—tool choices, monitoring categories, and severity tiers—so your agent can implement runbooks that match your stack instead of generic Linux snippets.
- Process-supervisor recommendation
- Monitoring and alert-tier plan
- Scaling decision notes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Production server lifecycle—process supervision, reboot survival, and scaling tradeoffs—maps cleanly to Operate, where the product already runs and you own the metal. Infra is the canonical shelf because the skill centers on process orchestration (PM2, systemd, containers) and capacity decisions rather than incident triage alone.
Where it fits
Pick PM2 clustering versus a single systemd unit before your first real users hit the API.
Define critical versus warning alerts before a Product Hunt spike so you know when to scale the box.
Decide Docker versus bare-metal supervision when adding a second service on the same VPS.
Map availability, latency, errors, and resource metrics to a minimal Grafana or PM2 dashboard plan.
How it compares
Use for ops principles and tradeoffs instead of a bare DevOps command dump or an IaC-only provisioning skill.
Common Questions / FAQ
Who is server-management for?
Indie and solo developers who self-host APIs, SaaS backends, or CLIs on Linux and need a sane framework for supervisors, health checks, and scaling conversations with their coding agent.
When should I use server-management?
Use it in Operate when tuning infra on a live box; in Ship when choosing how processes survive deploys and crashes; and at Launch when you need uptime and alerting before traffic spikes—not only after outages.
Is server-management safe to install?
The skill is labeled community source with safe risk in its frontmatter; review the Security Audits panel on this Prism page and restrict agent permissions if you let it run shell on production hosts.
SKILL.md
READMESKILL.md - Server Management
# Server Management > Server management principles for production operations. > **Learn to THINK, not memorize commands.** --- ## 1. Process Management Principles ### Tool Selection | Scenario | Tool | |----------|------| | **Node.js app** | PM2 (clustering, reload) | | **Any app** | systemd (Linux native) | | **Containers** | Docker/Podman | | **Orchestration** | Kubernetes, Docker Swarm | ### Process Management Goals | Goal | What It Means | |------|---------------| | **Restart on crash** | Auto-recovery | | **Zero-downtime reload** | No service interruption | | **Clustering** | Use all CPU cores | | **Persistence** | Survive server reboot | --- ## 2. Monitoring Principles ### What to Monitor | Category | Key Metrics | |----------|-------------| | **Availability** | Uptime, health checks | | **Performance** | Response time, throughput | | **Errors** | Error rate, types | | **Resources** | CPU, memory, disk | ### Alert Severity Strategy | Level | Response | |-------|----------| | **Critical** | Immediate action | | **Warning** | Investigate soon | | **Info** | Review daily | ### Monitoring Tool Selection | Need | Options | |------|---------| | Simple/Free | PM2 metrics, htop | | Full observability | Grafana, Datadog | | Error tracking | Sentry | | Uptime | UptimeRobot, Pingdom | --- ## 3. Log Management Principles ### Log Strategy | Log Type | Purpose | |----------|---------| | **Application logs** | Debug, audit | | **Access logs** | Traffic analysis | | **Error logs** | Issue detection | ### Log Principles 1. **Rotate logs** to prevent disk fill 2. **Structured logging** (JSON) for parsing 3. **Appropriate levels** (error/warn/info/debug) 4. **No sensitive data** in logs --- ## 4. Scaling Decisions ### When to Scale | Symptom | Solution | |---------|----------| | High CPU | Add instances (horizontal) | | High memory | Increase RAM or fix leak | | Slow response | Profile first, then scale | | Traffic spikes | Auto-scaling | ### Scaling Strategy | Type | When to Use | |------|-------------| | **Vertical** | Quick fix, single instance | | **Horizontal** | Sustainable, distributed | | **Auto** | Variable traffic | --- ## 5. Health Check Principles ### What Constitutes Healthy | Check | Meaning | |-------|---------| | **HTTP 200** | Service responding | | **Database connected** | Data accessible | | **Dependencies OK** | External services reachable | | **Resources OK** | CPU/memory not exhausted | ### Health Check Implementation - Simple: Just return 200 - Deep: Check all dependencies - Choose based on load balancer needs --- ## 6. Security Principles | Area | Principle | |------|-----------| | **Access** | SSH keys only, no passwords | | **Firewall** | Only needed ports open | | **Updates** | Regular security patches | | **Secrets** | Environment vars, not files | | **Audit** | Log access and changes | --- ## 7. Troubleshooting Priority When something's wrong: 1. **Check if running** (process status) 2. **Check logs** (error messages) 3. **Check resources** (disk, memory, CPU) 4. **Check network** (ports, DNS) 5. **Check dependencies** (database, APIs) --- ## 8. Anti-Patterns | ❌ Don't | ✅ Do | |----------|-------| | Run as root | Use non-root user | | Ignore logs | Set up log rotation | | Skip monitoring | Monitor from day one | | Manual restarts | Auto-restart config | | No backups | Regular backup schedule | --- > **Remember:** A well-managed server is boring. That's the goal. ## When to Use This skill is applicable to execute the workflow or actions described in the overview. ## Limitations - Use this skill only when the task clearly matches the scope described above. - Do not treat the output as a substitute for environment-specific