Monitoring Operations

Primary shelf is Operate because the skill targets live metrics, alarms, and troubleshooting gaps after resources exist. Monitoring subphase covers observability setup and missing-data diagnosis—not initial Terraform authoring (though Landing Zone is referenced).

Also useful

Also useful

Where it fits

Example use

Verify alarms and log connectors before cutting over production traffic on OCI.

Example use

Fix empty metric charts by correcting namespace and MQL queries.

Example use

Extend Landing Zone observability with Service Connector patterns from references.

Example use

GrowAnalytics & insights

Ensure baseline metrics exist before building business dashboards on OCI telemetry.

How it compares

OCI-focused runbook skill, not a vendor-neutral APM product integration.

Common Questions / FAQ

Who is monitoring-operations for?

Indie operators and agents working on Oracle Cloud who must configure or fix metrics, alarms, and logs on deployed environments.

When should I use monitoring-operations?

In Operate when tuning alarms or fixing missing telemetry; in Ship when validating observability before go-live; whenever namespace confusion or threshold gotchas appear.

Is monitoring-operations safe to install?

Review the Security Audits panel on this page; the skill may drive shell and cloud API actions—scope credentials and test alarms in non-production first.

SKILL.md

READMESKILL.md - Monitoring Operations

# OCI Monitoring and Observability - Expert Knowledge

## 🏗️ Use OCI Landing Zone Terraform Modules

**Don't reinvent the wheel.** Use [oracle-terraform-modules/landing-zone](https://github.com/oracle-terraform-modules/terraform-oci-landing-zones) for observability stack.

**Landing Zone solves:**
- ❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability)
- ❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS)

**This skill provides**: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone.

---

## ⚠️ OCI CLI/API Knowledge Gap

**You don't know OCI CLI commands or OCI API structure.**

Your training data has limited and outdated knowledge of:
- OCI CLI syntax and parameters (updates monthly)
- OCI API endpoints and request/response formats
- Monitoring service CLI operations (`oci monitoring alarm`, `oci monitoring metric`)
- Metric namespaces and MQL (Monitoring Query Language)
- Latest Logging and Service Connector features

**When OCI operations are needed:**
1. Use exact CLI commands from this skill's references
2. Do NOT guess metric namespace names
3. Do NOT assume AWS CloudWatch patterns work in OCI
4. Load reference files for detailed MQL documentation

**What you DO know:**
- General observability concepts
- Alerting and threshold design principles
- Log aggregation patterns

This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas.

---

## NEVER Do This

❌ **NEVER assume metrics are instant (10-15 minute lag)**
- Metrics published every 1-5 minutes
- Processing delay: 5-10 minutes
- **Total lag**: 10-15 minutes from event to visible metric
- Don't debug "missing metrics" within first 15 minutes of resource creation

❌ **NEVER use `=` for alarm thresholds with sparse metrics**
```
# WRONG - alarm never fires if metric has gaps
MetricName[1m].mean() = 0

# RIGHT - handle missing data
MetricName[1m]{dataMissing=zero}.mean() > 0
```

❌ **NEVER forget metric dimensions (causes "no data")**
```
# WRONG - missing required dimension
CPUUtilization[1m].mean()

# RIGHT - include resourceId dimension
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
```

❌ **NEVER set alarm thresholds without trigger delay (alert fatigue)**
```
# BAD - fires on every CPU spike
CPUUtilization[1m].mean() > 80

# BETTER - sustained high CPU
CPUUtilization[5m].mean() > 80
Trigger delay: 5 minutes (fires after 5 consecutive breaches)
```

❌ **NEVER create alarms without notification channels**
```
# WRONG - alarm fires but nobody knows
oci monitoring alarm create ... --destinations '[]'

# RIGHT - always link to notification topic
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
```
Cost impact: Undetected outages cost $5,000-50,000/hour in production

❌ **NEVER ignore Cloud Guard findings (security audit failure)**
- Cloud Guard detects misconfigurations BEFORE they become incidents
- Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty
- Cost impact: $100,000+ per security breach vs $0 for proactive remediation

## Metric Namespace Gotchas

**OCI Metrics Use Service-Specific Namespaces:**

| Service | Namespace | Example Metric |
|---------|-----------|----------------|
| Compute | `oci_computeagent` | `CPUUtilization`, `MemoryUtilization` |
| Autonomous DB | `oci_autonomous_database` | `CpuUtilization`, `StorageUtilization` |
| Load Balancer | `oci_lbaas` | `HttpRequests`, `UnHealthyBackendServers` |
| Object Storage | `oci_objectstorage` | `ObjectCount`, `BytesUploaded` |

**Common Mistake**: Using wrong namespace (`oci_compute` vs `oci

What is this skill?

Expert guidance on OCI metric namespaces, MQL, and alarm threshold gotchas

Log collection setup and common monitoring gaps

Recommends oracle-terraform-modules landing-zone for observability baseline

Explicit warning: do not guess OCI CLI—use skill reference commands

Covers Cloud Guard, VSS, OSMS integration context within Landing Zone

version 2.0.0 in skill metadata

references Landing Zone bad practices #7 and #10 for monitoring gaps

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.3k installs on skills.sh; 11 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Verify alarms and log connectors before cutting over production traffic on OCI.

Example use

Fix empty metric charts by correcting namespace and MQL queries.

Example use