
Monitoring Operations
Configure OCI metrics, alarms, and log collection while avoiding namespace and threshold mistakes on a deployed stack.
Overview
monitoring-operations is an agent skill most often used in Operate (also Ship) that sets up and troubleshoots OCI metrics, alarms, and log collection without guessing CLI or namespace details.
Install
npx skills add https://github.com/acedergren/oci-agent-skills --skill monitoring-operationsWhat is this skill?
- Expert guidance on OCI metric namespaces, MQL, and alarm threshold gotchas
- Log collection setup and common monitoring gaps
- Recommends oracle-terraform-modules landing-zone for observability baseline
- Explicit warning: do not guess OCI CLI—use skill reference commands
- Covers Cloud Guard, VSS, OSMS integration context within Landing Zone
- version 2.0.0 in skill metadata
- references Landing Zone bad practices #7 and #10 for monitoring gaps
Adoption & trust: 1.3k installs on skills.sh; 11 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your OCI stack is up but dashboards are empty, alarms misfire, or you cannot map services to the right metric namespaces.
Who is it for?
Solo builders or tiny teams running OCI workloads who need alarm and metric setup aligned with Landing Zone observability.
Skip if: Non-OCI clouds, greenfield apps with zero deployed resources, or teams that want generic monitoring theory without OCI CLI specifics.
When should I use this skill?
Setting up metrics, alarms, or troubleshooting missing data in OCI Monitoring; metric namespace or alarm threshold issues.
What do I get? / Deliverables
You apply documented OCI monitoring patterns—alarms, MQL, logging connectors—and close gaps that Landing Zone observability expects you to maintain.
- Working alarm definitions and metric queries
- Log collection configuration guidance
- Troubleshooting notes for monitoring gaps
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Primary shelf is Operate because the skill targets live metrics, alarms, and troubleshooting gaps after resources exist. Monitoring subphase covers observability setup and missing-data diagnosis—not initial Terraform authoring (though Landing Zone is referenced).
Where it fits
Verify alarms and log connectors before cutting over production traffic on OCI.
Fix empty metric charts by correcting namespace and MQL queries.
Extend Landing Zone observability with Service Connector patterns from references.
Ensure baseline metrics exist before building business dashboards on OCI telemetry.
How it compares
OCI-focused runbook skill, not a vendor-neutral APM product integration.
Common Questions / FAQ
Who is monitoring-operations for?
Indie operators and agents working on Oracle Cloud who must configure or fix metrics, alarms, and logs on deployed environments.
When should I use monitoring-operations?
In Operate when tuning alarms or fixing missing telemetry; in Ship when validating observability before go-live; whenever namespace confusion or threshold gotchas appear.
Is monitoring-operations safe to install?
Review the Security Audits panel on this page; the skill may drive shell and cloud API actions—scope credentials and test alarms in non-production first.
SKILL.md
READMESKILL.md - Monitoring Operations
# OCI Monitoring and Observability - Expert Knowledge ## 🏗️ Use OCI Landing Zone Terraform Modules **Don't reinvent the wheel.** Use [oracle-terraform-modules/landing-zone](https://github.com/oracle-terraform-modules/terraform-oci-landing-zones) for observability stack. **Landing Zone solves:** - ❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability) - ❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS) **This skill provides**: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone. --- ## ⚠️ OCI CLI/API Knowledge Gap **You don't know OCI CLI commands or OCI API structure.** Your training data has limited and outdated knowledge of: - OCI CLI syntax and parameters (updates monthly) - OCI API endpoints and request/response formats - Monitoring service CLI operations (`oci monitoring alarm`, `oci monitoring metric`) - Metric namespaces and MQL (Monitoring Query Language) - Latest Logging and Service Connector features **When OCI operations are needed:** 1. Use exact CLI commands from this skill's references 2. Do NOT guess metric namespace names 3. Do NOT assume AWS CloudWatch patterns work in OCI 4. Load reference files for detailed MQL documentation **What you DO know:** - General observability concepts - Alerting and threshold design principles - Log aggregation patterns This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas. --- ## NEVER Do This ❌ **NEVER assume metrics are instant (10-15 minute lag)** - Metrics published every 1-5 minutes - Processing delay: 5-10 minutes - **Total lag**: 10-15 minutes from event to visible metric - Don't debug "missing metrics" within first 15 minutes of resource creation ❌ **NEVER use `=` for alarm thresholds with sparse metrics** ``` # WRONG - alarm never fires if metric has gaps MetricName[1m].mean() = 0 # RIGHT - handle missing data MetricName[1m]{dataMissing=zero}.mean() > 0 ``` ❌ **NEVER forget metric dimensions (causes "no data")** ``` # WRONG - missing required dimension CPUUtilization[1m].mean() # RIGHT - include resourceId dimension CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean() ``` ❌ **NEVER set alarm thresholds without trigger delay (alert fatigue)** ``` # BAD - fires on every CPU spike CPUUtilization[1m].mean() > 80 # BETTER - sustained high CPU CPUUtilization[5m].mean() > 80 Trigger delay: 5 minutes (fires after 5 consecutive breaches) ``` ❌ **NEVER create alarms without notification channels** ``` # WRONG - alarm fires but nobody knows oci monitoring alarm create ... --destinations '[]' # RIGHT - always link to notification topic oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]' ``` Cost impact: Undetected outages cost $5,000-50,000/hour in production ❌ **NEVER ignore Cloud Guard findings (security audit failure)** - Cloud Guard detects misconfigurations BEFORE they become incidents - Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty - Cost impact: $100,000+ per security breach vs $0 for proactive remediation ## Metric Namespace Gotchas **OCI Metrics Use Service-Specific Namespaces:** | Service | Namespace | Example Metric | |---------|-----------|----------------| | Compute | `oci_computeagent` | `CPUUtilization`, `MemoryUtilization` | | Autonomous DB | `oci_autonomous_database` | `CpuUtilization`, `StorageUtilization` | | Load Balancer | `oci_lbaas` | `HttpRequests`, `UnHealthyBackendServers` | | Object Storage | `oci_objectstorage` | `ObjectCount`, `BytesUploaded` | **Common Mistake**: Using wrong namespace (`oci_compute` vs `oci