
Aws Observability
Drop in CDK-ready CloudWatch alarm and dashboard patterns for Lambda with M-of-N evaluation and error-rate math instead of brittle defaults.
Overview
AWS Observability is an agent skill for the Operate phase that applies best-practice CloudWatch alarm and Lambda monitoring patterns in AWS CDK.
Install
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill aws-observabilityWhat is this skill?
- createLambdaMonitoring helper with documented best-practice defaults vs common mistakes
- evaluationPeriods: 3 and datapointsToAlarm: 2 for M-of-N alarm stability
- treatMissingData: NOT_BREACHING and 60s period for faster, fewer false positives
- Error-rate alarm via MathExpression (percentage) with configurable threshold default 5%
- Duration alarm on p99 with default 3000 ms threshold; SNS and dashboard widget patterns
- Default evaluationPeriods: 3 (not 1)
- Default datapointsToAlarm: 2 for M-of-N
- Default error rate threshold: 5 percent
Adoption & trust: 1.5k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your Lambda CDK stacks use one-period alarms and average duration metrics that page too often or miss real error-rate regressions.
Who is it for?
Indie SaaS or API builders on AWS CDK who need production-grade CloudWatch defaults without re-reading every alarm operator doc.
Skip if: Non-AWS stacks, teams wanting only ad-hoc console alarms, or observability purely outside CloudWatch (e.g. only Datadog with no CDK parity).
When should I use this skill?
When implementing or reviewing AWS CDK CloudWatch alarms, Lambda monitoring dashboards, or SNS-backed alerting with production-grade defaults.
What do I get? / Deliverables
You implement Lambda monitoring with M-of-N evaluation, math-expression error rates, p99 duration alarms, and SNS-ready constructs aligned to AWS toolkit guidance.
- CDK construct patterns for Lambda error-rate and duration alarms plus dashboard widgets
Recommended Skills
Journey fit
Observability is canonical in Operate when production services need alarms, dashboards, and on-call signal—not during initial ideation. Monitoring is the shelf for CloudWatch metrics, composite alarms, SNS actions, and Lambda SLO-style thresholds.
How it compares
Infrastructure-as-code alarm recipes—not a hosted APM product or passive log viewer.
Common Questions / FAQ
Who is aws-observability for?
Solo and small-team developers shipping Lambda on AWS CDK who want copy-paste monitoring constructs with sensible defaults.
When should I use aws-observability?
Use it in Operate when you add or refactor production monitoring—CloudWatch alarms, dashboards, and SNS notifications for Lambda after ship.
Is aws-observability safe to install?
Check the Security Audits panel on this page; CDK changes affect live AWS accounts—review IAM, SNS topics, and alarm actions before deploy.
SKILL.md
READMESKILL.md - Aws Observability
// Best-practice CloudWatch alarm patterns for CDK import { Alarm, CompositeAlarm, AlarmRule, AlarmState, ComparisonOperator, MathExpression, TreatMissingData, Dashboard, AlarmWidget, GraphWidget, TextWidget, PeriodOverride, } from 'aws-cdk-lib/aws-cloudwatch'; import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions'; import { Duration } from 'aws-cdk-lib'; import { IFunction } from 'aws-cdk-lib/aws-lambda'; import { ITopic } from 'aws-cdk-lib/aws-sns'; import { Construct } from 'constructs'; /** * Create Lambda monitoring with best-practice defaults. * * Best-practice defaults (vs common defaults): * - evaluationPeriods: 3 (not 1) — reduces false positives * - datapointsToAlarm: 2 (not 1) — M-of-N prevents flapping * - treatMissingData: NOT_BREACHING (not MISSING) — absence of errors = OK * - period: 60s (not 300s) — faster detection * - error rate uses math expression (not raw Errors count) * - duration uses p99 (not Average) */ export function createLambdaMonitoring( scope: Construct, fn: IFunction, snsTopic: ITopic, options?: { errorRateThreshold?: number; // default: 5 (percent) durationThresholdMs?: number; // default: 3000 (ms) }, ) { const errorRateThreshold = options?.errorRateThreshold ?? 5; const durationThreshold = options?.durationThresholdMs ?? 3000; // Error rate alarm (percentage via math expression) const errorRateAlarm = new Alarm(scope, 'ErrorRateAlarm', { metric: new MathExpression({ expression: 'IF(invocations > 0, errors * 100 / invocations, 0)', usingMetrics: { errors: fn.metricErrors({ period: Duration.minutes(1) }), invocations: fn.metricInvocations({ period: Duration.minutes(1) }), }, }), threshold: errorRateThreshold, evaluationPeriods: 3, datapointsToAlarm: 2, comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD, treatMissingData: TreatMissingData.NOT_BREACHING, }); // Duration alarm (p99, not average) const durationAlarm = new Alarm(scope, 'DurationP99Alarm', { metric: fn.metricDuration({ statistic: 'p99', period: Duration.minutes(1), }), threshold: durationThreshold, evaluationPeriods: 3, datapointsToAlarm: 2, comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD, treatMissingData: TreatMissingData.NOT_BREACHING, }); // Throttle alarm const throttleAlarm = new Alarm(scope, 'ThrottleAlarm', { metric: fn.metricThrottles({ period: Duration.minutes(1) }), threshold: 1, evaluationPeriods: 3, datapointsToAlarm: 2, comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD, treatMissingData: TreatMissingData.NOT_BREACHING, }); // Composite alarm — only page when service is unhealthy const serviceHealthAlarm = new CompositeAlarm(scope, 'ServiceHealthAlarm', { alarmRule: AlarmRule.anyOf( AlarmRule.fromAlarm(errorRateAlarm, AlarmState.ALARM), AlarmRule.fromAlarm(durationAlarm, AlarmState.ALARM), AlarmRule.fromAlarm(throttleAlarm, AlarmState.ALARM), ), }); serviceHealthAlarm.addAlarmAction(new SnsAction(snsTopic)); // Dashboard const dashboard = new Dashboard(scope, 'ServiceDashboard', { start: '-PT8H', periodOverride: PeriodOverride.INHERIT, }); dashboard.addWidgets( new TextWidget({ width: 24, height: 1, markdown: '# Service Health' }), new AlarmWidget({ width: 8, height: 6, title: 'Error Rate', alarm: errorRateAlarm }), new AlarmWidget({ width: 8, height: 6, title: 'Duration P99', alarm: durationAlarm }), new AlarmWidget({ width: 8, height: 6, title: 'Throttles', alarm: throttleAlarm }), new GraphWidget({ width: 24, height: 6, title: 'Invocations & Errors', left: [fn.metricInvocations({ period: Duration.minutes(1) })], right: [fn.metricErrors({ period: Duration.minutes(1) })], }), ); return { errorRateAlarm, durationAlarm, throttleAlarm, serviceHealthAlarm, dashboard }; } # ADOT