Aws Observability

Name: Aws Observability
Author: aws

aws/agent-toolkit-for-aws

Drop in CDK-ready CloudWatch alarm and dashboard patterns for Lambda with M-of-N evaluation and error-rate math instead of brittle defaults.

Overview

AWS Observability is an agent skill for the Operate phase that applies best-practice CloudWatch alarm and Lambda monitoring patterns in AWS CDK.

Install

npx skills add https://github.com/aws/agent-toolkit-for-aws --skill aws-observability

What is this skill?

createLambdaMonitoring helper with documented best-practice defaults vs common mistakes
evaluationPeriods: 3 and datapointsToAlarm: 2 for M-of-N alarm stability
treatMissingData: NOT_BREACHING and 60s period for faster, fewer false positives
Error-rate alarm via MathExpression (percentage) with configurable threshold default 5%
Duration alarm on p99 with default 3000 ms threshold; SNS and dashboard widget patterns
Default evaluationPeriods: 3 (not 1)
Default datapointsToAlarm: 2 for M-of-N
Default error rate threshold: 5 percent

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 1.5k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

Your Lambda CDK stacks use one-period alarms and average duration metrics that page too often or miss real error-rate regressions.

Who is it for?

Indie SaaS or API builders on AWS CDK who need production-grade CloudWatch defaults without re-reading every alarm operator doc.

Skip if: Non-AWS stacks, teams wanting only ad-hoc console alarms, or observability purely outside CloudWatch (e.g. only Datadog with no CDK parity).

When should I use this skill?

When implementing or reviewing AWS CDK CloudWatch alarms, Lambda monitoring dashboards, or SNS-backed alerting with production-grade defaults.

What do I get? / Deliverables

You implement Lambda monitoring with M-of-N evaluation, math-expression error rates, p99 duration alarms, and SNS-ready constructs aligned to AWS toolkit guidance.

CDK construct patterns for Lambda error-rate and duration alarms plus dashboard widgets

Recommended Skills

Azure Deploymicrosoft/azure-skills

Azure Deploy is a Microsoft agent skill that executes cloud releases for applications that are already planned and valid…374k installs·1.2k stars

Azure Preparemicrosoft/azure-skills

Azure Prepare is Microsoft's skill for getting applications ready to run on Azure—writing the deployment plan, generatin…374k installs·1.2k stars

Azure Storagemicrosoft/azure-skills

Azure Storage skill helps agents pick the right Azure storage service—Blob for objects, Files for SMB shares, Queues for…374k installs·1.2k stars

Azure Validatemicrosoft/azure-skills

Microsoft-guided preflight validation for Azure deployments including IaC, identity, and service-specific readiness.374k installs·1.2k stars

Appinsights Instrumentationmicrosoft/azure-skills

appinsights-instrumentation is a Microsoft Azure-skills package that walks solo builders through enabling Application In…374k installs·1.2k stars

Azure Resource Lookupmicrosoft/azure-skills

Azure Resource Lookup is a Microsoft agent skill that helps solo builders and small teams answer “what do I have in Azur…373k installs·1.2k stars

Journey fit

Primary fit

OperateMonitoring & observability

Observability is canonical in Operate when production services need alarms, dashboards, and on-call signal—not during initial ideation. Monitoring is the shelf for CloudWatch metrics, composite alarms, SNS actions, and Lambda SLO-style thresholds.

Also useful

ShipCI/CD & deploy

How it compares

Infrastructure-as-code alarm recipes—not a hosted APM product or passive log viewer.

Common Questions / FAQ

Who is aws-observability for?

Solo and small-team developers shipping Lambda on AWS CDK who want copy-paste monitoring constructs with sensible defaults.

When should I use aws-observability?

Use it in Operate when you add or refactor production monitoring—CloudWatch alarms, dashboards, and SNS notifications for Lambda after ship.

Is aws-observability safe to install?

Check the Security Audits panel on this page; CDK changes affect live AWS accounts—review IAM, SNS topics, and alarm actions before deploy.

SKILL.md

READMESKILL.md - Aws Observability

// Best-practice CloudWatch alarm patterns for CDK

import {
  Alarm, CompositeAlarm, AlarmRule, AlarmState,
  ComparisonOperator, MathExpression, TreatMissingData,
  Dashboard, AlarmWidget, GraphWidget, TextWidget, PeriodOverride,
} from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';
import { Duration } from 'aws-cdk-lib';
import { IFunction } from 'aws-cdk-lib/aws-lambda';
import { ITopic } from 'aws-cdk-lib/aws-sns';
import { Construct } from 'constructs';

/**
 * Create Lambda monitoring with best-practice defaults.
 *
 * Best-practice defaults (vs common defaults):
 * - evaluationPeriods: 3 (not 1) — reduces false positives
 * - datapointsToAlarm: 2 (not 1) — M-of-N prevents flapping
 * - treatMissingData: NOT_BREACHING (not MISSING) — absence of errors = OK
 * - period: 60s (not 300s) — faster detection
 * - error rate uses math expression (not raw Errors count)
 * - duration uses p99 (not Average)
 */
export function createLambdaMonitoring(
  scope: Construct,
  fn: IFunction,
  snsTopic: ITopic,
  options?: {
    errorRateThreshold?: number;  // default: 5 (percent)
    durationThresholdMs?: number; // default: 3000 (ms)
  },
) {
  const errorRateThreshold = options?.errorRateThreshold ?? 5;
  const durationThreshold = options?.durationThresholdMs ?? 3000;

  // Error rate alarm (percentage via math expression)
  const errorRateAlarm = new Alarm(scope, 'ErrorRateAlarm', {
    metric: new MathExpression({
      expression: 'IF(invocations > 0, errors * 100 / invocations, 0)',
      usingMetrics: {
        errors: fn.metricErrors({ period: Duration.minutes(1) }),
        invocations: fn.metricInvocations({ period: Duration.minutes(1) }),
      },
    }),
    threshold: errorRateThreshold,
    evaluationPeriods: 3,
    datapointsToAlarm: 2,
    comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
    treatMissingData: TreatMissingData.NOT_BREACHING,
  });

  // Duration alarm (p99, not average)
  const durationAlarm = new Alarm(scope, 'DurationP99Alarm', {
    metric: fn.metricDuration({
      statistic: 'p99',
      period: Duration.minutes(1),
    }),
    threshold: durationThreshold,
    evaluationPeriods: 3,
    datapointsToAlarm: 2,
    comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
    treatMissingData: TreatMissingData.NOT_BREACHING,
  });

  // Throttle alarm
  const throttleAlarm = new Alarm(scope, 'ThrottleAlarm', {
    metric: fn.metricThrottles({ period: Duration.minutes(1) }),
    threshold: 1,
    evaluationPeriods: 3,
    datapointsToAlarm: 2,
    comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
    treatMissingData: TreatMissingData.NOT_BREACHING,
  });

  // Composite alarm — only page when service is unhealthy
  const serviceHealthAlarm = new CompositeAlarm(scope, 'ServiceHealthAlarm', {
    alarmRule: AlarmRule.anyOf(
      AlarmRule.fromAlarm(errorRateAlarm, AlarmState.ALARM),
      AlarmRule.fromAlarm(durationAlarm, AlarmState.ALARM),
      AlarmRule.fromAlarm(throttleAlarm, AlarmState.ALARM),
    ),
  });
  serviceHealthAlarm.addAlarmAction(new SnsAction(snsTopic));

  // Dashboard
  const dashboard = new Dashboard(scope, 'ServiceDashboard', {
    start: '-PT8H',
    periodOverride: PeriodOverride.INHERIT,
  });
  dashboard.addWidgets(
    new TextWidget({ width: 24, height: 1, markdown: '# Service Health' }),
    new AlarmWidget({ width: 8, height: 6, title: 'Error Rate', alarm: errorRateAlarm }),
    new AlarmWidget({ width: 8, height: 6, title: 'Duration P99', alarm: durationAlarm }),
    new AlarmWidget({ width: 8, height: 6, title: 'Throttles', alarm: throttleAlarm }),
    new GraphWidget({
      width: 24, height: 6,
      title: 'Invocations & Errors',
      left: [fn.metricInvocations({ period: Duration.minutes(1) })],
      right: [fn.metricErrors({ period: Duration.minutes(1) })],
    }),
  );

  return { errorRateAlarm, durationAlarm, throttleAlarm, serviceHealthAlarm, dashboard };
}


# ADOT

What is this skill?

createLambdaMonitoring helper with documented best-practice defaults vs common mistakes

evaluationPeriods: 3 and datapointsToAlarm: 2 for M-of-N alarm stability

treatMissingData: NOT_BREACHING and 60s period for faster, fewer false positives

Error-rate alarm via MathExpression (percentage) with configurable threshold default 5%

Duration alarm on p99 with default 3000 ms threshold; SNS and dashboard widget patterns

Default evaluationPeriods: 3 (not 1)

Default datapointsToAlarm: 2 for M-of-N

Default error rate threshold: 5 percent

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 1.5k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

OperateMonitoring & observability

Also useful

ShipCI/CD & deploy

SKILL.md

READMESKILL.md - Aws Observability

// Best-practice CloudWatch alarm patterns for CDK

import {
  Alarm, CompositeAlarm, AlarmRule, AlarmState,
  ComparisonOperator, MathExpression, TreatMissingData,
  Dashboard, AlarmWidget, GraphWidget, TextWidget, PeriodOverride,
} from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';
import { Duration } from 'aws-cdk-lib';
import { IFunction } from 'aws-cdk-lib/aws-lambda';
import { ITopic } from 'aws-cdk-lib/aws-sns';
import { Construct } from 'constructs';

/**
 * Create Lambda monitoring with best-practice defaults.
 *
 * Best-practice defaults (vs common defaults):
 * - evaluationPeriods: 3 (not 1) — reduces false positives
 * - datapointsToAlarm: 2 (not 1) — M-of-N prevents flapping
 * - treatMissingData: NOT_BREACHING (not MISSING) — absence of errors = OK
 * - period: 60s (not 300s) — faster detection
 * - error rate uses math expression (not raw Errors count)
 * - duration uses p99 (not Average)
 */
export function createLambdaMonitoring(
  scope: Construct,
  fn: IFunction,
  snsTopic: ITopic,
  options?: {
    errorRateThreshold?: number;  // default: 5 (percent)
    durationThresholdMs?: number; // default: 3000 (ms)
  },
) {
  const errorRateThreshold = options?.errorRateThreshold ?? 5;
  const durationThreshold = options?.durationThresholdMs ?? 3000;

  // Error rate alarm (percentage via math expression)
  const errorRateAlarm = new Alarm(scope, 'ErrorRateAlarm', {
    metric: new MathExpression({
      expression: 'IF(invocations > 0, errors * 100 / invocations, 0)',
      usingMetrics: {
        errors: fn.metricErrors({ period: Duration.minutes(1) }),
        invocations: fn.metricInvocations({ period: Duration.minutes(1) }),
      },
    }),
    threshold: errorRateThreshold,
    evaluationPeriods: 3,
    datapointsToAlarm: 2,
    comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
    treatMissingData: TreatMissingData.NOT_BREACHING,
  });

  // Duration alarm (p99, not average)
  const durationAlarm = new Alarm(scope, 'DurationP99Alarm', {
    metric: fn.metricDuration({
      statistic: 'p99',
      period: Duration.minutes(1),
    }),
    threshold: durationThreshold,
    evaluationPeriods: 3,
    datapointsToAlarm: 2,
    comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
    treatMissingData: TreatMissingData.NOT_BREACHING,
  });

  // Throttle alarm
  const throttleAlarm = new Alarm(scope, 'ThrottleAlarm', {
    metric: fn.metricThrottles({ period: Duration.minutes(1) }),
    threshold: 1,
    evaluationPeriods: 3,
    datapointsToAlarm: 2,
    comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
    treatMissingData: TreatMissingData.NOT_BREACHING,
  });

  // Composite alarm — only page when service is unhealthy
  const serviceHealthAlarm = new CompositeAlarm(scope, 'ServiceHealthAlarm', {
    alarmRule: AlarmRule.anyOf(
      AlarmRule.fromAlarm(errorRateAlarm, AlarmState.ALARM),
      AlarmRule.fromAlarm(durationAlarm, AlarmState.ALARM),
      AlarmRule.fromAlarm(throttleAlarm, AlarmState.ALARM),
    ),
  });
  serviceHealthAlarm.addAlarmAction(new SnsAction(snsTopic));

  // Dashboard
  const dashboard = new Dashboard(scope, 'ServiceDashboard', {
    start: '-PT8H',
    periodOverride: PeriodOverride.INHERIT,
  });
  dashboard.addWidgets(
    new TextWidget({ width: 24, height: 1, markdown: '# Service Health' }),
    new AlarmWidget({ width: 8, height: 6, title: 'Error Rate', alarm: errorRateAlarm }),
    new AlarmWidget({ width: 8, height: 6, title: 'Duration P99', alarm: durationAlarm }),
    new AlarmWidget({ width: 8, height: 6, title: 'Throttles', alarm: throttleAlarm }),
    new GraphWidget({
      width: 24, height: 6,
      title: 'Invocations & Errors',
      left: [fn.metricInvocations({ period: Duration.minutes(1) })],
      right: [fn.metricErrors({ period: Duration.minutes(1) })],
    }),
  );

  return { errorRateAlarm, durationAlarm, throttleAlarm, serviceHealthAlarm, dashboard };
}


# ADOT

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is aws-observability for?

When should I use aws-observability?

Is aws-observability safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is aws-observability for?

When should I use aws-observability?

Is aws-observability safe to install?

SKILL.md