Troubleshooting Application Failures

Name: Troubleshooting Application Failures
Author: aws

aws/agent-toolkit-for-aws

Diagnose a failing AWS-hosted app by mining CloudWatch logs for errors, stack traces, and actionable fix steps without manual console hopping.

Overview

Troubleshooting Application Failures is an agent skill for the Operate phase that analyzes CloudWatch logs for a named AWS application and recommends fixes based on errors and stack traces.

Install

npx skills add https://github.com/aws/agent-toolkit-for-aws --skill troubleshooting-application-failures

What is this skill?

Structured SOP: collect application_name, region, and optional time_window_hours before any investigation
Discovers related CloudWatch log groups for the named application
Searches error patterns, stack traces, and exceptions in the lookback window
Requires call_aws in context and verifies tooling before execution
Delivers specific recommendations grounded in log findings
3 required-or-default parameters: application_name, region, time_window_hours (default 2)

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

Your service is failing in AWS and you need a disciplined way to find the right log groups, surface exceptions in a time window, and get concrete next steps.

Who is it for?

Indie builders running named microservices or APIs in a known AWS region who want CloudWatch-centered triage through an agent with call_aws.

Skip if: Local-only apps with no CloudWatch logs, incidents where you cannot name the application or region, or teams that need live metric dashboards instead of log SOPs.

When should I use this skill?

A named AWS application is failing and you need CloudWatch log discovery plus error and stack-trace analysis in a configurable hour window.

What do I get? / Deliverables

You receive a log-backed diagnosis with error patterns analyzed and targeted recommendations so you can patch or escalate with evidence.

Log group inventory for the application
Error and exception analysis for the time window
Actionable remediation recommendations

Recommended Skills

Azure Deploymicrosoft/azure-skills

Azure Deploy is a Microsoft agent skill that executes cloud releases for applications that are already planned and valid…374k installs·1.2k stars

Azure Preparemicrosoft/azure-skills

Azure Prepare is Microsoft's skill for getting applications ready to run on Azure—writing the deployment plan, generatin…374k installs·1.2k stars

Azure Storagemicrosoft/azure-skills

Azure Storage skill helps agents pick the right Azure storage service—Blob for objects, Files for SMB shares, Queues for…374k installs·1.2k stars

Azure Validatemicrosoft/azure-skills

Microsoft-guided preflight validation for Azure deployments including IaC, identity, and service-specific readiness.374k installs·1.2k stars

Appinsights Instrumentationmicrosoft/azure-skills

appinsights-instrumentation is a Microsoft Azure-skills package that walks solo builders through enabling Application In…374k installs·1.2k stars

Azure Resource Lookupmicrosoft/azure-skills

Azure Resource Lookup is a Microsoft agent skill that helps solo builders and small teams answer “what do I have in Azur…373k installs·1.2k stars

Journey fit

Primary fit

OperateError tracking

Production incident response belongs on the Operate shelf because it assumes something is already deployed and broken. Errors subphase matches log-driven failure analysis and remediation guidance rather than greenfield build or launch work.

Also useful

OperateMonitoring & observability

How it compares

Use instead of unstructured “grep the logs” chat when you want a parameter-gated AWS SOP rather than a generic debugging brainstorm.

Common Questions / FAQ

Who is troubleshooting-application-failures for?

Solo and indie builders operating APIs or services on AWS who need agent-guided CloudWatch log analysis when an application is already in production.

When should I use troubleshooting-application-failures?

Use it in Operate when a deployed app is failing—after you know application_name and region—or when you need a 2-hour default (configurable) error window reviewed systematically.

Is troubleshooting-application-failures safe to install?

It is designed to verify call_aws exists before running destructive steps, but you should review the Security Audits panel on this page before granting AWS tool access in your agent.

SKILL.md

READMESKILL.md - Troubleshooting Application Failures

# Application Failure Troubleshooting

## Overview

This SOP provides comprehensive troubleshooting for failing applications through CloudWatch log analysis. It discovers log groups related to the application name, searches for error patterns, analyzes stack traces and exceptions, and provides specific recommendations based on the findings in the logs.

## Parameters

Prompt the user in a single message to provide all required parameters at once. Clearly list the required parameters and their descriptions, and include any optional parameters with their default values. Do not proceed until you have received and confirmed all required parameters. If any required parameter is missing or unclear, you MUST explicitly request the missing information before moving forward.

- **application_name** (required): The name of the failing application (e.g., "user-api", "payment-service", "web-app")
- **region** (required): The AWS region where the application is deployed
- **time_window_hours** (optional, default: 2): Number of hours to look back for analysis (e.g., 1, 2, 4, 8, 12, 24)

Only proceed to the steps below if you have all required information.

## Steps

### 1. Verify Dependencies

Check for required tools and warn the user if any are missing.

**Constraints:**

- You MUST verify the following tools are available in your context:
  - call_aws
- You MUST ONLY check for tool existence and MUST NOT attempt to run the tools because running tools during verification could cause unintended side effects, consume resources unnecessarily, or trigger actions before the user is ready
- You MUST inform the user about any missing tools with a clear message
- You MUST ask if the user wants to proceed anyway despite missing tools
- You MUST respect the user's decision to proceed or abort

### 2. Discover Relevant Log Groups

Search for CloudWatch log groups that are related to the application name.

**Constraints:**

- You MUST search for log groups that contain the application name using: `aws logs describe-log-groups --region ${region}`
- You MUST filter the results to find log groups that contain the application_name in their log group name
- You MUST also search for common AWS service log group patterns that might be related:
  - `/aws/lambda/*${application_name}*`
  - `/aws/apigateway/*${application_name}*`
  - `/aws/ecs/*${application_name}*`
  - `/aws/applicationelb/*${application_name}*`
  - `*${application_name}*` (custom application log groups)
- You MUST present all discovered log groups to the user and ask them to confirm which ones are relevant to the application
- You MUST handle cases where no log groups are found and ask the user to provide specific log group names
- You MUST save the confirmed log groups for analysis
- If no relevant log groups are found, You MUST ask the user to provide specific log group names manually

### 3. Validate Log Groups and Check Availability

Verify the selected log groups exist and determine the available time range for analysis.

**Constraints:**

- You MUST validate each confirmed log group using: `aws logs describe-log-groups --log-group-name-prefix ${log_group_name} --region ${region}`
- You MUST list available log streams for each log group: `aws logs describe-log-streams --log-group-name ${log_group_name} --order-by LastEventTime --descending --max-items 10 --region ${region}`
- You MUST verify that log streams exist before attempting any log queries
- You MUST calculate the effective time range based on log retention and creation time
- You MUST extract the `lastEventTimestamp` from log streams to determine the most recent activity
- You MUST inform the user if any log groups are empty or have no recent activity
- You MUST inform the user if the requested time window exceeds available log data
- You MUST adjust the analysis time window to fit within the available log data range

### 4. Analyze Application Logs

Search CloudWatch logs for error patterns and failure indicators.

**Constraint

What is this skill?

Structured SOP: collect application_name, region, and optional time_window_hours before any investigation

Discovers related CloudWatch log groups for the named application

Searches error patterns, stack traces, and exceptions in the lookback window

Requires call_aws in context and verifies tooling before execution

Delivers specific recommendations grounded in log findings

3 required-or-default parameters: application_name, region, time_window_hours (default 2)

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Who is it for?

Indie builders running named microservices or APIs in a known AWS region who want CloudWatch-centered triage through an agent with call_aws.

Skip if: Local-only apps with no CloudWatch logs, incidents where you cannot name the application or region, or teams that need live metric dashboards instead of log SOPs.

Journey fit

Primary fit

OperateError tracking

Also useful

OperateMonitoring & observability

SKILL.md

READMESKILL.md - Troubleshooting Application Failures

# Application Failure Troubleshooting

## Overview

This SOP provides comprehensive troubleshooting for failing applications through CloudWatch log analysis. It discovers log groups related to the application name, searches for error patterns, analyzes stack traces and exceptions, and provides specific recommendations based on the findings in the logs.

## Parameters

Prompt the user in a single message to provide all required parameters at once. Clearly list the required parameters and their descriptions, and include any optional parameters with their default values. Do not proceed until you have received and confirmed all required parameters. If any required parameter is missing or unclear, you MUST explicitly request the missing information before moving forward.

- **application_name** (required): The name of the failing application (e.g., "user-api", "payment-service", "web-app")
- **region** (required): The AWS region where the application is deployed
- **time_window_hours** (optional, default: 2): Number of hours to look back for analysis (e.g., 1, 2, 4, 8, 12, 24)

Only proceed to the steps below if you have all required information.

## Steps

### 1. Verify Dependencies

Check for required tools and warn the user if any are missing.

**Constraints:**

- You MUST verify the following tools are available in your context:
  - call_aws
- You MUST ONLY check for tool existence and MUST NOT attempt to run the tools because running tools during verification could cause unintended side effects, consume resources unnecessarily, or trigger actions before the user is ready
- You MUST inform the user about any missing tools with a clear message
- You MUST ask if the user wants to proceed anyway despite missing tools
- You MUST respect the user's decision to proceed or abort

### 2. Discover Relevant Log Groups

Search for CloudWatch log groups that are related to the application name.

**Constraints:**

- You MUST search for log groups that contain the application name using: `aws logs describe-log-groups --region ${region}`
- You MUST filter the results to find log groups that contain the application_name in their log group name
- You MUST also search for common AWS service log group patterns that might be related:
  - `/aws/lambda/*${application_name}*`
  - `/aws/apigateway/*${application_name}*`
  - `/aws/ecs/*${application_name}*`
  - `/aws/applicationelb/*${application_name}*`
  - `*${application_name}*` (custom application log groups)
- You MUST present all discovered log groups to the user and ask them to confirm which ones are relevant to the application
- You MUST handle cases where no log groups are found and ask the user to provide specific log group names
- You MUST save the confirmed log groups for analysis
- If no relevant log groups are found, You MUST ask the user to provide specific log group names manually

### 3. Validate Log Groups and Check Availability

Verify the selected log groups exist and determine the available time range for analysis.

**Constraints:**

- You MUST validate each confirmed log group using: `aws logs describe-log-groups --log-group-name-prefix ${log_group_name} --region ${region}`
- You MUST list available log streams for each log group: `aws logs describe-log-streams --log-group-name ${log_group_name} --order-by LastEventTime --descending --max-items 10 --region ${region}`
- You MUST verify that log streams exist before attempting any log queries
- You MUST calculate the effective time range based on log retention and creation time
- You MUST extract the `lastEventTimestamp` from log streams to determine the most recent activity
- You MUST inform the user if any log groups are empty or have no recent activity
- You MUST inform the user if the requested time window exceeds available log data
- You MUST adjust the analysis time window to fit within the available log data range

### 4. Analyze Application Logs

Search CloudWatch logs for error patterns and failure indicators.

**Constraint

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is troubleshooting-application-failures for?

When should I use troubleshooting-application-failures?

Is troubleshooting-application-failures safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is troubleshooting-application-failures for?

When should I use troubleshooting-application-failures?

Is troubleshooting-application-failures safe to install?

SKILL.md