
Troubleshooting Application Failures
Diagnose a failing AWS-hosted app by mining CloudWatch logs for errors, stack traces, and actionable fix steps without manual console hopping.
Overview
Troubleshooting Application Failures is an agent skill for the Operate phase that analyzes CloudWatch logs for a named AWS application and recommends fixes based on errors and stack traces.
Install
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill troubleshooting-application-failuresWhat is this skill?
- Structured SOP: collect application_name, region, and optional time_window_hours before any investigation
- Discovers related CloudWatch log groups for the named application
- Searches error patterns, stack traces, and exceptions in the lookback window
- Requires call_aws in context and verifies tooling before execution
- Delivers specific recommendations grounded in log findings
- 3 required-or-default parameters: application_name, region, time_window_hours (default 2)
Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your service is failing in AWS and you need a disciplined way to find the right log groups, surface exceptions in a time window, and get concrete next steps.
Who is it for?
Indie builders running named microservices or APIs in a known AWS region who want CloudWatch-centered triage through an agent with call_aws.
Skip if: Local-only apps with no CloudWatch logs, incidents where you cannot name the application or region, or teams that need live metric dashboards instead of log SOPs.
When should I use this skill?
A named AWS application is failing and you need CloudWatch log discovery plus error and stack-trace analysis in a configurable hour window.
What do I get? / Deliverables
You receive a log-backed diagnosis with error patterns analyzed and targeted recommendations so you can patch or escalate with evidence.
- Log group inventory for the application
- Error and exception analysis for the time window
- Actionable remediation recommendations
Recommended Skills
Journey fit
Production incident response belongs on the Operate shelf because it assumes something is already deployed and broken. Errors subphase matches log-driven failure analysis and remediation guidance rather than greenfield build or launch work.
How it compares
Use instead of unstructured “grep the logs” chat when you want a parameter-gated AWS SOP rather than a generic debugging brainstorm.
Common Questions / FAQ
Who is troubleshooting-application-failures for?
Solo and indie builders operating APIs or services on AWS who need agent-guided CloudWatch log analysis when an application is already in production.
When should I use troubleshooting-application-failures?
Use it in Operate when a deployed app is failing—after you know application_name and region—or when you need a 2-hour default (configurable) error window reviewed systematically.
Is troubleshooting-application-failures safe to install?
It is designed to verify call_aws exists before running destructive steps, but you should review the Security Audits panel on this page before granting AWS tool access in your agent.
SKILL.md
READMESKILL.md - Troubleshooting Application Failures
# Application Failure Troubleshooting ## Overview This SOP provides comprehensive troubleshooting for failing applications through CloudWatch log analysis. It discovers log groups related to the application name, searches for error patterns, analyzes stack traces and exceptions, and provides specific recommendations based on the findings in the logs. ## Parameters Prompt the user in a single message to provide all required parameters at once. Clearly list the required parameters and their descriptions, and include any optional parameters with their default values. Do not proceed until you have received and confirmed all required parameters. If any required parameter is missing or unclear, you MUST explicitly request the missing information before moving forward. - **application_name** (required): The name of the failing application (e.g., "user-api", "payment-service", "web-app") - **region** (required): The AWS region where the application is deployed - **time_window_hours** (optional, default: 2): Number of hours to look back for analysis (e.g., 1, 2, 4, 8, 12, 24) Only proceed to the steps below if you have all required information. ## Steps ### 1. Verify Dependencies Check for required tools and warn the user if any are missing. **Constraints:** - You MUST verify the following tools are available in your context: - call_aws - You MUST ONLY check for tool existence and MUST NOT attempt to run the tools because running tools during verification could cause unintended side effects, consume resources unnecessarily, or trigger actions before the user is ready - You MUST inform the user about any missing tools with a clear message - You MUST ask if the user wants to proceed anyway despite missing tools - You MUST respect the user's decision to proceed or abort ### 2. Discover Relevant Log Groups Search for CloudWatch log groups that are related to the application name. **Constraints:** - You MUST search for log groups that contain the application name using: `aws logs describe-log-groups --region ${region}` - You MUST filter the results to find log groups that contain the application_name in their log group name - You MUST also search for common AWS service log group patterns that might be related: - `/aws/lambda/*${application_name}*` - `/aws/apigateway/*${application_name}*` - `/aws/ecs/*${application_name}*` - `/aws/applicationelb/*${application_name}*` - `*${application_name}*` (custom application log groups) - You MUST present all discovered log groups to the user and ask them to confirm which ones are relevant to the application - You MUST handle cases where no log groups are found and ask the user to provide specific log group names - You MUST save the confirmed log groups for analysis - If no relevant log groups are found, You MUST ask the user to provide specific log group names manually ### 3. Validate Log Groups and Check Availability Verify the selected log groups exist and determine the available time range for analysis. **Constraints:** - You MUST validate each confirmed log group using: `aws logs describe-log-groups --log-group-name-prefix ${log_group_name} --region ${region}` - You MUST list available log streams for each log group: `aws logs describe-log-streams --log-group-name ${log_group_name} --order-by LastEventTime --descending --max-items 10 --region ${region}` - You MUST verify that log streams exist before attempting any log queries - You MUST calculate the effective time range based on log retention and creation time - You MUST extract the `lastEventTimestamp` from log streams to determine the most recent activity - You MUST inform the user if any log groups are empty or have no recent activity - You MUST inform the user if the requested time window exceeds available log data - You MUST adjust the analysis time window to fit within the available log data range ### 4. Analyze Application Logs Search CloudWatch logs for error patterns and failure indicators. **Constraint