Docs > Getting Started > Getting Started with Incidents

Getting Started with Incidents

Overview

This guide introduces AppStatus Incidents and covers:

Automatic incident creation from monitor failures
Manual incident reporting with structured fields
AI-powered analysis (Gemini) with root cause hints and solving steps
Incident lifecycle: ongoing → acknowledged → resolved
Email and chat broadcast for team communication
Impact classification: low, medium, high, critical

What are AppStatus Incidents?

Incidents centralize the detection, triage, communication, and resolution of service disruptions. When a monitor detects a failure, AppStatus automatically creates an incident with locked-transaction safety (PostgreSQL advisory locks) to prevent duplicate incidents for the same monitor.

Each incident tracks the full lifecycle from detection to resolution with accurate timestamps, duration calculation, impact classification, and optional AI-powered root cause analysis. Teams can manually report incidents, broadcast updates via email and chat, and retain closed incident data for trend analysis and post-incident review.

Key capabilities:

Automatic creation with pg_advisory_xact_lock to prevent duplicates
Manual creation with 8 incident types: availability, latency, security, deployment, infrastructure, network, manual, other
AI analysis via Gemini 2.0 Flash with rate limiting (6 concurrent), caching, and fallback
4 impact levels: low (<10min), medium (10–30min), high (30min+), critical (30min+ ongoing)
Email broadcast to multiple recipients with full incident context
Chat integration for real-time team updates
Duration auto-calculation on resolution
Paginated history with monitor_id, date range, and status filters
Daily summary aggregation: incident count, uptime %, total downtime

Incident Workflows

Each workflow maps to real AppStatus features and API endpoints used in the main app.

Incident Intake

Manual incident form captures required lifecycle fields and normalizes incident metadata for timeline, analytics, and response tracking.

Set title, description, incident_type, impact, and status.
Assign owner (`assignee_id`) and affected services list.
Capture cause and consequence to keep context reusable for post-incident review.
Set accurate started/resolved timestamps for duration reporting.

Troubleshooting

Duplicate incidents created for the same outage

This should not happen — AppStatus uses PostgreSQL advisory locks with serializable isolation. If you see duplicates, check if multiple monitors point to the same endpoint. Each monitor creates its own incident.

AI analysis returns fallback instead of real analysis

Fallback triggers when Gemini API is unavailable (rate limit, auth error, or timeout). Check your GEMINI_API_KEY environment variable. The system limits to 6 concurrent analysis requests. Check ai-quota endpoint for monthly usage.

Incident not auto-resolving when service recovers

Auto-resolution happens when the monitor checker detects the service is UP again. Verify the monitor is not paused. Check monitor interval — resolution happens on the next successful check cycle.

Impact level seems wrong

Impact is auto-calculated from duration: <10min=low, 10–30min=medium, 30min+=high, 30min+ ongoing=critical. You can override impact in manual incidents.

Email broadcast not sending

Verify email addresses contain @ symbol. The system queues emails through EmailQueueService — check delivery logs. Ensure the workspace email sending limit is not exceeded.

Operational Guidance

Assign one accountable incident commander per active event.
Link incidents to monitors and alerts for complete timeline context.
Preserve post-incident actions with owner and due date.

Step-by-Step Setup

Incidents are the structured record of every outage or degradation. Most incidents open automatically when a monitor fails — but you can also open one manually for planned maintenance or for issues your monitors cannot detect (customer-reported, partial degradation). The incident page is where the responder triages, broadcasts updates and finally records the resolution.

Before you start

At least one monitor (incidents are usually linked to a monitor)
At least one alert rule (so incidents auto-open on failure)
(Optional) A status page if you want customer-facing broadcast

1
Open Incidents from the sidebar
The Incidents page shows every incident in the workspace — both open and historical. Filter by status, severity or assignee with the chips at the top of the list.
WhereSidebar → Incidents
2
Most incidents appear here automatically
When a monitor reaches the failure threshold defined in its alert rule, an incident opens in state "Investigating" with the failing monitor pre-linked. No action needed — you just see it appear and click in to triage.
Tip
Severity is inherited from the alert rule that triggered the incident, so Critical-severity rules open Critical incidents.
3
To open one manually, click "+ New incident"
Use this for planned maintenance, customer-reported issues or partial degradation your monitors did not catch. The form has the same fields as an auto-created incident — you just fill them yourself.
WhereIncidents → + New incident (top-right)
4
Set the incident metadata
Give it a short, impact-first title ("Login down for EU users", not "Auth service alert"). Pick the type (Availability / Latency / Security / Deployment / Network), impact level, and the affected services. Assign one responder — the incident commander — so it is clear who is driving.
WhereIncident form → Title, Type, Impact, Affected services, Assignee
5
Broadcast the first update
Open the Updates panel and post a short status. Tick "Broadcast to status page" to publish to subscribers, and "Broadcast to chat" to mirror the same message into the team chat channel. One message, two audiences.
WhereIncident detail → Updates → New update
6
Run AI root-cause analysis (Pro plan and above)
Click "AI analyse" on the incident detail. AppStatus pulls the linked monitor logs and recent activity and uses Gemini to suggest likely causes and remediation steps. The output is appended to the incident timeline — treat it as a starting hypothesis, not a final answer.
WhereIncident detail → AI analyse (top-right)
7
Post status changes as you investigate
As you make progress, change the lifecycle status (Investigating → Identified → Monitoring → Resolved) and post a short update at each transition. Customers on your status page see exactly the same updates.
WhereIncident detail → Status dropdown + Updates
8
Resolve and capture the post-mortem
When the issue is fixed, switch status to Resolved. Two extra fields appear — Cause (root cause) and Consequence (customer impact). Filling these takes 60 seconds and pays back the next time a similar issue hits, because the analytics page can show patterns across past incidents.
WhereIncident detail → Status: Resolved → Cause + Consequence

Configuration Options

Every option you can set, what each choice means, and what to pick. Use this as a reference while you fill in the form.

Incident type

Field	Options	What it does	Recommended
Availability	—	The service is fully or partially unreachable.	Default for monitor-triggered incidents.
Latency	—	The service responds but is slower than acceptable.	Use when SLA threshold breached but service still works.
Security	—	Authentication, access or vulnerability incident.	Use for any incident touching authn/authz or compliance.
Deployment	—	Issue caused by a recent release or migration.	Helps post-mortem grouping — "X% of incidents are deploy-related".
Network	—	Upstream/downstream provider or DNS-level issue.	Use when root cause is outside your application code.
Infrastructure	—	Underlying compute, database or storage issue.	Use for cloud-provider or hardware-layer incidents.
Manual	—	Catch-all for anything that does not fit the above.	Use for planned maintenance, drills or customer-reported issues.

Impact

Field	Options	What it does	Recommended
Low	—	Internal only, minor inconvenience.	Background job failure with no user-facing impact.
Medium	—	Some users affected, workaround exists.	One feature degraded, rest of app fine.
High	—	Most users affected, primary flow broken.	Login slow, checkout broken — page on-call.
Critical	—	Full outage / data loss / security exposure.	All hands. Status page banner. Customer email.

Lifecycle status

Field	Options	What it does	Recommended
Investigating	—	Default starting state — root cause unknown.	Initial post in this state with what you know.
Identified	—	Root cause confirmed, fix in progress.	Post the cause + expected fix time when moving here.
Monitoring	—	Fix applied, watching for regression.	Wait at least one monitoring interval before resolving.
Resolved	—	Service back to normal — incident closed.	Requires Cause + Consequence to close.

Feature Reference

Every feature, where to find it in the app, and what it does. Use this when you know what you want to do but not where it lives.

Feature	Where in app	Description
Auto-create incident	Triggered by alert rule "consecutive failures"	Monitor failures open an incident automatically, linked to the monitor and region.
Manual create	Sidebar → Incidents → + New incident	For planned maintenance or human-reported issues with no monitor trigger.
Assign incident commander	Incident detail → Assignee	Single accountable owner; shown on every broadcast update.
Broadcast update	Incident detail → Updates → New update	One panel posts to chat + status page subscribers in one click.
AI root-cause analysis	Incident detail → AI analyse (Pro+)	Gemini suggests likely causes and remediation from monitor logs + recent activity.
Lifecycle status	Incident detail → Status dropdown	Investigating → Identified → Monitoring → Resolved with timestamped transitions.
Timeline	Incident detail → Timeline tab	Every state change, update, ack and AI suggestion in chronological order.
MTTR & duration	Computed automatically on Resolved	Mean-time-to-recovery rolled up across services and time windows in Analytics.
Post-mortem fields	Resolve step → Cause + Consequence	Captured at resolution; surface in monthly retrospectives and SLA reports.

Next Steps

Continue building your monitoring stack:

Configure Alerts

Route incident notifications to channels and escalation policies.

Publish Status Pages

Display incidents on public status pages automatically.

Set up Monitors

Create the health checks that trigger automatic incidents.

Team Governance

Assign incident commanders and responder roles.

Set up Heartbeats

Detect failing scheduled jobs as incidents.

Install the Agent

Correlate incidents with host-level metrics and logs.