Docs > Getting Started > Getting Started with Incidents
Getting Started with Incidents
Overview
This guide introduces AppStatus Incidents and covers:
- Automatic incident creation from monitor failures
- Manual incident reporting with structured fields
- AI-powered analysis (Gemini) with root cause hints and solving steps
- Incident lifecycle: ongoing → acknowledged → resolved
- Email and chat broadcast for team communication
- Impact classification: low, medium, high, critical
What are AppStatus Incidents?
Incidents centralize the detection, triage, communication, and resolution of service disruptions. When a monitor detects a failure, AppStatus automatically creates an incident with locked-transaction safety (PostgreSQL advisory locks) to prevent duplicate incidents for the same monitor.
Each incident tracks the full lifecycle from detection to resolution with accurate timestamps, duration calculation, impact classification, and optional AI-powered root cause analysis. Teams can manually report incidents, broadcast updates via email and chat, and retain closed incident data for trend analysis and post-incident review.
Key capabilities:
- Automatic creation with pg_advisory_xact_lock to prevent duplicates
- Manual creation with 8 incident types: availability, latency, security, deployment, infrastructure, network, manual, other
- AI analysis via Gemini 2.0 Flash with rate limiting (6 concurrent), caching, and fallback
- 4 impact levels: low (<10min), medium (10–30min), high (30min+), critical (30min+ ongoing)
- Email broadcast to multiple recipients with full incident context
- Chat integration for real-time team updates
- Duration auto-calculation on resolution
- Paginated history with monitor_id, date range, and status filters
- Daily summary aggregation: incident count, uptime %, total downtime
Incident Workflows
Each workflow maps to real AppStatus features and API endpoints used in the main app.
Incident Intake
Manual incident form captures required lifecycle fields and normalizes incident metadata for timeline, analytics, and response tracking.
- Set title, description, incident_type, impact, and status.
- Assign owner (`assignee_id`) and affected services list.
- Capture cause and consequence to keep context reusable for post-incident review.
- Set accurate started/resolved timestamps for duration reporting.
Troubleshooting
Duplicate incidents created for the same outage
This should not happen — AppStatus uses PostgreSQL advisory locks with serializable isolation. If you see duplicates, check if multiple monitors point to the same endpoint. Each monitor creates its own incident.
AI analysis returns fallback instead of real analysis
Fallback triggers when Gemini API is unavailable (rate limit, auth error, or timeout). Check your GEMINI_API_KEY environment variable. The system limits to 6 concurrent analysis requests. Check ai-quota endpoint for monthly usage.
Incident not auto-resolving when service recovers
Auto-resolution happens when the monitor checker detects the service is UP again. Verify the monitor is not paused. Check monitor interval — resolution happens on the next successful check cycle.
Impact level seems wrong
Impact is auto-calculated from duration: <10min=low, 10–30min=medium, 30min+=high, 30min+ ongoing=critical. You can override impact in manual incidents.
Email broadcast not sending
Verify email addresses contain @ symbol. The system queues emails through EmailQueueService — check delivery logs. Ensure the workspace email sending limit is not exceeded.
Operational Guidance
- Assign one accountable incident commander per active event.
- Link incidents to monitors and alerts for complete timeline context.
- Preserve post-incident actions with owner and due date.
Step-by-Step Setup
Incidents are the structured record of every outage or degradation. Most incidents open automatically when a monitor fails — but you can also open one manually for planned maintenance or for issues your monitors cannot detect (customer-reported, partial degradation). The incident page is where the responder triages, broadcasts updates and finally records the resolution.
Before you start
- At least one monitor (incidents are usually linked to a monitor)
- At least one alert rule (so incidents auto-open on failure)
- (Optional) A status page if you want customer-facing broadcast
- 1
Open Incidents from the sidebar
The Incidents page shows every incident in the workspace — both open and historical. Filter by status, severity or assignee with the chips at the top of the list.
WhereSidebar → Incidents - 2
Most incidents appear here automatically
When a monitor reaches the failure threshold defined in its alert rule, an incident opens in state "Investigating" with the failing monitor pre-linked. No action needed — you just see it appear and click in to triage.
TipSeverity is inherited from the alert rule that triggered the incident, so Critical-severity rules open Critical incidents.
- 3
To open one manually, click "+ New incident"
Use this for planned maintenance, customer-reported issues or partial degradation your monitors did not catch. The form has the same fields as an auto-created incident — you just fill them yourself.
WhereIncidents → + New incident (top-right) - 4
Set the incident metadata
Give it a short, impact-first title ("Login down for EU users", not "Auth service alert"). Pick the type (Availability / Latency / Security / Deployment / Network), impact level, and the affected services. Assign one responder — the incident commander — so it is clear who is driving.
WhereIncident form → Title, Type, Impact, Affected services, Assignee - 5
Broadcast the first update
Open the Updates panel and post a short status. Tick "Broadcast to status page" to publish to subscribers, and "Broadcast to chat" to mirror the same message into the team chat channel. One message, two audiences.
WhereIncident detail → Updates → New update - 6
Run AI root-cause analysis (Pro plan and above)
Click "AI analyse" on the incident detail. AppStatus pulls the linked monitor logs and recent activity and uses Gemini to suggest likely causes and remediation steps. The output is appended to the incident timeline — treat it as a starting hypothesis, not a final answer.
WhereIncident detail → AI analyse (top-right) - 7
Post status changes as you investigate
As you make progress, change the lifecycle status (Investigating → Identified → Monitoring → Resolved) and post a short update at each transition. Customers on your status page see exactly the same updates.
WhereIncident detail → Status dropdown + Updates - 8
Resolve and capture the post-mortem
When the issue is fixed, switch status to Resolved. Two extra fields appear — Cause (root cause) and Consequence (customer impact). Filling these takes 60 seconds and pays back the next time a similar issue hits, because the analytics page can show patterns across past incidents.
WhereIncident detail → Status: Resolved → Cause + Consequence
Configuration Options
Every option you can set, what each choice means, and what to pick. Use this as a reference while you fill in the form.
Incident type
| Field | Options | What it does | Recommended |
|---|---|---|---|
| Availability | — | The service is fully or partially unreachable. | Default for monitor-triggered incidents. |
| Latency | — | The service responds but is slower than acceptable. | Use when SLA threshold breached but service still works. |
| Security | — | Authentication, access or vulnerability incident. | Use for any incident touching authn/authz or compliance. |
| Deployment | — | Issue caused by a recent release or migration. | Helps post-mortem grouping — "X% of incidents are deploy-related". |
| Network | — | Upstream/downstream provider or DNS-level issue. | Use when root cause is outside your application code. |
| Infrastructure | — | Underlying compute, database or storage issue. | Use for cloud-provider or hardware-layer incidents. |
| Manual | — | Catch-all for anything that does not fit the above. | Use for planned maintenance, drills or customer-reported issues. |
Impact
| Field | Options | What it does | Recommended |
|---|---|---|---|
| Low | — | Internal only, minor inconvenience. | Background job failure with no user-facing impact. |
| Medium | — | Some users affected, workaround exists. | One feature degraded, rest of app fine. |
| High | — | Most users affected, primary flow broken. | Login slow, checkout broken — page on-call. |
| Critical | — | Full outage / data loss / security exposure. | All hands. Status page banner. Customer email. |
Lifecycle status
| Field | Options | What it does | Recommended |
|---|---|---|---|
| Investigating | — | Default starting state — root cause unknown. | Initial post in this state with what you know. |
| Identified | — | Root cause confirmed, fix in progress. | Post the cause + expected fix time when moving here. |
| Monitoring | — | Fix applied, watching for regression. | Wait at least one monitoring interval before resolving. |
| Resolved | — | Service back to normal — incident closed. | Requires Cause + Consequence to close. |
Feature Reference
Every feature, where to find it in the app, and what it does. Use this when you know what you want to do but not where it lives.
| Feature | Where in app | Description |
|---|---|---|
| Auto-create incident | Triggered by alert rule "consecutive failures" | Monitor failures open an incident automatically, linked to the monitor and region. |
| Manual create | Sidebar → Incidents → + New incident | For planned maintenance or human-reported issues with no monitor trigger. |
| Assign incident commander | Incident detail → Assignee | Single accountable owner; shown on every broadcast update. |
| Broadcast update | Incident detail → Updates → New update | One panel posts to chat + status page subscribers in one click. |
| AI root-cause analysis | Incident detail → AI analyse (Pro+) | Gemini suggests likely causes and remediation from monitor logs + recent activity. |
| Lifecycle status | Incident detail → Status dropdown | Investigating → Identified → Monitoring → Resolved with timestamped transitions. |
| Timeline | Incident detail → Timeline tab | Every state change, update, ack and AI suggestion in chronological order. |
| MTTR & duration | Computed automatically on Resolved | Mean-time-to-recovery rolled up across services and time windows in Analytics. |
| Post-mortem fields | Resolve step → Cause + Consequence | Captured at resolution; surface in monthly retrospectives and SLA reports. |
Next Steps
Continue building your monitoring stack:
Configure Alerts
Route incident notifications to channels and escalation policies.
Publish Status Pages
Display incidents on public status pages automatically.
Set up Monitors
Create the health checks that trigger automatic incidents.
Team Governance
Assign incident commanders and responder roles.
Set up Heartbeats
Detect failing scheduled jobs as incidents.
Install the Agent
Correlate incidents with host-level metrics and logs.
