AppStatus Documentation Hub for Production Operations

Docs > Getting Started > Getting Started with Incidents

Getting Started with Incidents

Overview

This guide introduces AppStatus Incidents and covers:

  • Automatic incident creation from monitor failures
  • Manual incident reporting with structured fields
  • AI-powered analysis (Gemini) with root cause hints and solving steps
  • Incident lifecycle: ongoing → acknowledged → resolved
  • Email and chat broadcast for team communication
  • Impact classification: low, medium, high, critical

What are AppStatus Incidents?

Incidents centralize the detection, triage, communication, and resolution of service disruptions. When a monitor detects a failure, AppStatus automatically creates an incident with locked-transaction safety (PostgreSQL advisory locks) to prevent duplicate incidents for the same monitor.

Each incident tracks the full lifecycle from detection to resolution with accurate timestamps, duration calculation, impact classification, and optional AI-powered root cause analysis. Teams can manually report incidents, broadcast updates via email and chat, and retain closed incident data for trend analysis and post-incident review.

Key capabilities:

  • Automatic creation with pg_advisory_xact_lock to prevent duplicates
  • Manual creation with 8 incident types: availability, latency, security, deployment, infrastructure, network, manual, other
  • AI analysis via Gemini 2.0 Flash with rate limiting (6 concurrent), caching, and fallback
  • 4 impact levels: low (<10min), medium (10–30min), high (30min+), critical (30min+ ongoing)
  • Email broadcast to multiple recipients with full incident context
  • Chat integration for real-time team updates
  • Duration auto-calculation on resolution
  • Paginated history with monitor_id, date range, and status filters
  • Daily summary aggregation: incident count, uptime %, total downtime

Incident Workflows

Each workflow maps to real AppStatus features and API endpoints used in the main app.

Incident Intake

Manual incident form captures required lifecycle fields and normalizes incident metadata for timeline, analytics, and response tracking.

  1. Set title, description, incident_type, impact, and status.
  2. Assign owner (`assignee_id`) and affected services list.
  3. Capture cause and consequence to keep context reusable for post-incident review.
  4. Set accurate started/resolved timestamps for duration reporting.

Troubleshooting

Duplicate incidents created for the same outage

This should not happen — AppStatus uses PostgreSQL advisory locks with serializable isolation. If you see duplicates, check if multiple monitors point to the same endpoint. Each monitor creates its own incident.

AI analysis returns fallback instead of real analysis

Fallback triggers when Gemini API is unavailable (rate limit, auth error, or timeout). Check your GEMINI_API_KEY environment variable. The system limits to 6 concurrent analysis requests. Check ai-quota endpoint for monthly usage.

Incident not auto-resolving when service recovers

Auto-resolution happens when the monitor checker detects the service is UP again. Verify the monitor is not paused. Check monitor interval — resolution happens on the next successful check cycle.

Impact level seems wrong

Impact is auto-calculated from duration: <10min=low, 10–30min=medium, 30min+=high, 30min+ ongoing=critical. You can override impact in manual incidents.

Email broadcast not sending

Verify email addresses contain @ symbol. The system queues emails through EmailQueueService — check delivery logs. Ensure the workspace email sending limit is not exceeded.

Operational Guidance

  • Assign one accountable incident commander per active event.
  • Link incidents to monitors and alerts for complete timeline context.
  • Preserve post-incident actions with owner and due date.

Step-by-Step Setup

Incidents are the structured record of every outage or degradation. Most incidents open automatically when a monitor fails — but you can also open one manually for planned maintenance or for issues your monitors cannot detect (customer-reported, partial degradation). The incident page is where the responder triages, broadcasts updates and finally records the resolution.

Before you start

  • At least one monitor (incidents are usually linked to a monitor)
  • At least one alert rule (so incidents auto-open on failure)
  • (Optional) A status page if you want customer-facing broadcast
  1. 1

    Open Incidents from the sidebar

    The Incidents page shows every incident in the workspace — both open and historical. Filter by status, severity or assignee with the chips at the top of the list.

    WhereSidebar → Incidents
  2. 2

    Most incidents appear here automatically

    When a monitor reaches the failure threshold defined in its alert rule, an incident opens in state "Investigating" with the failing monitor pre-linked. No action needed — you just see it appear and click in to triage.

    Tip

    Severity is inherited from the alert rule that triggered the incident, so Critical-severity rules open Critical incidents.

  3. 3

    To open one manually, click "+ New incident"

    Use this for planned maintenance, customer-reported issues or partial degradation your monitors did not catch. The form has the same fields as an auto-created incident — you just fill them yourself.

    WhereIncidents → + New incident (top-right)
  4. 4

    Set the incident metadata

    Give it a short, impact-first title ("Login down for EU users", not "Auth service alert"). Pick the type (Availability / Latency / Security / Deployment / Network), impact level, and the affected services. Assign one responder — the incident commander — so it is clear who is driving.

    WhereIncident form → Title, Type, Impact, Affected services, Assignee
  5. 5

    Broadcast the first update

    Open the Updates panel and post a short status. Tick "Broadcast to status page" to publish to subscribers, and "Broadcast to chat" to mirror the same message into the team chat channel. One message, two audiences.

    WhereIncident detail → Updates → New update
  6. 6

    Run AI root-cause analysis (Pro plan and above)

    Click "AI analyse" on the incident detail. AppStatus pulls the linked monitor logs and recent activity and uses Gemini to suggest likely causes and remediation steps. The output is appended to the incident timeline — treat it as a starting hypothesis, not a final answer.

    WhereIncident detail → AI analyse (top-right)
  7. 7

    Post status changes as you investigate

    As you make progress, change the lifecycle status (Investigating → Identified → Monitoring → Resolved) and post a short update at each transition. Customers on your status page see exactly the same updates.

    WhereIncident detail → Status dropdown + Updates
  8. 8

    Resolve and capture the post-mortem

    When the issue is fixed, switch status to Resolved. Two extra fields appear — Cause (root cause) and Consequence (customer impact). Filling these takes 60 seconds and pays back the next time a similar issue hits, because the analytics page can show patterns across past incidents.

    WhereIncident detail → Status: Resolved → Cause + Consequence

Configuration Options

Every option you can set, what each choice means, and what to pick. Use this as a reference while you fill in the form.

Incident type

FieldOptionsWhat it doesRecommended
AvailabilityThe service is fully or partially unreachable.Default for monitor-triggered incidents.
LatencyThe service responds but is slower than acceptable.Use when SLA threshold breached but service still works.
SecurityAuthentication, access or vulnerability incident.Use for any incident touching authn/authz or compliance.
DeploymentIssue caused by a recent release or migration.Helps post-mortem grouping — "X% of incidents are deploy-related".
NetworkUpstream/downstream provider or DNS-level issue.Use when root cause is outside your application code.
InfrastructureUnderlying compute, database or storage issue.Use for cloud-provider or hardware-layer incidents.
ManualCatch-all for anything that does not fit the above.Use for planned maintenance, drills or customer-reported issues.

Impact

FieldOptionsWhat it doesRecommended
LowInternal only, minor inconvenience.Background job failure with no user-facing impact.
MediumSome users affected, workaround exists.One feature degraded, rest of app fine.
HighMost users affected, primary flow broken.Login slow, checkout broken — page on-call.
CriticalFull outage / data loss / security exposure.All hands. Status page banner. Customer email.

Lifecycle status

FieldOptionsWhat it doesRecommended
InvestigatingDefault starting state — root cause unknown.Initial post in this state with what you know.
IdentifiedRoot cause confirmed, fix in progress.Post the cause + expected fix time when moving here.
MonitoringFix applied, watching for regression.Wait at least one monitoring interval before resolving.
ResolvedService back to normal — incident closed.Requires Cause + Consequence to close.

Feature Reference

Every feature, where to find it in the app, and what it does. Use this when you know what you want to do but not where it lives.

FeatureWhere in appDescription
Auto-create incidentTriggered by alert rule "consecutive failures"Monitor failures open an incident automatically, linked to the monitor and region.
Manual createSidebar → Incidents → + New incidentFor planned maintenance or human-reported issues with no monitor trigger.
Assign incident commanderIncident detail → AssigneeSingle accountable owner; shown on every broadcast update.
Broadcast updateIncident detail → Updates → New updateOne panel posts to chat + status page subscribers in one click.
AI root-cause analysisIncident detail → AI analyse (Pro+)Gemini suggests likely causes and remediation from monitor logs + recent activity.
Lifecycle statusIncident detail → Status dropdownInvestigating → Identified → Monitoring → Resolved with timestamped transitions.
TimelineIncident detail → Timeline tabEvery state change, update, ack and AI suggestion in chronological order.
MTTR & durationComputed automatically on ResolvedMean-time-to-recovery rolled up across services and time windows in Analytics.
Post-mortem fieldsResolve step → Cause + ConsequenceCaptured at resolution; surface in monthly retrospectives and SLA reports.

Next Steps

Continue building your monitoring stack: