AI-SRE -- Autonomous Site Reliability Engineering¶
Turn pages into resolved incidents in minutes, not hours.
AI-SRE is an autonomous incident response platform that ingests alerts from any monitoring source, diagnoses root causes using LLM-powered reasoning, executes safe remediation actions, and continuously learns from incident patterns -- all with full audit trails, guardrails, and no vendor lock-in.
Incident Response Flow
Alert → Diagnose → Remediate → Learn
How It Works¶
graph LR
A[Alert Sources] -->|POST /webhook| B[Normalize & Dedup]
B --> C[AI Diagnosis]
C --> D{Policy Engine}
D -->|Auto-execute| E[Remediate]
D -->|Needs approval| F[Notify Operator]
F -->|Approve| E
E --> G[Audit & Learn]
G -->|Patterns| C
AI-SRE operates as an autonomous loop: alerts arrive from any monitoring tool, get normalized into a unified incident model, pass through LLM-powered diagnosis enriched with Kubernetes state and logs, and flow through a policy-governed action engine that enforces safety guardrails at every step.
Key Features¶
-
Multi-Source Alert Ingestion
Receive alerts from PagerDuty, Opsgenie, Datadog, Grafana, New Relic, Prometheus/Alertmanager, and generic webhooks. Automatic deduplication and alert grouping by service and severity.
-
AI-Powered Diagnosis
LLM-driven root cause analysis combining Kubernetes cluster state, application logs, matched playbooks, similar historical incidents, and deploy correlation context. Confidence-scored with explainable reasoning.
-
Safe Autonomous Remediation
Three-tier action execution: advisory-only suggestions, dry-run previews, and live execution. Policy engine enforces namespace isolation, replica limits, and approval requirements.
-
Runbooks & Workflows
Built-in and auto-generated runbooks with pattern-matching triggers. YAML-defined multi-step workflows with conditional steps, timeouts, and rollback actions.
-
Proactive Intelligence
Capacity forecasting, incident pattern clustering, per-service failure profiles, and operational drift detection. Identify systemic issues before they become outages.
-
Multi-Tenant Workspaces
Workspace isolation with per-workspace API keys, Kubernetes namespace restrictions, and RBAC roles (viewer, operator, admin). Each workspace gets its own filtered view.
Quick Start¶
Get AI-SRE running in under 5 minutes:
Then verify it is running:
curl http://localhost:8888/health
# {"status":"ok"}
# Seed demo data and open the console
make seed
open http://localhost:8888/console
Integrations¶
AI-SRE integrates with your existing stack without requiring you to replace any tools:
| Category | Supported |
|---|---|
| Alert Sources | PagerDuty, Opsgenie, Datadog, Grafana, New Relic, Prometheus/Alertmanager, generic webhook |
| Log Backends | Grafana Loki, Elasticsearch, Datadog Logs, mock (for dev) |
| LLM Providers | Anthropic Claude, OpenAI GPT, Ollama (self-hosted) |
| Action Targets | Kubernetes, AWS ECS |
| Notifications | Slack, Microsoft Teams, PagerDuty, Email (SMTP) |
| Database | SQLite (dev/pilot), PostgreSQL (production) |
| CI/CD | Deploy event ingestion from any pipeline |
Architecture at a Glance¶
graph TB
subgraph External["External Systems"]
PD[PagerDuty]
DD[Datadog]
GF[Grafana]
AM[Alertmanager]
CI[CI/CD Pipelines]
end
subgraph API["API Layer"]
WH[Webhook Receiver]
SEC[Auth + Rate Limit]
RT[Routers]
end
subgraph Core["Core Services"]
ORCH[Orchestration]
REASON[Reasoning Engine]
ACT[Action Engine]
AUTO[Autonomy Loop]
POL[Policy Engine]
end
subgraph Adapters["Adapter Layer"]
AS[Alert Sources]
LB[Log Backends]
AP[Action Providers]
NC[Notification Channels]
end
subgraph Storage["Storage"]
DB[(SQLite / PostgreSQL)]
CACHE[In-Memory Caches]
end
External --> API
API --> Core
Core --> Adapters
Core --> Storage
Adapters --> Storage
Full architecture documentation
What AI-SRE Gives Your Team¶
| Metric | Without AI-SRE | With AI-SRE |
|---|---|---|
| Time to first response | 10-30 min (human pages) | < 1 min (automatic diagnosis) |
| Mean time to resolution | 30-120 min | 5-15 min |
| On-call toil | Manual runbook execution | Automated with approval gates |
| Incident pattern detection | Quarterly reviews | Continuous, real-time |
| Postmortem creation | Hours of writing | AI-generated draft in seconds |