AI-SRE -- Autonomous Site Reliability Engineering¶

Turn pages into resolved incidents in minutes, not hours.

AI-SRE is an autonomous incident response platform that ingests alerts from any monitoring source, diagnoses root causes using LLM-powered reasoning, executes safe remediation actions, and continuously learns from incident patterns -- all with full audit trails, guardrails, and no vendor lock-in.

Get Started API Reference View on GitHub

Incident Response Flow

Alert → Diagnose → Remediate → Learn

How It Works¶

graph LR
    A[Alert Sources] -->|POST /webhook| B[Normalize & Dedup]
    B --> C[AI Diagnosis]
    C --> D{Policy Engine}
    D -->|Auto-execute| E[Remediate]
    D -->|Needs approval| F[Notify Operator]
    F -->|Approve| E
    E --> G[Audit & Learn]
    G -->|Patterns| C

AI-SRE operates as an autonomous loop: alerts arrive from any monitoring tool, get normalized into a unified incident model, pass through LLM-powered diagnosis enriched with Kubernetes state and logs, and flow through a policy-governed action engine that enforces safety guardrails at every step.

Key Features¶

Multi-Source Alert Ingestion

Receive alerts from PagerDuty, Opsgenie, Datadog, Grafana, New Relic, Prometheus/Alertmanager, and generic webhooks. Automatic deduplication and alert grouping by service and severity.

API Reference
AI-Powered Diagnosis

LLM-driven root cause analysis combining Kubernetes cluster state, application logs, matched playbooks, similar historical incidents, and deploy correlation context. Confidence-scored with explainable reasoning.

Architecture
Safe Autonomous Remediation

Three-tier action execution: advisory-only suggestions, dry-run previews, and live execution. Policy engine enforces namespace isolation, replica limits, and approval requirements.

Configuration
Runbooks & Workflows

Built-in and auto-generated runbooks with pattern-matching triggers. YAML-defined multi-step workflows with conditional steps, timeouts, and rollback actions.

API Reference
Proactive Intelligence

Capacity forecasting, incident pattern clustering, per-service failure profiles, and operational drift detection. Identify systemic issues before they become outages.

API Reference
Multi-Tenant Workspaces

Workspace isolation with per-workspace API keys, Kubernetes namespace restrictions, and RBAC roles (viewer, operator, admin). Each workspace gets its own filtered view.

Configuration

Quick Start¶

Get AI-SRE running in under 5 minutes:

Bare Metal (SQLite)Kubernetes (Minikube)Docker Compose

git clone https://github.com/aabhat-ai/AI-SRE.git && cd AI-SRE
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# Edit .env -- set ANTHROPIC_API_KEY or OPENAI_API_KEY
make demo

git clone https://github.com/aabhat-ai/AI-SRE.git && cd AI-SRE
minikube start
make minikube-deploy
make minikube-url

git clone https://github.com/aabhat-ai/AI-SRE.git && cd AI-SRE
cp .env.example .env
# Edit .env with your API keys
docker compose -f deploy/docker-compose.yml up -d

Then verify it is running:

curl http://localhost:8888/health
# {"status":"ok"}

# Seed demo data and open the console
make seed
open http://localhost:8888/console

Full quickstart guide

Integrations¶

AI-SRE integrates with your existing stack without requiring you to replace any tools:

Category	Supported
Alert Sources	PagerDuty, Opsgenie, Datadog, Grafana, New Relic, Prometheus/Alertmanager, generic webhook
Log Backends	Grafana Loki, Elasticsearch, Datadog Logs, mock (for dev)
LLM Providers	Anthropic Claude, OpenAI GPT, Ollama (self-hosted)
Action Targets	Kubernetes, AWS ECS
Notifications	Slack, Microsoft Teams, PagerDuty, Email (SMTP)
Database	SQLite (dev/pilot), PostgreSQL (production)
CI/CD	Deploy event ingestion from any pipeline

Architecture at a Glance¶

graph TB
    subgraph External["External Systems"]
        PD[PagerDuty]
        DD[Datadog]
        GF[Grafana]
        AM[Alertmanager]
        CI[CI/CD Pipelines]
    end

    subgraph API["API Layer"]
        WH[Webhook Receiver]
        SEC[Auth + Rate Limit]
        RT[Routers]
    end

    subgraph Core["Core Services"]
        ORCH[Orchestration]
        REASON[Reasoning Engine]
        ACT[Action Engine]
        AUTO[Autonomy Loop]
        POL[Policy Engine]
    end

    subgraph Adapters["Adapter Layer"]
        AS[Alert Sources]
        LB[Log Backends]
        AP[Action Providers]
        NC[Notification Channels]
    end

    subgraph Storage["Storage"]
        DB[(SQLite / PostgreSQL)]
        CACHE[In-Memory Caches]
    end

    External --> API
    API --> Core
    Core --> Adapters
    Core --> Storage
    Adapters --> Storage

Full architecture documentation

What AI-SRE Gives Your Team¶

Metric	Without AI-SRE	With AI-SRE
Time to first response	10-30 min (human pages)	< 1 min (automatic diagnosis)
Mean time to resolution	30-120 min	5-15 min
On-call toil	Manual runbook execution	Automated with approval gates
Incident pattern detection	Quarterly reviews	Continuous, real-time
Postmortem creation	Hours of writing	AI-generated draft in seconds