Skip to content

AI-SRE -- Autonomous Site Reliability Engineering

Turn pages into resolved incidents in minutes, not hours.

AI-SRE is an autonomous incident response platform that ingests alerts from any monitoring source, diagnoses root causes using LLM-powered reasoning, executes safe remediation actions, and continuously learns from incident patterns -- all with full audit trails, guardrails, and no vendor lock-in.

Incident Response Flow

Alert → Diagnose → Remediate → Learn


How It Works

graph LR
    A[Alert Sources] -->|POST /webhook| B[Normalize & Dedup]
    B --> C[AI Diagnosis]
    C --> D{Policy Engine}
    D -->|Auto-execute| E[Remediate]
    D -->|Needs approval| F[Notify Operator]
    F -->|Approve| E
    E --> G[Audit & Learn]
    G -->|Patterns| C

AI-SRE operates as an autonomous loop: alerts arrive from any monitoring tool, get normalized into a unified incident model, pass through LLM-powered diagnosis enriched with Kubernetes state and logs, and flow through a policy-governed action engine that enforces safety guardrails at every step.


Key Features

  • Multi-Source Alert Ingestion


    Receive alerts from PagerDuty, Opsgenie, Datadog, Grafana, New Relic, Prometheus/Alertmanager, and generic webhooks. Automatic deduplication and alert grouping by service and severity.

    API Reference

  • AI-Powered Diagnosis


    LLM-driven root cause analysis combining Kubernetes cluster state, application logs, matched playbooks, similar historical incidents, and deploy correlation context. Confidence-scored with explainable reasoning.

    Architecture

  • Safe Autonomous Remediation


    Three-tier action execution: advisory-only suggestions, dry-run previews, and live execution. Policy engine enforces namespace isolation, replica limits, and approval requirements.

    Configuration

  • Runbooks & Workflows


    Built-in and auto-generated runbooks with pattern-matching triggers. YAML-defined multi-step workflows with conditional steps, timeouts, and rollback actions.

    API Reference

  • Proactive Intelligence


    Capacity forecasting, incident pattern clustering, per-service failure profiles, and operational drift detection. Identify systemic issues before they become outages.

    API Reference

  • Multi-Tenant Workspaces


    Workspace isolation with per-workspace API keys, Kubernetes namespace restrictions, and RBAC roles (viewer, operator, admin). Each workspace gets its own filtered view.

    Configuration


Quick Start

Get AI-SRE running in under 5 minutes:

git clone https://github.com/aabhat-ai/AI-SRE.git && cd AI-SRE
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
# Edit .env -- set ANTHROPIC_API_KEY or OPENAI_API_KEY
make demo
git clone https://github.com/aabhat-ai/AI-SRE.git && cd AI-SRE
minikube start
make minikube-deploy
make minikube-url
git clone https://github.com/aabhat-ai/AI-SRE.git && cd AI-SRE
cp .env.example .env
# Edit .env with your API keys
docker compose -f deploy/docker-compose.yml up -d

Then verify it is running:

curl http://localhost:8888/health
# {"status":"ok"}

# Seed demo data and open the console
make seed
open http://localhost:8888/console

Full quickstart guide


Integrations

AI-SRE integrates with your existing stack without requiring you to replace any tools:

Category Supported
Alert Sources PagerDuty, Opsgenie, Datadog, Grafana, New Relic, Prometheus/Alertmanager, generic webhook
Log Backends Grafana Loki, Elasticsearch, Datadog Logs, mock (for dev)
LLM Providers Anthropic Claude, OpenAI GPT, Ollama (self-hosted)
Action Targets Kubernetes, AWS ECS
Notifications Slack, Microsoft Teams, PagerDuty, Email (SMTP)
Database SQLite (dev/pilot), PostgreSQL (production)
CI/CD Deploy event ingestion from any pipeline

Architecture at a Glance

graph TB
    subgraph External["External Systems"]
        PD[PagerDuty]
        DD[Datadog]
        GF[Grafana]
        AM[Alertmanager]
        CI[CI/CD Pipelines]
    end

    subgraph API["API Layer"]
        WH[Webhook Receiver]
        SEC[Auth + Rate Limit]
        RT[Routers]
    end

    subgraph Core["Core Services"]
        ORCH[Orchestration]
        REASON[Reasoning Engine]
        ACT[Action Engine]
        AUTO[Autonomy Loop]
        POL[Policy Engine]
    end

    subgraph Adapters["Adapter Layer"]
        AS[Alert Sources]
        LB[Log Backends]
        AP[Action Providers]
        NC[Notification Channels]
    end

    subgraph Storage["Storage"]
        DB[(SQLite / PostgreSQL)]
        CACHE[In-Memory Caches]
    end

    External --> API
    API --> Core
    Core --> Adapters
    Core --> Storage
    Adapters --> Storage

Full architecture documentation


What AI-SRE Gives Your Team

Metric Without AI-SRE With AI-SRE
Time to first response 10-30 min (human pages) < 1 min (automatic diagnosis)
Mean time to resolution 30-120 min 5-15 min
On-call toil Manual runbook execution Automated with approval gates
Incident pattern detection Quarterly reviews Continuous, real-time
Postmortem creation Hours of writing AI-generated draft in seconds