Skip to content

AI-SRE Architecture

This document describes the internal architecture of the AI-SRE platform, including component responsibilities, data flow, the adapter/plugin system, storage layer, and security model.

System Overview

AI-SRE is a FastAPI application that acts as an autonomous Site Reliability Engineering agent. It receives alerts from monitoring tools, normalizes them into a unified incident model, applies AI-powered diagnosis, and executes or suggests remediation actions within configurable safety guardrails.

┌─────────────────────────────────────────────────────────────────────┐
│                        External Systems                             │
│  PagerDuty  Opsgenie  Datadog  Grafana  NewRelic  Alertmanager     │
│  CI/CD (deploy events)   Slack   Teams   Email   PagerDuty (sync)  │
└──────────┬──────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                      API Layer (FastAPI)                             │
│  ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐  │
│  │ Security │ │   Ingestion  │ │   Routers    │ │    System     │  │
│  │ (Auth +  │ │  (Webhook +  │ │ (Automation, │ │  (Health,     │  │
│  │  Rate    │ │   Dedup +    │ │ Intelligence,│ │   Metrics,    │  │
│  │  Limit)  │ │   Grouping)  │ │ OnCall, SLO, │ │   Console)   │  │
│  │          │ │              │ │ Workflows,   │ │              │  │
│  │          │ │              │ │ Notifications│ │              │  │
│  └──────────┘ └──────────────┘ └──────────────┘ └───────────────┘  │
└──────────┬──────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                       Core Services                                 │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │ Orchestration│ │   Reasoning  │ │   Actions    │                │
│  │ (Diagnosis   │ │ (LLM Agent + │ │ (Catalog +   │                │
│  │  Pipeline)   │ │  Tools)      │ │  Execution)  │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │  Autonomy    │ │  Policies    │ │  Knowledge   │                │
│  │ (Auto-       │ │ (Guardrails) │ │ (Similar     │                │
│  │  remediate)  │ │              │ │  Incidents)  │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │ Correlation  │ │ Forecasting  │ │ Intelligence │                │
│  │ (Deploy)     │ │ (Capacity)   │ │ (Patterns)   │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │  Runbooks    │ │  Workflows   │ │  Postmortem  │                │
│  │ (Match +     │ │ (YAML        │ │ + Digest     │                │
│  │  Execute +   │ │  Engine)     │ │              │                │
│  │  Generate)   │ │              │ │              │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
└──────────┬──────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                     Adapter Layer (Pluggable)                       │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐      │
│  │ Alert      │ │ Log        │ │ Action     │ │Notification│      │
│  │ Sources    │ │ Backends   │ │ Providers  │ │ Channels   │      │
│  │ (PD, OG,  │ │ (Loki,     │ │ (K8s, ECS) │ │(Slack, PD, │      │
│  │  DD, GF,  │ │  Elastic,  │ │            │ │ Teams,     │      │
│  │  NR, AM)  │ │  Datadog,  │ │            │ │ Email)     │      │
│  │           │ │  Mock)     │ │            │ │            │      │
│  └────────────┘ └────────────┘ └────────────┘ └────────────┘      │
└──────────┬──────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                      Storage Layer                                  │
│  ┌────────────────────┐ ┌────────────────────┐                      │
│  │ SQLite / PostgreSQL│ │  In-Memory Caches  │                      │
│  │ (Incidents, Actions│ │ (Dedup, Deploy     │                      │
│  │  Timeline, Metrics,│ │  Events, Runbook   │                      │
│  │  Groups, SLO)      │ │  Store)            │                      │
│  └────────────────────┘ └────────────────────┘                      │
└─────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Ingestion Pipeline (src/ingestion/)

The ingestion pipeline is the entry point for all alert data.

Module Responsibility
server.py FastAPI application, endpoint definitions, lifespan management
normalize.py Converts provider-specific payloads (PagerDuty, Opsgenie, generic webhook) into the unified Incident model
models.py Pydantic Incident model with fields for title, description, severity, source, namespace, pod, deployment, workspace, timestamps
dedup.py Alert deduplication with three strategies: exact (fingerprint match), fuzzy (title/service similarity), window (time-window + service + severity). Includes cache-based fast path and DB-based strategy path
grouper.py Groups related alerts into parent-child incident relationships based on service, namespace, and time proximity
retry.py Retry wrapper with exponential backoff and dead-letter queue for failed alert processing

Data flow: Webhook request -> source detection -> resolved/dedup fast path -> normalize -> strategy dedup -> persist -> grouping -> dedup cache update -> response.

Reasoning Engine (src/reasoning/)

The LLM-powered diagnosis brain.

Module Responsibility
agent.py LangGraph agent that orchestrates LLM calls with tool use for diagnosis
tools.py Tool definitions the LLM can invoke during diagnosis (fetch logs, check K8s state, list pods, etc.)
logs.py Log fetching and summarization logic, dispatches to the configured log backend adapter

Orchestration (src/orchestration/)

Coordinates the diagnosis workflow.

Module Responsibility
incident_service.py Main diagnose_incident() function that assembles context (K8s state, logs, playbooks, similar incidents, deploy correlation) and invokes the reasoning agent
presentation.py Formats diagnosis results for API responses and Slack messages

Actions (src/actions/)

Safe action execution with guardrails.

Module Responsibility
catalog.py Action definitions (restart_pod, scale_deployment, suggest_config_fix) with metadata (blast radius, executability). Applies policy engine to every action suggestion
k8s_actions.py Kubernetes action executor using the kubernetes Python client. Enforces namespace allowlists, replica limits, dry-run defaults
pr_generator.py Creates GitHub PRs with suggested config fixes (resource limits, replica counts)

Autonomy (src/autonomy/)

The autonomous remediation loop.

Module Responsibility
engine.py run_autonomous_monitor_cycle() scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Respects AUTONOMY_ENABLED, AUTONOMOUS_ACTIONS, and approval settings
worker.py Background worker that runs autonomy cycles on a configurable interval

Policy Engine (src/policies/)

Centralized guardrail evaluation.

Module Responsibility
engine.py evaluate_action_policy() determines whether an action is allowed, requires approval, should be dry-run only, or can be auto-executed. Considers namespace restrictions, deployment profile, and autonomy settings

Adapter System (src/adapters/)

Four pluggable adapter registries allow extending the platform without modifying core logic.

Alert Source Registry (src/adapters/alert_registry.py)

Maps source names to normalizer functions. Built-in sources: - webhook -- generic JSON webhook - pagerduty -- PagerDuty v3 events - opsgenie -- Opsgenie alert format - alertmanager / prometheus -- Prometheus Alertmanager format - datadog -- Datadog webhook events - grafana -- Grafana alerting webhook - newrelic -- New Relic alert format

Protocol: normalize_fn(payload: dict, workspace_id, workspace_name) -> Incident | None

Log Backend Registry (src/adapters/log_registry.py)

Maps log provider names to log fetching implementations. Supported backends: Loki, Elasticsearch, Datadog Logs, Mock.

Action Provider Registry (src/adapters/actions/registry.py)

Maps action names to adapter instances. Built-in providers: - kubernetes -- K8s actions (restart_pod, scale_deployment, etc.) - ecs -- AWS ECS actions

Protocol:

class ActionAdapter(Protocol):
    async def execute(self, action_name, arguments, dry_run, incident, approved) -> dict
    def supported_actions(self) -> list[str]

Notification Channel Registry (src/adapters/notifications/registry.py)

Maps channel names to notification adapters. Built-in channels: Slack, PagerDuty, Microsoft Teams, Email (SMTP).

Runbook System (src/runbooks/)

Module Responsibility
schema.py Pydantic Runbook model with steps, triggers, conditions
executor.py Executes runbooks step-by-step, matching trigger patterns to incidents, with approval gates
generator.py Analyzes incident history patterns and generates runbooks using LLM
store.py Persists generated runbooks to config/runbooks/generated/ as YAML

Workflow Engine (src/workflows/)

Module Responsibility
engine.py Loads YAML workflow definitions from config/workflows/, executes multi-step workflows with conditional steps, timeouts, rollbacks, and inter-step dependencies

Intelligence (src/intelligence/, src/forecasting/, src/correlation/)

Module Responsibility
intelligence/pattern_learner.py Clusters incidents by keywords, computes repeat rates, builds per-service failure profiles
forecasting/capacity.py Predicts capacity risks from incident rate trends and metric timelines
correlation/deploy.py Stores deploy events in memory, correlates with incidents by timestamp and service/namespace proximity

Notification & On-Call (src/notifications/)

Module Responsibility
router.py Severity-based notification routing with per-service overrides. Dispatches to all matching channels
oncall.py Resolves current on-call engineer from YAML schedule using ISO week rotation
escalation.py Time-based escalation levels (L1, L2, L3) for unacknowledged incidents

SLO Engine (src/slo/)

Loads SLO definitions from config/slos.yaml and calculates error budget status from incident data. Supports availability, latency, and error rate indicators.

Multi-Tenancy (src/tenancy/)

Module Responsibility
workspace.py Workspace model, YAML-based store, API key to workspace resolution
rbac.py Role hierarchy (viewer < operator < admin), permission checks, FastAPI auth dependencies

Security (src/security/)

Module Responsibility
auth.py API key authentication via X-API-Key header or api_key query parameter. Dev mode when no keys configured
rate_limit.py Per-IP rate limiting configurable via AI_SRE_RATE_LIMIT

Storage Layer

Database (SQLAlchemy Async)

The platform uses SQLAlchemy with async drivers for both SQLite (aiosqlite) and PostgreSQL (asyncpg). Tables are auto-created on startup; Alembic handles schema migrations for production PostgreSQL.

Tables:

Table Purpose
incidents Core incident records with title, severity, source, status, timestamps, workspace
incident_groups Parent-child relationships between grouped incidents
action_logs Audit trail of all actions executed against incidents
timeline_events Event-sourced timeline (alert received, diagnosis started/completed, action executed, postmortem generated, etc.)
incident_metrics Time-series metric snapshots per incident

Repositories:

Module Responsibility
storage/db.py Engine creation, session factory, init_db()
storage/models.py SQLAlchemy ORM models
storage/incident_repo.py CRUD for incidents, grouping, similar search, noisy incident queries
storage/action_log_repo.py Action audit log reads/writes
storage/timeline.py Timeline event recording and retrieval
storage/metrics_repo.py Metric aggregation (MTTR, TTFR, incident rates, trends)
storage/metrics_export.py Pilot metrics computation and export

In-Memory State

Several components maintain in-memory state for performance:

  • Dedup cache (ingestion/dedup.py): Fingerprint-to-incident-id mapping with TTL for fast dedup decisions
  • Deploy event store (correlation/deploy.py): Recent deploy events for correlation lookups
  • Runbook store (runbooks/store.py): Generated runbooks loaded from YAML files
  • Workflow engine (workflows/engine.py): Loaded workflow definitions and execution run history

Security Model

Authentication

  1. API Key Auth: All protected endpoints require a valid API key via X-API-Key header or api_key query parameter. Keys are configured in AI_SRE_API_KEYS (comma-separated). When no keys are configured, the platform runs in dev mode with open access.

  2. Workspace Resolution: API keys are mapped to workspaces via config/workspaces.yaml. Each workspace key has an assigned role.

Authorization (RBAC)

Three-tier role hierarchy:

  • Viewer (level 10): Read incidents, metrics, timelines
  • Operator (level 20): Execute actions, manage autonomy, seed demo data
  • Admin (level 30): All operator permissions plus workspace management

Requests with unknown or no API key default to Admin role for backwards compatibility (dev mode behavior).

Namespace Isolation

Workspaces define allowed_namespaces lists. When a workspace is resolved, Kubernetes actions are restricted to those namespaces. Empty list means all namespaces are allowed.

Rate Limiting

Per-IP rate limiting configured via AI_SRE_RATE_LIMIT (requests per minute). Set to 0 to disable.

Action Safety Guardrails

Every action passes through the policy engine which enforces:

  1. Dry-run default: Actions default to dry-run unless explicitly overridden
  2. Approval gates: Live actions require explicit approval when APPROVAL_REQUIRED=true
  3. Namespace allowlists: Actions blocked outside allowed namespaces
  4. Replica limits: Scale actions capped at MAX_SCALE_REPLICAS
  5. Autonomous action allowlist: Only actions in AUTONOMOUS_ACTIONS can be auto-executed
  6. Blast radius metadata: Each action carries blast radius classification for operator awareness

CORS

Configurable via AI_SRE_CORS_ORIGINS. Defaults to * for development.

Request Lifecycle

A typical alert-to-resolution flow:

  1. External monitoring sends POST to /webhook with X-Source header
  2. Security middleware validates API key and rate limit
  3. Alert payload is normalized to Incident model via the alert source registry
  4. Dedup check: cache fast-path, then strategy-based (exact/fuzzy/window)
  5. Incident persisted to database with timeline events
  6. Alert grouper checks for related open incidents
  7. Operator (or autonomy loop) requests diagnosis via /incidents/{id}/diagnosis
  8. Orchestration layer assembles context: K8s state, logs, playbooks, similar incidents, deploy correlation
  9. Reasoning agent runs LLM with tools to produce diagnosis
  10. Diagnosis suggests actions, each wrapped with policy engine metadata
  11. Operator approves and executes action via /actions/execute
  12. Action executes through the action adapter registry (K8s, ECS)
  13. Result recorded in action log and timeline
  14. Notifications dispatched per routing rules
  15. Postmortem generated when incident resolves