AI-SRE Architecture¶
This document describes the internal architecture of the AI-SRE platform, including component responsibilities, data flow, the adapter/plugin system, storage layer, and security model.
System Overview¶
AI-SRE is a FastAPI application that acts as an autonomous Site Reliability Engineering agent. It receives alerts from monitoring tools, normalizes them into a unified incident model, applies AI-powered diagnosis, and executes or suggests remediation actions within configurable safety guardrails.
┌─────────────────────────────────────────────────────────────────────┐
│ External Systems │
│ PagerDuty Opsgenie Datadog Grafana NewRelic Alertmanager │
│ CI/CD (deploy events) Slack Teams Email PagerDuty (sync) │
└──────────┬──────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ API Layer (FastAPI) │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Security │ │ Ingestion │ │ Routers │ │ System │ │
│ │ (Auth + │ │ (Webhook + │ │ (Automation, │ │ (Health, │ │
│ │ Rate │ │ Dedup + │ │ Intelligence,│ │ Metrics, │ │
│ │ Limit) │ │ Grouping) │ │ OnCall, SLO, │ │ Console) │ │
│ │ │ │ │ │ Workflows, │ │ │ │
│ │ │ │ │ │ Notifications│ │ │ │
│ └──────────┘ └──────────────┘ └──────────────┘ └───────────────┘ │
└──────────┬──────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Core Services │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Orchestration│ │ Reasoning │ │ Actions │ │
│ │ (Diagnosis │ │ (LLM Agent + │ │ (Catalog + │ │
│ │ Pipeline) │ │ Tools) │ │ Execution) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Autonomy │ │ Policies │ │ Knowledge │ │
│ │ (Auto- │ │ (Guardrails) │ │ (Similar │ │
│ │ remediate) │ │ │ │ Incidents) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Correlation │ │ Forecasting │ │ Intelligence │ │
│ │ (Deploy) │ │ (Capacity) │ │ (Patterns) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Runbooks │ │ Workflows │ │ Postmortem │ │
│ │ (Match + │ │ (YAML │ │ + Digest │ │
│ │ Execute + │ │ Engine) │ │ │ │
│ │ Generate) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────┬──────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Adapter Layer (Pluggable) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Alert │ │ Log │ │ Action │ │Notification│ │
│ │ Sources │ │ Backends │ │ Providers │ │ Channels │ │
│ │ (PD, OG, │ │ (Loki, │ │ (K8s, ECS) │ │(Slack, PD, │ │
│ │ DD, GF, │ │ Elastic, │ │ │ │ Teams, │ │
│ │ NR, AM) │ │ Datadog, │ │ │ │ Email) │ │
│ │ │ │ Mock) │ │ │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└──────────┬──────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ SQLite / PostgreSQL│ │ In-Memory Caches │ │
│ │ (Incidents, Actions│ │ (Dedup, Deploy │ │
│ │ Timeline, Metrics,│ │ Events, Runbook │ │
│ │ Groups, SLO) │ │ Store) │ │
│ └────────────────────┘ └────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Component Responsibilities¶
Ingestion Pipeline (src/ingestion/)¶
The ingestion pipeline is the entry point for all alert data.
| Module | Responsibility |
|---|---|
server.py |
FastAPI application, endpoint definitions, lifespan management |
normalize.py |
Converts provider-specific payloads (PagerDuty, Opsgenie, generic webhook) into the unified Incident model |
models.py |
Pydantic Incident model with fields for title, description, severity, source, namespace, pod, deployment, workspace, timestamps |
dedup.py |
Alert deduplication with three strategies: exact (fingerprint match), fuzzy (title/service similarity), window (time-window + service + severity). Includes cache-based fast path and DB-based strategy path |
grouper.py |
Groups related alerts into parent-child incident relationships based on service, namespace, and time proximity |
retry.py |
Retry wrapper with exponential backoff and dead-letter queue for failed alert processing |
Data flow: Webhook request -> source detection -> resolved/dedup fast path -> normalize -> strategy dedup -> persist -> grouping -> dedup cache update -> response.
Reasoning Engine (src/reasoning/)¶
The LLM-powered diagnosis brain.
| Module | Responsibility |
|---|---|
agent.py |
LangGraph agent that orchestrates LLM calls with tool use for diagnosis |
tools.py |
Tool definitions the LLM can invoke during diagnosis (fetch logs, check K8s state, list pods, etc.) |
logs.py |
Log fetching and summarization logic, dispatches to the configured log backend adapter |
Orchestration (src/orchestration/)¶
Coordinates the diagnosis workflow.
| Module | Responsibility |
|---|---|
incident_service.py |
Main diagnose_incident() function that assembles context (K8s state, logs, playbooks, similar incidents, deploy correlation) and invokes the reasoning agent |
presentation.py |
Formats diagnosis results for API responses and Slack messages |
Actions (src/actions/)¶
Safe action execution with guardrails.
| Module | Responsibility |
|---|---|
catalog.py |
Action definitions (restart_pod, scale_deployment, suggest_config_fix) with metadata (blast radius, executability). Applies policy engine to every action suggestion |
k8s_actions.py |
Kubernetes action executor using the kubernetes Python client. Enforces namespace allowlists, replica limits, dry-run defaults |
pr_generator.py |
Creates GitHub PRs with suggested config fixes (resource limits, replica counts) |
Autonomy (src/autonomy/)¶
The autonomous remediation loop.
| Module | Responsibility |
|---|---|
engine.py |
run_autonomous_monitor_cycle() scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Respects AUTONOMY_ENABLED, AUTONOMOUS_ACTIONS, and approval settings |
worker.py |
Background worker that runs autonomy cycles on a configurable interval |
Policy Engine (src/policies/)¶
Centralized guardrail evaluation.
| Module | Responsibility |
|---|---|
engine.py |
evaluate_action_policy() determines whether an action is allowed, requires approval, should be dry-run only, or can be auto-executed. Considers namespace restrictions, deployment profile, and autonomy settings |
Adapter System (src/adapters/)¶
Four pluggable adapter registries allow extending the platform without modifying core logic.
Alert Source Registry (src/adapters/alert_registry.py)¶
Maps source names to normalizer functions. Built-in sources:
- webhook -- generic JSON webhook
- pagerduty -- PagerDuty v3 events
- opsgenie -- Opsgenie alert format
- alertmanager / prometheus -- Prometheus Alertmanager format
- datadog -- Datadog webhook events
- grafana -- Grafana alerting webhook
- newrelic -- New Relic alert format
Protocol: normalize_fn(payload: dict, workspace_id, workspace_name) -> Incident | None
Log Backend Registry (src/adapters/log_registry.py)¶
Maps log provider names to log fetching implementations. Supported backends: Loki, Elasticsearch, Datadog Logs, Mock.
Action Provider Registry (src/adapters/actions/registry.py)¶
Maps action names to adapter instances. Built-in providers:
- kubernetes -- K8s actions (restart_pod, scale_deployment, etc.)
- ecs -- AWS ECS actions
Protocol:
class ActionAdapter(Protocol):
async def execute(self, action_name, arguments, dry_run, incident, approved) -> dict
def supported_actions(self) -> list[str]
Notification Channel Registry (src/adapters/notifications/registry.py)¶
Maps channel names to notification adapters. Built-in channels: Slack, PagerDuty, Microsoft Teams, Email (SMTP).
Runbook System (src/runbooks/)¶
| Module | Responsibility |
|---|---|
schema.py |
Pydantic Runbook model with steps, triggers, conditions |
executor.py |
Executes runbooks step-by-step, matching trigger patterns to incidents, with approval gates |
generator.py |
Analyzes incident history patterns and generates runbooks using LLM |
store.py |
Persists generated runbooks to config/runbooks/generated/ as YAML |
Workflow Engine (src/workflows/)¶
| Module | Responsibility |
|---|---|
engine.py |
Loads YAML workflow definitions from config/workflows/, executes multi-step workflows with conditional steps, timeouts, rollbacks, and inter-step dependencies |
Intelligence (src/intelligence/, src/forecasting/, src/correlation/)¶
| Module | Responsibility |
|---|---|
intelligence/pattern_learner.py |
Clusters incidents by keywords, computes repeat rates, builds per-service failure profiles |
forecasting/capacity.py |
Predicts capacity risks from incident rate trends and metric timelines |
correlation/deploy.py |
Stores deploy events in memory, correlates with incidents by timestamp and service/namespace proximity |
Notification & On-Call (src/notifications/)¶
| Module | Responsibility |
|---|---|
router.py |
Severity-based notification routing with per-service overrides. Dispatches to all matching channels |
oncall.py |
Resolves current on-call engineer from YAML schedule using ISO week rotation |
escalation.py |
Time-based escalation levels (L1, L2, L3) for unacknowledged incidents |
SLO Engine (src/slo/)¶
Loads SLO definitions from config/slos.yaml and calculates error budget status from incident data. Supports availability, latency, and error rate indicators.
Multi-Tenancy (src/tenancy/)¶
| Module | Responsibility |
|---|---|
workspace.py |
Workspace model, YAML-based store, API key to workspace resolution |
rbac.py |
Role hierarchy (viewer < operator < admin), permission checks, FastAPI auth dependencies |
Security (src/security/)¶
| Module | Responsibility |
|---|---|
auth.py |
API key authentication via X-API-Key header or api_key query parameter. Dev mode when no keys configured |
rate_limit.py |
Per-IP rate limiting configurable via AI_SRE_RATE_LIMIT |
Storage Layer¶
Database (SQLAlchemy Async)¶
The platform uses SQLAlchemy with async drivers for both SQLite (aiosqlite) and PostgreSQL (asyncpg). Tables are auto-created on startup; Alembic handles schema migrations for production PostgreSQL.
Tables:
| Table | Purpose |
|---|---|
incidents |
Core incident records with title, severity, source, status, timestamps, workspace |
incident_groups |
Parent-child relationships between grouped incidents |
action_logs |
Audit trail of all actions executed against incidents |
timeline_events |
Event-sourced timeline (alert received, diagnosis started/completed, action executed, postmortem generated, etc.) |
incident_metrics |
Time-series metric snapshots per incident |
Repositories:
| Module | Responsibility |
|---|---|
storage/db.py |
Engine creation, session factory, init_db() |
storage/models.py |
SQLAlchemy ORM models |
storage/incident_repo.py |
CRUD for incidents, grouping, similar search, noisy incident queries |
storage/action_log_repo.py |
Action audit log reads/writes |
storage/timeline.py |
Timeline event recording and retrieval |
storage/metrics_repo.py |
Metric aggregation (MTTR, TTFR, incident rates, trends) |
storage/metrics_export.py |
Pilot metrics computation and export |
In-Memory State¶
Several components maintain in-memory state for performance:
- Dedup cache (
ingestion/dedup.py): Fingerprint-to-incident-id mapping with TTL for fast dedup decisions - Deploy event store (
correlation/deploy.py): Recent deploy events for correlation lookups - Runbook store (
runbooks/store.py): Generated runbooks loaded from YAML files - Workflow engine (
workflows/engine.py): Loaded workflow definitions and execution run history
Security Model¶
Authentication¶
-
API Key Auth: All protected endpoints require a valid API key via
X-API-Keyheader orapi_keyquery parameter. Keys are configured inAI_SRE_API_KEYS(comma-separated). When no keys are configured, the platform runs in dev mode with open access. -
Workspace Resolution: API keys are mapped to workspaces via
config/workspaces.yaml. Each workspace key has an assigned role.
Authorization (RBAC)¶
Three-tier role hierarchy:
- Viewer (level 10): Read incidents, metrics, timelines
- Operator (level 20): Execute actions, manage autonomy, seed demo data
- Admin (level 30): All operator permissions plus workspace management
Requests with unknown or no API key default to Admin role for backwards compatibility (dev mode behavior).
Namespace Isolation¶
Workspaces define allowed_namespaces lists. When a workspace is resolved, Kubernetes actions are restricted to those namespaces. Empty list means all namespaces are allowed.
Rate Limiting¶
Per-IP rate limiting configured via AI_SRE_RATE_LIMIT (requests per minute). Set to 0 to disable.
Action Safety Guardrails¶
Every action passes through the policy engine which enforces:
- Dry-run default: Actions default to dry-run unless explicitly overridden
- Approval gates: Live actions require explicit approval when
APPROVAL_REQUIRED=true - Namespace allowlists: Actions blocked outside allowed namespaces
- Replica limits: Scale actions capped at
MAX_SCALE_REPLICAS - Autonomous action allowlist: Only actions in
AUTONOMOUS_ACTIONScan be auto-executed - Blast radius metadata: Each action carries blast radius classification for operator awareness
CORS¶
Configurable via AI_SRE_CORS_ORIGINS. Defaults to * for development.
Request Lifecycle¶
A typical alert-to-resolution flow:
- External monitoring sends POST to
/webhookwithX-Sourceheader - Security middleware validates API key and rate limit
- Alert payload is normalized to
Incidentmodel via the alert source registry - Dedup check: cache fast-path, then strategy-based (exact/fuzzy/window)
- Incident persisted to database with timeline events
- Alert grouper checks for related open incidents
- Operator (or autonomy loop) requests diagnosis via
/incidents/{id}/diagnosis - Orchestration layer assembles context: K8s state, logs, playbooks, similar incidents, deploy correlation
- Reasoning agent runs LLM with tools to produce diagnosis
- Diagnosis suggests actions, each wrapped with policy engine metadata
- Operator approves and executes action via
/actions/execute - Action executes through the action adapter registry (K8s, ECS)
- Result recorded in action log and timeline
- Notifications dispatched per routing rules
- Postmortem generated when incident resolves