AI-SRE Architecture¶

This document describes the internal architecture of the AI-SRE platform, including component responsibilities, data flow, the adapter/plugin system, storage layer, and security model.

System Overview¶

AI-SRE is a FastAPI application that acts as an autonomous Site Reliability Engineering agent. It receives alerts from monitoring tools, normalizes them into a unified incident model, applies AI-powered diagnosis, and executes or suggests remediation actions within configurable safety guardrails.

┌─────────────────────────────────────────────────────────────────────┐
│                        External Systems                             │
│  PagerDuty  Opsgenie  Datadog  Grafana  NewRelic  Alertmanager     │
│  CI/CD (deploy events)   Slack   Teams   Email   PagerDuty (sync)  │
└──────────┬──────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      API Layer (FastAPI)                             │
│  ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐  │
│  │ Security │ │   Ingestion  │ │   Routers    │ │    System     │  │
│  │ (Auth +  │ │  (Webhook +  │ │ (Automation, │ │  (Health,     │  │
│  │  Rate    │ │   Dedup +    │ │ Intelligence,│ │   Metrics,    │  │
│  │  Limit)  │ │   Grouping)  │ │ OnCall, SLO, │ │   Console)   │  │
│  │          │ │              │ │ Workflows,   │ │              │  │
│  │          │ │              │ │ Notifications│ │              │  │
│  └──────────┘ └──────────────┘ └──────────────┘ └───────────────┘  │
└──────────┬──────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Core Services                                 │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │ Orchestration│ │   Reasoning  │ │   Actions    │                │
│  │ (Diagnosis   │ │ (LLM Agent + │ │ (Catalog +   │                │
│  │  Pipeline)   │ │  Tools)      │ │  Execution)  │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │  Autonomy    │ │  Policies    │ │  Knowledge   │                │
│  │ (Auto-       │ │ (Guardrails) │ │ (Similar     │                │
│  │  remediate)  │ │              │ │  Incidents)  │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │ Correlation  │ │ Forecasting  │ │ Intelligence │                │
│  │ (Deploy)     │ │ (Capacity)   │ │ (Patterns)   │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                │
│  │  Runbooks    │ │  Workflows   │ │  Postmortem  │                │
│  │ (Match +     │ │ (YAML        │ │ + Digest     │                │
│  │  Execute +   │ │  Engine)     │ │              │                │
│  │  Generate)   │ │              │ │              │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                │
└──────────┬──────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Adapter Layer (Pluggable)                       │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐      │
│  │ Alert      │ │ Log        │ │ Action     │ │Notification│      │
│  │ Sources    │ │ Backends   │ │ Providers  │ │ Channels   │      │
│  │ (PD, OG,  │ │ (Loki,     │ │ (K8s, ECS) │ │(Slack, PD, │      │
│  │  DD, GF,  │ │  Elastic,  │ │            │ │ Teams,     │      │
│  │  NR, AM)  │ │  Datadog,  │ │            │ │ Email)     │      │
│  │           │ │  Mock)     │ │            │ │            │      │
│  └────────────┘ └────────────┘ └────────────┘ └────────────┘      │
└──────────┬──────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Storage Layer                                  │
│  ┌────────────────────┐ ┌────────────────────┐                      │
│  │ SQLite / PostgreSQL│ │  In-Memory Caches  │                      │
│  │ (Incidents, Actions│ │ (Dedup, Deploy     │                      │
│  │  Timeline, Metrics,│ │  Events, Runbook   │                      │
│  │  Groups, SLO)      │ │  Store)            │                      │
│  └────────────────────┘ └────────────────────┘                      │
└─────────────────────────────────────────────────────────────────────┘

Component Responsibilities¶

Ingestion Pipeline (`src/ingestion/`)¶

The ingestion pipeline is the entry point for all alert data.

Module	Responsibility
`server.py`	FastAPI application, endpoint definitions, lifespan management
`normalize.py`	Converts provider-specific payloads (PagerDuty, Opsgenie, generic webhook) into the unified `Incident` model
`models.py`	Pydantic `Incident` model with fields for title, description, severity, source, namespace, pod, deployment, workspace, timestamps
`dedup.py`	Alert deduplication with three strategies: `exact` (fingerprint match), `fuzzy` (title/service similarity), `window` (time-window + service + severity). Includes cache-based fast path and DB-based strategy path
`grouper.py`	Groups related alerts into parent-child incident relationships based on service, namespace, and time proximity
`retry.py`	Retry wrapper with exponential backoff and dead-letter queue for failed alert processing

Data flow: Webhook request -> source detection -> resolved/dedup fast path -> normalize -> strategy dedup -> persist -> grouping -> dedup cache update -> response.

Reasoning Engine (`src/reasoning/`)¶

The LLM-powered diagnosis brain.

Module	Responsibility
`agent.py`	LangGraph agent that orchestrates LLM calls with tool use for diagnosis
`tools.py`	Tool definitions the LLM can invoke during diagnosis (fetch logs, check K8s state, list pods, etc.)
`logs.py`	Log fetching and summarization logic, dispatches to the configured log backend adapter

Orchestration (`src/orchestration/`)¶

Coordinates the diagnosis workflow.

Module	Responsibility
`incident_service.py`	Main `diagnose_incident()` function that assembles context (K8s state, logs, playbooks, similar incidents, deploy correlation) and invokes the reasoning agent
`presentation.py`	Formats diagnosis results for API responses and Slack messages

Actions (`src/actions/`)¶

Safe action execution with guardrails.

Module	Responsibility
`catalog.py`	Action definitions (`restart_pod`, `scale_deployment`, `suggest_config_fix`) with metadata (blast radius, executability). Applies policy engine to every action suggestion
`k8s_actions.py`	Kubernetes action executor using the `kubernetes` Python client. Enforces namespace allowlists, replica limits, dry-run defaults
`pr_generator.py`	Creates GitHub PRs with suggested config fixes (resource limits, replica counts)

Autonomy (`src/autonomy/`)¶

The autonomous remediation loop.

Module	Responsibility
`engine.py`	`run_autonomous_monitor_cycle()` scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Respects `AUTONOMY_ENABLED`, `AUTONOMOUS_ACTIONS`, and approval settings
`worker.py`	Background worker that runs autonomy cycles on a configurable interval

Policy Engine (`src/policies/`)¶

Centralized guardrail evaluation.

Module	Responsibility
`engine.py`	`evaluate_action_policy()` determines whether an action is allowed, requires approval, should be dry-run only, or can be auto-executed. Considers namespace restrictions, deployment profile, and autonomy settings

Adapter System (`src/adapters/`)¶

Four pluggable adapter registries allow extending the platform without modifying core logic.

Alert Source Registry (`src/adapters/alert_registry.py`)¶

Maps source names to normalizer functions. Built-in sources: - webhook -- generic JSON webhook - pagerduty -- PagerDuty v3 events - opsgenie -- Opsgenie alert format - alertmanager / prometheus -- Prometheus Alertmanager format - datadog -- Datadog webhook events - grafana -- Grafana alerting webhook - newrelic -- New Relic alert format

Protocol: normalize_fn(payload: dict, workspace_id, workspace_name) -> Incident | None

Log Backend Registry (`src/adapters/log_registry.py`)¶

Maps log provider names to log fetching implementations. Supported backends: Loki, Elasticsearch, Datadog Logs, Mock.

Action Provider Registry (`src/adapters/actions/registry.py`)¶

Maps action names to adapter instances. Built-in providers: - kubernetes -- K8s actions (restart_pod, scale_deployment, etc.) - ecs -- AWS ECS actions

Protocol:

class ActionAdapter(Protocol):
    async def execute(self, action_name, arguments, dry_run, incident, approved) -> dict
    def supported_actions(self) -> list[str]

Notification Channel Registry (`src/adapters/notifications/registry.py`)¶

Maps channel names to notification adapters. Built-in channels: Slack, PagerDuty, Microsoft Teams, Email (SMTP).

Runbook System (`src/runbooks/`)¶

Module	Responsibility
`schema.py`	Pydantic `Runbook` model with steps, triggers, conditions
`executor.py`	Executes runbooks step-by-step, matching trigger patterns to incidents, with approval gates
`generator.py`	Analyzes incident history patterns and generates runbooks using LLM
`store.py`	Persists generated runbooks to `config/runbooks/generated/` as YAML

Workflow Engine (`src/workflows/`)¶

Module	Responsibility
`engine.py`	Loads YAML workflow definitions from `config/workflows/`, executes multi-step workflows with conditional steps, timeouts, rollbacks, and inter-step dependencies

Intelligence (`src/intelligence/`, `src/forecasting/`, `src/correlation/`)¶

Module	Responsibility
`intelligence/pattern_learner.py`	Clusters incidents by keywords, computes repeat rates, builds per-service failure profiles
`forecasting/capacity.py`	Predicts capacity risks from incident rate trends and metric timelines
`correlation/deploy.py`	Stores deploy events in memory, correlates with incidents by timestamp and service/namespace proximity

Notification & On-Call (`src/notifications/`)¶

Module	Responsibility
`router.py`	Severity-based notification routing with per-service overrides. Dispatches to all matching channels
`oncall.py`	Resolves current on-call engineer from YAML schedule using ISO week rotation
`escalation.py`	Time-based escalation levels (L1, L2, L3) for unacknowledged incidents

SLO Engine (`src/slo/`)¶

Loads SLO definitions from config/slos.yaml and calculates error budget status from incident data. Supports availability, latency, and error rate indicators.

Multi-Tenancy (`src/tenancy/`)¶

Module	Responsibility
`workspace.py`	Workspace model, YAML-based store, API key to workspace resolution
`rbac.py`	Role hierarchy (viewer < operator < admin), permission checks, FastAPI auth dependencies

Security (`src/security/`)¶

Module	Responsibility
`auth.py`	API key authentication via `X-API-Key` header or `api_key` query parameter. Dev mode when no keys configured
`rate_limit.py`	Per-IP rate limiting configurable via `AI_SRE_RATE_LIMIT`

Storage Layer¶

Database (SQLAlchemy Async)¶

The platform uses SQLAlchemy with async drivers for both SQLite (aiosqlite) and PostgreSQL (asyncpg). Tables are auto-created on startup; Alembic handles schema migrations for production PostgreSQL.

Tables:

Table	Purpose
`incidents`	Core incident records with title, severity, source, status, timestamps, workspace
`incident_groups`	Parent-child relationships between grouped incidents
`action_logs`	Audit trail of all actions executed against incidents
`timeline_events`	Event-sourced timeline (alert received, diagnosis started/completed, action executed, postmortem generated, etc.)
`incident_metrics`	Time-series metric snapshots per incident

Repositories:

Module	Responsibility
`storage/db.py`	Engine creation, session factory, `init_db()`
`storage/models.py`	SQLAlchemy ORM models
`storage/incident_repo.py`	CRUD for incidents, grouping, similar search, noisy incident queries
`storage/action_log_repo.py`	Action audit log reads/writes
`storage/timeline.py`	Timeline event recording and retrieval
`storage/metrics_repo.py`	Metric aggregation (MTTR, TTFR, incident rates, trends)
`storage/metrics_export.py`	Pilot metrics computation and export

In-Memory State¶

Several components maintain in-memory state for performance:

Dedup cache (ingestion/dedup.py): Fingerprint-to-incident-id mapping with TTL for fast dedup decisions
Deploy event store (correlation/deploy.py): Recent deploy events for correlation lookups
Runbook store (runbooks/store.py): Generated runbooks loaded from YAML files
Workflow engine (workflows/engine.py): Loaded workflow definitions and execution run history

Security Model¶

Authentication¶

API Key Auth: All protected endpoints require a valid API key via X-API-Key header or api_key query parameter. Keys are configured in AI_SRE_API_KEYS (comma-separated). When no keys are configured, the platform runs in dev mode with open access.
Workspace Resolution: API keys are mapped to workspaces via config/workspaces.yaml. Each workspace key has an assigned role.

Authorization (RBAC)¶

Three-tier role hierarchy:

Viewer (level 10): Read incidents, metrics, timelines
Operator (level 20): Execute actions, manage autonomy, seed demo data
Admin (level 30): All operator permissions plus workspace management

Requests with unknown or no API key default to Admin role for backwards compatibility (dev mode behavior).

Namespace Isolation¶

Workspaces define allowed_namespaces lists. When a workspace is resolved, Kubernetes actions are restricted to those namespaces. Empty list means all namespaces are allowed.

Rate Limiting¶

Per-IP rate limiting configured via AI_SRE_RATE_LIMIT (requests per minute). Set to 0 to disable.

Action Safety Guardrails¶

Every action passes through the policy engine which enforces:

Dry-run default: Actions default to dry-run unless explicitly overridden
Approval gates: Live actions require explicit approval when APPROVAL_REQUIRED=true
Namespace allowlists: Actions blocked outside allowed namespaces
Replica limits: Scale actions capped at MAX_SCALE_REPLICAS
Autonomous action allowlist: Only actions in AUTONOMOUS_ACTIONS can be auto-executed
Blast radius metadata: Each action carries blast radius classification for operator awareness

CORS¶

Configurable via AI_SRE_CORS_ORIGINS. Defaults to * for development.

Request Lifecycle¶

A typical alert-to-resolution flow:

External monitoring sends POST to /webhook with X-Source header
Security middleware validates API key and rate limit
Alert payload is normalized to Incident model via the alert source registry
Dedup check: cache fast-path, then strategy-based (exact/fuzzy/window)
Incident persisted to database with timeline events
Alert grouper checks for related open incidents
Operator (or autonomy loop) requests diagnosis via /incidents/{id}/diagnosis
Orchestration layer assembles context: K8s state, logs, playbooks, similar incidents, deploy correlation
Reasoning agent runs LLM with tools to produce diagnosis
Diagnosis suggests actions, each wrapped with policy engine metadata
Operator approves and executes action via /actions/execute
Action executes through the action adapter registry (K8s, ECS)
Result recorded in action log and timeline
Notifications dispatched per routing rules
Postmortem generated when incident resolves

AI-SRE Architecture¶

System Overview¶

Component Responsibilities¶

Ingestion Pipeline (src/ingestion/)¶

Reasoning Engine (src/reasoning/)¶

Orchestration (src/orchestration/)¶

Actions (src/actions/)¶

Autonomy (src/autonomy/)¶

Policy Engine (src/policies/)¶

Adapter System (src/adapters/)¶

Alert Source Registry (src/adapters/alert_registry.py)¶

Log Backend Registry (src/adapters/log_registry.py)¶

Action Provider Registry (src/adapters/actions/registry.py)¶

Notification Channel Registry (src/adapters/notifications/registry.py)¶

Runbook System (src/runbooks/)¶

Workflow Engine (src/workflows/)¶

Intelligence (src/intelligence/, src/forecasting/, src/correlation/)¶

Notification & On-Call (src/notifications/)¶

SLO Engine (src/slo/)¶

Multi-Tenancy (src/tenancy/)¶

Security (src/security/)¶