Module Catalog¶
AI-SRE is composed of 54 modules under src/, organized by functional area. This page documents every module, its purpose, and whether it is part of the core platform or an optional extension.
Module Overview¶
graph TB
subgraph Core["Core Modules"]
ING[ingestion]
REASON[reasoning]
ACT[actions]
ORCH[orchestration]
POL[policies]
STORE[storage]
SEC[security]
RT[routers]
SCH[schemas]
RUN[runtime]
MOD[models]
end
subgraph Intelligence["Intelligence Modules"]
KNOW[knowledge]
INTEL[intelligence]
FORE[forecasting]
CORR[correlation]
ANOM[anomaly]
ANA[analytics]
RCA[rca]
end
subgraph Automation["Automation Modules"]
AUTO[autonomy]
RUNB[runbooks]
WORK[workflows]
PLAY[playbooks]
CHAOS[chaos]
SCALE[scaling]
end
subgraph Operations["Operations Modules"]
NOTIF[notifications]
ONCALL[oncall]
POST[postmortem]
DIG[digest]
SLO[slo]
SUPP[suppression]
COMP[compliance]
end
subgraph Integration["Integration Modules"]
ADAPT[adapters]
INTEG[integrations]
SLACK[slack_bot]
BOT[bot]
WH[webhooks]
MCP[mcp]
end
subgraph Infrastructure["Infrastructure Modules"]
MET[metrics]
OBS[observability]
REAL[realtime]
TOPO[topology]
CLUST[clusters]
TERRA[terraform]
MEM[memory]
end
subgraph Interface["Interface Modules"]
UI[ui]
CLI[cli]
DEMO[demo]
READ[readiness]
REP[reports]
EXP[export]
OP[operator]
CONF[config]
end
Core Modules¶
These modules are required for the platform to function. They handle alert ingestion, AI diagnosis, action execution, and data persistence.
| Module | Path | Description |
|---|---|---|
| ingestion | src/ingestion/ |
Webhook server, alert normalization, deduplication (exact/fuzzy/window), alert grouping by service and severity, retry with dead-letter queue |
| reasoning | src/reasoning/ |
LLM-powered diagnosis engine using LangGraph agent with tool use. Fetches logs, checks Kubernetes state, lists pods during reasoning |
| actions | src/actions/ |
Action catalog (restart_pod, scale_deployment, suggest_config_fix), Kubernetes executor, GitHub PR generator for config fixes |
| orchestration | src/orchestration/ |
Diagnosis pipeline coordination. Assembles context from K8s, logs, playbooks, similar incidents, and deploy correlation before invoking the reasoning agent |
| policies | src/policies/ |
Centralized policy engine evaluating every action against namespace restrictions, replica limits, approval requirements, dry-run defaults, and autonomy settings |
| storage | src/storage/ |
SQLAlchemy async ORM models, repository pattern for incidents, action logs, timeline events, metric snapshots. Supports SQLite and PostgreSQL |
| security | src/security/ |
API key authentication, per-IP rate limiting, CORS configuration |
| routers | src/routers/ |
FastAPI router modules for automation, intelligence, on-call, SLO, workflows, notifications, and system endpoints |
| schemas | src/schemas/ |
Pydantic response models for API serialization |
| runtime | src/runtime.py |
Runtime settings loader. Reads environment variables and exposes typed configuration to all modules |
Intelligence Modules¶
These modules provide pattern recognition, forecasting, and root cause analysis capabilities.
| Module | Path | Description | Status |
|---|---|---|---|
| knowledge | src/knowledge/ |
Similar incident search using title and service similarity scoring against historical incidents | Core |
| intelligence | src/intelligence/ |
Incident pattern clustering by keywords, per-service failure profiles with repeat rates and risk assessments | Core |
| forecasting | src/forecasting/ |
Capacity saturation forecasting from incident rate trends and metric history. Operational drift detection (MTTR degradation, rate spikes) | Core |
| correlation | src/correlation/ |
Deploy event ingestion and correlation. Links CI/CD deploys to subsequent incidents by timestamp and service proximity | Core |
| anomaly | src/anomaly/ |
Anomaly detection for metrics and incident patterns | Optional |
| analytics | src/analytics/ |
Advanced analytics and reporting on incident trends and team performance | Optional |
| rca | src/rca/ |
Advanced root cause analysis beyond the core reasoning engine | Optional |
Automation Modules¶
These modules handle autonomous remediation, runbooks, workflows, and proactive operations.
| Module | Path | Description | Status |
|---|---|---|---|
| autonomy | src/autonomy/ |
Autonomous remediation loop. Scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Background worker with configurable poll interval | Core |
| runbooks | src/runbooks/ |
Runbook schema, pattern-matching trigger system, step-by-step execution with approval gates, LLM-powered runbook generation from incident history | Core |
| workflows | src/workflows/ |
YAML-defined multi-step workflow engine with conditional steps, timeouts, rollback actions, and inter-step dependencies. Ships with OOM recovery and crashloop triage workflows | Core |
| playbooks | src/playbooks/ |
Built-in playbook catalog matched to incidents by pattern (OOMKilled, CrashLoopBackOff, high CPU, etc.) | Core |
| chaos | src/chaos/ |
Chaos engineering integration for controlled failure injection and resilience testing | Optional |
| scaling | src/scaling/ |
Advanced auto-scaling logic beyond basic replica count adjustments | Optional |
Operations Modules¶
These modules support day-to-day SRE operations: notifications, on-call management, SLOs, and compliance.
| Module | Path | Description | Status |
|---|---|---|---|
| notifications | src/notifications/ |
Severity-based notification routing with per-service overrides. Dispatches to Slack, PagerDuty, Teams, and email | Core |
| oncall | src/oncall/ |
On-call schedule management with YAML-defined weekly rotations, escalation contacts | Core |
| postmortem | src/postmortem/ |
AI-generated postmortems from incident timelines, action logs, and diagnosis results | Core |
| digest | src/digest/ |
Shift handoff digests summarizing open/resolved incidents and actions for a configurable lookback window | Core |
| slo | src/slo/ |
Service Level Objective engine. YAML-configured SLOs with error budget calculation, burn rate tracking, time-to-exhaustion, and alert level classification | Core |
| suppression | src/suppression/ |
Alert suppression rules for maintenance windows and known-noisy alerts | Optional |
| compliance | src/compliance/ |
Compliance and audit reporting for incident response processes | Optional |
Integration Modules¶
These modules connect AI-SRE with external services and provide bot interaction interfaces.
| Module | Path | Description | Status |
|---|---|---|---|
| adapters | src/adapters/ |
Four pluggable registries: alert sources (PD, OG, DD, GF, NR, AM), log backends (Loki, Elastic, Datadog, Mock), action providers (K8s, ECS), notification channels (Slack, PD, Teams, Email) | Core |
| integrations | src/integrations/ |
PagerDuty bidirectional sync (incident triggered/acknowledged/resolved) | Optional |
| slack_bot | src/slack_bot/ |
Slack Bolt app for interactive incident management via Slack | Optional |
| bot | src/bot/ |
Bot reply builder and animation logic for interactive conversations | Core |
| webhooks | src/webhooks/ |
Extended webhook processing and custom webhook configurations | Optional |
| mcp | src/mcp/ |
Model Context Protocol server for AI tool integration | Optional |
Infrastructure Modules¶
These modules handle observability, infrastructure state, and system topology.
| Module | Path | Description | Status |
|---|---|---|---|
| metrics | src/metrics/ |
Prometheus metrics endpoint exposing ai_sre_alerts_ingested_total, ai_sre_diagnosis_duration_seconds, ai_sre_actions_executed_total, ai_sre_active_incidents |
Core |
| observability | src/observability/ |
Extended observability features including distributed tracing and structured logging | Optional |
| realtime | src/realtime/ |
Real-time event streaming and WebSocket support for live dashboards | Optional |
| topology | src/topology/ |
Service topology mapping and dependency graph construction | Optional |
| clusters | src/clusters/ |
Multi-cluster Kubernetes management and cross-cluster incident correlation | Optional |
| terraform | src/terraform/ |
Terraform integration for infrastructure-as-code drift detection and remediation | Optional |
| memory | src/memory/ |
Persistent memory for the reasoning agent across sessions | Optional |
Interface Modules¶
These modules provide user-facing interfaces, demos, and data export capabilities.
| Module | Path | Description | Status |
|---|---|---|---|
| ui | src/ui/ |
Built-in operator console web UI served at /console |
Core |
| cli | src/cli/ |
Command-line interface (aisre command) for local operations |
Optional |
| demo | src/demo/ |
Demo data seeder for evaluation and customer POC demonstrations | Core |
| readiness | src/readiness/ |
Platform overview builder aggregating incident counts, MTTR, actions, and workspace status | Core |
| reports | src/reports/ |
Report generation for incident summaries and trend analysis | Optional |
| export | src/export/ |
Data export utilities for metrics, incidents, and audit trails | Optional |
| operator | src/operator/ |
Kubernetes Operator controller for managing AI-SRE via CRDs (AISREConfig, IncidentPolicy, RemediationAction) |
Optional |
| config | src/config/ |
Configuration management utilities and validation | Core |
| tenancy | src/tenancy/ |
Multi-tenant workspace model, YAML-based store, API key to workspace resolution, RBAC | Core |
Module Enable/Disable¶
Use the AI_SRE_MODULES and AI_SRE_MODULES_DISABLED environment variables to control which modules start with the server.
# Load only specific modules (core modules are always added automatically)
AI_SRE_MODULES=ingestion,reasoning,actions,autonomy,notifications
# Load everything except the listed modules
AI_SRE_MODULES_DISABLED=chaos,compliance,terraform,topology
# Default: load all modules (explicit wildcard is equivalent)
AI_SRE_MODULES=*
Both variables can be combined. AI_SRE_MODULES_DISABLED takes precedence over AI_SRE_MODULES.
Core modules (health, system, ui, ws, incidents, webhook_ingest) are always loaded and cannot be disabled — they are required for the platform to function.
Optional modules that depend on external services (Slack, PagerDuty, etc.) gracefully degrade when their credentials are not configured.
To inspect which modules are currently loaded at runtime:
Response:
{
"loaded": ["health", "system", "slo", "..."],
"skipped": ["chaos", "terraform"],
"core": ["health", "incidents", "system", "ui", "webhook_ingest", "ws"],
"total_available": 42
}
Adding a Custom Module¶
AI-SRE is designed for extensibility. To add a custom module:
- Create a new directory under
src/(e.g.,src/my_module/) - Add an
__init__.pywith your module's public API - If the module needs API endpoints, create a FastAPI router and register it in
src/ingestion/server.py - If the module provides an adapter, register it with the appropriate registry (
alert_registry,log_registry,action_registry, ornotification_registry)