Module Catalog¶
AI-SRE is composed of 54 modules under src/, organized by functional area. This page documents every module, its purpose, and whether it is part of the core platform or an optional extension.
Module Overview¶
graph TB
subgraph Core["Core Modules"]
ING[ingestion]
REASON[reasoning]
ACT[actions]
ORCH[orchestration]
POL[policies]
STORE[storage]
SEC[security]
RT[routers]
SCH[schemas]
RUN[runtime]
MOD[models]
end
subgraph Intelligence["Intelligence Modules"]
KNOW[knowledge]
INTEL[intelligence]
FORE[forecasting]
CORR[correlation]
ANOM[anomaly]
ANA[analytics]
RCA[rca]
end
subgraph Automation["Automation Modules"]
AUTO[autonomy]
RUNB[runbooks]
WORK[workflows]
PLAY[playbooks]
CHAOS[chaos]
SCALE[scaling]
end
subgraph Operations["Operations Modules"]
NOTIF[notifications]
ONCALL[oncall]
POST[postmortem]
DIG[digest]
SLO[slo]
SUPP[suppression]
COMP[compliance]
end
subgraph Integration["Integration Modules"]
ADAPT[adapters]
INTEG[integrations]
SLACK[slack_bot]
BOT[bot]
WH[webhooks]
MCP[mcp]
end
subgraph Infrastructure["Infrastructure Modules"]
MET[metrics]
OBS[observability]
REAL[realtime]
TOPO[topology]
CLUST[clusters]
TERRA[terraform]
MEM[memory]
end
subgraph Interface["Interface Modules"]
UI[ui]
CLI[cli]
DEMO[demo]
READ[readiness]
REP[reports]
EXP[export]
OP[operator]
CONF[config]
end
Core Modules¶
These modules are required for the platform to function. They handle alert ingestion, AI diagnosis, action execution, and data persistence.
| Module | Path | Description |
|---|---|---|
| ingestion | src/ingestion/ |
Webhook server, alert normalization, deduplication (exact/fuzzy/window), alert grouping by service and severity, retry with dead-letter queue |
| reasoning | src/reasoning/ |
LLM-powered diagnosis engine using LangGraph agent with tool use. Fetches logs, checks Kubernetes state, lists pods during reasoning |
| actions | src/actions/ |
Action catalog (restart_pod, scale_deployment, suggest_config_fix), Kubernetes executor, GitHub PR generator for config fixes |
| orchestration | src/orchestration/ |
Diagnosis pipeline coordination. Assembles context from K8s, logs, playbooks, similar incidents, and deploy correlation before invoking the reasoning agent |
| policies | src/policies/ |
Centralized policy engine evaluating every action against namespace restrictions, replica limits, approval requirements, dry-run defaults, and autonomy settings |
| storage | src/storage/ |
SQLAlchemy async ORM models, repository pattern for incidents, action logs, timeline events, metric snapshots. Supports SQLite and PostgreSQL |
| security | src/security/ |
API key authentication, per-IP rate limiting, CORS configuration |
| routers | src/routers/ |
FastAPI router modules for automation, intelligence, on-call, SLO, workflows, notifications, and system endpoints |
| schemas | src/schemas/ |
Pydantic response models for API serialization |
| runtime | src/runtime.py |
Runtime settings loader. Reads environment variables and exposes typed configuration to all modules |
Intelligence Modules¶
These modules provide pattern recognition, forecasting, and root cause analysis capabilities.
| Module | Path | Description | Status |
|---|---|---|---|
| knowledge | src/knowledge/ |
Similar incident search using title and service similarity scoring against historical incidents | Core |
| intelligence | src/intelligence/ |
Incident pattern clustering by keywords, per-service failure profiles with repeat rates and risk assessments | Core |
| forecasting | src/forecasting/ |
Capacity saturation forecasting from incident rate trends and metric history. Operational drift detection (MTTR degradation, rate spikes) | Core |
| correlation | src/correlation/ |
Deploy event ingestion and correlation. Links CI/CD deploys to subsequent incidents by timestamp and service proximity | Core |
| anomaly | src/anomaly/ |
Anomaly detection for metrics and incident patterns | Optional |
| analytics | src/analytics/ |
Advanced analytics and reporting on incident trends and team performance | Optional |
| rca | src/rca/ |
Advanced root cause analysis beyond the core reasoning engine | Optional |
Automation Modules¶
These modules handle autonomous remediation, runbooks, workflows, and proactive operations.
| Module | Path | Description | Status |
|---|---|---|---|
| autonomy | src/autonomy/ |
Autonomous remediation loop. Scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Background worker with configurable poll interval | Core |
| runbooks | src/runbooks/ |
Runbook schema, pattern-matching trigger system, step-by-step execution with approval gates, LLM-powered runbook generation from incident history | Core |
| workflows | src/workflows/ |
YAML-defined multi-step workflow engine with conditional steps, timeouts, rollback actions, and inter-step dependencies. Ships with OOM recovery and crashloop triage workflows | Core |
| playbooks | src/playbooks/ |
Built-in playbook catalog matched to incidents by pattern (OOMKilled, CrashLoopBackOff, high CPU, etc.) | Core |
| chaos | src/chaos/ |
Chaos engineering integration for controlled failure injection and resilience testing | Optional |
| scaling | src/scaling/ |
Advanced auto-scaling logic beyond basic replica count adjustments | Optional |
Operations Modules¶
These modules support day-to-day SRE operations: notifications, on-call management, SLOs, and compliance.
| Module | Path | Description | Status |
|---|---|---|---|
| notifications | src/notifications/ |
Severity-based notification routing with per-service overrides. Dispatches to Slack, PagerDuty, Teams, and email | Core |
| oncall | src/oncall/ |
On-call schedule management with YAML-defined weekly rotations, escalation contacts | Core |
| postmortem | src/postmortem/ |
AI-generated postmortems from incident timelines, action logs, and diagnosis results | Core |
| digest | src/digest/ |
Shift handoff digests summarizing open/resolved incidents and actions for a configurable lookback window | Core |
| slo | src/slo/ |
Service Level Objective engine. YAML-configured SLOs with error budget calculation, burn rate tracking, time-to-exhaustion, and alert level classification | Core |
| suppression | src/suppression/ |
Alert suppression rules for maintenance windows and known-noisy alerts | Optional |
| compliance | src/compliance/ |
Compliance and audit reporting for incident response processes | Optional |
Integration Modules¶
These modules connect AI-SRE with external services and provide bot interaction interfaces.
| Module | Path | Description | Status |
|---|---|---|---|
| adapters | src/adapters/ |
Four pluggable registries: alert sources (PD, OG, DD, GF, NR, AM), log backends (Loki, Elastic, Datadog, Mock), action providers (K8s, ECS), notification channels (Slack, PD, Teams, Email) | Core |
| integrations | src/integrations/ |
PagerDuty bidirectional sync (incident triggered/acknowledged/resolved) | Optional |
| slack_bot | src/slack_bot/ |
Slack Bolt app for interactive incident management via Slack | Optional |
| bot | src/bot/ |
Bot reply builder and animation logic for interactive conversations | Core |
| webhooks | src/webhooks/ |
Extended webhook processing and custom webhook configurations | Optional |
| mcp | src/mcp/ |
Model Context Protocol server for AI tool integration | Optional |
Infrastructure Modules¶
These modules handle observability, infrastructure state, and system topology.
| Module | Path | Description | Status |
|---|---|---|---|
| metrics | src/metrics/ |
Prometheus metrics endpoint exposing ai_sre_alerts_ingested_total, ai_sre_diagnosis_duration_seconds, ai_sre_actions_executed_total, ai_sre_active_incidents |
Core |
| observability | src/observability/ |
Extended observability features including distributed tracing and structured logging | Optional |
| realtime | src/realtime/ |
Real-time event streaming and WebSocket support for live dashboards | Optional |
| topology | src/topology/ |
Service topology mapping and dependency graph construction | Optional |
| clusters | src/clusters/ |
Multi-cluster Kubernetes management and cross-cluster incident correlation | Optional |
| terraform | src/terraform/ |
Terraform integration for infrastructure-as-code drift detection and remediation | Optional |
| memory | src/memory/ |
Persistent memory for the reasoning agent across sessions | Optional |
Interface Modules¶
These modules provide user-facing interfaces, demos, and data export capabilities.
| Module | Path | Description | Status |
|---|---|---|---|
| ui | src/ui/ |
Built-in operator console web UI served at /console |
Core |
| cli | src/cli/ |
Command-line interface (aisre command) for local operations |
Optional |
| demo | src/demo/ |
Demo data seeder for evaluation and customer POC demonstrations | Core |
| readiness | src/readiness/ |
Platform overview builder aggregating incident counts, MTTR, actions, and workspace status | Core |
| reports | src/reports/ |
Report generation for incident summaries and trend analysis | Optional |
| export | src/export/ |
Data export utilities for metrics, incidents, and audit trails | Optional |
| operator | src/operator/ |
Kubernetes Operator controller for managing AI-SRE via CRDs (AISREConfig, IncidentPolicy, RemediationAction) |
Optional |
| config | src/config/ |
Configuration management utilities and validation | Core |
| tenancy | src/tenancy/ |
Multi-tenant workspace model, YAML-based store, API key to workspace resolution, RBAC | Core |
Module Enable/Disable¶
Coming soon
Per-module enable/disable configuration is planned for a future release. Currently, all installed modules are active. Optional modules that depend on external services (Slack credentials, PagerDuty tokens, etc.) gracefully degrade when their configuration is not present -- they simply return empty responses or skip their functionality.
The planned approach:
# Enable specific modules (all others disabled)
MODULES_ENABLED=ingestion,reasoning,actions,autonomy,notifications
# Disable specific modules (all others enabled)
MODULES_DISABLED=chaos,compliance,terraform,topology
Core modules cannot be disabled as they are required for the platform to function.
Adding a Custom Module¶
AI-SRE is designed for extensibility. To add a custom module:
- Create a new directory under
src/(e.g.,src/my_module/) - Add an
__init__.pywith your module's public API - If the module needs API endpoints, create a FastAPI router and register it in
src/ingestion/server.py - If the module provides an adapter, register it with the appropriate registry (
alert_registry,log_registry,action_registry, ornotification_registry)