Module Catalog¶

AI-SRE is composed of 54 modules under src/, organized by functional area. This page documents every module, its purpose, and whether it is part of the core platform or an optional extension.

Module Overview¶

graph TB
    subgraph Core["Core Modules"]
        ING[ingestion]
        REASON[reasoning]
        ACT[actions]
        ORCH[orchestration]
        POL[policies]
        STORE[storage]
        SEC[security]
        RT[routers]
        SCH[schemas]
        RUN[runtime]
        MOD[models]
    end

    subgraph Intelligence["Intelligence Modules"]
        KNOW[knowledge]
        INTEL[intelligence]
        FORE[forecasting]
        CORR[correlation]
        ANOM[anomaly]
        ANA[analytics]
        RCA[rca]
    end

    subgraph Automation["Automation Modules"]
        AUTO[autonomy]
        RUNB[runbooks]
        WORK[workflows]
        PLAY[playbooks]
        CHAOS[chaos]
        SCALE[scaling]
    end

    subgraph Operations["Operations Modules"]
        NOTIF[notifications]
        ONCALL[oncall]
        POST[postmortem]
        DIG[digest]
        SLO[slo]
        SUPP[suppression]
        COMP[compliance]
    end

    subgraph Integration["Integration Modules"]
        ADAPT[adapters]
        INTEG[integrations]
        SLACK[slack_bot]
        BOT[bot]
        WH[webhooks]
        MCP[mcp]
    end

    subgraph Infrastructure["Infrastructure Modules"]
        MET[metrics]
        OBS[observability]
        REAL[realtime]
        TOPO[topology]
        CLUST[clusters]
        TERRA[terraform]
        MEM[memory]
    end

    subgraph Interface["Interface Modules"]
        UI[ui]
        CLI[cli]
        DEMO[demo]
        READ[readiness]
        REP[reports]
        EXP[export]
        OP[operator]
        CONF[config]
    end

Core Modules¶

These modules are required for the platform to function. They handle alert ingestion, AI diagnosis, action execution, and data persistence.

Module	Path	Description
ingestion	`src/ingestion/`	Webhook server, alert normalization, deduplication (exact/fuzzy/window), alert grouping by service and severity, retry with dead-letter queue
reasoning	`src/reasoning/`	LLM-powered diagnosis engine using LangGraph agent with tool use. Fetches logs, checks Kubernetes state, lists pods during reasoning
actions	`src/actions/`	Action catalog (`restart_pod`, `scale_deployment`, `suggest_config_fix`), Kubernetes executor, GitHub PR generator for config fixes
orchestration	`src/orchestration/`	Diagnosis pipeline coordination. Assembles context from K8s, logs, playbooks, similar incidents, and deploy correlation before invoking the reasoning agent
policies	`src/policies/`	Centralized policy engine evaluating every action against namespace restrictions, replica limits, approval requirements, dry-run defaults, and autonomy settings
storage	`src/storage/`	SQLAlchemy async ORM models, repository pattern for incidents, action logs, timeline events, metric snapshots. Supports SQLite and PostgreSQL
security	`src/security/`	API key authentication, per-IP rate limiting, CORS configuration
routers	`src/routers/`	FastAPI router modules for automation, intelligence, on-call, SLO, workflows, notifications, and system endpoints
schemas	`src/schemas/`	Pydantic response models for API serialization
runtime	`src/runtime.py`	Runtime settings loader. Reads environment variables and exposes typed configuration to all modules

Intelligence Modules¶

These modules provide pattern recognition, forecasting, and root cause analysis capabilities.

Module	Path	Description	Status
knowledge	`src/knowledge/`	Similar incident search using title and service similarity scoring against historical incidents	Core
intelligence	`src/intelligence/`	Incident pattern clustering by keywords, per-service failure profiles with repeat rates and risk assessments	Core
forecasting	`src/forecasting/`	Capacity saturation forecasting from incident rate trends and metric history. Operational drift detection (MTTR degradation, rate spikes)	Core
correlation	`src/correlation/`	Deploy event ingestion and correlation. Links CI/CD deploys to subsequent incidents by timestamp and service proximity	Core
anomaly	`src/anomaly/`	Anomaly detection for metrics and incident patterns	Optional
analytics	`src/analytics/`	Advanced analytics and reporting on incident trends and team performance	Optional
rca	`src/rca/`	Advanced root cause analysis beyond the core reasoning engine	Optional

Automation Modules¶

These modules handle autonomous remediation, runbooks, workflows, and proactive operations.

Module	Path	Description	Status
autonomy	`src/autonomy/`	Autonomous remediation loop. Scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Background worker with configurable poll interval	Core
runbooks	`src/runbooks/`	Runbook schema, pattern-matching trigger system, step-by-step execution with approval gates, LLM-powered runbook generation from incident history	Core
workflows	`src/workflows/`	YAML-defined multi-step workflow engine with conditional steps, timeouts, rollback actions, and inter-step dependencies. Ships with OOM recovery and crashloop triage workflows	Core
playbooks	`src/playbooks/`	Built-in playbook catalog matched to incidents by pattern (OOMKilled, CrashLoopBackOff, high CPU, etc.)	Core
chaos	`src/chaos/`	Chaos engineering integration for controlled failure injection and resilience testing	Optional
scaling	`src/scaling/`	Advanced auto-scaling logic beyond basic replica count adjustments	Optional

Operations Modules¶

These modules support day-to-day SRE operations: notifications, on-call management, SLOs, and compliance.

Module	Path	Description	Status
notifications	`src/notifications/`	Severity-based notification routing with per-service overrides. Dispatches to Slack, PagerDuty, Teams, and email	Core
oncall	`src/oncall/`	On-call schedule management with YAML-defined weekly rotations, escalation contacts	Core
postmortem	`src/postmortem/`	AI-generated postmortems from incident timelines, action logs, and diagnosis results	Core
digest	`src/digest/`	Shift handoff digests summarizing open/resolved incidents and actions for a configurable lookback window	Core
slo	`src/slo/`	Service Level Objective engine. YAML-configured SLOs with error budget calculation, burn rate tracking, time-to-exhaustion, and alert level classification	Core
suppression	`src/suppression/`	Alert suppression rules for maintenance windows and known-noisy alerts	Optional
compliance	`src/compliance/`	Compliance and audit reporting for incident response processes	Optional

Integration Modules¶

These modules connect AI-SRE with external services and provide bot interaction interfaces.

Module	Path	Description	Status
adapters	`src/adapters/`	Four pluggable registries: alert sources (PD, OG, DD, GF, NR, AM), log backends (Loki, Elastic, Datadog, Mock), action providers (K8s, ECS), notification channels (Slack, PD, Teams, Email)	Core
integrations	`src/integrations/`	PagerDuty bidirectional sync (incident triggered/acknowledged/resolved)	Optional
slack_bot	`src/slack_bot/`	Slack Bolt app for interactive incident management via Slack	Optional
bot	`src/bot/`	Bot reply builder and animation logic for interactive conversations	Core
webhooks	`src/webhooks/`	Extended webhook processing and custom webhook configurations	Optional
mcp	`src/mcp/`	Model Context Protocol server for AI tool integration	Optional

Infrastructure Modules¶

These modules handle observability, infrastructure state, and system topology.

Module	Path	Description	Status
metrics	`src/metrics/`	Prometheus metrics endpoint exposing `ai_sre_alerts_ingested_total`, `ai_sre_diagnosis_duration_seconds`, `ai_sre_actions_executed_total`, `ai_sre_active_incidents`	Core
observability	`src/observability/`	Extended observability features including distributed tracing and structured logging	Optional
realtime	`src/realtime/`	Real-time event streaming and WebSocket support for live dashboards	Optional
topology	`src/topology/`	Service topology mapping and dependency graph construction	Optional
clusters	`src/clusters/`	Multi-cluster Kubernetes management and cross-cluster incident correlation	Optional
terraform	`src/terraform/`	Terraform integration for infrastructure-as-code drift detection and remediation	Optional
memory	`src/memory/`	Persistent memory for the reasoning agent across sessions	Optional

Interface Modules¶

These modules provide user-facing interfaces, demos, and data export capabilities.

Module	Path	Description	Status
ui	`src/ui/`	Built-in operator console web UI served at `/console`	Core
cli	`src/cli/`	Command-line interface (`aisre` command) for local operations	Optional
demo	`src/demo/`	Demo data seeder for evaluation and customer POC demonstrations	Core
readiness	`src/readiness/`	Platform overview builder aggregating incident counts, MTTR, actions, and workspace status	Core
reports	`src/reports/`	Report generation for incident summaries and trend analysis	Optional
export	`src/export/`	Data export utilities for metrics, incidents, and audit trails	Optional
operator	`src/operator/`	Kubernetes Operator controller for managing AI-SRE via CRDs (`AISREConfig`, `IncidentPolicy`, `RemediationAction`)	Optional
config	`src/config/`	Configuration management utilities and validation	Core
tenancy	`src/tenancy/`	Multi-tenant workspace model, YAML-based store, API key to workspace resolution, RBAC	Core

Module Enable/Disable¶

Use the AI_SRE_MODULES and AI_SRE_MODULES_DISABLED environment variables to control which modules start with the server.

# Load only specific modules (core modules are always added automatically)
AI_SRE_MODULES=ingestion,reasoning,actions,autonomy,notifications

# Load everything except the listed modules
AI_SRE_MODULES_DISABLED=chaos,compliance,terraform,topology

# Default: load all modules (explicit wildcard is equivalent)
AI_SRE_MODULES=*

Both variables can be combined. AI_SRE_MODULES_DISABLED takes precedence over AI_SRE_MODULES.

Core modules (health, system, ui, ws, incidents, webhook_ingest) are always loaded and cannot be disabled — they are required for the platform to function.

Optional modules that depend on external services (Slack, PagerDuty, etc.) gracefully degrade when their credentials are not configured.

To inspect which modules are currently loaded at runtime:

GET /system/modules

Response:

{
  "loaded": ["health", "system", "slo", "..."],
  "skipped": ["chaos", "terraform"],
  "core": ["health", "incidents", "system", "ui", "webhook_ingest", "ws"],
  "total_available": 42
}

Adding a Custom Module¶

AI-SRE is designed for extensibility. To add a custom module:

Create a new directory under src/ (e.g., src/my_module/)
Add an __init__.py with your module's public API
If the module needs API endpoints, create a FastAPI router and register it in src/ingestion/server.py
If the module provides an adapter, register it with the appropriate registry (alert_registry, log_registry, action_registry, or notification_registry)

# src/my_module/__init__.py
"""My custom AI-SRE module."""

# src/my_module/router.py
from fastapi import APIRouter

router = APIRouter(prefix="/my-module", tags=["my-module"])

@router.get("/status")
async def status():
    return {"module": "my_module", "status": "active"}