Skip to content

Module Catalog

AI-SRE is composed of 54 modules under src/, organized by functional area. This page documents every module, its purpose, and whether it is part of the core platform or an optional extension.


Module Overview

graph TB
    subgraph Core["Core Modules"]
        ING[ingestion]
        REASON[reasoning]
        ACT[actions]
        ORCH[orchestration]
        POL[policies]
        STORE[storage]
        SEC[security]
        RT[routers]
        SCH[schemas]
        RUN[runtime]
        MOD[models]
    end

    subgraph Intelligence["Intelligence Modules"]
        KNOW[knowledge]
        INTEL[intelligence]
        FORE[forecasting]
        CORR[correlation]
        ANOM[anomaly]
        ANA[analytics]
        RCA[rca]
    end

    subgraph Automation["Automation Modules"]
        AUTO[autonomy]
        RUNB[runbooks]
        WORK[workflows]
        PLAY[playbooks]
        CHAOS[chaos]
        SCALE[scaling]
    end

    subgraph Operations["Operations Modules"]
        NOTIF[notifications]
        ONCALL[oncall]
        POST[postmortem]
        DIG[digest]
        SLO[slo]
        SUPP[suppression]
        COMP[compliance]
    end

    subgraph Integration["Integration Modules"]
        ADAPT[adapters]
        INTEG[integrations]
        SLACK[slack_bot]
        BOT[bot]
        WH[webhooks]
        MCP[mcp]
    end

    subgraph Infrastructure["Infrastructure Modules"]
        MET[metrics]
        OBS[observability]
        REAL[realtime]
        TOPO[topology]
        CLUST[clusters]
        TERRA[terraform]
        MEM[memory]
    end

    subgraph Interface["Interface Modules"]
        UI[ui]
        CLI[cli]
        DEMO[demo]
        READ[readiness]
        REP[reports]
        EXP[export]
        OP[operator]
        CONF[config]
    end

Core Modules

These modules are required for the platform to function. They handle alert ingestion, AI diagnosis, action execution, and data persistence.

Module Path Description
ingestion src/ingestion/ Webhook server, alert normalization, deduplication (exact/fuzzy/window), alert grouping by service and severity, retry with dead-letter queue
reasoning src/reasoning/ LLM-powered diagnosis engine using LangGraph agent with tool use. Fetches logs, checks Kubernetes state, lists pods during reasoning
actions src/actions/ Action catalog (restart_pod, scale_deployment, suggest_config_fix), Kubernetes executor, GitHub PR generator for config fixes
orchestration src/orchestration/ Diagnosis pipeline coordination. Assembles context from K8s, logs, playbooks, similar incidents, and deploy correlation before invoking the reasoning agent
policies src/policies/ Centralized policy engine evaluating every action against namespace restrictions, replica limits, approval requirements, dry-run defaults, and autonomy settings
storage src/storage/ SQLAlchemy async ORM models, repository pattern for incidents, action logs, timeline events, metric snapshots. Supports SQLite and PostgreSQL
security src/security/ API key authentication, per-IP rate limiting, CORS configuration
routers src/routers/ FastAPI router modules for automation, intelligence, on-call, SLO, workflows, notifications, and system endpoints
schemas src/schemas/ Pydantic response models for API serialization
runtime src/runtime.py Runtime settings loader. Reads environment variables and exposes typed configuration to all modules

Intelligence Modules

These modules provide pattern recognition, forecasting, and root cause analysis capabilities.

Module Path Description Status
knowledge src/knowledge/ Similar incident search using title and service similarity scoring against historical incidents Core
intelligence src/intelligence/ Incident pattern clustering by keywords, per-service failure profiles with repeat rates and risk assessments Core
forecasting src/forecasting/ Capacity saturation forecasting from incident rate trends and metric history. Operational drift detection (MTTR degradation, rate spikes) Core
correlation src/correlation/ Deploy event ingestion and correlation. Links CI/CD deploys to subsequent incidents by timestamp and service proximity Core
anomaly src/anomaly/ Anomaly detection for metrics and incident patterns Optional
analytics src/analytics/ Advanced analytics and reporting on incident trends and team performance Optional
rca src/rca/ Advanced root cause analysis beyond the core reasoning engine Optional

Automation Modules

These modules handle autonomous remediation, runbooks, workflows, and proactive operations.

Module Path Description Status
autonomy src/autonomy/ Autonomous remediation loop. Scans open incidents, runs diagnosis, selects safe auto-executable actions, and applies them. Background worker with configurable poll interval Core
runbooks src/runbooks/ Runbook schema, pattern-matching trigger system, step-by-step execution with approval gates, LLM-powered runbook generation from incident history Core
workflows src/workflows/ YAML-defined multi-step workflow engine with conditional steps, timeouts, rollback actions, and inter-step dependencies. Ships with OOM recovery and crashloop triage workflows Core
playbooks src/playbooks/ Built-in playbook catalog matched to incidents by pattern (OOMKilled, CrashLoopBackOff, high CPU, etc.) Core
chaos src/chaos/ Chaos engineering integration for controlled failure injection and resilience testing Optional
scaling src/scaling/ Advanced auto-scaling logic beyond basic replica count adjustments Optional

Operations Modules

These modules support day-to-day SRE operations: notifications, on-call management, SLOs, and compliance.

Module Path Description Status
notifications src/notifications/ Severity-based notification routing with per-service overrides. Dispatches to Slack, PagerDuty, Teams, and email Core
oncall src/oncall/ On-call schedule management with YAML-defined weekly rotations, escalation contacts Core
postmortem src/postmortem/ AI-generated postmortems from incident timelines, action logs, and diagnosis results Core
digest src/digest/ Shift handoff digests summarizing open/resolved incidents and actions for a configurable lookback window Core
slo src/slo/ Service Level Objective engine. YAML-configured SLOs with error budget calculation, burn rate tracking, time-to-exhaustion, and alert level classification Core
suppression src/suppression/ Alert suppression rules for maintenance windows and known-noisy alerts Optional
compliance src/compliance/ Compliance and audit reporting for incident response processes Optional

Integration Modules

These modules connect AI-SRE with external services and provide bot interaction interfaces.

Module Path Description Status
adapters src/adapters/ Four pluggable registries: alert sources (PD, OG, DD, GF, NR, AM), log backends (Loki, Elastic, Datadog, Mock), action providers (K8s, ECS), notification channels (Slack, PD, Teams, Email) Core
integrations src/integrations/ PagerDuty bidirectional sync (incident triggered/acknowledged/resolved) Optional
slack_bot src/slack_bot/ Slack Bolt app for interactive incident management via Slack Optional
bot src/bot/ Bot reply builder and animation logic for interactive conversations Core
webhooks src/webhooks/ Extended webhook processing and custom webhook configurations Optional
mcp src/mcp/ Model Context Protocol server for AI tool integration Optional

Infrastructure Modules

These modules handle observability, infrastructure state, and system topology.

Module Path Description Status
metrics src/metrics/ Prometheus metrics endpoint exposing ai_sre_alerts_ingested_total, ai_sre_diagnosis_duration_seconds, ai_sre_actions_executed_total, ai_sre_active_incidents Core
observability src/observability/ Extended observability features including distributed tracing and structured logging Optional
realtime src/realtime/ Real-time event streaming and WebSocket support for live dashboards Optional
topology src/topology/ Service topology mapping and dependency graph construction Optional
clusters src/clusters/ Multi-cluster Kubernetes management and cross-cluster incident correlation Optional
terraform src/terraform/ Terraform integration for infrastructure-as-code drift detection and remediation Optional
memory src/memory/ Persistent memory for the reasoning agent across sessions Optional

Interface Modules

These modules provide user-facing interfaces, demos, and data export capabilities.

Module Path Description Status
ui src/ui/ Built-in operator console web UI served at /console Core
cli src/cli/ Command-line interface (aisre command) for local operations Optional
demo src/demo/ Demo data seeder for evaluation and customer POC demonstrations Core
readiness src/readiness/ Platform overview builder aggregating incident counts, MTTR, actions, and workspace status Core
reports src/reports/ Report generation for incident summaries and trend analysis Optional
export src/export/ Data export utilities for metrics, incidents, and audit trails Optional
operator src/operator/ Kubernetes Operator controller for managing AI-SRE via CRDs (AISREConfig, IncidentPolicy, RemediationAction) Optional
config src/config/ Configuration management utilities and validation Core
tenancy src/tenancy/ Multi-tenant workspace model, YAML-based store, API key to workspace resolution, RBAC Core

Module Enable/Disable

Coming soon

Per-module enable/disable configuration is planned for a future release. Currently, all installed modules are active. Optional modules that depend on external services (Slack credentials, PagerDuty tokens, etc.) gracefully degrade when their configuration is not present -- they simply return empty responses or skip their functionality.

The planned approach:

# Enable specific modules (all others disabled)
MODULES_ENABLED=ingestion,reasoning,actions,autonomy,notifications

# Disable specific modules (all others enabled)
MODULES_DISABLED=chaos,compliance,terraform,topology

Core modules cannot be disabled as they are required for the platform to function.


Adding a Custom Module

AI-SRE is designed for extensibility. To add a custom module:

  1. Create a new directory under src/ (e.g., src/my_module/)
  2. Add an __init__.py with your module's public API
  3. If the module needs API endpoints, create a FastAPI router and register it in src/ingestion/server.py
  4. If the module provides an adapter, register it with the appropriate registry (alert_registry, log_registry, action_registry, or notification_registry)
# src/my_module/__init__.py
"""My custom AI-SRE module."""

# src/my_module/router.py
from fastapi import APIRouter

router = APIRouter(prefix="/my-module", tags=["my-module"])

@router.get("/status")
async def status():
    return {"module": "my_module", "status": "active"}