Skip to content

AI-SRE Deployment Guide

This guide covers production deployment of the AI-SRE platform across all supported configurations.

Deployment Options

Option Best For Database Scaling
Local dev Development, evaluation SQLite Single process
Docker Compose Pilot, small team SQLite Single node
Helm chart Production Kubernetes SQLite or PostgreSQL Multi-replica with HPA

Prerequisites

  • Python 3.11+ (local dev) or Docker 24+ (containerized)
  • At least one LLM API key: Anthropic (ANTHROPIC_API_KEY) or OpenAI (OPENAI_API_KEY)
  • Optional: Kubernetes cluster access (for remediation actions)
  • Optional: Slack app credentials (for Slack bot)
  • Optional: PostgreSQL 14+ (for production database)

Local Development

Setup

# Clone repository
git clone <repo-url> && cd AI-SRE

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with dev dependencies
pip install -e ".[dev]"

# Copy and configure environment
cp .env.example .env

Minimal Configuration

Edit .env with at minimum:

# One LLM provider is required
ANTHROPIC_API_KEY=sk-ant-...
# or
OPENAI_API_KEY=sk-...

# Use mock log provider for offline development
LOG_PROVIDER=mock

# SQLite database (created automatically)
DATABASE_URL=sqlite+aiosqlite:///./data/ai_sre.db

Run

# Start the server with mock log provider
make demo

# Or run directly
python -m src.ingestion.server

The server starts on http://localhost:8888 by default. Override with INGESTION_HOST and INGESTION_PORT.

Seed Demo Data

# In another terminal
make seed
# or
curl -s -X POST http://localhost:8888/demo/seed | python -m json.tool

Run Slack Bot (Optional)

Requires Slack app credentials in .env:

python -m src.slack_bot.app

Verify

curl http://localhost:8888/health
# {"status":"ok"}

curl http://localhost:8888/incidents
# {"incidents":[],"count":0}

# Open operator console
open http://localhost:8888/console

Docker Compose

The Docker Compose setup runs two services on a single node: the ingestion server and the Slack bot.

Build and Run

# Copy and configure environment
cp .env.example .env
# Edit .env with API keys and settings

# Start services
docker compose -f deploy/docker-compose.yml up -d

# View logs
docker compose -f deploy/docker-compose.yml logs -f ingestion

# Stop
docker compose -f deploy/docker-compose.yml down

Services

Service Port Command Health Check
ingestion 8888 python -m src.ingestion.server GET /health every 30s
slack_bot - python -m src.slack_bot.app Depends on ingestion health

Volumes

  • ai_sre_data -- Persistent volume for SQLite database (/app/data)
  • Kubeconfig mounted read-only at /root/.kube/config (if KUBECONFIG is set)

Environment Variables

Docker Compose reads from your .env file. All variables from .env.example are passed through with sensible defaults. Key overrides:

# Change exposed port
INGESTION_PORT=9090

# Switch to PostgreSQL
DATABASE_URL=postgresql+asyncpg://user:pass@postgres:5432/ai_sre

Adding PostgreSQL

To add PostgreSQL to the Docker Compose stack, append a postgres service:

# Add to deploy/docker-compose.yml under services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: ai_sre
      POSTGRES_PASSWORD: changeme
      POSTGRES_DB: ai_sre
    volumes:
      - pg_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ai_sre"]
      interval: 10s
      timeout: 5s
      retries: 5

# Update the ingestion service to depend on postgres and use:
# DATABASE_URL=postgresql+asyncpg://ai_sre:changeme@postgres:5432/ai_sre

# Add to volumes:
  pg_data:

Helm Chart (Kubernetes)

The Helm chart deploys AI-SRE into a Kubernetes cluster with full RBAC for cluster actions.

Chart Location

deploy/helm/ai-sre/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── configmap.yaml
    ├── secret.yaml
    ├── serviceaccount.yaml
    ├── clusterrole.yaml
    ├── clusterrolebinding.yaml
    ├── ingress.yaml
    ├── hpa.yaml
    ├── pvc.yaml
    └── NOTES.txt

Quick Install

# Install with minimal config
helm install ai-sre deploy/helm/ai-sre \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-...

# Install with custom namespace
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre --create-namespace \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-...

# Install with full config
helm install ai-sre deploy/helm/ai-sre \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-... \
  --set secrets.SLACK_BOT_TOKEN=xoxb-... \
  --set secrets.SLACK_APP_TOKEN=xapp-... \
  --set config.WORKSPACE_ID=my-team \
  --set config.WORKSPACE_NAME="My Team" \
  --set config.AUTONOMY_ENABLED=true \
  --set config.DEFAULT_DRY_RUN=false

Using an Existing Secret

If you manage secrets externally (Vault, Sealed Secrets, External Secrets):

# Create secret first
kubectl create secret generic ai-sre-secrets \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  --from-literal=SLACK_BOT_TOKEN=xoxb-...

# Reference it in the install
helm install ai-sre deploy/helm/ai-sre \
  --set existingSecret=ai-sre-secrets

Key Values

Replicas and Scaling

replicaCount: 2

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

Note: When using SQLite (default), only 1 replica is supported because SQLite does not handle concurrent writes. Switch to PostgreSQL for multi-replica deployments.

Resources

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

Ingress

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ai-sre.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: ai-sre-tls
      hosts:
        - ai-sre.example.com

Persistence (SQLite)

persistence:
  enabled: true
  storageClass: ""       # Use cluster default
  accessModes:
    - ReadWriteOnce
  size: 5Gi

RBAC for Kubernetes Actions

The chart creates a ClusterRole with permissions for the actions the platform can execute:

rbac:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["pods", "pods/log", "services", "events", "namespaces"]
      verbs: ["get", "list", "watch", "delete"]
    - apiGroups: ["apps"]
      resources: ["deployments", "replicasets", "statefulsets"]
      verbs: ["get", "list", "watch", "patch", "update"]
    - apiGroups: ["autoscaling"]
      resources: ["horizontalpodautoscalers"]
      verbs: ["get", "list", "watch", "patch", "update"]

Set rbac.create: false if you manage RBAC externally.

Security Context

podSecurityContext:
  fsGroup: 1000

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: false
  allowPrivilegeEscalation: false

Upgrade

helm upgrade ai-sre deploy/helm/ai-sre \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-... \
  --reuse-values

Uninstall

helm uninstall ai-sre
# PVC is not deleted by default. To remove data:
kubectl delete pvc -l app.kubernetes.io/name=ai-sre

PostgreSQL Setup

For production deployments, switch from SQLite to PostgreSQL.

Install Driver

pip install -e ".[postgres]"
# This adds: asyncpg>=0.29.0

Configure Connection

# In .env or Helm values
DATABASE_URL=postgresql+asyncpg://user:password@host:5432/ai_sre

Run Migrations

# Apply all migrations
make db-migrate
# or
alembic upgrade head

# Create a new migration after model changes
make db-revision MSG="add new column"

# Rollback one migration
make db-downgrade

Connection Pooling

The platform uses NullPool for PostgreSQL async connections to avoid connection leaks. For high-traffic deployments, place a PgBouncer or pgpool-II proxy in front of PostgreSQL.

Backup

# PostgreSQL
pg_dump -h host -U user ai_sre > backup.sql

# SQLite
cp data/ai_sre.db data/ai_sre.db.backup

Connecting Alert Sources

Generic Webhook

Point any monitoring tool's webhook to:

POST https://ai-sre.example.com/webhook
Header: X-Source: webhook
Header: X-API-Key: your-key

PagerDuty

  1. In PagerDuty, create a Generic Webhook (v3) subscription pointing to:

    POST https://ai-sre.example.com/webhook
    Header: X-Source: pagerduty
    

  2. For bidirectional sync, also configure:

    POST https://ai-sre.example.com/webhook/pagerduty
    

  3. Set PAGERDUTY_ROUTING_KEY and PAGERDUTY_API_TOKEN for outbound events.

Prometheus/Alertmanager

Configure Alertmanager webhook receiver:

# alertmanager.yml
receivers:
  - name: ai-sre
    webhook_configs:
      - url: https://ai-sre.example.com/webhook
        http_config:
          headers:
            X-Source: alertmanager
            X-API-Key: your-key
        send_resolved: true

Datadog

Configure a Datadog webhook integration pointing to:

POST https://ai-sre.example.com/webhook
Header: X-Source: datadog

Set DATADOG_API_KEY and DATADOG_APP_KEY for log fetching.

Grafana

Configure a Grafana contact point (webhook) pointing to:

POST https://ai-sre.example.com/webhook
Header: X-Source: grafana

CI/CD Deploy Events

Send deploy events from your pipeline:

# GitHub Actions example
curl -X POST https://ai-sre.example.com/webhook/deploy \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $AI_SRE_API_KEY" \
  -d '{
    "service": "'"$SERVICE_NAME"'",
    "namespace": "production",
    "image": "'"$IMAGE_TAG"'",
    "actor": "'"$GITHUB_ACTOR"'",
    "commit_sha": "'"$GITHUB_SHA"'"
  }'

Connecting Log Providers

Grafana Loki

LOG_PROVIDER=loki
LOKI_URL=https://loki.example.com
LOKI_USER=your-user        # Basic auth (optional)
LOKI_PASSWORD=your-password

Elasticsearch

LOG_PROVIDER=elastic
ELASTIC_URL=https://elasticsearch.example.com:9200
ELASTIC_API_KEY=your-api-key
ELASTIC_INDEX=logs-*

Datadog Logs

LOG_PROVIDER=datadog
DATADOG_API_KEY=your-api-key
DATADOG_APP_KEY=your-app-key
DATADOG_SITE=datadoghq.com

Self-Hosted LLM (Ollama)

For air-gapped or on-premise deployments:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama-server:11434
OLLAMA_MODEL=llama3.2

Notification Setup

Slack

  1. Create a Slack app at https://api.slack.com/apps
  2. Enable Socket Mode and add the connections:write scope
  3. Add bot token scopes: chat:write, channels:read, app_mentions:read
  4. Install to workspace and copy tokens:
SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
SLACK_SIGNING_SECRET=...

Microsoft Teams

Create an incoming webhook in your Teams channel:

TEAMS_WEBHOOK_URL=https://outlook.office.com/webhook/...

Email (SMTP)

SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=alerts@example.com
SMTP_PASSWORD=...
ALERT_EMAIL_TO=oncall@example.com

Security Hardening

API Keys

Always configure API keys in production:

AI_SRE_API_KEYS=key1-abc,key2-def,key3-ghi

Rate Limiting

AI_SRE_RATE_LIMIT=60   # 60 requests per minute per IP

CORS

Restrict to your frontend domains:

AI_SRE_CORS_ORIGINS=https://console.example.com,https://admin.example.com

Safety Defaults

For production, start with the safest configuration:

DEFAULT_DRY_RUN=true
APPROVAL_REQUIRED=true
AUTONOMY_ENABLED=false
AUTONOMOUS_ACTIONS=restart_pod
ALLOWED_NAMESPACES=staging,production
MAX_SCALE_REPLICAS=10

Gradually relax as you build confidence:

  1. Enable autonomy for low-risk actions in non-prod namespaces first
  2. Monitor the /activity endpoint and action logs
  3. Expand AUTONOMOUS_ACTIONS and namespaces over time
  4. Consider setting DEFAULT_DRY_RUN=false only after sufficient piloting

Monitoring the Platform

Health Check

# Kubernetes liveness/readiness probes use this
GET /health

Prometheus Metrics

Scrape /metrics with your Prometheus instance:

# prometheus.yml
scrape_configs:
  - job_name: ai-sre
    static_configs:
      - targets: ['ai-sre:8888']

Exposed metrics: - ai_sre_alerts_ingested_total (labels: source, severity) - ai_sre_diagnosis_duration_seconds - ai_sre_actions_executed_total (labels: action, outcome, namespace) - ai_sre_action_duration_seconds - ai_sre_active_incidents

Pilot Metrics

Export high-level pilot metrics:

GET /metrics/export

Returns: average time to first response, MTTR, total actions, incident count.

Platform Overview

GET /platform/overview

Returns: incident counts by state, average TTFR, average MTTR, total actions, workspace info, guardrail status.


Troubleshooting

Server fails to start

  • Check that DATABASE_URL is correct and the target directory exists
  • SQLite: the data/ directory is created automatically
  • PostgreSQL: verify the database exists and credentials are correct

No LLM responses

  • Verify LLM_PROVIDER is set correctly
  • Check that the corresponding API key is set and valid
  • For auto mode: at least one of ANTHROPIC_API_KEY or OPENAI_API_KEY must be set

Kubernetes actions fail

  • Verify KUBECONFIG path or in-cluster service account has correct RBAC
  • Check ALLOWED_NAMESPACES includes the target namespace
  • Verify the pod/deployment exists in the specified namespace

Dead-letter queue growing

curl http://localhost:8888/webhook/dead-letter

Check the error messages for patterns (database timeouts, validation errors, etc.).

Alerts not deduplicating

  • Check DEDUP_STRATEGY (exact, fuzzy, window)
  • Verify alerts include a fingerprint or consistent groupKey
  • Check ALERTMANAGER_DEDUP_TTL_SECONDS for cache TTL