AI-SRE Deployment Guide¶

This guide covers production deployment of the AI-SRE platform across all supported configurations.

Deployment Options¶

Option	Best For	Database	Scaling
Local dev	Development, evaluation	SQLite	Single process
Docker Compose	Pilot, small team	SQLite	Single node
Helm chart	Production Kubernetes	SQLite or PostgreSQL	Multi-replica with HPA

Prerequisites¶

Python 3.11+ (local dev) or Docker 24+ (containerized)
At least one LLM API key: Anthropic (ANTHROPIC_API_KEY) or OpenAI (OPENAI_API_KEY)
Optional: Kubernetes cluster access (for remediation actions)
Optional: Slack app credentials (for Slack bot)
Optional: PostgreSQL 14+ (for production database)

Local Development¶

Setup¶

# Clone repository
git clone <repo-url> && cd AI-SRE

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with dev dependencies
pip install -e ".[dev]"

# Copy and configure environment
cp .env.example .env

Minimal Configuration¶

Edit .env with at minimum:

# One LLM provider is required
ANTHROPIC_API_KEY=sk-ant-...
# or
OPENAI_API_KEY=sk-...

# Use mock log provider for offline development
LOG_PROVIDER=mock

# SQLite database (created automatically)
DATABASE_URL=sqlite+aiosqlite:///./data/ai_sre.db

Run¶

# Start the server with mock log provider
make demo

# Or run directly
python -m src.ingestion.server

The server starts on http://localhost:8888 by default. Override with INGESTION_HOST and INGESTION_PORT.

Seed Demo Data¶

# In another terminal
make seed
# or
curl -s -X POST http://localhost:8888/demo/seed | python -m json.tool

Run Slack Bot (Optional)¶

Requires Slack app credentials in .env:

python -m src.slack_bot.app

Verify¶

curl http://localhost:8888/health
# {"status":"ok"}

curl http://localhost:8888/incidents
# {"incidents":[],"count":0}

# Open operator console
open http://localhost:8888/console

Docker Compose¶

The Docker Compose setup runs two services on a single node: the ingestion server and the Slack bot.

Build and Run¶

# Copy and configure environment
cp .env.example .env
# Edit .env with API keys and settings

# Start services
docker compose -f deploy/docker-compose.yml up -d

# View logs
docker compose -f deploy/docker-compose.yml logs -f ingestion

# Stop
docker compose -f deploy/docker-compose.yml down

Services¶

Service	Port	Command	Health Check
`ingestion`	8888	`python -m src.ingestion.server`	`GET /health` every 30s
`slack_bot`	-	`python -m src.slack_bot.app`	Depends on ingestion health

Volumes¶

ai_sre_data -- Persistent volume for SQLite database (/app/data)
Kubeconfig mounted read-only at /root/.kube/config (if KUBECONFIG is set)

Environment Variables¶

Docker Compose reads from your .env file. All variables from .env.example are passed through with sensible defaults. Key overrides:

# Change exposed port
INGESTION_PORT=9090

# Switch to PostgreSQL
DATABASE_URL=postgresql+asyncpg://user:pass@postgres:5432/ai_sre

Adding PostgreSQL¶

To add PostgreSQL to the Docker Compose stack, append a postgres service:

# Add to deploy/docker-compose.yml under services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: ai_sre
      POSTGRES_PASSWORD: changeme
      POSTGRES_DB: ai_sre
    volumes:
      - pg_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ai_sre"]
      interval: 10s
      timeout: 5s
      retries: 5

# Update the ingestion service to depend on postgres and use:
# DATABASE_URL=postgresql+asyncpg://ai_sre:changeme@postgres:5432/ai_sre

# Add to volumes:
  pg_data:

Helm Chart (Kubernetes)¶

The Helm chart deploys AI-SRE into a Kubernetes cluster with full RBAC for cluster actions.

Chart Location¶

deploy/helm/ai-sre/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── configmap.yaml
    ├── secret.yaml
    ├── serviceaccount.yaml
    ├── clusterrole.yaml
    ├── clusterrolebinding.yaml
    ├── ingress.yaml
    ├── hpa.yaml
    ├── pvc.yaml
    └── NOTES.txt

Quick Install¶

# Install with minimal config
helm install ai-sre deploy/helm/ai-sre \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-...

# Install with custom namespace
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre --create-namespace \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-...

# Install with full config
helm install ai-sre deploy/helm/ai-sre \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-... \
  --set secrets.SLACK_BOT_TOKEN=xoxb-... \
  --set secrets.SLACK_APP_TOKEN=xapp-... \
  --set config.WORKSPACE_ID=my-team \
  --set config.WORKSPACE_NAME="My Team" \
  --set config.AUTONOMY_ENABLED=true \
  --set config.DEFAULT_DRY_RUN=false

Using an Existing Secret¶

If you manage secrets externally (Vault, Sealed Secrets, External Secrets):

# Create secret first
kubectl create secret generic ai-sre-secrets \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  --from-literal=SLACK_BOT_TOKEN=xoxb-...

# Reference it in the install
helm install ai-sre deploy/helm/ai-sre \
  --set existingSecret=ai-sre-secrets

Key Values¶

Replicas and Scaling¶

replicaCount: 2

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

Note: When using SQLite (default), only 1 replica is supported because SQLite does not handle concurrent writes. Switch to PostgreSQL for multi-replica deployments.

Resources¶

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

Ingress¶

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ai-sre.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: ai-sre-tls
      hosts:
        - ai-sre.example.com

Persistence (SQLite)¶

persistence:
  enabled: true
  storageClass: ""       # Use cluster default
  accessModes:
    - ReadWriteOnce
  size: 5Gi

RBAC for Kubernetes Actions¶

The chart creates a ClusterRole with permissions for the actions the platform can execute:

rbac:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["pods", "pods/log", "services", "events", "namespaces"]
      verbs: ["get", "list", "watch", "delete"]
    - apiGroups: ["apps"]
      resources: ["deployments", "replicasets", "statefulsets"]
      verbs: ["get", "list", "watch", "patch", "update"]
    - apiGroups: ["autoscaling"]
      resources: ["horizontalpodautoscalers"]
      verbs: ["get", "list", "watch", "patch", "update"]

Set rbac.create: false if you manage RBAC externally.

Security Context¶

podSecurityContext:
  fsGroup: 1000

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: false
  allowPrivilegeEscalation: false

Upgrade¶

helm upgrade ai-sre deploy/helm/ai-sre \
  --set secrets.ANTHROPIC_API_KEY=sk-ant-... \
  --reuse-values

Uninstall¶

helm uninstall ai-sre
# PVC is not deleted by default. To remove data:
kubectl delete pvc -l app.kubernetes.io/name=ai-sre

PostgreSQL Setup¶

For production deployments, switch from SQLite to PostgreSQL.

Install Driver¶

pip install -e ".[postgres]"
# This adds: asyncpg>=0.29.0

Configure Connection¶

# In .env or Helm values
DATABASE_URL=postgresql+asyncpg://user:password@host:5432/ai_sre

Run Migrations¶

# Apply all migrations
make db-migrate
# or
alembic upgrade head

# Create a new migration after model changes
make db-revision MSG="add new column"

# Rollback one migration
make db-downgrade

Connection Pooling¶

The platform uses NullPool for PostgreSQL async connections to avoid connection leaks. For high-traffic deployments, place a PgBouncer or pgpool-II proxy in front of PostgreSQL.

Backup¶

# PostgreSQL
pg_dump -h host -U user ai_sre > backup.sql

# SQLite
cp data/ai_sre.db data/ai_sre.db.backup

Connecting Alert Sources¶

Generic Webhook¶

Point any monitoring tool's webhook to:

POST https://ai-sre.example.com/webhook
Header: X-Source: webhook
Header: X-API-Key: your-key

PagerDuty¶

In PagerDuty, create a Generic Webhook (v3) subscription pointing to:

POST https://ai-sre.example.com/webhook
Header: X-Source: pagerduty

For bidirectional sync, also configure:

POST https://ai-sre.example.com/webhook/pagerduty

Set PAGERDUTY_ROUTING_KEY and PAGERDUTY_API_TOKEN for outbound events.

Prometheus/Alertmanager¶

Configure Alertmanager webhook receiver:

# alertmanager.yml
receivers:
  - name: ai-sre
    webhook_configs:
      - url: https://ai-sre.example.com/webhook
        http_config:
          headers:
            X-Source: alertmanager
            X-API-Key: your-key
        send_resolved: true

Datadog¶

Configure a Datadog webhook integration pointing to:

POST https://ai-sre.example.com/webhook
Header: X-Source: datadog

Set DATADOG_API_KEY and DATADOG_APP_KEY for log fetching.

Grafana¶

Configure a Grafana contact point (webhook) pointing to:

POST https://ai-sre.example.com/webhook
Header: X-Source: grafana

CI/CD Deploy Events¶

Send deploy events from your pipeline:

# GitHub Actions example
curl -X POST https://ai-sre.example.com/webhook/deploy \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $AI_SRE_API_KEY" \
  -d '{
    "service": "'"$SERVICE_NAME"'",
    "namespace": "production",
    "image": "'"$IMAGE_TAG"'",
    "actor": "'"$GITHUB_ACTOR"'",
    "commit_sha": "'"$GITHUB_SHA"'"
  }'

Connecting Log Providers¶

Grafana Loki¶

LOG_PROVIDER=loki
LOKI_URL=https://loki.example.com
LOKI_USER=your-user        # Basic auth (optional)
LOKI_PASSWORD=your-password

Elasticsearch¶

LOG_PROVIDER=elastic
ELASTIC_URL=https://elasticsearch.example.com:9200
ELASTIC_API_KEY=your-api-key
ELASTIC_INDEX=logs-*

Datadog Logs¶

LOG_PROVIDER=datadog
DATADOG_API_KEY=your-api-key
DATADOG_APP_KEY=your-app-key
DATADOG_SITE=datadoghq.com

Self-Hosted LLM (Ollama)¶

For air-gapped or on-premise deployments:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama-server:11434
OLLAMA_MODEL=llama3.2

Notification Setup¶

Slack¶

Create a Slack app at https://api.slack.com/apps
Enable Socket Mode and add the connections:write scope
Add bot token scopes: chat:write, channels:read, app_mentions:read
Install to workspace and copy tokens:

SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
SLACK_SIGNING_SECRET=...

Microsoft Teams¶

Create an incoming webhook in your Teams channel:

TEAMS_WEBHOOK_URL=https://outlook.office.com/webhook/...

Email (SMTP)¶

SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=alerts@example.com
SMTP_PASSWORD=...
ALERT_EMAIL_TO=oncall@example.com

Security Hardening¶

API Keys¶

Always configure API keys in production:

AI_SRE_API_KEYS=key1-abc,key2-def,key3-ghi

Rate Limiting¶

AI_SRE_RATE_LIMIT=60   # 60 requests per minute per IP

CORS¶

Restrict to your frontend domains:

AI_SRE_CORS_ORIGINS=https://console.example.com,https://admin.example.com

Safety Defaults¶

For production, start with the safest configuration:

DEFAULT_DRY_RUN=true
APPROVAL_REQUIRED=true
AUTONOMY_ENABLED=false
AUTONOMOUS_ACTIONS=restart_pod
ALLOWED_NAMESPACES=staging,production
MAX_SCALE_REPLICAS=10

Gradually relax as you build confidence:

Enable autonomy for low-risk actions in non-prod namespaces first
Monitor the /activity endpoint and action logs
Expand AUTONOMOUS_ACTIONS and namespaces over time
Consider setting DEFAULT_DRY_RUN=false only after sufficient piloting

Monitoring the Platform¶

Health Check¶

# Kubernetes liveness/readiness probes use this
GET /health

Prometheus Metrics¶

Scrape /metrics with your Prometheus instance:

# prometheus.yml
scrape_configs:
  - job_name: ai-sre
    static_configs:
      - targets: ['ai-sre:8888']

Exposed metrics: - ai_sre_alerts_ingested_total (labels: source, severity) - ai_sre_diagnosis_duration_seconds - ai_sre_actions_executed_total (labels: action, outcome, namespace) - ai_sre_action_duration_seconds - ai_sre_active_incidents

Pilot Metrics¶

Export high-level pilot metrics:

GET /metrics/export

Returns: average time to first response, MTTR, total actions, incident count.

Platform Overview¶

GET /platform/overview

Returns: incident counts by state, average TTFR, average MTTR, total actions, workspace info, guardrail status.

Troubleshooting¶

Server fails to start¶

Check that DATABASE_URL is correct and the target directory exists
SQLite: the data/ directory is created automatically
PostgreSQL: verify the database exists and credentials are correct

No LLM responses¶

Verify LLM_PROVIDER is set correctly
Check that the corresponding API key is set and valid
For auto mode: at least one of ANTHROPIC_API_KEY or OPENAI_API_KEY must be set

Kubernetes actions fail¶

Verify KUBECONFIG path or in-cluster service account has correct RBAC
Check ALLOWED_NAMESPACES includes the target namespace
Verify the pod/deployment exists in the specified namespace

Dead-letter queue growing¶

curl http://localhost:8888/webhook/dead-letter

Check the error messages for patterns (database timeouts, validation errors, etc.).

Alerts not deduplicating¶

Check DEDUP_STRATEGY (exact, fuzzy, window)
Verify alerts include a fingerprint or consistent groupKey
Check ALERTMANAGER_DEDUP_TTL_SECONDS for cache TTL