AI-SRE Deployment Guide¶
This guide covers production deployment of the AI-SRE platform across all supported configurations.
Deployment Options¶
| Option | Best For | Database | Scaling |
|---|---|---|---|
| Local dev | Development, evaluation | SQLite | Single process |
| Docker Compose | Pilot, small team | SQLite | Single node |
| Helm chart | Production Kubernetes | SQLite or PostgreSQL | Multi-replica with HPA |
Prerequisites¶
- Python 3.11+ (local dev) or Docker 24+ (containerized)
- At least one LLM API key: Anthropic (
ANTHROPIC_API_KEY) or OpenAI (OPENAI_API_KEY) - Optional: Kubernetes cluster access (for remediation actions)
- Optional: Slack app credentials (for Slack bot)
- Optional: PostgreSQL 14+ (for production database)
Local Development¶
Setup¶
# Clone repository
git clone <repo-url> && cd AI-SRE
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install with dev dependencies
pip install -e ".[dev]"
# Copy and configure environment
cp .env.example .env
Minimal Configuration¶
Edit .env with at minimum:
# One LLM provider is required
ANTHROPIC_API_KEY=sk-ant-...
# or
OPENAI_API_KEY=sk-...
# Use mock log provider for offline development
LOG_PROVIDER=mock
# SQLite database (created automatically)
DATABASE_URL=sqlite+aiosqlite:///./data/ai_sre.db
Run¶
# Start the server with mock log provider
make demo
# Or run directly
python -m src.ingestion.server
The server starts on http://localhost:8888 by default. Override with INGESTION_HOST and INGESTION_PORT.
Seed Demo Data¶
# In another terminal
make seed
# or
curl -s -X POST http://localhost:8888/demo/seed | python -m json.tool
Run Slack Bot (Optional)¶
Requires Slack app credentials in .env:
Verify¶
curl http://localhost:8888/health
# {"status":"ok"}
curl http://localhost:8888/incidents
# {"incidents":[],"count":0}
# Open operator console
open http://localhost:8888/console
Docker Compose¶
The Docker Compose setup runs two services on a single node: the ingestion server and the Slack bot.
Build and Run¶
# Copy and configure environment
cp .env.example .env
# Edit .env with API keys and settings
# Start services
docker compose -f deploy/docker-compose.yml up -d
# View logs
docker compose -f deploy/docker-compose.yml logs -f ingestion
# Stop
docker compose -f deploy/docker-compose.yml down
Services¶
| Service | Port | Command | Health Check |
|---|---|---|---|
ingestion |
8888 | python -m src.ingestion.server |
GET /health every 30s |
slack_bot |
- | python -m src.slack_bot.app |
Depends on ingestion health |
Volumes¶
ai_sre_data-- Persistent volume for SQLite database (/app/data)- Kubeconfig mounted read-only at
/root/.kube/config(ifKUBECONFIGis set)
Environment Variables¶
Docker Compose reads from your .env file. All variables from .env.example are passed through with sensible defaults. Key overrides:
# Change exposed port
INGESTION_PORT=9090
# Switch to PostgreSQL
DATABASE_URL=postgresql+asyncpg://user:pass@postgres:5432/ai_sre
Adding PostgreSQL¶
To add PostgreSQL to the Docker Compose stack, append a postgres service:
# Add to deploy/docker-compose.yml under services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: ai_sre
POSTGRES_PASSWORD: changeme
POSTGRES_DB: ai_sre
volumes:
- pg_data:/var/lib/postgresql/data
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ai_sre"]
interval: 10s
timeout: 5s
retries: 5
# Update the ingestion service to depend on postgres and use:
# DATABASE_URL=postgresql+asyncpg://ai_sre:changeme@postgres:5432/ai_sre
# Add to volumes:
pg_data:
Helm Chart (Kubernetes)¶
The Helm chart deploys AI-SRE into a Kubernetes cluster with full RBAC for cluster actions.
Chart Location¶
deploy/helm/ai-sre/
├── Chart.yaml
├── values.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── configmap.yaml
├── secret.yaml
├── serviceaccount.yaml
├── clusterrole.yaml
├── clusterrolebinding.yaml
├── ingress.yaml
├── hpa.yaml
├── pvc.yaml
└── NOTES.txt
Quick Install¶
# Install with minimal config
helm install ai-sre deploy/helm/ai-sre \
--set secrets.ANTHROPIC_API_KEY=sk-ant-...
# Install with custom namespace
helm install ai-sre deploy/helm/ai-sre \
--namespace ai-sre --create-namespace \
--set secrets.ANTHROPIC_API_KEY=sk-ant-...
# Install with full config
helm install ai-sre deploy/helm/ai-sre \
--set secrets.ANTHROPIC_API_KEY=sk-ant-... \
--set secrets.SLACK_BOT_TOKEN=xoxb-... \
--set secrets.SLACK_APP_TOKEN=xapp-... \
--set config.WORKSPACE_ID=my-team \
--set config.WORKSPACE_NAME="My Team" \
--set config.AUTONOMY_ENABLED=true \
--set config.DEFAULT_DRY_RUN=false
Using an Existing Secret¶
If you manage secrets externally (Vault, Sealed Secrets, External Secrets):
# Create secret first
kubectl create secret generic ai-sre-secrets \
--from-literal=ANTHROPIC_API_KEY=sk-ant-... \
--from-literal=SLACK_BOT_TOKEN=xoxb-...
# Reference it in the install
helm install ai-sre deploy/helm/ai-sre \
--set existingSecret=ai-sre-secrets
Key Values¶
Replicas and Scaling¶
replicaCount: 2
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
Note: When using SQLite (default), only 1 replica is supported because SQLite does not handle concurrent writes. Switch to PostgreSQL for multi-replica deployments.
Resources¶
Ingress¶
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: ai-sre.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: ai-sre-tls
hosts:
- ai-sre.example.com
Persistence (SQLite)¶
persistence:
enabled: true
storageClass: "" # Use cluster default
accessModes:
- ReadWriteOnce
size: 5Gi
RBAC for Kubernetes Actions¶
The chart creates a ClusterRole with permissions for the actions the platform can execute:
rbac:
create: true
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "events", "namespaces"]
verbs: ["get", "list", "watch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets"]
verbs: ["get", "list", "watch", "patch", "update"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch", "patch", "update"]
Set rbac.create: false if you manage RBAC externally.
Security Context¶
podSecurityContext:
fsGroup: 1000
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: false
allowPrivilegeEscalation: false
Upgrade¶
helm upgrade ai-sre deploy/helm/ai-sre \
--set secrets.ANTHROPIC_API_KEY=sk-ant-... \
--reuse-values
Uninstall¶
helm uninstall ai-sre
# PVC is not deleted by default. To remove data:
kubectl delete pvc -l app.kubernetes.io/name=ai-sre
PostgreSQL Setup¶
For production deployments, switch from SQLite to PostgreSQL.
Install Driver¶
Configure Connection¶
Run Migrations¶
# Apply all migrations
make db-migrate
# or
alembic upgrade head
# Create a new migration after model changes
make db-revision MSG="add new column"
# Rollback one migration
make db-downgrade
Connection Pooling¶
The platform uses NullPool for PostgreSQL async connections to avoid connection leaks. For high-traffic deployments, place a PgBouncer or pgpool-II proxy in front of PostgreSQL.
Backup¶
# PostgreSQL
pg_dump -h host -U user ai_sre > backup.sql
# SQLite
cp data/ai_sre.db data/ai_sre.db.backup
Connecting Alert Sources¶
Generic Webhook¶
Point any monitoring tool's webhook to:
PagerDuty¶
-
In PagerDuty, create a Generic Webhook (v3) subscription pointing to:
-
For bidirectional sync, also configure:
-
Set
PAGERDUTY_ROUTING_KEYandPAGERDUTY_API_TOKENfor outbound events.
Prometheus/Alertmanager¶
Configure Alertmanager webhook receiver:
# alertmanager.yml
receivers:
- name: ai-sre
webhook_configs:
- url: https://ai-sre.example.com/webhook
http_config:
headers:
X-Source: alertmanager
X-API-Key: your-key
send_resolved: true
Datadog¶
Configure a Datadog webhook integration pointing to:
Set DATADOG_API_KEY and DATADOG_APP_KEY for log fetching.
Grafana¶
Configure a Grafana contact point (webhook) pointing to:
CI/CD Deploy Events¶
Send deploy events from your pipeline:
# GitHub Actions example
curl -X POST https://ai-sre.example.com/webhook/deploy \
-H "Content-Type: application/json" \
-H "X-API-Key: $AI_SRE_API_KEY" \
-d '{
"service": "'"$SERVICE_NAME"'",
"namespace": "production",
"image": "'"$IMAGE_TAG"'",
"actor": "'"$GITHUB_ACTOR"'",
"commit_sha": "'"$GITHUB_SHA"'"
}'
Connecting Log Providers¶
Grafana Loki¶
LOG_PROVIDER=loki
LOKI_URL=https://loki.example.com
LOKI_USER=your-user # Basic auth (optional)
LOKI_PASSWORD=your-password
Elasticsearch¶
LOG_PROVIDER=elastic
ELASTIC_URL=https://elasticsearch.example.com:9200
ELASTIC_API_KEY=your-api-key
ELASTIC_INDEX=logs-*
Datadog Logs¶
LOG_PROVIDER=datadog
DATADOG_API_KEY=your-api-key
DATADOG_APP_KEY=your-app-key
DATADOG_SITE=datadoghq.com
Self-Hosted LLM (Ollama)¶
For air-gapped or on-premise deployments:
Notification Setup¶
Slack¶
- Create a Slack app at https://api.slack.com/apps
- Enable Socket Mode and add the
connections:writescope - Add bot token scopes:
chat:write,channels:read,app_mentions:read - Install to workspace and copy tokens:
Microsoft Teams¶
Create an incoming webhook in your Teams channel:
Email (SMTP)¶
SMTP_HOST=smtp.example.com
SMTP_PORT=587
SMTP_USER=alerts@example.com
SMTP_PASSWORD=...
ALERT_EMAIL_TO=oncall@example.com
Security Hardening¶
API Keys¶
Always configure API keys in production:
Rate Limiting¶
CORS¶
Restrict to your frontend domains:
Safety Defaults¶
For production, start with the safest configuration:
DEFAULT_DRY_RUN=true
APPROVAL_REQUIRED=true
AUTONOMY_ENABLED=false
AUTONOMOUS_ACTIONS=restart_pod
ALLOWED_NAMESPACES=staging,production
MAX_SCALE_REPLICAS=10
Gradually relax as you build confidence:
- Enable autonomy for low-risk actions in non-prod namespaces first
- Monitor the
/activityendpoint and action logs - Expand
AUTONOMOUS_ACTIONSand namespaces over time - Consider setting
DEFAULT_DRY_RUN=falseonly after sufficient piloting
Monitoring the Platform¶
Health Check¶
Prometheus Metrics¶
Scrape /metrics with your Prometheus instance:
Exposed metrics:
- ai_sre_alerts_ingested_total (labels: source, severity)
- ai_sre_diagnosis_duration_seconds
- ai_sre_actions_executed_total (labels: action, outcome, namespace)
- ai_sre_action_duration_seconds
- ai_sre_active_incidents
Pilot Metrics¶
Export high-level pilot metrics:
Returns: average time to first response, MTTR, total actions, incident count.
Platform Overview¶
Returns: incident counts by state, average TTFR, average MTTR, total actions, workspace info, guardrail status.
Troubleshooting¶
Server fails to start¶
- Check that
DATABASE_URLis correct and the target directory exists - SQLite: the
data/directory is created automatically - PostgreSQL: verify the database exists and credentials are correct
No LLM responses¶
- Verify
LLM_PROVIDERis set correctly - Check that the corresponding API key is set and valid
- For
automode: at least one ofANTHROPIC_API_KEYorOPENAI_API_KEYmust be set
Kubernetes actions fail¶
- Verify
KUBECONFIGpath or in-cluster service account has correct RBAC - Check
ALLOWED_NAMESPACESincludes the target namespace - Verify the pod/deployment exists in the specified namespace
Dead-letter queue growing¶
Check the error messages for patterns (database timeouts, validation errors, etc.).
Alerts not deduplicating¶
- Check
DEDUP_STRATEGY(exact, fuzzy, window) - Verify alerts include a
fingerprintor consistentgroupKey - Check
ALERTMANAGER_DEDUP_TTL_SECONDSfor cache TTL