Skip to content

Configuration Reference

All AI-SRE configuration is managed through environment variables. Copy .env.example to .env and edit the values for your environment. In Kubernetes deployments, these are passed through Helm values or Kubernetes Secrets.


LLM Provider

AI-SRE requires at least one LLM provider for diagnosis, runbook generation, and postmortem creation.

Variable Default Description
LLM_PROVIDER auto LLM selection strategy. auto prefers Anthropic, falls back to OpenAI. Options: auto, anthropic, openai, ollama
ANTHROPIC_API_KEY -- Anthropic API key for Claude models
ANTHROPIC_MODEL claude-sonnet-4-6 Anthropic model identifier
OPENAI_API_KEY -- OpenAI API key
OPENAI_MODEL gpt-4o-mini OpenAI model identifier
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL for self-hosted LLM inference
OLLAMA_MODEL llama3.2 Ollama model name

Provider selection

With LLM_PROVIDER=auto, the platform checks for ANTHROPIC_API_KEY first, then OPENAI_API_KEY. Set the provider explicitly if you want to force a specific backend. Use ollama for air-gapped or on-premise deployments where API calls to external services are not permitted.


Database

Variable Default Description
DATABASE_URL sqlite+aiosqlite:///./data/ai_sre.db SQLAlchemy async connection URL

Supported backends:

  • SQLite (development/pilot): sqlite+aiosqlite:///./data/ai_sre.db -- created automatically, single-writer only
  • PostgreSQL (production): postgresql+asyncpg://user:pass@host:5432/ai_sre -- requires asyncpg driver (included in base dependencies)

SQLite limitations

SQLite supports only a single writer at a time. For multi-replica Kubernetes deployments, you must use PostgreSQL. The platform uses NullPool for PostgreSQL connections to prevent connection leaks in async contexts.

Database Migrations

When using PostgreSQL, manage schema migrations with Alembic:

make db-migrate                    # Apply all pending migrations
make db-revision MSG="add column"  # Create a new migration
make db-downgrade                  # Roll back one migration

Log Provider

The log provider determines where AI-SRE fetches application logs during diagnosis.

Variable Default Description
LOG_PROVIDER mock Log backend: loki, elastic, datadog, mock, or empty for noop

Grafana Loki

Variable Default Description
LOKI_URL -- Grafana Loki base URL (e.g., https://loki.example.com)
LOKI_USER -- Loki basic auth username (optional)
LOKI_PASSWORD -- Loki basic auth password (optional)

Elasticsearch

Variable Default Description
ELASTIC_URL -- Elasticsearch endpoint (e.g., https://es.example.com:9200)
ELASTIC_API_KEY -- Elasticsearch API key
ELASTIC_INDEX logs-* Elasticsearch index pattern for log queries

Datadog Logs

Variable Default Description
DATADOG_API_KEY -- Datadog API key
DATADOG_APP_KEY -- Datadog application key
DATADOG_SITE datadoghq.com Datadog regional site (e.g., datadoghq.eu for EU)

Kubernetes & Actions

Variable Default Description
ACTION_PROVIDER kubernetes Action backend: kubernetes, ecs, or auto
KUBECONFIG -- Path to kubeconfig file. Empty uses in-cluster credentials or ~/.kube/config
ALLOWED_NAMESPACES -- Comma-separated namespace allowlist. Empty allows all namespaces
MAX_SCALE_REPLICAS 25 Maximum replica count for scale_deployment actions

AWS ECS (Alternative Action Provider)

Variable Default Description
AWS_REGION us-east-1 AWS region for ECS operations
ECS_CLUSTER my-cluster ECS cluster name

Available actions

The platform ships with three built-in actions: restart_pod (delete a pod so its controller recreates it), scale_deployment (adjust replica count), and suggest_config_fix (generate a configuration change proposal). Custom actions can be registered through the action provider registry.


Safety & Autonomy

These settings control the guardrails around automated remediation. Start with the safest defaults and gradually relax as you build confidence.

Variable Default Description
DEFAULT_DRY_RUN true When true, all actions default to dry-run mode (preview without executing)
APPROVAL_REQUIRED true When true, live actions require explicit operator approval
AUTONOMY_ENABLED false Enable the autonomous remediation loop
AUTONOMY_POLL_INTERVAL_SECONDS 30 Seconds between autonomy scan cycles
AUTONOMY_MAX_INCIDENTS_PER_CYCLE 20 Maximum incidents processed in a single autonomy cycle
AUTONOMOUS_ACTIONS restart_pod Comma-separated list of actions the autonomy loop may execute without human approval

Production safety checklist

Before enabling autonomy in production:

  1. Set ALLOWED_NAMESPACES to restrict which namespaces actions can target
  2. Start with AUTONOMOUS_ACTIONS=restart_pod only
  3. Keep DEFAULT_DRY_RUN=true until you have reviewed dry-run outputs
  4. Monitor the /activity endpoint to audit all actions
  5. Gradually expand AUTONOMOUS_ACTIONS and namespaces as confidence grows
Phase DRY_RUN APPROVAL AUTONOMY ACTIONS
Evaluation true true false restart_pod
Pilot (staging) false true false restart_pod
Pilot (production) false true true restart_pod
Trusted false false true restart_pod,scale_deployment

Workspace & Identity

Variable Default Description
DEPLOYMENT_PROFILE hybrid Deployment mode: on_prem (fully self-hosted), saas_agent (managed control plane), hybrid
WORKSPACE_ID default Default workspace identifier
WORKSPACE_NAME Primary Workspace Display name for the default workspace

Workspaces provide multi-tenant isolation. Define additional workspaces in config/workspaces.yaml:

workspaces:
  - id: ws-acme
    name: Acme Corp
    allowed_namespaces:
      - acme-prod
      - acme-staging
    api_keys:
      - key: acme-admin-key-001
        role: admin
      - key: acme-viewer-key-001
        role: viewer

RBAC Roles

Role Level Permissions
viewer 10 Read incidents, metrics, timelines
operator 20 All viewer permissions + execute actions, manage autonomy, seed demo data
admin 30 All operator permissions + manage workspaces

Slack Integration

Variable Default Description
SLACK_BOT_TOKEN -- Slack bot OAuth token (xoxb-...)
SLACK_APP_TOKEN -- Slack app-level token (xapp-...) for Socket Mode
SLACK_SIGNING_SECRET -- Slack signing secret for request verification

To set up Slack:

  1. Create a Slack app at api.slack.com/apps
  2. Enable Socket Mode and add the connections:write scope
  3. Add bot token scopes: chat:write, channels:read, app_mentions:read
  4. Install to your workspace and copy the three tokens above

PagerDuty Integration

Variable Default Description
PAGERDUTY_ROUTING_KEY -- Events API v2 routing key for sending trigger/acknowledge/resolve events
PAGERDUTY_API_TOKEN -- REST API token for fetching incident details (bidirectional sync)

Notifications

Microsoft Teams

Variable Default Description
TEAMS_WEBHOOK_URL -- Microsoft Teams incoming webhook URL

Email (SMTP)

Variable Default Description
SMTP_HOST -- SMTP server hostname
SMTP_PORT 587 SMTP server port
SMTP_USER -- SMTP authentication username
SMTP_PASSWORD -- SMTP authentication password
ALERT_EMAIL_TO -- Recipient email address for alert notifications

Web Push Notifications

Variable Default Description
VAPID_PUBLIC_KEY -- VAPID public key for web push notifications
VAPID_PRIVATE_KEY -- VAPID private key
VAPID_CLAIMS_EMAIL mailto:admin@ai-sre.local VAPID claims email address

Generate VAPID keys with: npx web-push generate-vapid-keys


Alertmanager Integration

Variable Default Description
ALERTMANAGER_DEDUP_TTL_SECONDS 300 Duration (seconds) to suppress duplicate alerts from the same fingerprint
ALERTMANAGER_GROUP_ALERTS true When true, alerts with the same groupKey are grouped into a single incident

Variable Default Description
SIMILAR_INCIDENT_LIMIT 3 Maximum number of similar incidents returned per query
SIMILAR_INCIDENT_SEARCH_LIMIT 50 Number of recent incidents to scan when searching for similar matches

API Security

Variable Default Description
AI_SRE_API_KEYS -- Comma-separated API keys. When empty, the platform runs in dev mode with open access
AI_SRE_RATE_LIMIT 60 Requests per minute per client IP. Set to 0 to disable rate limiting
AI_SRE_CORS_ORIGINS * Comma-separated CORS allowed origins. Restrict to your frontend domains in production

Always set API keys in production

Without AI_SRE_API_KEYS configured, all endpoints are accessible without authentication. This is convenient for local development but must not be used in production.


Module Enable/Disable

Coming soon

Per-module enable/disable configuration is planned for a future release. Currently, all modules are active when the server starts. The module catalog below documents what each module provides.

The planned configuration will look like:

# Example (not yet implemented)
MODULES_ENABLED=ingestion,reasoning,actions,autonomy
MODULES_DISABLED=chaos,compliance,terraform

Until this feature lands, modules that depend on external services (Slack, PagerDuty, Datadog) gracefully degrade when their credentials are not configured.