Configuration Reference¶

All AI-SRE configuration is managed through environment variables. Copy .env.example to .env and edit the values for your environment. In Kubernetes deployments, these are passed through Helm values or Kubernetes Secrets.

LLM Provider¶

AI-SRE requires at least one LLM provider for diagnosis, runbook generation, and postmortem creation.

Variable	Default	Description
`LLM_PROVIDER`	`auto`	LLM selection strategy. `auto` prefers Anthropic, falls back to OpenAI. Options: `auto`, `anthropic`, `openai`, `ollama`
`ANTHROPIC_API_KEY`	--	Anthropic API key for Claude models
`ANTHROPIC_MODEL`	`claude-sonnet-4-6`	Anthropic model identifier
`OPENAI_API_KEY`	--	OpenAI API key
`OPENAI_MODEL`	`gpt-4o-mini`	OpenAI model identifier
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL for self-hosted LLM inference
`OLLAMA_MODEL`	`llama3.2`	Ollama model name

Provider selection

With LLM_PROVIDER=auto, the platform checks for ANTHROPIC_API_KEY first, then OPENAI_API_KEY. Set the provider explicitly if you want to force a specific backend. Use ollama for air-gapped or on-premise deployments where API calls to external services are not permitted.

Database¶

Variable	Default	Description
`DATABASE_URL`	`sqlite+aiosqlite:///./data/ai_sre.db`	SQLAlchemy async connection URL

Supported backends:

SQLite (development/pilot): sqlite+aiosqlite:///./data/ai_sre.db -- created automatically, single-writer only
PostgreSQL (production): postgresql+asyncpg://user:pass@host:5432/ai_sre -- requires asyncpg driver (included in base dependencies)

SQLite limitations

SQLite supports only a single writer at a time. For multi-replica Kubernetes deployments, you must use PostgreSQL. The platform uses NullPool for PostgreSQL connections to prevent connection leaks in async contexts.

Database Migrations¶

When using PostgreSQL, manage schema migrations with Alembic:

make db-migrate                    # Apply all pending migrations
make db-revision MSG="add column"  # Create a new migration
make db-downgrade                  # Roll back one migration

Log Provider¶

The log provider determines where AI-SRE fetches application logs during diagnosis.

Variable	Default	Description
`LOG_PROVIDER`	`mock`	Log backend: `loki`, `elastic`, `datadog`, `mock`, or empty for noop

Grafana Loki¶

Variable	Default	Description
`LOKI_URL`	--	Grafana Loki base URL (e.g., `https://loki.example.com`)
`LOKI_USER`	--	Loki basic auth username (optional)
`LOKI_PASSWORD`	--	Loki basic auth password (optional)

Elasticsearch¶

Variable	Default	Description
`ELASTIC_URL`	--	Elasticsearch endpoint (e.g., `https://es.example.com:9200`)
`ELASTIC_API_KEY`	--	Elasticsearch API key
`ELASTIC_INDEX`	`logs-*`	Elasticsearch index pattern for log queries

Datadog Logs¶

Variable	Default	Description
`DATADOG_API_KEY`	--	Datadog API key
`DATADOG_APP_KEY`	--	Datadog application key
`DATADOG_SITE`	`datadoghq.com`	Datadog regional site (e.g., `datadoghq.eu` for EU)

Kubernetes & Actions¶

Variable	Default	Description
`ACTION_PROVIDER`	`kubernetes`	Action backend: `kubernetes`, `ecs`, or `auto`
`KUBECONFIG`	--	Path to kubeconfig file. Empty uses in-cluster credentials or `~/.kube/config`
`ALLOWED_NAMESPACES`	--	Comma-separated namespace allowlist. Empty allows all namespaces
`MAX_SCALE_REPLICAS`	`25`	Maximum replica count for `scale_deployment` actions

AWS ECS (Alternative Action Provider)¶

Variable	Default	Description
`AWS_REGION`	`us-east-1`	AWS region for ECS operations
`ECS_CLUSTER`	`my-cluster`	ECS cluster name

Available actions

The platform ships with three built-in actions: restart_pod (delete a pod so its controller recreates it), scale_deployment (adjust replica count), and suggest_config_fix (generate a configuration change proposal). Custom actions can be registered through the action provider registry.

Safety & Autonomy¶

These settings control the guardrails around automated remediation. Start with the safest defaults and gradually relax as you build confidence.

Variable	Default	Description
`DEFAULT_DRY_RUN`	`true`	When `true`, all actions default to dry-run mode (preview without executing)
`APPROVAL_REQUIRED`	`true`	When `true`, live actions require explicit operator approval
`AUTONOMY_ENABLED`	`false`	Enable the autonomous remediation loop
`AUTONOMY_POLL_INTERVAL_SECONDS`	`30`	Seconds between autonomy scan cycles
`AUTONOMY_MAX_INCIDENTS_PER_CYCLE`	`20`	Maximum incidents processed in a single autonomy cycle
`AUTONOMOUS_ACTIONS`	`restart_pod`	Comma-separated list of actions the autonomy loop may execute without human approval

Production safety checklist

Before enabling autonomy in production:

Set ALLOWED_NAMESPACES to restrict which namespaces actions can target
Start with AUTONOMOUS_ACTIONS=restart_pod only
Keep DEFAULT_DRY_RUN=true until you have reviewed dry-run outputs
Monitor the /activity endpoint to audit all actions
Gradually expand AUTONOMOUS_ACTIONS and namespaces as confidence grows

Recommended progression¶

Phase	DRY_RUN	APPROVAL	AUTONOMY	ACTIONS
Evaluation	`true`	`true`	`false`	`restart_pod`
Pilot (staging)	`false`	`true`	`false`	`restart_pod`
Pilot (production)	`false`	`true`	`true`	`restart_pod`
Trusted	`false`	`false`	`true`	`restart_pod,scale_deployment`

Workspace & Identity¶

Variable	Default	Description
`DEPLOYMENT_PROFILE`	`hybrid`	Deployment mode: `on_prem` (fully self-hosted), `saas_agent` (managed control plane), `hybrid`
`WORKSPACE_ID`	`default`	Default workspace identifier
`WORKSPACE_NAME`	`Primary Workspace`	Display name for the default workspace

Workspaces provide multi-tenant isolation. Define additional workspaces in config/workspaces.yaml:

workspaces:
  - id: ws-acme
    name: Acme Corp
    allowed_namespaces:
      - acme-prod
      - acme-staging
    api_keys:
      - key: acme-admin-key-001
        role: admin
      - key: acme-viewer-key-001
        role: viewer

RBAC Roles¶

Role	Level	Permissions
`viewer`	10	Read incidents, metrics, timelines
`operator`	20	All viewer permissions + execute actions, manage autonomy, seed demo data
`admin`	30	All operator permissions + manage workspaces

Slack Integration¶

Variable	Default	Description
`SLACK_BOT_TOKEN`	--	Slack bot OAuth token (`xoxb-...`)
`SLACK_APP_TOKEN`	--	Slack app-level token (`xapp-...`) for Socket Mode
`SLACK_SIGNING_SECRET`	--	Slack signing secret for request verification

To set up Slack:

Create a Slack app at api.slack.com/apps
Enable Socket Mode and add the connections:write scope
Add bot token scopes: chat:write, channels:read, app_mentions:read
Install to your workspace and copy the three tokens above

PagerDuty Integration¶

Variable	Default	Description
`PAGERDUTY_ROUTING_KEY`	--	Events API v2 routing key for sending trigger/acknowledge/resolve events
`PAGERDUTY_API_TOKEN`	--	REST API token for fetching incident details (bidirectional sync)

Notifications¶

Microsoft Teams¶

Variable	Default	Description
`TEAMS_WEBHOOK_URL`	--	Microsoft Teams incoming webhook URL

Email (SMTP)¶

Variable	Default	Description
`SMTP_HOST`	--	SMTP server hostname
`SMTP_PORT`	`587`	SMTP server port
`SMTP_USER`	--	SMTP authentication username
`SMTP_PASSWORD`	--	SMTP authentication password
`ALERT_EMAIL_TO`	--	Recipient email address for alert notifications

Web Push Notifications¶

Variable	Default	Description
`VAPID_PUBLIC_KEY`	--	VAPID public key for web push notifications
`VAPID_PRIVATE_KEY`	--	VAPID private key
`VAPID_CLAIMS_EMAIL`	`mailto:admin@ai-sre.local`	VAPID claims email address

Generate VAPID keys with: npx web-push generate-vapid-keys

Alertmanager Integration¶

Variable	Default	Description
`ALERTMANAGER_DEDUP_TTL_SECONDS`	`300`	Duration (seconds) to suppress duplicate alerts from the same fingerprint
`ALERTMANAGER_GROUP_ALERTS`	`true`	When `true`, alerts with the same `groupKey` are grouped into a single incident

Similar Incident Search¶

Variable	Default	Description
`SIMILAR_INCIDENT_LIMIT`	`3`	Maximum number of similar incidents returned per query
`SIMILAR_INCIDENT_SEARCH_LIMIT`	`50`	Number of recent incidents to scan when searching for similar matches

API Security¶

Variable	Default	Description
`AI_SRE_API_KEYS`	--	Comma-separated API keys. When empty, the platform runs in dev mode with open access
`AI_SRE_RATE_LIMIT`	`60`	Requests per minute per client IP. Set to `0` to disable rate limiting
`AI_SRE_CORS_ORIGINS`	`*`	Comma-separated CORS allowed origins. Restrict to your frontend domains in production

Always set API keys in production

Without AI_SRE_API_KEYS configured, all endpoints are accessible without authentication. This is convenient for local development but must not be used in production.

Module Enable/Disable¶

Coming soon

Per-module enable/disable configuration is planned for a future release. Currently, all modules are active when the server starts. The module catalog below documents what each module provides.

The planned configuration will look like:

# Example (not yet implemented)
MODULES_ENABLED=ingestion,reasoning,actions,autonomy
MODULES_DISABLED=chaos,compliance,terraform

Until this feature lands, modules that depend on external services (Slack, PagerDuty, Datadog) gracefully degrade when their credentials are not configured.