Configuration Reference¶
All AI-SRE configuration is managed through environment variables. Copy .env.example to .env and edit the values for your environment. In Kubernetes deployments, these are passed through Helm values or Kubernetes Secrets.
LLM Provider¶
AI-SRE requires at least one LLM provider for diagnosis, runbook generation, and postmortem creation.
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
auto |
LLM selection strategy. auto prefers Anthropic, falls back to OpenAI. Options: auto, anthropic, openai, ollama |
ANTHROPIC_API_KEY |
-- | Anthropic API key for Claude models |
ANTHROPIC_MODEL |
claude-sonnet-4-6 |
Anthropic model identifier |
OPENAI_API_KEY |
-- | OpenAI API key |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI model identifier |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL for self-hosted LLM inference |
OLLAMA_MODEL |
llama3.2 |
Ollama model name |
Provider selection
With LLM_PROVIDER=auto, the platform checks for ANTHROPIC_API_KEY first, then OPENAI_API_KEY. Set the provider explicitly if you want to force a specific backend. Use ollama for air-gapped or on-premise deployments where API calls to external services are not permitted.
Database¶
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
sqlite+aiosqlite:///./data/ai_sre.db |
SQLAlchemy async connection URL |
Supported backends:
- SQLite (development/pilot):
sqlite+aiosqlite:///./data/ai_sre.db-- created automatically, single-writer only - PostgreSQL (production):
postgresql+asyncpg://user:pass@host:5432/ai_sre-- requiresasyncpgdriver (included in base dependencies)
SQLite limitations
SQLite supports only a single writer at a time. For multi-replica Kubernetes deployments, you must use PostgreSQL. The platform uses NullPool for PostgreSQL connections to prevent connection leaks in async contexts.
Database Migrations¶
When using PostgreSQL, manage schema migrations with Alembic:
make db-migrate # Apply all pending migrations
make db-revision MSG="add column" # Create a new migration
make db-downgrade # Roll back one migration
Log Provider¶
The log provider determines where AI-SRE fetches application logs during diagnosis.
| Variable | Default | Description |
|---|---|---|
LOG_PROVIDER |
mock |
Log backend: loki, elastic, datadog, mock, or empty for noop |
Grafana Loki¶
| Variable | Default | Description |
|---|---|---|
LOKI_URL |
-- | Grafana Loki base URL (e.g., https://loki.example.com) |
LOKI_USER |
-- | Loki basic auth username (optional) |
LOKI_PASSWORD |
-- | Loki basic auth password (optional) |
Elasticsearch¶
| Variable | Default | Description |
|---|---|---|
ELASTIC_URL |
-- | Elasticsearch endpoint (e.g., https://es.example.com:9200) |
ELASTIC_API_KEY |
-- | Elasticsearch API key |
ELASTIC_INDEX |
logs-* |
Elasticsearch index pattern for log queries |
Datadog Logs¶
| Variable | Default | Description |
|---|---|---|
DATADOG_API_KEY |
-- | Datadog API key |
DATADOG_APP_KEY |
-- | Datadog application key |
DATADOG_SITE |
datadoghq.com |
Datadog regional site (e.g., datadoghq.eu for EU) |
Kubernetes & Actions¶
| Variable | Default | Description |
|---|---|---|
ACTION_PROVIDER |
kubernetes |
Action backend: kubernetes, ecs, or auto |
KUBECONFIG |
-- | Path to kubeconfig file. Empty uses in-cluster credentials or ~/.kube/config |
ALLOWED_NAMESPACES |
-- | Comma-separated namespace allowlist. Empty allows all namespaces |
MAX_SCALE_REPLICAS |
25 |
Maximum replica count for scale_deployment actions |
AWS ECS (Alternative Action Provider)¶
| Variable | Default | Description |
|---|---|---|
AWS_REGION |
us-east-1 |
AWS region for ECS operations |
ECS_CLUSTER |
my-cluster |
ECS cluster name |
Available actions
The platform ships with three built-in actions: restart_pod (delete a pod so its controller recreates it), scale_deployment (adjust replica count), and suggest_config_fix (generate a configuration change proposal). Custom actions can be registered through the action provider registry.
Safety & Autonomy¶
These settings control the guardrails around automated remediation. Start with the safest defaults and gradually relax as you build confidence.
| Variable | Default | Description |
|---|---|---|
DEFAULT_DRY_RUN |
true |
When true, all actions default to dry-run mode (preview without executing) |
APPROVAL_REQUIRED |
true |
When true, live actions require explicit operator approval |
AUTONOMY_ENABLED |
false |
Enable the autonomous remediation loop |
AUTONOMY_POLL_INTERVAL_SECONDS |
30 |
Seconds between autonomy scan cycles |
AUTONOMY_MAX_INCIDENTS_PER_CYCLE |
20 |
Maximum incidents processed in a single autonomy cycle |
AUTONOMOUS_ACTIONS |
restart_pod |
Comma-separated list of actions the autonomy loop may execute without human approval |
Production safety checklist
Before enabling autonomy in production:
- Set
ALLOWED_NAMESPACESto restrict which namespaces actions can target - Start with
AUTONOMOUS_ACTIONS=restart_podonly - Keep
DEFAULT_DRY_RUN=trueuntil you have reviewed dry-run outputs - Monitor the
/activityendpoint to audit all actions - Gradually expand
AUTONOMOUS_ACTIONSand namespaces as confidence grows
Recommended progression¶
| Phase | DRY_RUN | APPROVAL | AUTONOMY | ACTIONS |
|---|---|---|---|---|
| Evaluation | true |
true |
false |
restart_pod |
| Pilot (staging) | false |
true |
false |
restart_pod |
| Pilot (production) | false |
true |
true |
restart_pod |
| Trusted | false |
false |
true |
restart_pod,scale_deployment |
Workspace & Identity¶
| Variable | Default | Description |
|---|---|---|
DEPLOYMENT_PROFILE |
hybrid |
Deployment mode: on_prem (fully self-hosted), saas_agent (managed control plane), hybrid |
WORKSPACE_ID |
default |
Default workspace identifier |
WORKSPACE_NAME |
Primary Workspace |
Display name for the default workspace |
Workspaces provide multi-tenant isolation. Define additional workspaces in config/workspaces.yaml:
workspaces:
- id: ws-acme
name: Acme Corp
allowed_namespaces:
- acme-prod
- acme-staging
api_keys:
- key: acme-admin-key-001
role: admin
- key: acme-viewer-key-001
role: viewer
RBAC Roles¶
| Role | Level | Permissions |
|---|---|---|
viewer |
10 | Read incidents, metrics, timelines |
operator |
20 | All viewer permissions + execute actions, manage autonomy, seed demo data |
admin |
30 | All operator permissions + manage workspaces |
Slack Integration¶
| Variable | Default | Description |
|---|---|---|
SLACK_BOT_TOKEN |
-- | Slack bot OAuth token (xoxb-...) |
SLACK_APP_TOKEN |
-- | Slack app-level token (xapp-...) for Socket Mode |
SLACK_SIGNING_SECRET |
-- | Slack signing secret for request verification |
To set up Slack:
- Create a Slack app at api.slack.com/apps
- Enable Socket Mode and add the
connections:writescope - Add bot token scopes:
chat:write,channels:read,app_mentions:read - Install to your workspace and copy the three tokens above
PagerDuty Integration¶
| Variable | Default | Description |
|---|---|---|
PAGERDUTY_ROUTING_KEY |
-- | Events API v2 routing key for sending trigger/acknowledge/resolve events |
PAGERDUTY_API_TOKEN |
-- | REST API token for fetching incident details (bidirectional sync) |
Notifications¶
Microsoft Teams¶
| Variable | Default | Description |
|---|---|---|
TEAMS_WEBHOOK_URL |
-- | Microsoft Teams incoming webhook URL |
Email (SMTP)¶
| Variable | Default | Description |
|---|---|---|
SMTP_HOST |
-- | SMTP server hostname |
SMTP_PORT |
587 |
SMTP server port |
SMTP_USER |
-- | SMTP authentication username |
SMTP_PASSWORD |
-- | SMTP authentication password |
ALERT_EMAIL_TO |
-- | Recipient email address for alert notifications |
Web Push Notifications¶
| Variable | Default | Description |
|---|---|---|
VAPID_PUBLIC_KEY |
-- | VAPID public key for web push notifications |
VAPID_PRIVATE_KEY |
-- | VAPID private key |
VAPID_CLAIMS_EMAIL |
mailto:admin@ai-sre.local |
VAPID claims email address |
Generate VAPID keys with: npx web-push generate-vapid-keys
Alertmanager Integration¶
| Variable | Default | Description |
|---|---|---|
ALERTMANAGER_DEDUP_TTL_SECONDS |
300 |
Duration (seconds) to suppress duplicate alerts from the same fingerprint |
ALERTMANAGER_GROUP_ALERTS |
true |
When true, alerts with the same groupKey are grouped into a single incident |
Similar Incident Search¶
| Variable | Default | Description |
|---|---|---|
SIMILAR_INCIDENT_LIMIT |
3 |
Maximum number of similar incidents returned per query |
SIMILAR_INCIDENT_SEARCH_LIMIT |
50 |
Number of recent incidents to scan when searching for similar matches |
API Security¶
| Variable | Default | Description |
|---|---|---|
AI_SRE_API_KEYS |
-- | Comma-separated API keys. When empty, the platform runs in dev mode with open access |
AI_SRE_RATE_LIMIT |
60 |
Requests per minute per client IP. Set to 0 to disable rate limiting |
AI_SRE_CORS_ORIGINS |
* |
Comma-separated CORS allowed origins. Restrict to your frontend domains in production |
Always set API keys in production
Without AI_SRE_API_KEYS configured, all endpoints are accessible without authentication. This is convenient for local development but must not be used in production.
Module Enable/Disable¶
Coming soon
Per-module enable/disable configuration is planned for a future release. Currently, all modules are active when the server starts. The module catalog below documents what each module provides.
The planned configuration will look like:
# Example (not yet implemented)
MODULES_ENABLED=ingestion,reasoning,actions,autonomy
MODULES_DISABLED=chaos,compliance,terraform
Until this feature lands, modules that depend on external services (Slack, PagerDuty, Datadog) gracefully degrade when their credentials are not configured.