Bare Metal Deployment¶
Run AI-SRE directly on a server without containers. This is the simplest deployment option, suitable for evaluation, development, small teams, or environments where Docker/Kubernetes is not available.
Architecture¶
In a bare metal deployment, AI-SRE runs as a single Python process serving the FastAPI application. SQLite is the default database (zero configuration), but PostgreSQL is recommended for any deployment beyond evaluation.
graph LR
subgraph Server
PY[Python Process<br/>FastAPI + Uvicorn]
DB[(SQLite / PostgreSQL)]
PY --> DB
end
MON[Monitoring Tools] -->|POST /webhook| PY
PY -->|K8s API| K8S[Kubernetes Cluster]
PY -->|Notifications| SLACK[Slack / Teams / Email]
Prerequisites¶
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.11+ | CPython recommended |
| pip | Latest | For package installation |
| Git | Any | To clone the repository |
| LLM API key | -- | Anthropic or OpenAI (or Ollama for self-hosted) |
Optional:
- PostgreSQL 14+ (for production database)
kubectlwith cluster access (for Kubernetes remediation actions)- Slack app credentials (for Slack bot)
Installation¶
Clone and Set Up¶
# Clone the repository
git clone https://github.com/aabhat-ai/AI-SRE.git
cd AI-SRE
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install with development dependencies
pip install -e ".[dev]"
Configure¶
Edit .env with your configuration. At minimum:
# LLM provider (at least one key required)
ANTHROPIC_API_KEY=sk-ant-your-key-here
# Log provider (mock for offline, or configure Loki/Elastic/Datadog)
LOG_PROVIDER=mock
# Database (SQLite for quick start)
DATABASE_URL=sqlite+aiosqlite:///./data/ai_sre.db
For production, switch to PostgreSQL:
# Install PostgreSQL driver
pip install -e ".[dev]"
# asyncpg is already in base dependencies
# Update .env
DATABASE_URL=postgresql+asyncpg://ai_sre:password@localhost:5432/ai_sre
# Run migrations
make db-migrate
Running the Server¶
Development Mode¶
# Using make (starts with mock log provider)
make run
# Or directly
python -m src.ingestion.server
# Or using the installed entry point
ai-sre-ingest
The server starts on http://localhost:8888 by default.
Production Mode¶
For production, use a process manager to ensure the server restarts on failure and runs on boot.
systemd Service¶
Create /etc/systemd/system/ai-sre.service:
[Unit]
Description=AI-SRE Incident Response Platform
After=network.target postgresql.service
[Service]
Type=simple
User=ai-sre
Group=ai-sre
WorkingDirectory=/opt/ai-sre
Environment=PATH=/opt/ai-sre/.venv/bin:/usr/local/bin:/usr/bin
EnvironmentFile=/opt/ai-sre/.env
ExecStart=/opt/ai-sre/.venv/bin/python -m src.ingestion.server
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/ai-sre/data
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable ai-sre
sudo systemctl start ai-sre
# Check status
sudo systemctl status ai-sre
# View logs
sudo journalctl -u ai-sre -f
Supervisor (Alternative)¶
If you prefer Supervisor over systemd:
; /etc/supervisor/conf.d/ai-sre.conf
[program:ai-sre]
command=/opt/ai-sre/.venv/bin/python -m src.ingestion.server
directory=/opt/ai-sre
user=ai-sre
autostart=true
autorestart=true
stderr_logfile=/var/log/ai-sre/error.log
stdout_logfile=/var/log/ai-sre/output.log
environment=
ANTHROPIC_API_KEY="sk-ant-...",
DATABASE_URL="postgresql+asyncpg://...",
LOG_PROVIDER="loki"
Running the Slack Bot¶
The Slack bot runs as a separate process:
Create a second systemd unit for the Slack bot:
[Unit]
Description=AI-SRE Slack Bot
After=ai-sre.service
[Service]
Type=simple
User=ai-sre
Group=ai-sre
WorkingDirectory=/opt/ai-sre
EnvironmentFile=/opt/ai-sre/.env
ExecStart=/opt/ai-sre/.venv/bin/python -m src.slack_bot.app
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Reverse Proxy Setup¶
For production, place AI-SRE behind a reverse proxy for TLS termination and load balancing.
Nginx¶
server {
listen 443 ssl http2;
server_name ai-sre.example.com;
ssl_certificate /etc/letsencrypt/live/ai-sre.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ai-sre.example.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:8888;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (for real-time features)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
Caddy (Simpler Alternative)¶
Caddy handles TLS certificate provisioning automatically.
Health Monitoring¶
Health Check Endpoint¶
Use this for uptime monitoring (Pingdom, UptimeRobot, etc.) and process manager health checks.
Prometheus Metrics¶
Scrape /metrics for operational metrics:
Key metrics:
ai_sre_alerts_ingested_total-- Alert ingestion rate by source and severityai_sre_diagnosis_duration_seconds-- LLM diagnosis latencyai_sre_actions_executed_total-- Action execution count by type and outcomeai_sre_active_incidents-- Current open incident count
Backup and Recovery¶
SQLite¶
# Simple file copy (stop the server first for consistency)
cp data/ai_sre.db data/ai_sre.db.backup
# Or use SQLite's built-in backup (works while running)
sqlite3 data/ai_sre.db ".backup data/ai_sre.db.backup"
PostgreSQL¶
# Full dump
pg_dump -h localhost -U ai_sre ai_sre > backup.sql
# Restore
psql -h localhost -U ai_sre ai_sre < backup.sql
Upgrading¶
cd /opt/ai-sre
# Pull latest code
git pull origin main
# Activate venv and install updates
source .venv/bin/activate
pip install -e ".[dev]"
# Run database migrations (PostgreSQL only)
make db-migrate
# Restart the service
sudo systemctl restart ai-sre
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
| Server fails to start | Missing data/ directory |
Created automatically; check filesystem permissions |
ModuleNotFoundError |
Virtual environment not activated | Run source .venv/bin/activate |
| No LLM responses | Missing or invalid API key | Verify ANTHROPIC_API_KEY or OPENAI_API_KEY in .env |
| Database locked errors | Multiple writers to SQLite | Switch to PostgreSQL for concurrent access |
| Actions fail with 403 | Namespace not in allowlist | Add namespace to ALLOWED_NAMESPACES |
| Slow diagnosis | LLM API latency | Check API provider status; consider using a faster model |