Bare Metal Deployment¶

Run AI-SRE directly on a server without containers. This is the simplest deployment option, suitable for evaluation, development, small teams, or environments where Docker/Kubernetes is not available.

Architecture¶

In a bare metal deployment, AI-SRE runs as a single Python process serving the FastAPI application. SQLite is the default database (zero configuration), but PostgreSQL is recommended for any deployment beyond evaluation.

graph LR
    subgraph Server
        PY[Python Process<br/>FastAPI + Uvicorn]
        DB[(SQLite / PostgreSQL)]
        PY --> DB
    end
    MON[Monitoring Tools] -->|POST /webhook| PY
    PY -->|K8s API| K8S[Kubernetes Cluster]
    PY -->|Notifications| SLACK[Slack / Teams / Email]

Prerequisites¶

Requirement	Version	Notes
Python	3.11+	CPython recommended
pip	Latest	For package installation
Git	Any	To clone the repository
LLM API key	--	Anthropic or OpenAI (or Ollama for self-hosted)

Optional:

PostgreSQL 14+ (for production database)
kubectl with cluster access (for Kubernetes remediation actions)
Slack app credentials (for Slack bot)

Installation¶

Clone and Set Up¶

# Clone the repository
git clone https://github.com/aabhat-ai/AI-SRE.git
cd AI-SRE

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with development dependencies
pip install -e ".[dev]"

Configure¶

cp .env.example .env

Edit .env with your configuration. At minimum:

# LLM provider (at least one key required)
ANTHROPIC_API_KEY=sk-ant-your-key-here

# Log provider (mock for offline, or configure Loki/Elastic/Datadog)
LOG_PROVIDER=mock

# Database (SQLite for quick start)
DATABASE_URL=sqlite+aiosqlite:///./data/ai_sre.db

For production, switch to PostgreSQL:

# Install PostgreSQL driver
pip install -e ".[dev]"
# asyncpg is already in base dependencies

# Update .env
DATABASE_URL=postgresql+asyncpg://ai_sre:password@localhost:5432/ai_sre

# Run migrations
make db-migrate

Running the Server¶

Development Mode¶

# Using make (starts with mock log provider)
make run

# Or directly
python -m src.ingestion.server

# Or using the installed entry point
ai-sre-ingest

The server starts on http://localhost:8888 by default.

Production Mode¶

For production, use a process manager to ensure the server restarts on failure and runs on boot.

systemd Service¶

Create /etc/systemd/system/ai-sre.service:

[Unit]
Description=AI-SRE Incident Response Platform
After=network.target postgresql.service

[Service]
Type=simple
User=ai-sre
Group=ai-sre
WorkingDirectory=/opt/ai-sre
Environment=PATH=/opt/ai-sre/.venv/bin:/usr/local/bin:/usr/bin
EnvironmentFile=/opt/ai-sre/.env
ExecStart=/opt/ai-sre/.venv/bin/python -m src.ingestion.server
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/ai-sre/data

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable ai-sre
sudo systemctl start ai-sre

# Check status
sudo systemctl status ai-sre

# View logs
sudo journalctl -u ai-sre -f

Supervisor (Alternative)¶

If you prefer Supervisor over systemd:

; /etc/supervisor/conf.d/ai-sre.conf
[program:ai-sre]
command=/opt/ai-sre/.venv/bin/python -m src.ingestion.server
directory=/opt/ai-sre
user=ai-sre
autostart=true
autorestart=true
stderr_logfile=/var/log/ai-sre/error.log
stdout_logfile=/var/log/ai-sre/output.log
environment=
    ANTHROPIC_API_KEY="sk-ant-...",
    DATABASE_URL="postgresql+asyncpg://...",
    LOG_PROVIDER="loki"

Running the Slack Bot¶

The Slack bot runs as a separate process:

# In a separate terminal or as a separate service
python -m src.slack_bot.app
# Or: ai-sre-slack

Create a second systemd unit for the Slack bot:

[Unit]
Description=AI-SRE Slack Bot
After=ai-sre.service

[Service]
Type=simple
User=ai-sre
Group=ai-sre
WorkingDirectory=/opt/ai-sre
EnvironmentFile=/opt/ai-sre/.env
ExecStart=/opt/ai-sre/.venv/bin/python -m src.slack_bot.app
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Reverse Proxy Setup¶

For production, place AI-SRE behind a reverse proxy for TLS termination and load balancing.

Nginx¶

server {
    listen 443 ssl http2;
    server_name ai-sre.example.com;

    ssl_certificate /etc/letsencrypt/live/ai-sre.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai-sre.example.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:8888;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support (for real-time features)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Caddy (Simpler Alternative)¶

ai-sre.example.com {
    reverse_proxy localhost:8888
}

Caddy handles TLS certificate provisioning automatically.

Health Monitoring¶

Health Check Endpoint¶

curl http://localhost:8888/health
# {"status":"ok"}

Use this for uptime monitoring (Pingdom, UptimeRobot, etc.) and process manager health checks.

Prometheus Metrics¶

Scrape /metrics for operational metrics:

curl http://localhost:8888/metrics

Key metrics:

ai_sre_alerts_ingested_total -- Alert ingestion rate by source and severity
ai_sre_diagnosis_duration_seconds -- LLM diagnosis latency
ai_sre_actions_executed_total -- Action execution count by type and outcome
ai_sre_active_incidents -- Current open incident count

Backup and Recovery¶

SQLite¶

# Simple file copy (stop the server first for consistency)
cp data/ai_sre.db data/ai_sre.db.backup

# Or use SQLite's built-in backup (works while running)
sqlite3 data/ai_sre.db ".backup data/ai_sre.db.backup"

PostgreSQL¶

# Full dump
pg_dump -h localhost -U ai_sre ai_sre > backup.sql

# Restore
psql -h localhost -U ai_sre ai_sre < backup.sql

Upgrading¶

cd /opt/ai-sre

# Pull latest code
git pull origin main

# Activate venv and install updates
source .venv/bin/activate
pip install -e ".[dev]"

# Run database migrations (PostgreSQL only)
make db-migrate

# Restart the service
sudo systemctl restart ai-sre

Troubleshooting¶

Symptom	Cause	Fix
Server fails to start	Missing `data/` directory	Created automatically; check filesystem permissions
`ModuleNotFoundError`	Virtual environment not activated	Run `source .venv/bin/activate`
No LLM responses	Missing or invalid API key	Verify `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` in `.env`
Database locked errors	Multiple writers to SQLite	Switch to PostgreSQL for concurrent access
Actions fail with 403	Namespace not in allowlist	Add namespace to `ALLOWED_NAMESPACES`
Slow diagnosis	LLM API latency	Check API provider status; consider using a faster model