AI-SRE API Reference¶
Base URL: http://localhost:8888 (default)
Authentication¶
All endpoints except /health and /metrics require authentication when AI_SRE_API_KEYS is configured.
Methods:
- Header: X-API-Key: your-api-key
- Query parameter: ?api_key=your-api-key
When no API keys are configured (empty AI_SRE_API_KEYS), the platform runs in dev mode and all requests are allowed.
Incidents¶
POST /webhook -- Ingest Alert¶
Ingest an alert from any supported provider. The X-Source header determines which normalizer processes the payload.
Headers:
| Header | Required | Description |
|--------|----------|-------------|
| X-Source | No | Alert source: webhook (default), pagerduty, opsgenie, alertmanager, datadog, grafana, newrelic |
| X-Workspace-Id | No | Workspace identifier for multi-tenant isolation |
| X-Workspace-Name | No | Workspace display name |
| X-API-Key | Conditional | Required when API keys are configured |
Request (generic webhook):
curl -X POST http://localhost:8888/webhook \
-H "Content-Type: application/json" \
-H "X-Source: webhook" \
-H "X-API-Key: your-key" \
-d '{
"title": "High CPU on payments-api",
"description": "CPU usage > 90% for 5 minutes",
"severity": "critical",
"source": "prometheus",
"namespace": "production",
"pod": "payments-api-7d8f9c6b5-x2k4m",
"deployment": "payments-api",
"service": "payments-api"
}'
Request (PagerDuty format):
curl -X POST http://localhost:8888/webhook \
-H "Content-Type: application/json" \
-H "X-Source: pagerduty" \
-d '{
"event": {
"event_type": "incident.triggered",
"data": {
"id": "PD12345",
"title": "Database connection pool exhausted",
"urgency": "high",
"service": {"summary": "auth-service"}
}
}
}'
Request (Alertmanager format):
curl -X POST http://localhost:8888/webhook \
-H "Content-Type: application/json" \
-H "X-Source: alertmanager" \
-d '{
"status": "firing",
"groupKey": "alertmanager/cluster-1/{severity=\"critical\"}",
"alerts": [
{
"status": "firing",
"fingerprint": "abc123",
"labels": {
"alertname": "PodCrashLooping",
"namespace": "production",
"pod": "api-server-5d4f3c2b1-xyz"
},
"annotations": {
"summary": "Pod is crash looping",
"description": "Pod has restarted 5 times in the last 10 minutes"
}
}
]
}'
Response:
{
"ok": true,
"incident_id": "inc-a1b2c3d4",
"deduplicated": false,
"dedup_strategy": null,
"group_id": null
}
When deduplicated:
GET /webhook/dead-letter -- Inspect Dead-Letter Queue¶
Return alerts that failed processing after all retry attempts.
Query Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| limit | int | 100 | Max entries to return (1-1000) |
Response:
{
"entries": [
{
"payload": {"title": "..."},
"source": "webhook",
"error": "Database connection failed",
"timestamp": "2025-01-15T10:30:00Z"
}
],
"count": 1
}
GET /incidents -- List Incidents¶
Query Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| limit | int | 20 | Max incidents (1-100) |
| state | string | null | Filter: open, acknowledged, resolved |
| workspace_id | string | null | Filter by workspace |
Response:
{
"incidents": [
{
"id": "inc-a1b2c3d4",
"title": "High CPU on payments-api",
"severity": "critical",
"source": "prometheus",
"state": "open",
"namespace": "production",
"pod": "payments-api-7d8f9c6b5-x2k4m",
"deployment": "payments-api",
"created_at": "2025-01-15T10:30:00",
"resolved_at": null,
"workspace_id": "ws-acme"
}
],
"count": 1
}
GET /incidents/{incident_id} -- Incident Detail¶
Response:
{
"incident": {
"id": "inc-a1b2c3d4",
"title": "High CPU on payments-api",
"severity": "critical",
"state": "open",
"created_at": "2025-01-15T10:30:00"
},
"metrics": {
"time_to_first_response_minutes": 2.5,
"resolution_time_minutes": null,
"actions_taken": 1
},
"actions": [
{
"action": "restart_pod",
"outcome": "success",
"dry_run": false,
"created_at": "2025-01-15T10:32:30"
}
]
}
GET /incidents/{incident_id}/diagnosis -- AI Diagnosis¶
Returns an AI-powered diagnosis with Kubernetes context, log excerpts, matched playbooks, similar incidents, and suggested actions.
curl "http://localhost:8888/incidents/inc-a1b2c3d4/diagnosis?log_limit=50" \
-H "X-API-Key: your-key"
Query Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| log_limit | int | 50 | Max log lines to fetch (1-200) |
Response:
{
"incident_id": "inc-a1b2c3d4",
"summary": "The payments-api pod is experiencing OOMKilled events due to...",
"root_cause": "Memory limit of 512Mi is insufficient for current traffic load",
"confidence": 0.85,
"k8s_context": {
"pod_status": "CrashLoopBackOff",
"restart_count": 5,
"last_termination_reason": "OOMKilled"
},
"log_excerpt": "2025-01-15 10:29:45 ERROR OutOfMemoryError: Java heap space...",
"matched_playbooks": ["oom-recovery"],
"similar_incidents": [
{
"id": "inc-older-123",
"title": "OOMKilled on payments-api",
"similarity_score": 0.92,
"resolution": "Increased memory limit to 1Gi"
}
],
"deploy_correlation": {
"deploy_id": "deploy-456",
"service": "payments-api",
"minutes_before_incident": 12,
"actor": "ci-bot"
},
"suggested_actions": [
{
"name": "restart_pod",
"description": "Restart one pod so its controller can recreate it.",
"arguments": {"namespace": "production", "pod_name": "payments-api-7d8f9c6b5-x2k4m"},
"executable": true,
"auto_executable": true,
"allowed": true,
"approval_required": true,
"dry_run": true,
"blast_radius": "single_pod",
"policy_name": "default",
"policy_reason": "Action is within guardrails"
}
]
}
GET /incidents/{incident_id}/similar -- Similar Incidents¶
Response:
{
"incident_id": "inc-a1b2c3d4",
"similar_incidents": [
{
"id": "inc-older-123",
"title": "OOMKilled on payments-api",
"similarity_score": 0.92,
"severity": "critical",
"state": "resolved",
"created_at": "2025-01-10T08:15:00"
}
],
"count": 1
}
GET /incidents/{incident_id}/group -- Incident Group¶
Response:
{
"incident_id": "inc-a1b2c3d4",
"parent_id": null,
"children": [
{
"incident_id": "inc-child-567",
"correlation_reason": "same_service_within_window"
}
],
"is_parent": true
}
GET /incidents/{incident_id}/metrics -- Incident Metrics¶
Response:
{
"incident_id": "inc-a1b2c3d4",
"summary": {
"time_to_first_response_minutes": 2.5,
"resolution_time_minutes": 15.3,
"actions_taken": 3
},
"timeline": [
{
"metric_name": "resolution_time_minutes",
"value": 15.3,
"created_at": "2025-01-15T10:45:30"
}
]
}
GET /incidents/{incident_id}/actions -- Action History¶
Response:
{
"incident_id": "inc-a1b2c3d4",
"actions": [
{
"action": "restart_pod",
"outcome": "success",
"dry_run": false,
"namespace": "production",
"created_at": "2025-01-15T10:32:30"
}
],
"count": 1
}
GET /incidents/{incident_id}/timeline -- Event Timeline¶
Response:
{
"incident_id": "inc-a1b2c3d4",
"events": [
{
"event_type": "alert_received",
"actor": "webhook:prometheus",
"payload": {"source": "prometheus", "severity": "critical"},
"created_at": "2025-01-15T10:30:00"
},
{
"event_type": "incident_created",
"actor": "system",
"payload": {"title": "High CPU on payments-api"},
"created_at": "2025-01-15T10:30:00"
},
{
"event_type": "diagnosis_started",
"actor": "system",
"payload": {},
"created_at": "2025-01-15T10:30:15"
},
{
"event_type": "diagnosis_completed",
"actor": "system",
"payload": {"summary": "...", "suggested_actions_count": 2},
"created_at": "2025-01-15T10:30:45"
}
],
"count": 4
}
GET /activity -- Recent Activity¶
Response:
{
"activity": [
{
"action": "restart_pod",
"incident_id": "inc-a1b2c3d4",
"outcome": "success",
"created_at": "2025-01-15T10:32:30"
}
],
"count": 1
}
Integrations¶
POST /webhook/pagerduty -- PagerDuty Webhook¶
Receive PagerDuty v3 webhook events for bidirectional incident sync.
curl -X POST http://localhost:8888/webhook/pagerduty \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"event": {
"event_type": "incident.triggered",
"data": {
"id": "PD12345",
"title": "Service degraded",
"urgency": "high",
"service": {"summary": "api-gateway"}
}
}
}'
Response:
{
"ok": true,
"event_type": "incident.triggered",
"pd_incident_id": "PD12345",
"local_incident_id": "inc-a1b2c3d4",
"action_taken": "created"
}
Actions¶
POST /actions/execute -- Execute Action¶
Execute a named safe action through the policy engine.
curl -X POST http://localhost:8888/actions/execute \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"action": "restart_pod",
"namespace": "production",
"pod_name": "payments-api-7d8f9c6b5-x2k4m",
"dry_run": false,
"approved": true
}'
Request Body:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| action | string | required | Action name: restart_pod, scale_deployment, suggest_config_fix |
| namespace | string | "" | Kubernetes namespace |
| pod_name | string | "" | Target pod name |
| deployment_name | string | "" | Target deployment name |
| replicas | int | 1 | Desired replica count |
| component | string | "" | Component name for config suggestions |
| issue | string | "" | Issue description for config suggestions |
| current_spec | string | "" | Current spec for config suggestions |
| dry_run | bool | true | Preview without executing |
| approved | bool | false | Explicit operator approval |
Response (dry run):
{
"status": "dry_run",
"action": "restart_pod",
"message": "Would delete pod payments-api-7d8f9c6b5-x2k4m in namespace production",
"namespace": "production",
"pod_name": "payments-api-7d8f9c6b5-x2k4m"
}
Response (executed):
{
"status": "success",
"action": "restart_pod",
"message": "Pod payments-api-7d8f9c6b5-x2k4m deleted in namespace production",
"namespace": "production",
"pod_name": "payments-api-7d8f9c6b5-x2k4m"
}
Scale deployment example:
curl -X POST http://localhost:8888/actions/execute \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"action": "scale_deployment",
"namespace": "production",
"deployment_name": "payments-api",
"replicas": 5,
"dry_run": false,
"approved": true
}'
Automation¶
POST /webhook/deploy -- Ingest Deploy Event¶
Send from your CI/CD pipeline after every production deploy.
curl -X POST http://localhost:8888/webhook/deploy \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"service": "payments-api",
"namespace": "production",
"environment": "prod",
"image": "payments-api:v2.3.1",
"actor": "ci-bot",
"commit_sha": "abc123def456"
}'
Response:
{
"ok": true,
"deploy_id": "deploy-a1b2c3",
"stored": {
"deploy_id": "deploy-a1b2c3",
"service": "payments-api",
"namespace": "production",
"image": "payments-api:v2.3.1",
"actor": "ci-bot",
"timestamp": "2025-01-15T10:00:00Z"
}
}
GET /webhook/deploy/recent -- List Recent Deploys¶
GET /incidents/{incident_id}/deploy-correlation -- Deploy Correlation¶
curl "http://localhost:8888/incidents/inc-a1b2c3d4/deploy-correlation?lookback_minutes=30" \
-H "X-API-Key: your-key"
Response:
{
"incident_id": "inc-a1b2c3d4",
"has_correlation": true,
"correlated_deploy": {
"deploy_id": "deploy-a1b2c3",
"service": "payments-api",
"actor": "ci-bot",
"minutes_before_incident": 12
},
"lookback_minutes": 30,
"message": "Deploy by ci-bot (payments-api) was 12 min before this incident"
}
GET /runbooks -- List Runbooks¶
GET /runbooks/all -- List All Runbooks (Built-in + Generated)¶
POST /runbooks/generate -- Generate Runbook¶
Generate a runbook from historical incident patterns.
curl -X POST http://localhost:8888/runbooks/generate \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"service": "payments-api",
"pattern": "oomkilled",
"min_incidents": 2
}'
Response:
{
"pattern": "oomkilled",
"occurrences": 5,
"runbook": {
"id": "generated-oomkilled-payments-api",
"name": "OOMKilled Recovery for payments-api",
"steps": ["..."]
},
"source": "generated",
"message": "Generated runbook for 'oomkilled' pattern (seen 5x). Review and approve to save."
}
POST /runbooks/generate/approve -- Save Generated Runbook¶
curl -X POST http://localhost:8888/runbooks/generate/approve \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"runbook": { "id": "generated-oomkilled-payments-api", "name": "...", "steps": ["..."] }
}'
Response:
{
"ok": true,
"runbook_id": "generated-oomkilled-payments-api",
"saved_path": "config/runbooks/generated/generated-oomkilled-payments-api.yaml",
"message": "Runbook 'generated-oomkilled-payments-api' saved successfully."
}
GET /incidents/{incident_id}/runbooks -- Match Runbooks¶
POST /incidents/{incident_id}/runbooks/{runbook_id}/execute -- Execute Runbook¶
curl -X POST "http://localhost:8888/incidents/inc-a1b2c3d4/runbooks/oom-recovery/execute?auto_approve=false" \
-H "X-API-Key: your-key"
Query Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| auto_approve | bool | false | Auto-approve all steps (use carefully) |
Intelligence¶
GET /forecasting/capacity -- Capacity Forecast¶
Response:
{
"generated_at": "2025-01-15T10:00:00Z",
"analysis_period_hours": 24,
"workspace": "Primary Workspace",
"total_forecasts": 2,
"critical_count": 1,
"high_count": 1,
"forecasts": [
{
"service": "payments-api",
"risk_level": "critical",
"signal": "5 OOMKilled incidents in 24h, increasing trend",
"recommendation": "Increase memory limits or add horizontal scaling"
}
],
"incident_rate_summary": {
"rate_per_hour": 2.5,
"total_incidents": 60,
"period_hours": 24
}
}
GET /forecasting/drift -- Operational Drift Detection¶
Response:
{
"generated_at": "2025-01-15T10:00:00Z",
"workspace": "Primary Workspace",
"drift_signals": [
{
"type": "incident_rate_spike",
"severity": "high",
"detail": "Incident rate spiked to 8/hr (24h avg: 2.50/hr)",
"recommendation": "Investigate recent deploys and check infra health."
}
],
"signal_count": 1,
"healthy": false,
"rates": {"last_1h": 8.0, "last_24h": 2.5}
}
GET /patterns/incidents -- Incident Patterns¶
Response:
{
"generated_at": "2025-01-15T10:00:00Z",
"incidents_analyzed": 100,
"patterns_found": 3,
"min_occurrences": 3,
"patterns": [
{
"pattern": "oomkilled",
"occurrences": 8,
"services": ["payments-api", "auth-service"],
"recommendation": "Audit memory limits across affected services"
}
]
}
GET /patterns/services -- Service Failure Profiles¶
GET /patterns/services/{service_name} -- Single Service Profile¶
Response:
{
"service": "payments-api",
"total_incidents": 15,
"top_failure_patterns": ["oomkilled", "high_cpu"],
"most_common_severity": "critical",
"repeat_rate": 0.6,
"risk_assessment": "high"
}
On-Call¶
POST /incidents/{incident_id}/postmortem -- Generate Postmortem¶
GET /digest/handoff -- Shift Handoff Digest¶
curl "http://localhost:8888/digest/handoff?hours=12&workspace_id=ws-acme" \
-H "X-API-Key: your-key"
Query Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| hours | int | 12 | Lookback window (1-48) |
| workspace_id | string | null | Filter by workspace |
GET /incidents/noisy -- Noisy Incidents¶
curl "http://localhost:8888/incidents/noisy?max_duration_minutes=5&limit=20" \
-H "X-API-Key: your-key"
Response:
{
"noisy_incidents": [
{
"id": "inc-noise-123",
"title": "Brief CPU spike on cache-service",
"severity": "warning",
"state": "resolved",
"created_at": "2025-01-15T09:00:00",
"resolved_at": "2025-01-15T09:02:30"
}
],
"count": 1,
"max_duration_minutes": 5.0,
"note": "These incidents resolved with no actions -- review alert thresholds."
}
Workflows¶
GET /workflows -- List Workflows¶
Response:
{
"workflows": [
{
"id": "oom_recovery",
"name": "OOMKilled Recovery Workflow",
"description": "Multi-step recovery for pods killed by OOM...",
"step_count": 5
},
{
"id": "crashloop_triage",
"name": "CrashLoopBackOff Triage",
"description": "...",
"step_count": 4
}
],
"count": 2
}
POST /workflows/{workflow_id}/execute -- Execute Workflow¶
curl -X POST http://localhost:8888/workflows/oom_recovery/execute \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"context": {
"namespace": "production",
"pod_name": "payments-api-7d8f9c6b5-x2k4m",
"deployment_name": "payments-api",
"service": "payments-api"
},
"dry_run": true
}'
GET /workflows/{run_id}/status -- Workflow Run Status¶
SLO¶
GET /slo -- List SLOs¶
Response:
{
"slos": [
{
"name": "payments-api-availability",
"service": "payments-api",
"target": 0.9995,
"target_percent": 99.95,
"window_days": 30,
"indicator": "availability"
},
{
"name": "search-api-latency-p99",
"service": "search-api",
"target": 0.999,
"target_percent": 99.9,
"window_days": 30,
"indicator": "latency",
"latency_threshold_ms": 200,
"latency_percentile": "p99"
}
],
"count": 2
}
GET /slo/{name} -- SLO Details¶
GET /slo/{name}/budget -- Error Budget Status¶
Response:
{
"slo_name": "payments-api-availability",
"target": 0.9995,
"window_days": 30,
"budget_total_minutes": 21.6,
"budget_consumed_minutes": 8.5,
"budget_remaining_minutes": 13.1,
"budget_remaining_percent": 60.6,
"burn_rate": 1.2,
"time_to_exhaustion_hours": 48.0,
"alert_level": "warning"
}
Notifications¶
GET /notifications/routes -- Routing Rules¶
Response:
{
"default_rules": {
"critical": ["slack", "pagerduty", "email"],
"high": ["slack", "email"],
"warning": ["slack"],
"info": ["log"]
},
"service_overrides": {
"payments": {
"critical": ["slack", "pagerduty", "email", "teams"]
}
},
"registered_adapters": ["email", "pagerduty", "slack", "teams"]
}
GET /oncall/current -- Current On-Call¶
Response:
{
"service": "payments",
"oncall": "dave",
"team": "payments-team",
"escalation_contact": "eve",
"week_number": 3,
"rotation_index": 1
}
POST /notifications/test -- Test Notification¶
curl -X POST "http://localhost:8888/notifications/test?severity=critical&service=payments&message=Test%20alert" \
-H "X-API-Key: your-key"
GET /notifications/escalation -- Escalation Preview¶
curl "http://localhost:8888/notifications/escalation?service=payments&minutes_elapsed=45" \
-H "X-API-Key: your-key"
System¶
GET /health -- Health Check¶
No authentication required.
Response:
GET /metrics -- Prometheus Metrics¶
No authentication required. Returns metrics in Prometheus text exposition format.
GET /metrics/export -- Pilot Metrics Export¶
GET /autonomy/status -- Autonomy Status¶
Response:
{
"enabled": false,
"approval_required": true,
"default_dry_run": true,
"poll_interval_seconds": 30,
"max_incidents_per_cycle": 20,
"autonomous_actions": ["restart_pod"],
"allowed_namespaces": [],
"deployment_profile": "hybrid"
}
POST /autonomy/run -- Run Autonomy Cycle¶
Manually trigger one autonomous remediation cycle.
GET /platform/overview -- Platform Overview¶
POST /demo/seed -- Seed Demo Data¶
POST /bot/interact -- Bot Interaction¶
curl -X POST http://localhost:8888/bot/interact \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"message": "What is wrong with payments-api?",
"incident_id": "inc-a1b2c3d4",
"auto_apply": false
}'
GET /adapters -- List Adapters¶
Response:
{
"alert_sources": ["alertmanager", "datadog", "grafana", "newrelic", "opsgenie", "pagerduty", "prometheus", "webhook"],
"log_backends": ["datadog", "elastic", "loki", "mock"],
"action_providers": ["kubernetes"],
"action_names": ["restart_pod", "scale_deployment", "suggest_config_fix"],
"notification_channels": ["email", "pagerduty", "slack", "teams"]
}
GET /console -- Operator Console¶
Serves the built-in web UI. Open in a browser:
Error Responses¶
All error responses follow this format:
| Status | Meaning |
|---|---|
| 400 | Bad request (unsupported event, invalid payload) |
| 401 | Missing or invalid API key |
| 403 | Insufficient permissions (RBAC) or namespace not allowed |
| 404 | Resource not found |
| 422 | Validation error |
| 429 | Rate limit exceeded |
| 500 | Internal server error |