Kubernetes Deployment¶
Deploy AI-SRE into a Kubernetes cluster using the included Helm chart. This guide covers Minikube (local development), Amazon EKS, and Google GKE.
Architecture¶
In Kubernetes, AI-SRE runs as a Deployment with a ServiceAccount that has ClusterRole permissions to inspect and remediate workloads. The Helm chart provisions all required resources.
graph TB
subgraph Cluster["Kubernetes Cluster"]
subgraph NS["ai-sre namespace"]
DEP[Deployment<br/>ai-sre]
SVC[Service<br/>:8888]
SA[ServiceAccount]
PVC[(PVC<br/>data)]
CM[ConfigMap]
SEC[Secret]
HPA[HPA]
end
subgraph Target["Target Namespaces"]
POD1[production pods]
POD2[staging pods]
end
CR[ClusterRole<br/>pods, deployments]
CRB[ClusterRoleBinding]
DEP --> SVC
DEP --> PVC
DEP --> CM
DEP --> SEC
SA --> CR
CR --> CRB
HPA --> DEP
DEP -.->|remediate| Target
end
ING[Ingress Controller] --> SVC
MON[Alert Sources] --> ING
Helm Chart Structure¶
deploy/helm/ai-sre/
├── Chart.yaml
├── values.yaml
├── values-minikube.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── configmap.yaml
├── secret.yaml
├── serviceaccount.yaml
├── clusterrole.yaml
├── clusterrolebinding.yaml
├── ingress.yaml
├── hpa.yaml
├── pvc.yaml
└── NOTES.txt
Minikube (Local Development)¶
Prerequisites¶
One-Command Deploy¶
# Start minikube if needed
minikube start --cpus=4 --memory=4096
# Build image and deploy via Helm
make minikube-deploy
This command:
- Builds the Docker image inside minikube's Docker daemon (
make minikube-build) - Creates the
ai-srenamespace - Applies CRDs for the Kubernetes Operator
- Installs the Helm chart with
values-minikube.yaml
Access the Service¶
# Get the service URL
make minikube-url
# Or use port-forwarding
kubectl port-forward -n ai-sre svc/ai-sre 8888:8888
# View logs
make minikube-logs
Configure LLM API Key¶
# Create a secret with your API key
kubectl create secret generic ai-sre-secrets \
--namespace ai-sre \
--from-literal=ANTHROPIC_API_KEY=sk-ant-your-key-here
# Upgrade the Helm release to use the external secret
helm upgrade ai-sre deploy/helm/ai-sre \
--namespace ai-sre \
--set existingSecret=ai-sre-secrets
Clean Up¶
Amazon EKS¶
Prerequisites¶
- EKS cluster running Kubernetes 1.27+
awsCLI configured with cluster accesskubectlconfigured for the cluster- Helm 3.12+
Install¶
# Update kubeconfig
aws eks update-kubeconfig --name my-cluster --region us-east-1
# Create namespace
kubectl create namespace ai-sre
# Create secrets
kubectl create secret generic ai-sre-secrets \
--namespace ai-sre \
--from-literal=ANTHROPIC_API_KEY=sk-ant-... \
--from-literal=SLACK_BOT_TOKEN=xoxb-... \
--from-literal=SLACK_APP_TOKEN=xapp-...
# Install with EKS-specific values
helm install ai-sre deploy/helm/ai-sre \
--namespace ai-sre \
--set existingSecret=ai-sre-secrets \
--set config.WORKSPACE_ID=my-team \
--set config.WORKSPACE_NAME="My Team" \
--set config.DATABASE_URL="postgresql+asyncpg://user:pass@rds-host:5432/ai_sre" \
--set config.LOG_PROVIDER=loki \
--set config.LOKI_URL="https://loki.internal.example.com" \
--set ingress.enabled=true \
--set ingress.className=alb \
--set ingress.annotations."alb\.ingress\.kubernetes\.io/scheme"=internet-facing \
--set ingress.annotations."alb\.ingress\.kubernetes\.io/target-type"=ip \
--set ingress.hosts[0].host=ai-sre.example.com \
--set ingress.hosts[0].paths[0].path=/ \
--set ingress.hosts[0].paths[0].pathType=Prefix
EKS with AWS Load Balancer Controller¶
If using the AWS Load Balancer Controller for Ingress:
# values-eks.yaml
ingress:
enabled: true
className: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/abc-123
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/ssl-redirect: "443"
hosts:
- host: ai-sre.example.com
paths:
- path: /
pathType: Prefix
helm install ai-sre deploy/helm/ai-sre \
--namespace ai-sre \
--values values-eks.yaml \
--set existingSecret=ai-sre-secrets
EKS with RDS PostgreSQL¶
For production, use Amazon RDS instead of SQLite:
# Set DATABASE_URL to your RDS instance
--set config.DATABASE_URL="postgresql+asyncpg://ai_sre:password@mydb.abc123.us-east-1.rds.amazonaws.com:5432/ai_sre"
Google GKE¶
Prerequisites¶
- GKE cluster running Kubernetes 1.27+
gcloudCLI configuredkubectlconfigured for the cluster- Helm 3.12+
Install¶
# Get cluster credentials
gcloud container clusters get-credentials my-cluster --zone us-central1-a
# Create namespace
kubectl create namespace ai-sre
# Create secrets
kubectl create secret generic ai-sre-secrets \
--namespace ai-sre \
--from-literal=ANTHROPIC_API_KEY=sk-ant-...
# Install with GKE-specific values
helm install ai-sre deploy/helm/ai-sre \
--namespace ai-sre \
--set existingSecret=ai-sre-secrets \
--set config.DATABASE_URL="postgresql+asyncpg://user:pass@cloud-sql-host:5432/ai_sre" \
--set ingress.enabled=true \
--set ingress.className=gce \
--set ingress.annotations."kubernetes\.io/ingress\.global-static-ip-name"=ai-sre-ip \
--set ingress.annotations."networking\.gke\.io/managed-certificates"=ai-sre-cert \
--set ingress.hosts[0].host=ai-sre.example.com \
--set ingress.hosts[0].paths[0].path=/ \
--set ingress.hosts[0].paths[0].pathType=Prefix
GKE with Cloud SQL¶
Use the Cloud SQL Auth Proxy sidecar for secure database access:
# values-gke.yaml
config:
DATABASE_URL: "postgresql+asyncpg://ai_sre:password@127.0.0.1:5432/ai_sre"
# Add Cloud SQL proxy as a sidecar (customize the deployment template)
Helm Values Reference¶
Replicas and Autoscaling¶
replicaCount: 2
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
SQLite and replicas
SQLite supports only a single writer. When using SQLite, set replicaCount: 1 and autoscaling.enabled: false. Switch to PostgreSQL for multi-replica deployments.
Resources¶
Ingress¶
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: ai-sre.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: ai-sre-tls
hosts:
- ai-sre.example.com
Persistence (SQLite)¶
persistence:
enabled: true
storageClass: "" # Use cluster default
accessModes:
- ReadWriteOnce
size: 5Gi
RBAC¶
The chart creates a ClusterRole with the minimum permissions needed for remediation actions:
rbac:
create: true
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "events", "namespaces"]
verbs: ["get", "list", "watch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets", "statefulsets"]
verbs: ["get", "list", "watch", "patch", "update"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch", "patch", "update"]
Set rbac.create: false if you manage RBAC externally (e.g., with OPA/Gatekeeper).
Security Context¶
podSecurityContext:
fsGroup: 1000
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: false
allowPrivilegeEscalation: false
Using an Existing Secret¶
If you manage secrets with Vault, Sealed Secrets, or External Secrets Operator:
# Create the secret with expected keys
kubectl create secret generic ai-sre-secrets \
--namespace ai-sre \
--from-literal=ANTHROPIC_API_KEY=sk-ant-... \
--from-literal=SLACK_BOT_TOKEN=xoxb-... \
--from-literal=OPENAI_API_KEY=sk-...
# Reference in Helm install
helm install ai-sre deploy/helm/ai-sre \
--namespace ai-sre \
--set existingSecret=ai-sre-secrets
Operations¶
Upgrade¶
Rollback¶
# List revisions
helm history ai-sre --namespace ai-sre
# Rollback to previous revision
helm rollback ai-sre --namespace ai-sre
Uninstall¶
helm uninstall ai-sre --namespace ai-sre
# PVC is not deleted by default. To remove data:
kubectl delete pvc -l app.kubernetes.io/name=ai-sre -n ai-sre
Database Migrations¶
Run migrations inside the pod:
Or as a Job:
kubectl create job --from=deployment/ai-sre ai-sre-migrate \
--namespace ai-sre \
-- alembic upgrade head
Monitoring¶
Prometheus ServiceMonitor¶
If you use the Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ai-sre
namespace: ai-sre
spec:
selector:
matchLabels:
app.kubernetes.io/name: ai-sre
endpoints:
- port: http
path: /metrics
interval: 30s
Health Probes¶
The Helm chart configures liveness and readiness probes on GET /health:
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 10
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
| Pod in CrashLoopBackOff | Missing API keys | Check secret has required keys: kubectl get secret ai-sre-secrets -n ai-sre -o yaml |
| Ingress not routing | Ingress class mismatch | Verify ingress.className matches your cluster's Ingress controller |
| K8s actions return 403 | RBAC insufficient | Check ClusterRole and ClusterRoleBinding: kubectl describe clusterrole ai-sre |
| PVC pending | No StorageClass | Set persistence.storageClass to a valid StorageClass in your cluster |
| HPA not scaling | Metrics server missing | Install metrics-server: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml |