Kubernetes Deployment¶

Deploy AI-SRE into a Kubernetes cluster using the included Helm chart. This guide covers Minikube (local development), Amazon EKS, and Google GKE.

Architecture¶

In Kubernetes, AI-SRE runs as a Deployment with a ServiceAccount that has ClusterRole permissions to inspect and remediate workloads. The Helm chart provisions all required resources.

graph TB
    subgraph Cluster["Kubernetes Cluster"]
        subgraph NS["ai-sre namespace"]
            DEP[Deployment<br/>ai-sre]
            SVC[Service<br/>:8888]
            SA[ServiceAccount]
            PVC[(PVC<br/>data)]
            CM[ConfigMap]
            SEC[Secret]
            HPA[HPA]
        end

        subgraph Target["Target Namespaces"]
            POD1[production pods]
            POD2[staging pods]
        end

        CR[ClusterRole<br/>pods, deployments]
        CRB[ClusterRoleBinding]

        DEP --> SVC
        DEP --> PVC
        DEP --> CM
        DEP --> SEC
        SA --> CR
        CR --> CRB
        HPA --> DEP
        DEP -.->|remediate| Target
    end

    ING[Ingress Controller] --> SVC
    MON[Alert Sources] --> ING

Helm Chart Structure¶

deploy/helm/ai-sre/
├── Chart.yaml
├── values.yaml
├── values-minikube.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── configmap.yaml
    ├── secret.yaml
    ├── serviceaccount.yaml
    ├── clusterrole.yaml
    ├── clusterrolebinding.yaml
    ├── ingress.yaml
    ├── hpa.yaml
    ├── pvc.yaml
    └── NOTES.txt

Minikube (Local Development)¶

Prerequisites¶

minikube version   # v1.30+ recommended
helm version       # v3.12+ recommended
kubectl version

One-Command Deploy¶

# Start minikube if needed
minikube start --cpus=4 --memory=4096

# Build image and deploy via Helm
make minikube-deploy

This command:

Builds the Docker image inside minikube's Docker daemon (make minikube-build)
Creates the ai-sre namespace
Applies CRDs for the Kubernetes Operator
Installs the Helm chart with values-minikube.yaml

Access the Service¶

# Get the service URL
make minikube-url

# Or use port-forwarding
kubectl port-forward -n ai-sre svc/ai-sre 8888:8888

# View logs
make minikube-logs

Configure LLM API Key¶

# Create a secret with your API key
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-your-key-here

# Upgrade the Helm release to use the external secret
helm upgrade ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets

Clean Up¶

make minikube-delete

Amazon EKS¶

Prerequisites¶

EKS cluster running Kubernetes 1.27+
aws CLI configured with cluster access
kubectl configured for the cluster
Helm 3.12+

Install¶

# Update kubeconfig
aws eks update-kubeconfig --name my-cluster --region us-east-1

# Create namespace
kubectl create namespace ai-sre

# Create secrets
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  --from-literal=SLACK_BOT_TOKEN=xoxb-... \
  --from-literal=SLACK_APP_TOKEN=xapp-...

# Install with EKS-specific values
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets \
  --set config.WORKSPACE_ID=my-team \
  --set config.WORKSPACE_NAME="My Team" \
  --set config.DATABASE_URL="postgresql+asyncpg://user:pass@rds-host:5432/ai_sre" \
  --set config.LOG_PROVIDER=loki \
  --set config.LOKI_URL="https://loki.internal.example.com" \
  --set ingress.enabled=true \
  --set ingress.className=alb \
  --set ingress.annotations."alb\.ingress\.kubernetes\.io/scheme"=internet-facing \
  --set ingress.annotations."alb\.ingress\.kubernetes\.io/target-type"=ip \
  --set ingress.hosts[0].host=ai-sre.example.com \
  --set ingress.hosts[0].paths[0].path=/ \
  --set ingress.hosts[0].paths[0].pathType=Prefix

EKS with AWS Load Balancer Controller¶

If using the AWS Load Balancer Controller for Ingress:

# values-eks.yaml
ingress:
  enabled: true
  className: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/abc-123
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
  hosts:
    - host: ai-sre.example.com
      paths:
        - path: /
          pathType: Prefix

helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --values values-eks.yaml \
  --set existingSecret=ai-sre-secrets

EKS with RDS PostgreSQL¶

For production, use Amazon RDS instead of SQLite:

# Set DATABASE_URL to your RDS instance
--set config.DATABASE_URL="postgresql+asyncpg://ai_sre:password@mydb.abc123.us-east-1.rds.amazonaws.com:5432/ai_sre"

Google GKE¶

Prerequisites¶

GKE cluster running Kubernetes 1.27+
gcloud CLI configured
kubectl configured for the cluster
Helm 3.12+

Install¶

# Get cluster credentials
gcloud container clusters get-credentials my-cluster --zone us-central1-a

# Create namespace
kubectl create namespace ai-sre

# Create secrets
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-...

# Install with GKE-specific values
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets \
  --set config.DATABASE_URL="postgresql+asyncpg://user:pass@cloud-sql-host:5432/ai_sre" \
  --set ingress.enabled=true \
  --set ingress.className=gce \
  --set ingress.annotations."kubernetes\.io/ingress\.global-static-ip-name"=ai-sre-ip \
  --set ingress.annotations."networking\.gke\.io/managed-certificates"=ai-sre-cert \
  --set ingress.hosts[0].host=ai-sre.example.com \
  --set ingress.hosts[0].paths[0].path=/ \
  --set ingress.hosts[0].paths[0].pathType=Prefix

GKE with Cloud SQL¶

Use the Cloud SQL Auth Proxy sidecar for secure database access:

# values-gke.yaml
config:
  DATABASE_URL: "postgresql+asyncpg://ai_sre:password@127.0.0.1:5432/ai_sre"

# Add Cloud SQL proxy as a sidecar (customize the deployment template)

Helm Values Reference¶

Replicas and Autoscaling¶

replicaCount: 2

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

SQLite and replicas

SQLite supports only a single writer. When using SQLite, set replicaCount: 1 and autoscaling.enabled: false. Switch to PostgreSQL for multi-replica deployments.

Resources¶

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

Ingress¶

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ai-sre.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: ai-sre-tls
      hosts:
        - ai-sre.example.com

Persistence (SQLite)¶

persistence:
  enabled: true
  storageClass: ""       # Use cluster default
  accessModes:
    - ReadWriteOnce
  size: 5Gi

RBAC¶

The chart creates a ClusterRole with the minimum permissions needed for remediation actions:

rbac:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["pods", "pods/log", "services", "events", "namespaces"]
      verbs: ["get", "list", "watch", "delete"]
    - apiGroups: ["apps"]
      resources: ["deployments", "replicasets", "statefulsets"]
      verbs: ["get", "list", "watch", "patch", "update"]
    - apiGroups: ["autoscaling"]
      resources: ["horizontalpodautoscalers"]
      verbs: ["get", "list", "watch", "patch", "update"]

Set rbac.create: false if you manage RBAC externally (e.g., with OPA/Gatekeeper).

Security Context¶

podSecurityContext:
  fsGroup: 1000

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: false
  allowPrivilegeEscalation: false

Using an Existing Secret¶

If you manage secrets with Vault, Sealed Secrets, or External Secrets Operator:

# Create the secret with expected keys
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  --from-literal=SLACK_BOT_TOKEN=xoxb-... \
  --from-literal=OPENAI_API_KEY=sk-...

# Reference in Helm install
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets

Operations¶

Upgrade¶

helm upgrade ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --reuse-values

Rollback¶

# List revisions
helm history ai-sre --namespace ai-sre

# Rollback to previous revision
helm rollback ai-sre --namespace ai-sre

Uninstall¶

helm uninstall ai-sre --namespace ai-sre

# PVC is not deleted by default. To remove data:
kubectl delete pvc -l app.kubernetes.io/name=ai-sre -n ai-sre

Database Migrations¶

Run migrations inside the pod:

kubectl exec -n ai-sre deployment/ai-sre -- alembic upgrade head

Or as a Job:

kubectl create job --from=deployment/ai-sre ai-sre-migrate \
  --namespace ai-sre \
  -- alembic upgrade head

Monitoring¶

Prometheus ServiceMonitor¶

If you use the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-sre
  namespace: ai-sre
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: ai-sre
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Health Probes¶

The Helm chart configures liveness and readiness probes on GET /health:

livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 5
  periodSeconds: 10

Troubleshooting¶

Symptom	Cause	Fix
Pod in CrashLoopBackOff	Missing API keys	Check secret has required keys: `kubectl get secret ai-sre-secrets -n ai-sre -o yaml`
Ingress not routing	Ingress class mismatch	Verify `ingress.className` matches your cluster's Ingress controller
K8s actions return 403	RBAC insufficient	Check ClusterRole and ClusterRoleBinding: `kubectl describe clusterrole ai-sre`
PVC pending	No StorageClass	Set `persistence.storageClass` to a valid StorageClass in your cluster
HPA not scaling	Metrics server missing	Install metrics-server: `kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml`