Skip to content

Kubernetes Deployment

Deploy AI-SRE into a Kubernetes cluster using the included Helm chart. This guide covers Minikube (local development), Amazon EKS, and Google GKE.


Architecture

In Kubernetes, AI-SRE runs as a Deployment with a ServiceAccount that has ClusterRole permissions to inspect and remediate workloads. The Helm chart provisions all required resources.

graph TB
    subgraph Cluster["Kubernetes Cluster"]
        subgraph NS["ai-sre namespace"]
            DEP[Deployment<br/>ai-sre]
            SVC[Service<br/>:8888]
            SA[ServiceAccount]
            PVC[(PVC<br/>data)]
            CM[ConfigMap]
            SEC[Secret]
            HPA[HPA]
        end

        subgraph Target["Target Namespaces"]
            POD1[production pods]
            POD2[staging pods]
        end

        CR[ClusterRole<br/>pods, deployments]
        CRB[ClusterRoleBinding]

        DEP --> SVC
        DEP --> PVC
        DEP --> CM
        DEP --> SEC
        SA --> CR
        CR --> CRB
        HPA --> DEP
        DEP -.->|remediate| Target
    end

    ING[Ingress Controller] --> SVC
    MON[Alert Sources] --> ING

Helm Chart Structure

deploy/helm/ai-sre/
├── Chart.yaml
├── values.yaml
├── values-minikube.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── configmap.yaml
    ├── secret.yaml
    ├── serviceaccount.yaml
    ├── clusterrole.yaml
    ├── clusterrolebinding.yaml
    ├── ingress.yaml
    ├── hpa.yaml
    ├── pvc.yaml
    └── NOTES.txt

Minikube (Local Development)

Prerequisites

minikube version   # v1.30+ recommended
helm version       # v3.12+ recommended
kubectl version

One-Command Deploy

# Start minikube if needed
minikube start --cpus=4 --memory=4096

# Build image and deploy via Helm
make minikube-deploy

This command:

  1. Builds the Docker image inside minikube's Docker daemon (make minikube-build)
  2. Creates the ai-sre namespace
  3. Applies CRDs for the Kubernetes Operator
  4. Installs the Helm chart with values-minikube.yaml

Access the Service

# Get the service URL
make minikube-url

# Or use port-forwarding
kubectl port-forward -n ai-sre svc/ai-sre 8888:8888

# View logs
make minikube-logs

Configure LLM API Key

# Create a secret with your API key
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-your-key-here

# Upgrade the Helm release to use the external secret
helm upgrade ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets

Clean Up

make minikube-delete

Amazon EKS

Prerequisites

  • EKS cluster running Kubernetes 1.27+
  • aws CLI configured with cluster access
  • kubectl configured for the cluster
  • Helm 3.12+

Install

# Update kubeconfig
aws eks update-kubeconfig --name my-cluster --region us-east-1

# Create namespace
kubectl create namespace ai-sre

# Create secrets
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  --from-literal=SLACK_BOT_TOKEN=xoxb-... \
  --from-literal=SLACK_APP_TOKEN=xapp-...

# Install with EKS-specific values
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets \
  --set config.WORKSPACE_ID=my-team \
  --set config.WORKSPACE_NAME="My Team" \
  --set config.DATABASE_URL="postgresql+asyncpg://user:pass@rds-host:5432/ai_sre" \
  --set config.LOG_PROVIDER=loki \
  --set config.LOKI_URL="https://loki.internal.example.com" \
  --set ingress.enabled=true \
  --set ingress.className=alb \
  --set ingress.annotations."alb\.ingress\.kubernetes\.io/scheme"=internet-facing \
  --set ingress.annotations."alb\.ingress\.kubernetes\.io/target-type"=ip \
  --set ingress.hosts[0].host=ai-sre.example.com \
  --set ingress.hosts[0].paths[0].path=/ \
  --set ingress.hosts[0].paths[0].pathType=Prefix

EKS with AWS Load Balancer Controller

If using the AWS Load Balancer Controller for Ingress:

# values-eks.yaml
ingress:
  enabled: true
  className: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/abc-123
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
  hosts:
    - host: ai-sre.example.com
      paths:
        - path: /
          pathType: Prefix
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --values values-eks.yaml \
  --set existingSecret=ai-sre-secrets

EKS with RDS PostgreSQL

For production, use Amazon RDS instead of SQLite:

# Set DATABASE_URL to your RDS instance
--set config.DATABASE_URL="postgresql+asyncpg://ai_sre:password@mydb.abc123.us-east-1.rds.amazonaws.com:5432/ai_sre"

Google GKE

Prerequisites

  • GKE cluster running Kubernetes 1.27+
  • gcloud CLI configured
  • kubectl configured for the cluster
  • Helm 3.12+

Install

# Get cluster credentials
gcloud container clusters get-credentials my-cluster --zone us-central1-a

# Create namespace
kubectl create namespace ai-sre

# Create secrets
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-...

# Install with GKE-specific values
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets \
  --set config.DATABASE_URL="postgresql+asyncpg://user:pass@cloud-sql-host:5432/ai_sre" \
  --set ingress.enabled=true \
  --set ingress.className=gce \
  --set ingress.annotations."kubernetes\.io/ingress\.global-static-ip-name"=ai-sre-ip \
  --set ingress.annotations."networking\.gke\.io/managed-certificates"=ai-sre-cert \
  --set ingress.hosts[0].host=ai-sre.example.com \
  --set ingress.hosts[0].paths[0].path=/ \
  --set ingress.hosts[0].paths[0].pathType=Prefix

GKE with Cloud SQL

Use the Cloud SQL Auth Proxy sidecar for secure database access:

# values-gke.yaml
config:
  DATABASE_URL: "postgresql+asyncpg://ai_sre:password@127.0.0.1:5432/ai_sre"

# Add Cloud SQL proxy as a sidecar (customize the deployment template)

Helm Values Reference

Replicas and Autoscaling

replicaCount: 2

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

SQLite and replicas

SQLite supports only a single writer. When using SQLite, set replicaCount: 1 and autoscaling.enabled: false. Switch to PostgreSQL for multi-replica deployments.

Resources

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

Ingress

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: ai-sre.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: ai-sre-tls
      hosts:
        - ai-sre.example.com

Persistence (SQLite)

persistence:
  enabled: true
  storageClass: ""       # Use cluster default
  accessModes:
    - ReadWriteOnce
  size: 5Gi

RBAC

The chart creates a ClusterRole with the minimum permissions needed for remediation actions:

rbac:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["pods", "pods/log", "services", "events", "namespaces"]
      verbs: ["get", "list", "watch", "delete"]
    - apiGroups: ["apps"]
      resources: ["deployments", "replicasets", "statefulsets"]
      verbs: ["get", "list", "watch", "patch", "update"]
    - apiGroups: ["autoscaling"]
      resources: ["horizontalpodautoscalers"]
      verbs: ["get", "list", "watch", "patch", "update"]

Set rbac.create: false if you manage RBAC externally (e.g., with OPA/Gatekeeper).

Security Context

podSecurityContext:
  fsGroup: 1000

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: false
  allowPrivilegeEscalation: false

Using an Existing Secret

If you manage secrets with Vault, Sealed Secrets, or External Secrets Operator:

# Create the secret with expected keys
kubectl create secret generic ai-sre-secrets \
  --namespace ai-sre \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  --from-literal=SLACK_BOT_TOKEN=xoxb-... \
  --from-literal=OPENAI_API_KEY=sk-...

# Reference in Helm install
helm install ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --set existingSecret=ai-sre-secrets

Operations

Upgrade

helm upgrade ai-sre deploy/helm/ai-sre \
  --namespace ai-sre \
  --reuse-values

Rollback

# List revisions
helm history ai-sre --namespace ai-sre

# Rollback to previous revision
helm rollback ai-sre --namespace ai-sre

Uninstall

helm uninstall ai-sre --namespace ai-sre

# PVC is not deleted by default. To remove data:
kubectl delete pvc -l app.kubernetes.io/name=ai-sre -n ai-sre

Database Migrations

Run migrations inside the pod:

kubectl exec -n ai-sre deployment/ai-sre -- alembic upgrade head

Or as a Job:

kubectl create job --from=deployment/ai-sre ai-sre-migrate \
  --namespace ai-sre \
  -- alembic upgrade head

Monitoring

Prometheus ServiceMonitor

If you use the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-sre
  namespace: ai-sre
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: ai-sre
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Health Probes

The Helm chart configures liveness and readiness probes on GET /health:

livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 5
  periodSeconds: 10

Troubleshooting

Symptom Cause Fix
Pod in CrashLoopBackOff Missing API keys Check secret has required keys: kubectl get secret ai-sre-secrets -n ai-sre -o yaml
Ingress not routing Ingress class mismatch Verify ingress.className matches your cluster's Ingress controller
K8s actions return 403 RBAC insufficient Check ClusterRole and ClusterRoleBinding: kubectl describe clusterrole ai-sre
PVC pending No StorageClass Set persistence.storageClass to a valid StorageClass in your cluster
HPA not scaling Metrics server missing Install metrics-server: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml