Skip to content
Refalia Defani
Go back

Kube AI Agent: AI-Powered Kubernetes Monitoring & Auto-Remediation

Project Overview

Kube AI Agent is an autonomous bot that monitors Kubernetes clusters in real-time. When a pod fails, it automatically:

  1. Detects — Scans cluster for unhealthy pods (CrashLoopBackOff, OOMKilled, etc.)
  2. Diagnoses — Sends pod logs to an AI model for root cause analysis
  3. Remediates — Executes the recommended fix (restart, scale up, rollback)
  4. Notifies — Sends a detailed alert to Telegram with full context

No more 3AM wake-up calls to manually restart crashing pods.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    AKS CLUSTER                           │
│                                                         │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│   │  Pod A   │  │  Pod B   │  │  Pod C   │            │
│   │ (healthy)│  │ (CRASH!) │  │ (healthy)│            │
│   └──────────┘  └──────────┘  └──────────┘            │
└───────────────────────│─────────────────────────────────┘


              ┌──────────────────┐
              │  WATCHER         │  ← Real-time pod monitoring
              │  (watcher.py)    │     via K8s Python Client
              └────────┬─────────┘


              ┌──────────────────┐
              │  AI AGENT        │  ← LLM diagnosis via OpenRouter
              │  (app.py)        │     (Nvidia Nemotron 9B)
              └────────┬─────────┘


              ┌──────────────────┐
              │  REMEDIATION     │  ← Auto-fix: restart, scale,
              │  (remediation.py)│     rollback, force-delete
              └────────┬─────────┘


              ┌──────────────────┐
              │  NOTIFIER        │  ← Real-time Telegram alerts
              │  (notifier.py)   │     with full context
              └──────────────────┘

Key Features

Intelligent Failure Detection

Monitors all pods across all namespaces for critical states:

StatusMeaningAI Typical Action
CrashLoopBackOffContainer crash looprestart_pod
OOMKilledOut of memoryscale_up
ImagePullBackOffImage not founddo_nothing (manual fix)
ErrorContainer exit with errorrestart_pod or rollback

AI-Powered Root Cause Analysis

The agent sends pod logs + context to an LLM that returns structured diagnosis:

response = client.chat.completions.create(
    model="nvidia/nemotron-nano-9b-v2:free",
    messages=[
        {"role": "system", "content": "You are a Kubernetes expert. Analyze this pod failure and recommend an action."},
        {"role": "user", "content": f"Pod: {pod_name}\nStatus: {status}\nLogs:\n{logs}"}
    ]
)
# Returns: {cause: "...", action: "restart_pod", reason: "..."}

Auto-Remediation Actions

# Restart (delete → controller recreates)
v1.delete_namespaced_pod(name=pod_name, namespace=namespace)

# Scale up (add replicas for overloaded services)
apps_v1.patch_namespaced_deployment_scale(
    name=deployment, namespace=namespace,
    body={"spec": {"replicas": 3}}
)

# Rollback (revert to previous stable version)
rollback_deployment(deployment_name, namespace)

Anti-Spam Cooldown

Prevents notification flooding when a pod keeps crashing:

COOLDOWN_SECONDS = 300  # 5 minutes between handling same pod

if pod_key in handled_pods:
    elapsed = time.time() - handled_pods[pod_key]
    if elapsed < COOLDOWN_SECONDS:
        return  # Skip — already handled recently

Real-time Telegram Notifications

⚠️ ALERT: Pod Bermasalah!

📦 Pod: crashy-app-86458cddc7-zgbfb
🏷 Namespace: ai-agent-test
❌ Status: CrashLoopBackOff
🔄 Restart Count: 5x

🤖 AI Analysis:
💡 Cause: Database connection refused
🔧 Action: restart_pod
📝 Reason: Likely transient connection issue

✅ Result: Pod restarted successfully

Deployment on AKS

The agent runs inside the cluster it monitors—24/7, with auto-restart:

RBAC (Least Privilege)

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-ai-agent-role
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "deployments/scale"]
    verbs: ["get", "list", "patch", "update"]

Container Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-ai-agent
  namespace: kube-ai-agent
spec:
  replicas: 1
  template:
    spec:
      serviceAccountName: kube-ai-agent-sa
      containers:
        - name: agent
          image: acrrefaagentsg.azurecr.io/kube-ai-agent:latest
          envFrom:
            - secretRef:
                name: kube-ai-agent-secrets
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "200m"

CI/CD: Build & Deploy

# Build image via ACR (no local Docker needed)
az acr build --registry acrrefaagentsg \
  --image kube-ai-agent:latest --file Dockerfile .

# Deploy to cluster
kubectl apply -f k8s/deployment.yaml

# Update after code changes
kubectl rollout restart deployment/kube-ai-agent -n kube-ai-agent

Infrastructure

ComponentDetails
AKS Clusteraks-refa (Indonesia Central)
Container RegistryAzure ACR (Southeast Asia)
AI ModelNvidia Nemotron Nano 9B (free via OpenRouter)
LanguagePython 3.12
NotificationsTelegram Bot API
AuthK8s ServiceAccount + RBAC

Results

Tech Stack


Share this post on:

Next Post
End-to-End DevSecOps Pipeline with Blue-Green Deployment on AKS