Project Overview
Kube AI Agent is an autonomous bot that monitors Kubernetes clusters in real-time. When a pod fails, it automatically:
- Detects — Scans cluster for unhealthy pods (CrashLoopBackOff, OOMKilled, etc.)
- Diagnoses — Sends pod logs to an AI model for root cause analysis
- Remediates — Executes the recommended fix (restart, scale up, rollback)
- Notifies — Sends a detailed alert to Telegram with full context
No more 3AM wake-up calls to manually restart crashing pods.
Architecture
┌─────────────────────────────────────────────────────────┐
│ AKS CLUSTER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ (healthy)│ │ (CRASH!) │ │ (healthy)│ │
│ └──────────┘ └──────────┘ └──────────┘ │
└───────────────────────│─────────────────────────────────┘
│
▼
┌──────────────────┐
│ WATCHER │ ← Real-time pod monitoring
│ (watcher.py) │ via K8s Python Client
└────────┬─────────┘
│
▼
┌──────────────────┐
│ AI AGENT │ ← LLM diagnosis via OpenRouter
│ (app.py) │ (Nvidia Nemotron 9B)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ REMEDIATION │ ← Auto-fix: restart, scale,
│ (remediation.py)│ rollback, force-delete
└────────┬─────────┘
│
▼
┌──────────────────┐
│ NOTIFIER │ ← Real-time Telegram alerts
│ (notifier.py) │ with full context
└──────────────────┘
Key Features
Intelligent Failure Detection
Monitors all pods across all namespaces for critical states:
| Status | Meaning | AI Typical Action |
|---|---|---|
CrashLoopBackOff | Container crash loop | restart_pod |
OOMKilled | Out of memory | scale_up |
ImagePullBackOff | Image not found | do_nothing (manual fix) |
Error | Container exit with error | restart_pod or rollback |
AI-Powered Root Cause Analysis
The agent sends pod logs + context to an LLM that returns structured diagnosis:
response = client.chat.completions.create(
model="nvidia/nemotron-nano-9b-v2:free",
messages=[
{"role": "system", "content": "You are a Kubernetes expert. Analyze this pod failure and recommend an action."},
{"role": "user", "content": f"Pod: {pod_name}\nStatus: {status}\nLogs:\n{logs}"}
]
)
# Returns: {cause: "...", action: "restart_pod", reason: "..."}
Auto-Remediation Actions
# Restart (delete → controller recreates)
v1.delete_namespaced_pod(name=pod_name, namespace=namespace)
# Scale up (add replicas for overloaded services)
apps_v1.patch_namespaced_deployment_scale(
name=deployment, namespace=namespace,
body={"spec": {"replicas": 3}}
)
# Rollback (revert to previous stable version)
rollback_deployment(deployment_name, namespace)
Anti-Spam Cooldown
Prevents notification flooding when a pod keeps crashing:
COOLDOWN_SECONDS = 300 # 5 minutes between handling same pod
if pod_key in handled_pods:
elapsed = time.time() - handled_pods[pod_key]
if elapsed < COOLDOWN_SECONDS:
return # Skip — already handled recently
Real-time Telegram Notifications
⚠️ ALERT: Pod Bermasalah!
📦 Pod: crashy-app-86458cddc7-zgbfb
🏷 Namespace: ai-agent-test
❌ Status: CrashLoopBackOff
🔄 Restart Count: 5x
🤖 AI Analysis:
💡 Cause: Database connection refused
🔧 Action: restart_pod
📝 Reason: Likely transient connection issue
✅ Result: Pod restarted successfully
Deployment on AKS
The agent runs inside the cluster it monitors—24/7, with auto-restart:
RBAC (Least Privilege)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-ai-agent-role
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch", "delete"]
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "deployments/scale"]
verbs: ["get", "list", "patch", "update"]
Container Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-ai-agent
namespace: kube-ai-agent
spec:
replicas: 1
template:
spec:
serviceAccountName: kube-ai-agent-sa
containers:
- name: agent
image: acrrefaagentsg.azurecr.io/kube-ai-agent:latest
envFrom:
- secretRef:
name: kube-ai-agent-secrets
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
CI/CD: Build & Deploy
# Build image via ACR (no local Docker needed)
az acr build --registry acrrefaagentsg \
--image kube-ai-agent:latest --file Dockerfile .
# Deploy to cluster
kubectl apply -f k8s/deployment.yaml
# Update after code changes
kubectl rollout restart deployment/kube-ai-agent -n kube-ai-agent
Infrastructure
| Component | Details |
|---|---|
| AKS Cluster | aks-refa (Indonesia Central) |
| Container Registry | Azure ACR (Southeast Asia) |
| AI Model | Nvidia Nemotron Nano 9B (free via OpenRouter) |
| Language | Python 3.12 |
| Notifications | Telegram Bot API |
| Auth | K8s ServiceAccount + RBAC |
Results
- 24/7 autonomous monitoring — no human intervention needed
- < 30 second response time from detection to remediation
- Zero cost for AI — using free-tier LLM via OpenRouter
- Anti-spam protection — 5-minute cooldown per pod prevents alert fatigue
- Least privilege security — agent only has permissions it needs
Tech Stack
- Language: Python 3.12
- AI: OpenRouter API (Nvidia Nemotron 9B)
- Kubernetes: Python Client, RBAC, ServiceAccount
- Cloud: Azure AKS, Azure Container Registry
- Notifications: Telegram Bot API
- Container: Docker, ACR Build
- Security: K8s Secrets, RBAC, .dockerignore