Agent Self-Healing
Self-healing pipeline from alert detection to autonomous repair.
Agent Self-Healing
Overview
The self-healing system automatically detects and resolves infrastructure issues without human intervention. It integrates with Prometheus for monitoring and uses AI agents for diagnosis and repair.
Healing Pipeline
Prometheus Alert → Alertmanager Webhook → Agent Runtime → Responsible Agent → Heal()
Step 1: Detection
Prometheus monitors all managed resources and fires alerts when thresholds are breached:
# Example Prometheus alert rule
- alert: InstanceDown
expr: up{job="instance-health"} == 0
for: 2m
labels:
severity: critical
resource_kind: Instance
annotations:
summary: "Instance {{ $labels.instance }} is down"
Step 2: Webhook
Alertmanager sends the alert to the Agent Runtime via webhook:
{
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "InstanceDown",
"resource_kind": "Instance",
"resource_id": "inst-abc123",
"severity": "critical"
}
}
]
}
Step 3: Dispatch
The Agent Runtime maps the resource_kind to the responsible agent and calls its Heal() method with the issue details.
Step 4: Diagnosis
The agent uses the LLM to diagnose the issue by analyzing available data:
- Resource status and recent state changes
- Metrics history (CPU, memory, disk, network)
- Recent logs and events
- Provider-level information
Step 5: Repair
Based on the diagnosis, the agent generates and executes a repair plan using its tools.
Issue Types
| Issue Type | Description | Typical Repair Actions |
|---|---|---|
service_down | Service is unresponsive or crashed | Restart service, reboot instance, failover |
disk_full | Disk usage exceeds threshold | Clean old logs, rotate files, expand volume |
high_cpu | Sustained high CPU utilization | Identify runaway process, scale up, add instances |
replication_lag | Database replica significantly behind | Restart replication, rebuild replica |
connection_timeout | Cannot establish connection | Check security groups, restart networking, DNS check |
Healing behavior is configurable per resource kind and issue type:
{
"resource_kind": "Instance",
"issue_type": "service_down",
"auto_heal": true,
"max_retries": 3
}
When auto_heal is true, the agent attempts repair automatically up to max_retries times. After exhausting retries, the issue is escalated for human intervention.
When auto_heal is false, the agent generates a repair plan but waits for human approval before executing it.
Healing Example
Alert: InstanceDown for inst-abc123
Agent: instance-agent
Diagnosis: SSH connection failed. Provider reports instance is running.
Network interface may be misconfigured.
Plan:
1. Attempt soft reboot via provider API
2. Wait 60s for instance to come back
3. Verify connectivity via health check
Result: Instance recovered after soft reboot. Total healing time: 90s.