API: Self-Healing
Overview
The self-healing system automatically detects and resolves infrastructure issues. Agents monitor resources via Prometheus alerts and take corrective action based on configured policies.
Endpoints
| Method | Path | Description |
|---|
| GET | /v1/healing/status | Current healing status across all resources |
| GET | /v1/healing/history | History of healing events |
| POST | /v1/healing/policies | Create a healing policy |
| POST | /v1/healing/trigger | Manually trigger healing for a resource |
## Get Healing Status
GET /v1/healing/status
{
"active_issues": 1,
"resolved_last_24h": 5,
"resources": [
{
"resource_id": "inst-abc123",
"issue_type": "high_cpu",
"detected_at": "2025-01-15T10:25:00Z",
"status": "healing",
"agent": "instance-agent"
}
]
}
Create Healing Policy
POST /v1/healing/policies
{
"resource_kind": "Instance",
"issue_type": "service_down",
"auto_heal": true,
"max_retries": 3
}
Manually Trigger Healing
POST /v1/healing/trigger
{
"resource_id": "inst-abc123",
"issue_type": "service_down"
}
HealingPolicy Fields
| Field | Type | Description |
|---|
resource_kind | string | Resource type (Instance, Database, RedisCluster, etc.) |
issue_type | string | Issue category (see below) |
auto_heal | bool | Automatically attempt to fix issues |
max_retries | int | Maximum healing attempts before escalation |
## Issue Types
| Issue Type | Description | Typical Action |
|---|
service_down | Service is unresponsive | Restart service or instance |
disk_full | Disk usage above threshold | Clean logs, expand storage |
high_cpu | CPU usage sustained above threshold | Scale up or optimize |
replication_lag | Database replica lag too high | Restart replication |
connection_timeout | Unable to connect to resource | Check networking, restart |