API: Self-Healing

REST API endpoints for self-healing status, history, and policy management.

API: Self-Healing

Overview

The self-healing system automatically detects and resolves infrastructure issues. Agents monitor resources via Prometheus alerts and take corrective action based on configured policies.

Endpoints

MethodPathDescription
GET/v1/healing/statusCurrent healing status across all resources
GET/v1/healing/historyHistory of healing events
POST/v1/healing/policiesCreate a healing policy
POST/v1/healing/triggerManually trigger healing for a resource
## Get Healing Status
GET /v1/healing/status
{
  "active_issues": 1,
  "resolved_last_24h": 5,
  "resources": [
    {
      "resource_id": "inst-abc123",
      "issue_type": "high_cpu",
      "detected_at": "2025-01-15T10:25:00Z",
      "status": "healing",
      "agent": "instance-agent"
    }
  ]
}

Create Healing Policy

POST /v1/healing/policies
{
  "resource_kind": "Instance",
  "issue_type": "service_down",
  "auto_heal": true,
  "max_retries": 3
}

Manually Trigger Healing

POST /v1/healing/trigger
{
  "resource_id": "inst-abc123",
  "issue_type": "service_down"
}

HealingPolicy Fields

FieldTypeDescription
resource_kindstringResource type (Instance, Database, RedisCluster, etc.)
issue_typestringIssue category (see below)
auto_healboolAutomatically attempt to fix issues
max_retriesintMaximum healing attempts before escalation
## Issue Types
Issue TypeDescriptionTypical Action
service_downService is unresponsiveRestart service or instance
disk_fullDisk usage above thresholdClean logs, expand storage
high_cpuCPU usage sustained above thresholdScale up or optimize
replication_lagDatabase replica lag too highRestart replication
connection_timeoutUnable to connect to resourceCheck networking, restart