API: Self-Healing

Overview

The self-healing system automatically detects and resolves infrastructure issues. Agents monitor resources via Prometheus alerts and take corrective action based on configured policies.

Endpoints

Method	Path	Description
GET	`/v1/healing/status`	Current healing status across all resources
GET	`/v1/healing/history`	History of healing events
POST	`/v1/healing/policies`	Create a healing policy
POST	`/v1/healing/trigger`	Manually trigger healing for a resource

## Get Healing Status

GET /v1/healing/status

{
  "active_issues": 1,
  "resolved_last_24h": 5,
  "resources": [
    {
      "resource_id": "inst-abc123",
      "issue_type": "high_cpu",
      "detected_at": "2025-01-15T10:25:00Z",
      "status": "healing",
      "agent": "instance-agent"
    }
  ]
}

Create Healing Policy

POST /v1/healing/policies

{
  "resource_kind": "Instance",
  "issue_type": "service_down",
  "auto_heal": true,
  "max_retries": 3
}

Manually Trigger Healing

POST /v1/healing/trigger

{
  "resource_id": "inst-abc123",
  "issue_type": "service_down"
}

HealingPolicy Fields

Field	Type	Description
`resource_kind`	string	Resource type (Instance, Database, RedisCluster, etc.)
`issue_type`	string	Issue category (see below)
`auto_heal`	bool	Automatically attempt to fix issues
`max_retries`	int	Maximum healing attempts before escalation

## Issue Types

Issue Type	Description	Typical Action
`service_down`	Service is unresponsive	Restart service or instance
`disk_full`	Disk usage above threshold	Clean logs, expand storage
`high_cpu`	CPU usage sustained above threshold	Scale up or optimize
`replication_lag`	Database replica lag too high	Restart replication
`connection_timeout`	Unable to connect to resource	Check networking, restart