AI Agents
Overview of the 13 AI agents that manage infrastructure resources in AgentMetal.
AI Agents
Overview
AgentMetal uses 13 specialized AI agents to manage infrastructure resources. Each agent is responsible for a specific domain (instances, databases, DNS, etc.) and uses Large Language Models for reasoning combined with a tool-use system for executing actions.
Agent Architecture
The system follows an agent-per-domain pattern where each agent encapsulates deep knowledge about its resource type. An Orchestrator agent coordinates multi-service deployments that span multiple domains.
Agent Roster
| Agent | Domain | Description |
|---|---|---|
| Orchestrator | Multi-service | Coordinates complex deployments across agents |
| Instance Agent | Compute | Provisions and manages bare-metal/virtual instances |
| Database Agent | Data | Manages PostgreSQL and MySQL databases |
| VPC Agent | Networking | Manages VPCs, subnets, and security groups |
| LoadBalancer Agent | Traffic | Manages HTTP and TCP load balancers |
| DNS Agent | DNS | Manages zones and records |
| K3s Agent | Kubernetes | Manages K3s cluster lifecycle |
| Redis Agent | Caching | Manages Redis clusters |
| Queue Agent | Messaging | Manages RabbitMQ and NATS queues |
| Function Agent | Serverless | Manages serverless functions |
| Bucket Agent | Storage | Manages object storage buckets |
| IaC Agent | Declarative | Processes Infrastructure as Code stacks |
| Healing Agent | Reliability | Monitors and auto-heals infrastructure issues |
- State Change Detection: etcd watchers notify agents when desired or actual state changes
- Reasoning: The agent uses an LLM to analyze the state delta and generate an execution plan
- Risk Classification: Each planned action is classified as Safe, Moderate, or Dangerous
- Execution: Tools are invoked sequentially to carry out the plan
- Audit: Every decision and action is recorded in the audit log
Key Capabilities
- Reconciliation: Continuously converge actual state to desired state
- Self-Healing: Automatically detect and resolve infrastructure issues
- Approval Workflows: Dangerous operations require human approval
- Tool Use: Each agent has domain-specific tools for interacting with providers
- LLM Routing: Tasks are routed to appropriate model sizes based on complexity