---
title: "Training Hermes Agent as a Business Decision Partner"
description: "An experiment in growing Hermes Agent into a business-ready support agent for decision-making."
lang: "en"
canonical: "https://llm-lab.dev/en/posts/hermes-agent-001-support-triage-agent-start/"
source: "https://llm-lab.dev/en/posts/hermes-agent-001-support-triage-agent-start.md"
publishedAt: "2026-06-14"
updatedAt: "2026-06-14"
category: "Hermes Agent"
tags:
  - "hermes-agent"
  - "mdx"
  - "cloudflare"
---

# Training Hermes Agent as a Business Decision Partner

import LinkCard from "../../components/LinkCard.astro";

I'm starting an experiment to train Hermes Agent into an agent that supports business decisions.

## What is Hermes Agent

[Hermes Agent](https://hermes-agent.nousresearch.com/docs) is an open-source autonomous AI agent developed by Nous Research. Unlike IDE-integrated coding assistants or chatbots that wrap a single API in a conversation UI, Hermes combines tools, skills, persistent memory, scheduled execution, and subagents to continue working across local environments, Docker, SSH remotes, and cloud execution environments.

What the official documentation highlights most is the learning loop that carries experience from one task to the next. Hermes retains important facts and user preferences as cross-session memory, and loads repeatable procedures as skills. Because it only loads the skills it needs at each step, it doesn't stuff every procedure into every prompt. It can also be connected from multiple messaging platforms — not just the CLI, but also Telegram, Discord, and Slack — to the same agent.

| Mechanism | Official Feature Summary | What This Experiment Will Examine |
|---|---|---|
| memory | Retains important information about the project and user across sessions | How past decisions and corrections are reflected in the next proposal |
| skills | Reusable work procedures loaded on demand | Whether triage criteria can be separated into fixed rules and adjustable heuristics |
| tools / mcp | Adds the ability to retrieve or operate on external information | How to separate the scope of inquiry data reading from operations with side effects |
| subagents | Delegates processing to isolated work environments | Whether classification, rationale verification, and monitoring can be role-split in the future |
| soul.md / context files | Provides response posture and project-specific assumptions | Whether a posture that prioritizes human approval and maintains operational constraints can be preserved |

However, I do not take the official "self-improving" label to mean that operational decisions automatically become correct. If memory or skills are updated, incorrect assumptions or temporary exceptions can also be carried over. The official features include settings that require approval before writing to memory or skills, so in actual practice you need to design in advance: "what to let the agent learn," "who approves the changes," and "which rules must not be altered."

From this perspective, this series will not merely introduce Hermes Agent's official features. I will also separately verify fixed rules, adjustable heuristics, and continuous monitoring of misjudgments. The L1 / L2 / L3 layers described below are not official Hermes terminology; they are my own classification adopted for this PoC to manage decision rules.

The theme for this project is customer support inquiry triage and escalation decisions. The aim is not to fully automate the response itself. It is to build an agent that reads incoming inquiries and proposes priority and routing in a form that makes it easy for human staff to decide.

Support triage is only the first use case. What I want to see here is which rules should be fixed, which heuristics should be grown, and where control should revert to humans when bringing Hermes Agent closer to real-world decision-making.

## Why Support Triage

To make Hermes Agent's growth record useful for actual business, the theme needs to be neither too vague nor too simple.

Customer support inquiry triage meets this condition.

- Multiple signals are involved: wording, urgency, customer contract status, past exchanges
- Priority outputs like immediate, normal, and low are relatively easy to compare
- Routing outcomes like immediate escalation, normal queue, and auto-reply eligible are also easy to compare
- Judgment tends to rely on staff experience and intuition
- When it misses, the reason is easy to review and can be horizontally applied to other inquiry operations

What I especially want to see is "what does Hermes get wrong in its initial state." Rather than pretending the agent is smart from the start, I will record what is missing for the judgment to hold up.

## Scope Delegated to the Agent

In this PoC, Hermes is only responsible for proposing triage.

For each received inquiry (ticket), Hermes produces output like the following:

```json
{
  "priority": "immediate",
  "routing": "escalate",
  "rationale": [
    "strongly demands contract termination and refund",
    "implies legal action depending on the response",
    "target customer is on a premium plan with a large impact scope"
  ],
  "confidence_score": 0.82,
  "human_approval_required": true
}
```

On the other hand, the actual reply content and final resolution are handled by a human. Automating this would let misjudgments directly affect customer-facing quality and trust.

In this series, the agent is treated not as "a staff member who replies automatically," but as "an assistant that organizes the basis for judgment and proposes options." I will test this distance with support triage, and record what works and what is dangerous in a form that can be transferred to other inquiry operations.

## Separating into L1 / L2 / L3

I divide Hermes's judgments into three layers.

### L1: Fixed Rules That Must Never Be Broken

L1 is the set of principles that must not be changed by feedback.

- Wording that implies legal risk, contract termination, or litigation must be escalated immediately
- Content involving personal or confidential information must never be auto-replied; it always requires human review
- Any mention of self-harm or harm to others must be passed to a human with the highest priority
- Anything already past its SLA must be treated with the highest priority
- Proposals must not be made without rationale, and human approval must not be bypassed

This is not the "growth" target. Even if it performs well in operation, weakening L1 would allow dangerous judgments to slip through.

### L2: Heuristics to Grow

L2 is the set of weights adjusted while observing feedback.

- How to adjust the weight of keywords and patterns that were frequently false positives in the past
- How much to trust routing based on each staff member's specialty and response speed
- How much to reflect customer contract plan and status in priority
- How much to treat strong wording in itself as urgency

When we say Hermes "grows," we mainly mean improvement in this L2 layer.

### L3: Monitoring When the Agent Keeps Missing

L3 is not about individual proposal judgments.

For example, if there are five consecutive misjudgments in the same category, it is not enough to correct each individual decision. You must suspect structural changes that Hermes is not seeing, such as a drift in category definitions or changes in operational rules.

At this point, L3 is expected to issue a separate alert like the following:

```json
{
  "alert_type": "consecutive_misclassification",
  "message": "Triage decisions for this category have deviated from accepted outcomes for five consecutive cases. Please verify whether category definitions or operational rules have changed."
}
```

In this initial PoC, I will first evaluate normal judgments with L1 and L2. L3 will be handled in a follow-up post.

## Initial Verification Scenarios

Creating all scenarios from the start would make data preparation heavier than the evaluation loop itself. I will start with only three scenarios to test normal judgment.

| ID | Scenario | Expected Judgment |
|---|---|---|
| A-01 | Typical product return or exchange request | Normal priority / normal queue |
| A-02 | Minor complaint with strong wording but no actual damage | Normal priority / normal queue |
| A-03 | Simple usage or specification question | Low priority / auto-reply eligible |

What I want to see with these three is not whether Hermes can produce "plausible-looking" output.

- Whether priority and routing judgments match expectations
- Whether rationale is tied to ticket content
- Whether confidence is neither too high nor too low
- Whether it proposes anything that violates L1

I will score these first and record the weaknesses of the initial state.

Note that all scenarios use synthetic inquiry text and synthetic customer information. No proper nouns that suggest a specific industry or company are used; I will proceed only with abstract inquiry categories and response teams.

## What Was Built This Time

Verification data and working files are kept in a private local workspace, not in a public repository.

Support operations may involve information close to actual inquiry text and customer correspondence. Even though this starts from synthetic data, there is room for real data or internal operational notes to mix in later. Therefore, the article side published on GitHub will only describe "what structure the verification follows," while specific data and internal notes are not disclosed.

The main contents are as follows:

- PoC README
- Minimal design notes
- SQLite-oriented schema.sql (tickets, triage decisions, feedback, escalations)
- Hermes soul.md
- L1 / L2 context
- JSON for the first three scenarios to try
- Manual scoring notes

At this point, I have not yet created an automated evaluation script. The next step is to feed the three scenarios to the initial Hermes and record which judgments miss.

## What to Check Next

In the next post, I will hand the three scenarios above to Hermes. Rather than just looking at triage accuracy, I will observe what is missing for an agent that supports business decisions.

I will record the following:

- Whether priority and routing are judged as expected
- Whether rationale corresponds to data in the input JSON
- Whether minor complaints are escalated excessively
- Whether strong wording alone pushes the judgment toward a dangerous outcome

In the initial evaluation, failures are more valuable as material. If it becomes clear where the judgment breaks, the next step will show how L1 or L2 should be corrected.

## Next in the Series

The next article brings the synthetic scenarios, expected judgments, rationale, and safety constraints into a uniform format before feeding them into Hermes Agent.

<LinkCard
  href="https://llm-lab.dev/posts/hermes-agent-002-support-triage-scenarios/"
  title="Before Training Hermes Agent: Building Synthetic Support Triage Scenarios"
  description="A follow-up that designs three evaluatable synthetic scenarios and safety constraints without using real data."
  siteName="LLM Lab"
  image="/images/posts/hermes-agent-002-support-triage-scenarios/scenario-evaluation.webp"
/>

## References

- [Hermes Agent Documentation](https://hermes-agent.nousresearch.com/docs)
- [Persistent Memory](https://hermes-agent.nousresearch.com/docs/user-guide/features/memory)
- [Skills System](https://hermes-agent.nousresearch.com/docs/user-guide/features/skills)
- [Personality & SOUL.md](https://hermes-agent.nousresearch.com/docs/user-guide/features/personality)