Before Growing Hermes Agent: Creating Synthetic Support Triage Scenarios

When growing Hermes Agent as a support triage assistant, the first impulse is to measure how smart the agent is right away. But if the evaluation scenarios themselves are ambiguous, you cannot tell whether a failure comes from the agent or from poorly designed input data.

So this time, before asking Hermes Agent to suggest priority and routing for inquiries, I shifted my approach to building just three evaluation scenarios using only synthetic data. The goal is not to rush toward a triage accuracy score. It is to first align the inputs, safety constraints, and review criteria into a shared format: what the agent should treat as evidence, which constraints it must always follow, and from which angles a human should review its output.

This article is a continuation of the post that outlined the Hermes Agent overview and a practical roadmap for decision-support agent validation. The characteristics of Hermes Agent itself and the roles of L1, L2, and L3 used in this series are explained in the first part.

LLM Lab A Validation Roadmap for Growing Hermes Agent as a Practical Decision-Making Partner The first part covering Hermes Agent overview and a validation approach that separates fixed rules, adjustable judgment criteria, and continuous monitoring. https://llm-lab.dev/posts/hermes-agent-001-support-triage-agent-start/

What to evaluate and what to decide first

What I want to see in this experiment is not just the model’s raw performance. If input fields are inconsistent, expected judgments are described only in prose that cannot be mechanically compared, and the distinction between information to reference as evidence versus background context is unclear—then staring at the output may produce plausible-looking reflections, but it leaves the next step ambiguous.

Support triage involves multiple factors: inquiry text, customer status, SLA, past interactions, risk phrasing, and more. So before handing anything to the agent, I need to pin down at least the following three things:

Hold expected priority in a comparable form such as immediate, normal, low
Hold expected routing in a comparable form such as escalate, queue, auto_reply_candidate
For each scenario, define which input items to reference as evidence and which L1 safety constraints must hold

Building only three scenarios

Building many scenarios up front makes data preparation heavy before the evaluation loop even starts. This time I limited it to three representative normal-cases.

ID	Scenario	Expected judgment	Failure mode to watch
A-01	Typical return/exchange request	Normal priority · normal queue	Whether a routine inquiry is over-escalated
A-02	Minor complaint with strong wording but no actual harm	Normal priority · normal queue	Whether emotional wording alone pushes it toward immediate handling
A-03	Simple how-to or specification question	Low priority · auto-reply candidate	Whether auto-answerable content is unnecessarily bounced back to a human

These three are not designed to yield a single correct answer. Support decisions vary by company SLA, customer plan, team structure, and history—even for the same inquiry. What I want to fix here is the evaluation template: what the agent should look at, within what scope it should propose, and where it should hand off to a human.

Equipping scenarios with review criteria, not just expected judgments

The scenario JSON includes not only inquiry text and customer info, but also expected priority, expected routing, required evidence, L1 constraints, and review questions. For example, a minor complaint with strong wording carries a safety constraint: do not let surface tone alone trigger immediate escalation, but if legal risk or personal information appears, treat it as L1 and pass to a human.

With this design, evaluating Hermes Agent output no longer reduces to “did it say escalate?” I can check, in the same format every time, which part of the inquiry text was cited as evidence, whether customer status was overweighted, and whether prohibited content for auto-reply was present.

At this stage I am not evaluating Hermes Agent output itself. The target of this validation is whether the input scenarios can withstand a shared scoring format. Initial agent evaluation comes in the next phase.

What the validation script checks

I planned to use a small homemade validation script for this article—not an official CLI, but a compact checker that reads scenario JSON. The input is synthetic scenarios under data/scenarios/*.json, and the target under inspection is the scenario definition itself, not the agent.

The main items checked are:

Required keys are present
Expected priority is one of immediate, normal, low
Expected routing is one of escalate, queue, auto_reply_candidate
Auto-reply prohibition conditions are explicitly stated as L1 constraints
Paths specified as required evidence exist inside the actual scenario
When text matching L1 is present, the expected judgment assumes human verification

At this stage I do not include screenshots or scores in the article. I will solidify the format first, confirm the scenarios are intact, and then move on to the initial Hermes evaluation logs.

Reducing input variance before discussing L1 / L2 / L3

In the Hermes Agent growth log, I plan to treat L1 as fixed rules, L2 as adjustable judgment weights, and L3 as monitoring for repeated outliers. But before I can discuss L1/L2/L3, I need to reduce variance in the input scenarios; otherwise I cannot tell which layer to fix.

Suppose a minor complaint gets escalated immediately. I want to separate whether the weight on strong wording is too high, whether customer status is being overweighted, or whether legal-risk and SLA-breach conditions are themselves ambiguous. If the scenario includes required evidence and safety constraints, I can compare later which part of the output broke.

Conversely, if a scenario is text-only, it tends to stop at impressions like “feels a bit exaggerated” or “probably normal handling is fine.” If I am using the phrase “growing Hermes Agent,” I need to record not impressions but which input led to which changed judgment.

What comes next

Next time, I will feed these three scenarios to the initial-state Hermes Agent and score the outputs. I will look at more than priority and routing match. I will check whether the evidence ties to data inside the scenario, whether confidence is not excessively high, whether L1 constraints are violated, and whether the proposal is structured for easy human approval.

Failure itself is not the problem. Rather, if I can observe where the initial Hermes Agent deviates, I can see which dangerous judgments should be stopped at L1, which judgment biases should be adjusted at L2, and which persistent misclassifications should be monitored at L3.

My takeaway this time is that when evaluating a production-oriented agent, preparation before calling the model matters a lot. Simply aligning input scenarios, expected judgments, evidence, constraints, and review criteria into a small shared format turns the next failure log from mere impressions into actionable improvement records.

Before Growing Hermes Agent: Creating Synthetic Support Triage Scenarios

What to evaluate and what to decide first

Building only three scenarios

Equipping scenarios with review criteria, not just expected judgments

What the validation script checks

Reducing input variance before discussing L1 / L2 / L3

What comes next

DUOps（デュオプス）

Related posts

Why I Stopped Polishing Prompts and Started Using Feedback Loops

Training Hermes Agent as a Business Decision Partner

When You Build a Minimal API Loop, You Stop Designing Prompts and Start Designing Stop Conditions