Before Growing Hermes Agent: Creating Synthetic Support Triage Scenarios
As a preparatory step before delegating support triage to Hermes Agent, I built three evaluation scenarios using synthetic data—fixing decision criteria and safety constraints in advance without relying on real customer data.
When growing Hermes Agent as a support triage assistant, the first impulse is to measure how smart the agent is right away. But if the evaluation scenarios themselves are ambiguous, you cannot tell whether a failure comes from the agent or from poorly designed input data.
So this time, before asking Hermes Agent to suggest priority and routing for inquiries, I shifted my approach to building just three evaluation scenarios using only synthetic data. The goal is not to rush toward a triage accuracy score. It is to first align the inputs, safety constraints, and review criteria into a shared format: what the agent should treat as evidence, which constraints it must always follow, and from which angles a human should review its output.
This article is a continuation of the post that outlined the Hermes Agent overview and a practical roadmap for decision-support agent validation. The characteristics of Hermes Agent itself and the roles of L1, L2, and L3 used in this series are explained in the first part.
LLM Lab A Validation Roadmap for Growing Hermes Agent as a Practical Decision-Making Partner The first part covering Hermes Agent overview and a validation approach that separates fixed rules, adjustable judgment criteria, and continuous monitoring. https://llm-lab.dev/posts/hermes-agent-001-support-triage-agent-start/
What to evaluate and what to decide first
What I want to see in this experiment is not just the model’s raw performance. If input fields are inconsistent, expected judgments are described only in prose that cannot be mechanically compared, and the distinction between information to reference as evidence versus background context is unclear—then staring at the output may produce plausible-looking reflections, but it leaves the next step ambiguous.
Support triage involves multiple factors: inquiry text, customer status, SLA, past interactions, risk phrasing, and more. So before handing anything to the agent, I need to pin down at least the following three things:
- Hold expected priority in a comparable form such as
immediate,normal,low - Hold expected routing in a comparable form such as
escalate,queue,auto_reply_candidate - For each scenario, define which input items to reference as evidence and which L1 safety constraints must hold
Building only three scenarios
Building many scenarios up front makes data preparation heavy before the evaluation loop even starts. This time I limited it to three representative normal-cases.
| ID | Scenario | Expected judgment | Failure mode to watch |
|---|---|---|---|
| A-01 | Typical return/exchange request | Normal priority · normal queue | Whether a routine inquiry is over-escalated |
| A-02 | Minor complaint with strong wording but no actual harm | Normal priority · normal queue | Whether emotional wording alone pushes it toward immediate handling |
| A-03 | Simple how-to or specification question | Low priority · auto-reply candidate | Whether auto-answerable content is unnecessarily bounced back to a human |
These three are not designed to yield a single correct answer. Support decisions vary by company SLA, customer plan, team structure, and history—even for the same inquiry. What I want to fix here is the evaluation template: what the agent should look at, within what scope it should propose, and where it should hand off to a human.
Equipping scenarios with review criteria, not just expected judgments
The scenario JSON includes not only inquiry text and customer info, but also expected priority, expected routing, required evidence, L1 constraints, and review questions. For example, a minor complaint with strong wording carries a safety constraint: do not let surface tone alone trigger immediate escalation, but if legal risk or personal information appears, treat it as L1 and pass to a human.
With this design, evaluating Hermes Agent output no longer reduces to “did it say escalate?” I can check, in the same format every time, which part of the inquiry text was cited as evidence, whether customer status was overweighted, and whether prohibited content for auto-reply was present.
At this stage I am not evaluating Hermes Agent output itself. The target of this validation is whether the input scenarios can withstand a shared scoring format. Initial agent evaluation comes in the next phase.
What the validation script checks
I planned to use a small homemade validation script for this article—not an official CLI, but a compact checker that reads scenario JSON. The input is synthetic scenarios under data/scenarios/*.json, and the target under inspection is the scenario definition itself, not the agent.
The main items checked are:
- Required keys are present
- Expected priority is one of
immediate,normal,low - Expected routing is one of
escalate,queue,auto_reply_candidate - Auto-reply prohibition conditions are explicitly stated as L1 constraints
- Paths specified as required evidence exist inside the actual scenario
- When text matching L1 is present, the expected judgment assumes human verification
At this stage I do not include screenshots or scores in the article. I will solidify the format first, confirm the scenarios are intact, and then move on to the initial Hermes evaluation logs.
Reducing input variance before discussing L1 / L2 / L3
In the Hermes Agent growth log, I plan to treat L1 as fixed rules, L2 as adjustable judgment weights, and L3 as monitoring for repeated outliers. But before I can discuss L1/L2/L3, I need to reduce variance in the input scenarios; otherwise I cannot tell which layer to fix.
Suppose a minor complaint gets escalated immediately. I want to separate whether the weight on strong wording is too high, whether customer status is being overweighted, or whether legal-risk and SLA-breach conditions are themselves ambiguous. If the scenario includes required evidence and safety constraints, I can compare later which part of the output broke.
Conversely, if a scenario is text-only, it tends to stop at impressions like “feels a bit exaggerated” or “probably normal handling is fine.” If I am using the phrase “growing Hermes Agent,” I need to record not impressions but which input led to which changed judgment.
What comes next
Next time, I will feed these three scenarios to the initial-state Hermes Agent and score the outputs. I will look at more than priority and routing match. I will check whether the evidence ties to data inside the scenario, whether confidence is not excessively high, whether L1 constraints are violated, and whether the proposal is structured for easy human approval.
Failure itself is not the problem. Rather, if I can observe where the initial Hermes Agent deviates, I can see which dangerous judgments should be stopped at L1, which judgment biases should be adjusted at L2, and which persistent misclassifications should be monitored at L3.
My takeaway this time is that when evaluating a production-oriented agent, preparation before calling the model matters a lot. Simply aligning input scenarios, expected judgments, evidence, constraints, and review criteria into a small shared format turns the next failure log from mere impressions into actionable improvement records.