---
title: "Can Langfuse Assistant Serve as an Entry Point for Operations Investigation?"
description: "Testing the Langfuse Assistant public beta against existing Sakana Fugu observations and ground truth calculated through the Public API."
lang: "en"
canonical: "https://llm-lab.dev/en/posts/langfuse-assistant-public-beta-verification/"
source: "https://llm-lab.dev/en/posts/langfuse-assistant-public-beta-verification.md"
publishedAt: "2026-06-30"
updatedAt: "2026-06-30"
category: "Operations & Observability"
tags:
  - "langfuse"
  - "llmops"
  - "observability"
  - "mcp"
---

# Can Langfuse Assistant Serve as an Entry Point for Operations Investigation?

import LinkCard from "../../components/LinkCard.astro";

Even after traces accumulate in Langfuse, every incident investigation still requires someone to assemble the right filters, time window, and aggregation dimensions. Having a dashboard is not the same as reaching the facts you need quickly.

On June 19, 2026, Langfuse released the [Langfuse Assistant public beta](https://langfuse.com/changelog/2026-06-19-langfuse-assistant-public-beta). It lets users ask questions about traces, observations, and metrics in natural language, and Langfuse Cloud users can access it without opting into a Feature Preview.

This looked useful!!

For operations work, however, the important question is not whether the Assistant produces a plausible summary. It must interpret the time range, target, and aggregation correctly, then return an answer consistent with the underlying data.

I tested this using traces from an earlier Sakana Fugu API experiment. I compared the Assistant's search and aggregation results with ground truth calculated through the Langfuse Public API. The subject here is not another evaluation of Fugu. Fugu simply provides real observations with known results for testing how accurately Langfuse Assistant reads existing observability data.

The source data and original findings are documented in the following post.

<LinkCard
  href="https://llm-lab.dev/en/posts/sakana-fugu-langfuse-experiment/"
  title="Observing the Sakana Fugu API with Langfuse: Understanding Hidden Costs in Multi-Agent Systems"
  description="An experiment running Fugu Level 1–3 tasks and observing latency, tokens, and TTFT in Langfuse."
  siteName=""
  image="/images/posts/sakana-fugu-langfuse-experiment/heroImage.webp"
/>

![The initial Langfuse Assistant screen, showing suggested trace investigations, unusual-pattern exploration, and a natural-language input box](/images/posts/langfuse-assistant-public-beta-verification/assistant-entry.webp)

## What Langfuse Assistant Does

According to the official announcement, the Assistant is built on the Langfuse MCP Server. It receives a natural-language question, uses MCP tools to query traces, observations, and metrics, and answers with that context.

```text
Natural-language question
  ↓
Langfuse Assistant
  ↓
Langfuse MCP Server tools
  ↓
Traces / Observations / Metrics
  ↓
Contextual answer
```

This makes it closer to an agent that queries project observability data than a chat interface that merely explains the dashboard. The official page gives examples such as:

- Which traces had the highest latency yesterday?
- Show failed generations from the last hour.
- Break down this week's token spend by model.

The public beta is available only on Langfuse Cloud and is free during the beta, although pricing may change later. The underlying Langfuse MCP Server provides both read and write tools. When connecting it to an external MCP client, write operations should be restricted with an allowlist when appropriate. The changelog alone does not establish which tools the built-in Assistant permits.

Enabling the Assistant also displayed a confirmation dialog explaining that AI features would be enabled for the organization and that data may be sent to AWS Bedrock in the active data region for processing. Observability data can contain prompts, completions, and metadata, so teams should review their data-handling rules and region before enabling the feature.

![Confirmation dialog explaining that enabling AI features may send data to AWS Bedrock in the active data region](/images/posts/langfuse-assistant-public-beta-verification/assistant-confirm.webp)

## What I Wanted to Verify

A product demo already shows that natural-language questions work. I wanted to verify what happens after that.

| Question | What to verify |
|---|---|
| Five highest-latency Fugu observations | Observation names, descending order, latency, timestamps |
| Non-streaming token usage by level | Level 1–3 classification, counts, input/output/total |
| Usage difference between stream and nonstream | Name-based classification, counts, total-token aggregation |

Relative expressions such as `yesterday` and `this week` are not unambiguous. The target data changes depending on the user's timezone, project settings, or UTC. A fluent answer is still wrong if it covers the wrong time window.

## Building Ground Truth from the API

I wrote a verification script that reads the observation list from the Langfuse project used for the Fugu experiment and calculates the expected answer for each question. This is not an official Langfuse CLI. It is a read-only script created for this article. It sends no new traces and only aggregates counts, order, latency, and token usage from existing Fugu observations.

```bash
node scripts/build-ground-truth.mjs
```

I fixed the target window to 22:30–23:45 UTC on June 22, 2026, when the Fugu experiment was recorded. The output stores both the questions and the exact `from` and `to` timestamps used for calculation.

The API produced the following expected values.

| Check | Expected value calculated from the API |
|---|---|
| Highest latency | Four Level 3 observations occupy the top four positions; `level3-stream-1` is highest at 160.096 seconds |
| Non-streaming tokens by level | Level 1: 1,360; Level 2: 4,581; Level 3: 17,111 tokens |
| Stream/nonstream usage gap | Eight generations each; stream: 0, nonstream: 23,052 tokens |

These values come from actual Level 1–3 prompts sent to Fugu. The earlier article found that Level 3 sharply increased both latency and token usage, while streaming observations recorded usage as zero. The goal here was to see whether Assistant could retrieve not only those trends but also the exact counts and values.

## Asking Assistant the Same Questions

After calculating the ground truth, I asked Assistant with an explicit time range and timezone. I avoided relative expressions and used the same absolute UTC window as the API aggregation.

### Five Highest-Latency Observations

```text
Use UTC. From 2026-06-22T22:30:00Z to 2026-06-22T23:45:00Z,
show the top 5 Fugu observations by latency with observation name, latency, and timestamp.
```

I initially expected this to work. Instead, Assistant interpreted `Fugu` as an observation-name filter rather than the operational label for the experiment. It returned zero observations named `Fugu`, then reported that names such as `smoke-test-hello` and the `level1-*` through `level3-*` series existed in the same time range.

This was not simply a failure. Assistant did not fabricate data. It checked the names that actually existed and asked for a corrected condition. At the same time, the unit users call the “Fugu experiment” was not automatically connected to its Langfuse observation names.

![Assistant interpreted Fugu as an observation name, returned zero matches, and suggested the actual level-series names](/images/posts/langfuse-assistant-public-beta-verification/fugu-name-mismatch.webp)

I then expressed the target using names from the data model.

```text
Use UTC. From 2026-06-22T22:30:00Z to 2026-06-22T23:45:00Z,
show the top 5 observations by latency where the observation name starts with
level1-, level2-, or level3-.
Return observation name, latency in seconds, and timestamp.
Sort by latency descending.
```

The API showed `level3-stream-1` at 160.096 seconds, with four Level 3 observations occupying the top four positions. Assistant first tried to query each prefix, then discovered that the name filter only supported `any of` and `none of`. It replaced the prefixes with exact observation names from the earlier name list, excluded trace-root rows with null latency, and returned the top five.

The observation names, order, latency, and timestamps all matched the API ground truth. Even though prefix filtering was unavailable, Assistant completed the investigation by combining the available filter with names obtained earlier in the conversation. That behavior goes beyond a single natural-language-to-query translation.

![Assistant switched from unsupported prefix filtering to exact observation names and returned the five highest-latency observations](/images/posts/langfuse-assistant-public-beta-verification/high-latency-answer.webp)

### Token Usage by Level

```text
Use UTC. From 2026-06-22T22:30:00Z to 2026-06-22T23:45:00Z,
summarize total token usage for non-streaming Fugu observations,
grouped by level 1, 2, and 3.
```

The API result was 1,360 tokens across three Level 1 observations, 4,581 across three Level 2 observations, and 17,111 across two Level 3 observations. After explaining again that no observation was literally named `Fugu`, Assistant switched to the known `level1-nonstream-*` through `level3-nonstream-*` observations and returned all three level values correctly.

However, it reported the final total as `22,052` tokens. The correct sum is `23,052`. The individual aggregates matched the API, but the final elementary addition dropped 1,000 tokens.

> That's where it makes the mistake!!

This result is a concise reason not to trust the visual polish of an Assistant-generated table. Search and grouping can be correct while an arithmetic error enters during answer generation. For operational decisions, totals should be calculated by a tool or at least checked against their components.

![Assistant returned the correct token count for every level but an incorrect total of 22,052; the correct total is 23,052](/images/posts/langfuse-assistant-public-beta-verification/token-by-level-answer.webp)

### Missing Usage on Streaming Observations

```text
Use UTC. From 2026-06-22T22:30:00Z to 2026-06-22T23:45:00Z,
compare token usage recorded for Fugu observations whose names contain stream and nonstream.
Report counts and total tokens for each group.
```

When counting token-bearing generations, the API result contained eight stream and eight nonstream generations. Stream recorded zero total tokens, while nonstream recorded 23,052. Assistant counted 16 rows in each group because the observation-list result also included placeholder rows, but explicitly explained that token-bearing generations were half that count. The actual generation counts and token totals therefore matched the API result.

More importantly, Assistant did not stop at displaying zero tokens for stream. It suspected that streaming usage had not been captured and suggested reviewing the relevant usage options or explicitly finalizing the generation with usage details. The earlier Fugu experiment had observed the same problem: only the streaming observations recorded zero usage. Assistant identified that operational anomaly from the existing data and proposed the next investigation.

This does not mean streaming was free. It means usage was not recorded on the streaming path in this instrumentation setup. Counts also change depending on whether placeholder rows are included, so an answer must state what it considers one observation.

![Assistant explained the placeholder-inclusive counts and reported nonstream at 23,052 tokens, stream at zero, with a suggestion to inspect streaming usage capture](/images/posts/langfuse-assistant-public-beta-verification/streaming-usage-gap-answer.webp)

## Comparing Assistant with the API Ground Truth

For the three questions, the high-latency and stream/nonstream results matched the API ground truth. Token usage by level matched at every level, but the final total was wrong.

| Check | API ground truth | Assistant answer | Result |
|---|---|---|---|
| Five highest-latency observations | Maximum 160.096 seconds; Level 3 occupies the top four | Same order and values | Match |
| Tokens by level | 1,360 / 4,581 / 17,111; total 23,052 | Per-level values match; total incorrectly reported as 22,052 | Partial match |
| Stream / nonstream | Eight actual generations each; 0 / 23,052 tokens | 16 rows each including placeholders, half are actual generations; token totals match | Match |

The investigation encountered two separate scope problems: the selected project and the experiment name used in natural language. Assistant searches within the currently open project, so the same query returns nothing in the wrong project. Even in the correct project, an operational label such as `Fugu` does not become a useful filter unless it maps explicitly to an observation name, metadata field, or tag.

This is highly practical. Making Assistant useful requires more than better questions. Observation names, metadata, and tags should reflect the units people use when discussing the system.

## Using Assistant in Three Stages

I would divide operational use into three stages.

### Exploration

Ask which executions were slow or whether failures increased, then narrow the investigation target. Natural language is valuable here because users can enter observability data without remembering dashboard columns and filter syntax.

### Verification

Check the time range, count, model, units, and aggregation reported by Assistant against the UI or API. A zero-result answer also needs interpretation: did no data exist, or did a filter or scope mismatch exclude it?

### Automation

Recurring notifications, SLO decisions, billing stops, and retries should use Monitors, the Public API, or fixed queries rather than natural-language answers. Assistant is useful for interactive exploration; reproducible systems should handle recurring evaluation of the same conditions.

This division does not conflict with the Langfuse morning briefing or Monitors I built earlier. Assistant is an entry point for a person investigating an unknown anomaly. Monitors evaluates known thresholds. The morning briefing summarizes multiple conditions on a fixed schedule.

## Design Implications Visible in the Public Beta

Beyond correctness, the actual answers demonstrated that Assistant can:

1. Work with an explicitly specified time range and timezone.
2. Keep traces and observations at the requested granularity.
3. Return input, output, and total tokens in a consistent unit.
4. Avoid fabricating results when the requested name does not exist.

Questions remain about whether it always returns identifiers or links to the underlying data, and whether repeated runs remain stable when the data itself has not changed.

An operations investigation agent needs more than persuasive prose. A human must be able to trace which data it queried, under what conditions, and how it aggregated the result. With that visibility, Assistant becomes a new interface into observability data rather than merely a replacement for the dashboard.

## Conclusion

The Langfuse Assistant public beta adds a natural-language entry point for exploring traces, observations, and metrics in Langfuse Cloud. Its use of the Langfuse MCP Server is also notable: this is not merely a help chatbot, but a tool-using interface over observability data.

Operational use still requires evaluating time range, target, aggregation method, and consistency with source data rather than trusting fluent answers. In these three fixed questions, search and grouping mostly matched the Public API ground truth, but the final elementary addition still produced an error. This experiment does not prove accuracy for arbitrary questions or larger datasets. My current conclusion is that Assistant is useful as an entry point for exploration, but it should not replace automated monitoring or authoritative calculations.

Keeping an API result or fixed query available for comparison moves the evaluation beyond “natural-language questions worked” toward “we know under which conditions this is usable for operations.” When testing the public beta, avoid ambiguous relative time expressions and start with a small dataset whose correct answer is already known.
