Product · Evals & sandbox

Where agent quality gets measured and gated

Evals is the surface where agent output quality, regressions and drift are scored — and where a release can be gated before it ships. The framework, the scorecards and the console exist and are wired. What is not yet live is the part that makes it count on real traffic: a wired session source for sampling, and ordered session replay in the sandbox. We say so plainly here, because it is the next step, not a finished claim.

View the repository What’s real today

In the product

The evals console

A genuine screenshot, example data. Scorecards, regression runs, prompt A/B and drift detection over agent output — plus an isolated sandbox for pre- and post-deploy comparison. The data shown is seeded, not from any real session.

Olivares evals console: scorecards for agent output quality, regression-test runs, prompt A/B comparison and drift detection, alongside an isolated sandbox panel for pre/post-deploy comparison and session replay — populated with example data.

What it does

A framework for agent quality

Output-quality monitoring, regression testing and an isolated sandbox — the place where agent behavior is scored before and after a change.

Scorecards and output-quality monitoring

Score agent output against the checks you define and watch quality over time. The framework, the scorecards and the console are wired; what they measure becomes real once a session source is connected.

Regression testing and prompt A/B

Re-run a suite against a change to catch regressions before they ship, and compare prompt variants A against B on the same inputs — so a change is judged on evidence, not intuition.

Drift detection

Detect when agent output drifts from its expected baseline over time, so quality erosion is surfaced rather than discovered in production.

Isolated sandbox

An isolated test environment for pre- and post-deploy comparison, with session replay. The environment is wired; ordered replay needs an ordered history source, which is the roadmap item described below.

What’s real

The framework and console exist; real sampling and ordered replay are the next step

This surface is the most seam-heavy in the product, so we are blunt about it — the honesty is the feature, not an apology:

Live: the evals framework, the scorecards, the console, regression runs, prompt A/B and drift detection are built and wired, and the sandbox is an isolated environment for pre/post-deploy comparison.
Roadmap, not live: real eval sampling needs a wired session source — without one there is no real sampling yet, only the framework around it. And sandbox session replay is degraded today because there is no ordered history source to replay from. Both are near-term work, not a finished capability, and we do not claim them before they ship.
Posture: the adaptive red-teaming engine is post-v1. For v1 we document the posture with compensating controls rather than overstate an engine that is not here yet.

Evals & sandbox — questions

Can I run evals against my real agent traffic today?

Not yet. The framework, scorecards and console are wired and run against seeded example data, but real eval sampling needs a wired session source — and that source is not connected today. Until it is, there is no real sampling, only the framework around it. Connecting that session source is near-term work, and we do not claim live sampling before it ships.

Does session replay in the sandbox work?

It is degraded today. Replay needs an ordered history source to reconstruct a session in sequence, and that source is not wired yet, so ordered replay is not available. The sandbox itself — the isolated environment for pre- and post-deploy comparison — exists; ordered replay is on the roadmap alongside the session source.

Is there an automated red-teaming engine?

Not in v1. The adaptive red-teaming engine is post-v1. For v1 we document the security posture with compensating controls rather than imply an adaptive engine that is not built yet.

So what is actually usable right now?

The evals framework and console — scorecards, regression runs, prompt A/B and drift detection — plus the isolated sandbox for pre/post-deploy comparison. What they need to act on real traffic is the wired session source and ordered replay, both roadmap. This is the place agent quality and regressions will be measured and gated; the live sampling wiring is the next step.

See where agent quality gets gated

Deploy Olivares on your own infrastructure and explore the evals framework and sandbox — scorecards, regression testing and pre/post-deploy comparison — with real session sampling and ordered replay arriving as the next step.

View the repository See the access map