Product · Evals & sandbox
Where agent quality gets measured and gated
Evals is the surface where agent output quality, regressions and drift are scored — and where a release can be gated before it ships. The framework, the scorecards and the console exist and are wired. What is not yet live is the part that makes it count on real traffic: a wired session source for sampling, and ordered session replay in the sandbox. We say so plainly here, because it is the next step, not a finished claim.
In the product
The evals console
A genuine screenshot, example data. Scorecards, regression runs, prompt A/B and drift detection over agent output — plus an isolated sandbox for pre- and post-deploy comparison. The data shown is seeded, not from any real session.
What it does
A framework for agent quality
Output-quality monitoring, regression testing and an isolated sandbox — the place where agent behaviour is scored before and after a change.
Scorecards and output-quality monitoring
Score agent output against the checks you define and watch quality over time. The framework, the scorecards and the console are wired; what they measure becomes real once a session source is connected.
Regression testing and prompt A/B
Re-run a suite against a change to catch regressions before they ship, and compare prompt variants A against B on the same inputs — so a change is judged on evidence, not intuition.
Drift detection
Detect when agent output drifts from its expected baseline over time, so quality erosion is surfaced rather than discovered in production.
Isolated sandbox
An isolated test environment for pre- and post-deploy comparison, with session replay. The environment is wired; ordered replay needs an ordered history source, which is the roadmap item described below.
What’s real
The framework and console exist; real sampling and ordered replay are the next step
This surface is the most seam-heavy in the product, so we are blunt about it — the honesty is the feature, not an apology:
- Live: the evals framework, the scorecards, the console, regression runs, prompt A/B and drift detection are built and wired, and the sandbox is an isolated environment for pre/post-deploy comparison.
- Roadmap, not live: real eval sampling needs a wired session source — without one there is no real sampling yet, only the framework around it. And sandbox session replay is degraded today because there is no ordered history source to replay from. Both are a near-term session of work, not a finished capability, and we do not claim them before they ship.
- Posture: the adaptive red-teaming engine is post-v1. For v1 we document the posture with compensating controls rather than overstate an engine that is not here yet.
Evals & sandbox — questions
Can I run evals against my real agent traffic today?
Not yet. The framework, scorecards and console are wired and run against seeded example data, but real eval sampling needs a wired session source — and that source is not connected today. Until it is, there is no real sampling, only the framework around it. Connecting that session source is a near-term session of work, and we do not claim live sampling before it ships.
Does session replay in the sandbox work?
It is degraded today. Replay needs an ordered history source to reconstruct a session in sequence, and that source is not wired yet, so ordered replay is not available. The sandbox itself — the isolated environment for pre- and post-deploy comparison — exists; ordered replay is on the roadmap alongside the session source.
Is there an automated red-teaming engine?
Not in v1. The adaptive red-teaming engine is post-v1. For v1 we document the security posture with compensating controls rather than imply an adaptive engine that is not built yet.
So what is actually usable right now?
The evals framework and console — scorecards, regression runs, prompt A/B and drift detection — plus the isolated sandbox for pre/post-deploy comparison. What they need to act on real traffic is the wired session source and ordered replay, both roadmap. This is the place agent quality and regressions will be measured and gated; the live sampling wiring is the next step.
See where agent quality gets gated
Deploy Olivares on your own infrastructure and explore the evals framework and sandbox — scorecards, regression testing and pre/post-deploy comparison — with real session sampling and ordered replay arriving as the next step.