Benchmarks born from real failures

RL Gym

Stress tests for agents that need to work outside the demo.

We turn real-world agent mistakes into structured problem sets, exposing the moments where impressive models lose the plot: brittle planning, weak recovery, unreliable tools, and bad calls.

Frontier agents ready to evaluate
9Real-world datasets under test
169Tests built from observed failures
36%Average score on current benchmark runs

Why test cases

Agents look capable until the task stops being polite.

Our tests are designed to make hidden weaknesses visible. Each case starts with a real agent failure, then becomes a reproducible task with clear expectations, validators, and scoring notes.

Use Cases

Failure modes worth measuring

01

Forecasting

Dataset-driven tasks that start simply, then test whether the agent can carry intent, state, constraints, and intermediate results across many steps.

sequencememorystate
02

Imbalanced Classification

Situations where the agent needs judgment: ask the missing question, reject the bad premise, and avoid rushing into a plausible but wrong answer.

intentclarifyrefusal
03

Data Quality

Messy inputs with stale facts, corrected requirements, hidden dependencies, and conflicting context to see what the agent keeps, drops, or invents.

contextconflictstale
04

Probabilistic Classification

Uncertain tasks where tools misfire or evidence is incomplete, revealing whether the agent can verify, adapt, and recover without bluffing.

verifyrecovertooling

Method

Built around problems with real consequences.

We choose datasets and scenarios because failures in these workflows are costly. The point is not to make agents look bad; it is to find the gaps that matter before they show up in production.

01

Start with a dataset that matters.

We focus on data that reflects workflows where agent mistakes create real downstream cost.

02

Define the user, goal, and pressure points.

We frame the task around who is using the data, what they need, and where the agent is likely to drift.

03

Stress the agent from multiple directions.

We vary constraints, context, tools, and ambiguity until the failure pattern becomes obvious.

04

Turn every miss into evidence.

When an agent fails, we capture the trace, conditions, outcome, and scoring signal.

Workflow

From messy scenario to clear scorecard

Every benchmark follows the same loop: choose the scenario, configure the agent, watch the execution, then review the scorecard for failure clusters, validator misses, and recovery quality.

Scenario selection screen for choosing a training scenario
01

Choose the scenario

Start from a real-world dataset scenario with a task pool, environment, and validators.

Run configuration screen with agent connection, sampling, and validators
02

Configure the benchmark

Connect your desired agent, set sampling and runtime limits, choose validators, and prepare the benchmark.

Live run screen showing metrics, progress, agent state, and event log
03

Watch the agent under pressure

Observe the live task pool, event stream, retries, latency, and agent state as the model attempts the scenario.

Run report screen with aggregate score, validator breakdowns, and failure clusters
04

Read the report card

Turn the run into scores, validator breakdowns, failure clusters, traces, and improvement signals.

Next step

Found a failure case in the wild?
Send it our way.

Share the model, task, expected outcome, and the moment the agent went off track.

Submit