Research benchmark platform

Benchmark agents with the calm precision of an editorial research studio.

WorldModel Gym turns environments, uploads, traces, and leaderboards into a public-facing benchmark product that feels intentional from the first click.

Get Started View Leaderboard

Sparse-reward environments with reproducible seeds

Public leaderboard slices across test, train, and continual tracks

Browser upload flow alongside API and CLI submission paths

Desk with benchmark notes and chart printouts

Benchmark narrative

Build a benchmark surface that communicates as well as the experiment itself.

FastAPI + Postgres + S3Next.js App RouterReal runs live in production

Our benchmark surfaces

Rigorous evaluation, not cherry-picked demo polish.

The site is designed to make benchmark evidence readable: task framing, run uploads, leaderboard slices, and trace inspection all move together as one public story.

Design benchmark tasks with explicit failure modes

Upload metrics, traces, and configs without leaving the product surface

Inspect runs, compare tracks, and share outcomes with one clean URL

Task library

Document environments the way a strong research deck would: clear defaults, precise constraints, and readable benchmark framing.

Browse tasks

Live leaderboards

Compare planning quality, return, and cost in one public surface instead of across notebooks, screenshots, and scattered artifacts.

Open leaderboards

Upload studio

Create a run, attach metrics and traces, and publish it from the browser while keeping automation-friendly CLI and API options.

Publish a run

Product workflow

A single workflow for create, evaluate, upload, and compare.

Workflow Create

Design a benchmark brief that feels intentional from the first click.

Shape sparse-reward tasks, defaults, and success criteria before you ever touch a leaderboard. This is the fastest way to move from a research idea to a benchmark someone else can immediately understand.

Browse Tasks Upload Existing Run

Prompt

Frame a partially observable benchmark with delayed reward, reproducible seeds, and a planning budget that matches the story you want the leaderboard to tell.

Step 1

Choose an environment with explicit constraints

Step 2

Set defaults that make the benchmark reproducible

Step 3

Carry the task into evaluation and upload flows

Task framing

Task defaults

Observation mode

Reward design

Task defaults

Observation mode

Reward design

Live benchmark product

Ready to publish planning research without the visual clutter.

Ship new runs, compare them publicly, and use the same benchmark surface in your README, interviews, project portfolio, or research demo.

Open Upload Studio Explore Benchmark Tasks

Financial board visual suggesting live benchmark comparison