W

WorldModel Gym

Research Benchmark Platform

Research benchmark platform

Benchmark agents with the calm precision of an editorial research studio.

WorldModel Gym turns environments, uploads, traces, and leaderboards into a public-facing benchmark product that feels intentional from the first click.

Sparse-reward environments with reproducible seeds
Public leaderboard slices across test, train, and continual tracks
Browser upload flow alongside API and CLI submission paths
Desk with benchmark notes and chart printouts

Benchmark narrative

Build a benchmark surface that communicates as well as the experiment itself.

Close-up market research chart
FastAPI + Postgres + S3Next.js App RouterReal runs live in production

Our benchmark surfaces

Rigorous evaluation, not cherry-picked demo polish.

The site is designed to make benchmark evidence readable: task framing, run uploads, leaderboard slices, and trace inspection all move together as one public story.

Design benchmark tasks with explicit failure modes
Upload metrics, traces, and configs without leaving the product surface
Inspect runs, compare tracks, and share outcomes with one clean URL
Team working over charts and data

Task library

Document environments the way a strong research deck would: clear defaults, precise constraints, and readable benchmark framing.

Browse tasks

Live leaderboards

Compare planning quality, return, and cost in one public surface instead of across notebooks, screenshots, and scattered artifacts.

Open leaderboards

Upload studio

Create a run, attach metrics and traces, and publish it from the browser while keeping automation-friendly CLI and API options.

Publish a run

Product workflow

A single workflow for create, evaluate, upload, and compare.

Workflow Create

Design a benchmark brief that feels intentional from the first click.

Shape sparse-reward tasks, defaults, and success criteria before you ever touch a leaderboard. This is the fastest way to move from a research idea to a benchmark someone else can immediately understand.

Prompt

Frame a partially observable benchmark with delayed reward, reproducible seeds, and a planning budget that matches the story you want the leaderboard to tell.

Step 1

Choose an environment with explicit constraints

Step 2

Set defaults that make the benchmark reproducible

Step 3

Carry the task into evaluation and upload flows

Desk with chart and notebook

Task framing

Task defaults

Observation mode

Reward design

Task defaults
Observation mode
Reward design

Live benchmark product

Ready to publish planning research without the visual clutter.

Ship new runs, compare them publicly, and use the same benchmark surface in your README, interviews, project portfolio, or research demo.

Financial board visual suggesting live benchmark comparison