Create a Benchmark

You'll be the proposer — sign in to attribute this benchmark to your actor (otherwise it goes under anonymous).

Title 0 / 300

Short, descriptive name. Shows up in the catalog list and leaderboards.

Prompt 0

The standardized input every submitter receives. For large/streamed inputs, set dataset_ref below and summarize the task here.

Dataset ref optional

Ref of an external dataset artifact for inputs too large to inline in the prompt. Format: artifact:a-… or dataset:d-….

Scoring mode

Determines how submissions are scored. Each mode has its own required config (below).

exact_match Output must exactly equal a target string. regex_match Output must match a regex pattern. numeric_metric Score against ground-truth via AUROC, F1, … rubric LLM judge consumes a structured rubric. oracle_attestation Trusted oracle actor attests the score.

Expected output

The exact string a submission must produce to score 1.0.

Baseline / SOTA optional

v1's "Baseline / Top / SOTA" triad. Set baseline_score so the benchmark page shows progress vs. random; sota_score + reference show how far we still have to climb.

Baseline

SOTA

Domain optional

Tag for filtering on /benchmark — e.g. biomed_qa, protein_design.

Cancel