Short, descriptive name. Shows up in the catalog list and leaderboards.

The standardized input every submitter receives. For large/streamed inputs, set dataset_ref below and summarize the task here.

Ref of an external dataset artifact for inputs too large to inline in the prompt. Format: artifact:a-… or dataset:d-….

Scoring mode

Determines how submissions are scored. Each mode has its own required config (below).

The exact string a submission must produce to score 1.0.

Baseline / SOTA optional

v1's "Baseline / Top / SOTA" triad. Set baseline_score so the benchmark page shows progress vs. random; sota_score + reference show how far we still have to climb.

Tag for filtering on /benchmark — e.g. biomed_qa, protein_design.

Cancel