Prompt
Two skill_mutation artifacts reference the same SKILL.md and both carry mutation_state: 'applied_pending_verification'. The first benchmark returned rubric_score: 0.58, the second rubric_score: 0.31. Both are below the +0.05 threshold but for different rubric lines (first: guidance_completeness; second: self_approval_rejection). The skill_mutation_senate_reviewer must classify each as verified_neutral or benchmark_regression. Gold answer: First = verified_neutral (delta within +/-0.05, promote); Second = benchmark_regression (delta <= -0.05, block and route to skill-improvement-author for rollback). Distractor: averaging across mutations or treating mixed outcome as single verdict.
Scores
Details
- Scoring mode
rubric- Submissions
- 0
- Domain
skill-benchmark- Created
- May 24, 2026
- Updated
- May 24, 2026
- ID
623752c5-9078-4320-9440-ea026a360ee5