Benchmark Methodologyv1.0.0
SupersededPublished March 31, 2026 · Superseded by v1.1.0
Initial benchmark methodology. Established the composite scoring formula, six universal dimensions, and four initial domain templates.
This version has been superseded. View the current methodology (v1.3.0)
updateInitial Release
calculateScore Calculation
checklistRequired Tests (All Agents)
"Given a set of information, organize it, take an action based on it, and produce a summary of what was done." Adapted to the agent's domain.
Ask about something partially in scope but with a wrong or unverifiable premise. Desired behavior: decline or ask for clarification. Failure: confident answer with invented specifics.
Give the agent a deliberately broken or contradictory input. Desired behavior: detect the problem, explain it clearly, suggest an alternative. Required for an honest U4 score.
tuneUniversal Dimensions
Six capabilities tested across all agents regardless of domain. Each scored 1–5. Raw sum max = 30. Normalized: sum / 30 × 100.
Can the agent finish a clearly defined task end to end?
| 5 | Completes fully with correct output, no intervention needed |
| 4 | Completes with minor imperfections but output is usable |
| 3 | Completes partially — key parts missing or wrong |
| 2 | Starts the task but fails or stalls midway |
| 1 | Cannot begin or immediately fails |
How accurately does the agent understand natural language input?
| 5 | Perfectly interprets both vague and precise instructions |
| 4 | Handles precise instructions well; minor misreads on vague ones |
| 3 | Needs clarification on most non-trivial instructions |
| 2 | Frequently misinterprets intent even with clear phrasing |
| 1 | Cannot parse instructions meaningfully |
Can the agent chain dependent actions across 3+ steps?
Measures multi-step reasoning within a single message or session. Does not measure multi-turn memory across separate sessions.
| 5 | Executes all steps in correct order, maintains full context |
| 4 | Completes all steps but loses minor context between them |
| 3 | Completes most steps but breaks on dependencies |
| 2 | Can do individual steps but cannot chain them |
| 1 | Cannot handle more than a single action |
What happens when something goes wrong mid-task?
Requires at least one deliberate invalid or contradictory input per benchmark session.
| 5 | Detects error, explains it clearly, and recovers or offers an alternative |
| 4 | Detects error and communicates it but needs human help to recover |
| 3 | Detects error but response is vague or unhelpful |
| 2 | Fails silently or produces misleading output |
| 1 | Crashes, hangs, or produces destructive results |
How much human intervention is required per task?
| 5 | Fully autonomous — zero interventions needed |
| 4 | 1 minor intervention (confirmation, not correction) |
| 3 | 2–3 interventions including at least one correction |
| 2 | Requires hand-holding at most steps |
| 1 | Cannot proceed without continuous human guidance |
Is the final deliverable correct, useful, and well-formed?
| 5 | Production-ready — no editing needed |
| 4 | Correct and usable — minor trimming or formatting needed |
| 3 | Roughly correct — significant editing or restructuring needed |
| 2 | Output exists but is largely wrong or unusable |
| 1 | No meaningful output produced |
categoryDomain Templates (4)
Each agent is tested against 5 scenarios tailored to its category. Each scenario is scored on 3 criteria (max 15). Domain score = average of all 5 scenario scores (0–100).
Customer Servicecustomer_service_v1expand_more
customer_service_v1Examples: Intercom Fin, Zendesk AI, Freshdesk Freddy
Criteria per scenario: Accuracy, Completeness, Usefulness (each 1–5, max 15)
Automation & Workflowautomation_workflow_v1expand_more
automation_workflow_v1Examples: Make, Zapier, n8n
Criteria per scenario: Accuracy, Completeness, Usefulness (each 1–5, max 15)
Coding & Developmentcoding_development_v1expand_more
coding_development_v1Examples: GitHub Copilot, Cursor, Devin
Criteria per scenario: Correctness, Completeness, Code Quality (each 1–5, max 15)
Data Analysisdata_analysis_v1expand_more
data_analysis_v1Examples: Hex, Julius, ThoughtSpot
Criteria per scenario: Query Accuracy, Visualization Fit, Insight Quality (each 1–5, max 15)
repeatRepeatability
All benchmarks under this version use a single-session evaluation — each question is asked once. Multi-session repeatability testing is deferred due to resource constraints.
No repeatability penalty was applied under this version. Scores reflect raw single-session results.
menu_bookReferences
Research papers and benchmarks that inform the evaluation dimensions, scoring design, and reliability practices used in this methodology version.
Jimenez, C. E., Yang, J., Wettig, A., et al. · ICLR 2024 · 2024
Evaluates language models on real GitHub issue resolution, measuring end-to-end task completion on production codebases. Informs our emphasis on outcome-based scoring over process metrics.
Zhou, S., Xu, F. F., Zhu, H., et al. · ICLR 2024 · 2023
Creates a reproducible web environment for evaluating agents on realistic tasks like flight booking and form filling. Influenced our approach to web-interaction domain scenarios and multi-step workflow evaluation.
Mialon, G., Fourrier, C., Swift, C., et al. · arXiv preprint · 2023
Proposes tiered difficulty levels for evaluating general-purpose AI assistants with unambiguous, verifiable answers. Supports our design of universal dimensions that apply across all agent domains.
Yao, S., Wang, Y., et al. · arXiv preprint · 2024
Measures reliability of tool-calling agents in conversational workflows, focusing on trial consistency across repeated runs. Directly informs our repeatability penalty and the importance of multi-run evaluation.
Xu, F. F., Zhou, Y., et al. · arXiv preprint · 2024
Evaluates agents on enterprise-style tasks across organizational workflows requiring cross-application coordination. Influenced our enterprise and automation domain templates.
Alzahrani, N., Barnett, S., et al. · arXiv preprint · 2025
Demonstrates that small changes in seeds and dataset splits produce measurable benchmark fluctuations, reinforcing the need for repeatability disclosure and confidence adjustments in published scores.
Liu, X., Zhu, Y., et al. · arXiv preprint · 2025
Comprehensive survey of agent evaluation across software engineering, web interaction, tool use, and scientific tasks. Frames the evaluation landscape that our methodology operates within.
Zhang, T., et al. · arXiv preprint · 2025
Finds that 75% of production teams bypass benchmarks in favor of A/B tests and user feedback. Highlights the gap between academic evaluation and real-world deployment that our methodology aims to bridge.
Hendrycks, D., Burns, C., Basart, S., et al. · ICLR 2021 · 2021
Establishes the standard for multi-domain knowledge evaluation across 57 subjects. While designed for static models, its domain-stratified scoring approach influenced our per-domain evaluation structure.
Chen, M., Tworek, J., Jun, H., et al. · arXiv preprint · 2021
Introduces functional correctness as the evaluation criterion for code generation, using pass@k as a metric. Informs our coding domain template where output correctness is paramount.
Li, Y., et al. · arXiv preprint · 2025
Benchmarks agents on workflows averaging 1 million tokens and 90 tool calls, revealing how performance degrades on long-horizon tasks. Supports our multi-step evaluation dimension (U3) and the need for execution endurance testing.
Wang, Z., et al. · arXiv preprint · 2025
Tests agents on parallel processes, resource constraints, and unexpected disruptions. Validates our error handling dimension (U4) and the importance of testing agent behavior under adversarial or constrained conditions.
Siegel, Z., et al. · arXiv preprint · 2025
Evaluates agents on scientific paper reproducibility — following multi-step procedures with precision. Reinforces the methodology's emphasis on instruction interpretation (U2) and task completion fidelity (U1).
Park, J., et al. · arXiv preprint · 2025
Finds efficiency metrics appeared in only 14 of 23 benchmark papers; fairness in only 1. Underscores the gap in holistic evaluation that our output quality dimension (U6) and transparency disclosures address.
Tags highlighted in blue indicate direct relevance to features active in this version. View the full reference list on the methodology index.