Benchmark Methodologyv1.1.0

Superseded

Published April 1, 2026 · Superseded by v1.2.0

Required tests and rubric anchors introduced. Hallucination stress test and error injection made mandatory.

warning

This version has been superseded. View the current methodology (v1.3.0)

updateWhat changed in v1.1.0

AddedHallucination stress test is now required in every benchmark session.
AddedError injection requirement for U4 — evaluator must include at least one deliberately invalid input.
AddedU3 scope disclosure: evaluator must note whether multi-step reflects single-message or multi-turn.
AddedScenario averaging rule: if two questions are combined into one scenario, scores are averaged.
AddedRubric anchors (1–5 descriptors) added to all six universal dimensions.
AddedRepeatability guidance and consistency note added.
AddedRequired disclosure: all published benchmarks must state whether repeatability testing was conducted.

calculateScore Calculation

40% weight
Universal Score
sum(U1–U6) / 30 × 100
60% weight
Domain Score
avg(D1–D5 scenario scores)
Final ranking
Composite Score
(U × 0.40) + (D × 0.60)

checklistRequired Tests (All Agents)

RequiredUniversal Standardised Task

"Given a set of information, organize it, take an action based on it, and produce a summary of what was done." Adapted to the agent's domain.

RequiredHallucination Stress Test

Ask about something partially in scope but with a wrong or unverifiable premise. Desired behavior: decline or ask for clarification. Failure: confident answer with invented specifics.

RequiredError / Invalid Input Test (U4 injection)

Give the agent a deliberately broken or contradictory input. Desired behavior: detect the problem, explain it clearly, suggest an alternative. Required for an honest U4 score.

tuneUniversal Dimensions

Six capabilities tested across all agents regardless of domain. Each scored 1–5. Raw sum max = 30. Normalized: sum / 30 × 100.

U1Task Completion

Can the agent finish a clearly defined task end to end?

5Completes fully with correct output, no intervention needed
4Completes with minor imperfections but output is usable
3Completes partially — key parts missing or wrong
2Starts the task but fails or stalls midway
1Cannot begin or immediately fails
U2Instruction Interpretation

How accurately does the agent understand natural language input?

5Perfectly interprets both vague and precise instructions
4Handles precise instructions well; minor misreads on vague ones
3Needs clarification on most non-trivial instructions
2Frequently misinterprets intent even with clear phrasing
1Cannot parse instructions meaningfully
U3Multi-Step Execution

Can the agent chain dependent actions across 3+ steps?

Measures multi-step reasoning within a single message or session. Does not measure multi-turn memory across separate sessions.

5Executes all steps in correct order, maintains full context
4Completes all steps but loses minor context between them
3Completes most steps but breaks on dependencies
2Can do individual steps but cannot chain them
1Cannot handle more than a single action
U4Error Handling

What happens when something goes wrong mid-task?

Requires at least one deliberate invalid or contradictory input per benchmark session.

5Detects error, explains it clearly, and recovers or offers an alternative
4Detects error and communicates it but needs human help to recover
3Detects error but response is vague or unhelpful
2Fails silently or produces misleading output
1Crashes, hangs, or produces destructive results
U5Autonomy Level

How much human intervention is required per task?

5Fully autonomous — zero interventions needed
41 minor intervention (confirmation, not correction)
32–3 interventions including at least one correction
2Requires hand-holding at most steps
1Cannot proceed without continuous human guidance
U6Output Quality

Is the final deliverable correct, useful, and well-formed?

5Production-ready — no editing needed
4Correct and usable — minor trimming or formatting needed
3Roughly correct — significant editing or restructuring needed
2Output exists but is largely wrong or unusable
1No meaningful output produced

categoryDomain Templates (4)

Each agent is tested against 5 scenarios tailored to its category. Each scenario is scored on 3 criteria (max 15). Domain score = average of all 5 scenario scores (0–100).

Customer Servicecustomer_service_v1
expand_more

Examples: Intercom Fin, Zendesk AI, Freshdesk Freddy

D1
Informational Query: Resolve a straightforward informational query using knowledge base
D2
Topic Change: Handle a mid-conversation topic change
D3
Emotional Message: Respond to an emotionally charged message; escalate appropriately
D4
False Premise: Answer a question with a false or unverifiable premise
D5
Complex Query: Respond to a multi-part complex query in a single coherent response

Criteria per scenario: Accuracy, Completeness, Usefulness (each 1–5, max 15)

Automation & Workflowautomation_workflow_v1
expand_more

Examples: Make, Zapier, n8n

D1
3-Step Automation: Build a 3-step automation from a natural language description
D2
Conditional Branching: Create a workflow with conditional branching (if/else logic)
D3
Integration Data Map: Connect two integrations and pass data between them correctly
D4
Mid-Workflow Error: Handle a mid-workflow integration error
D5
Workflow Modification: Modify an existing workflow based on a changed requirement

Criteria per scenario: Accuracy, Completeness, Usefulness (each 1–5, max 15)

Coding & Developmentcoding_development_v1
expand_more

Examples: GitHub Copilot, Cursor, Devin

D1
Code Generation: Write a functional code block from a natural language description
D2
Debugging: Debug a broken piece of code and explain the fix
D3
Multi-Step Pipeline: Execute a multi-step workflow: read input → transform → write output
D4
Broken Request: Handle an intentionally broken or impossible request
D5
Refactoring: Refactor code for clarity, performance, or security

Criteria per scenario: Correctness, Completeness, Code Quality (each 1–5, max 15)

Data Analysisdata_analysis_v1
expand_more

Examples: Hex, Julius, ThoughtSpot

D1
Query Generation: Write and execute a query from a natural language question
D2
Visualization: Build an appropriate visualization from a dataset
D3
End-to-End Pipeline: Multi-step: filter → aggregate → visualize → summarize
D4
Multi-Source Join: Join or relate multiple data sources
D5
Insight Summary: Produce a narrative insight summary from a completed analysis

Criteria per scenario: Query Accuracy, Visualization Fit, Insight Quality (each 1–5, max 15)

repeatRepeatability

All benchmarks under this version use a single-session evaluation — each question is asked once. Multi-session repeatability testing is deferred due to resource constraints.

No repeatability penalty was applied under this version. Scores reflect raw single-session results.

All published benchmarks under this version must disclose whether repeatability testing was conducted.

menu_bookReferences

Research papers and benchmarks that inform the evaluation dimensions, scoring design, and reliability practices used in this methodology version.

1.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?open_in_new

Jimenez, C. E., Yang, J., Wettig, A., et al. · ICLR 2024 · 2024

Evaluates language models on real GitHub issue resolution, measuring end-to-end task completion on production codebases. Informs our emphasis on outcome-based scoring over process metrics.

Task CompletionDomain-Specific EvaluationBenchmark Design
2.
WebArena: A Realistic Web Environment for Building Autonomous Agentsopen_in_new

Zhou, S., Xu, F. F., Zhu, H., et al. · ICLR 2024 · 2023

Creates a reproducible web environment for evaluating agents on realistic tasks like flight booking and form filling. Influenced our approach to web-interaction domain scenarios and multi-step workflow evaluation.

Web InteractionMulti-Step ReasoningTask CompletionAutonomy
3.
GAIA: A Benchmark for General AI Assistantsopen_in_new

Mialon, G., Fourrier, C., Swift, C., et al. · arXiv preprint · 2023

Proposes tiered difficulty levels for evaluating general-purpose AI assistants with unambiguous, verifiable answers. Supports our design of universal dimensions that apply across all agent domains.

Scoring FrameworkTask CompletionMulti-Step ReasoningAutonomy
4.
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domainsopen_in_new

Yao, S., Wang, Y., et al. · arXiv preprint · 2024

Measures reliability of tool-calling agents in conversational workflows, focusing on trial consistency across repeated runs. Directly informs our repeatability penalty and the importance of multi-run evaluation.

RepeatabilityDomain-Specific EvaluationError HandlingAutonomy
5.
TheAgentCompany: Benchmarking LLM Agents on Consequential Real-World Tasksopen_in_new

Xu, F. F., Zhou, Y., et al. · arXiv preprint · 2024

Evaluates agents on enterprise-style tasks across organizational workflows requiring cross-application coordination. Influenced our enterprise and automation domain templates.

Enterprise WorkflowsMulti-Step ReasoningDomain-Specific EvaluationTask Completion
6.
On the Reliability of LLM Benchmarks: A Study of Variance and Reproducibilityopen_in_new

Alzahrani, N., Barnett, S., et al. · arXiv preprint · 2025

Demonstrates that small changes in seeds and dataset splits produce measurable benchmark fluctuations, reinforcing the need for repeatability disclosure and confidence adjustments in published scores.

RepeatabilityBenchmark DesignScoring Framework
7.
A Survey on the Evaluation of LLM-based Agentsopen_in_new

Liu, X., Zhu, Y., et al. · arXiv preprint · 2025

Comprehensive survey of agent evaluation across software engineering, web interaction, tool use, and scientific tasks. Frames the evaluation landscape that our methodology operates within.

Benchmark DesignScoring FrameworkDomain-Specific Evaluation
8.
How Are LLM-based Agents Evaluated in Practice? Insights from Industryopen_in_new

Zhang, T., et al. · arXiv preprint · 2025

Finds that 75% of production teams bypass benchmarks in favor of A/B tests and user feedback. Highlights the gap between academic evaluation and real-world deployment that our methodology aims to bridge.

Benchmark DesignScoring FrameworkEnterprise Workflows
9.
Measuring Massive Multitask Language Understandingopen_in_new

Hendrycks, D., Burns, C., Basart, S., et al. · ICLR 2021 · 2021

Establishes the standard for multi-domain knowledge evaluation across 57 subjects. While designed for static models, its domain-stratified scoring approach influenced our per-domain evaluation structure.

Scoring FrameworkDomain-Specific EvaluationBenchmark Design
10.
Evaluating Large Language Models Trained on Codeopen_in_new

Chen, M., Tworek, J., Jun, H., et al. · arXiv preprint · 2021

Introduces functional correctness as the evaluation criterion for code generation, using pass@k as a metric. Informs our coding domain template where output correctness is paramount.

Task CompletionDomain-Specific EvaluationScoring Framework
11.
AgencyBench: Evaluating LLM Agents on Long-Horizon Compositional Tasksopen_in_new

Li, Y., et al. · arXiv preprint · 2025

Benchmarks agents on workflows averaging 1 million tokens and 90 tool calls, revealing how performance degrades on long-horizon tasks. Supports our multi-step evaluation dimension (U3) and the need for execution endurance testing.

Multi-Step ReasoningAutonomyBenchmark Design
12.
REALM-Bench: A Benchmark for Realistic Agent Learning with Multifaceted Challengesopen_in_new

Wang, Z., et al. · arXiv preprint · 2025

Tests agents on parallel processes, resource constraints, and unexpected disruptions. Validates our error handling dimension (U4) and the importance of testing agent behavior under adversarial or constrained conditions.

Error HandlingMulti-Step ReasoningSafety & PolicyAutonomy
13.
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmarkopen_in_new

Siegel, Z., et al. · arXiv preprint · 2025

Evaluates agents on scientific paper reproducibility — following multi-step procedures with precision. Reinforces the methodology's emphasis on instruction interpretation (U2) and task completion fidelity (U1).

Task CompletionMulti-Step ReasoningDomain-Specific Evaluation
14.
Beyond Accuracy: Evaluating the Efficiency and Fairness of LLM-based Agentsopen_in_new

Park, J., et al. · arXiv preprint · 2025

Finds efficiency metrics appeared in only 14 of 23 benchmark papers; fairness in only 1. Underscores the gap in holistic evaluation that our output quality dimension (U6) and transparency disclosures address.

Scoring FrameworkBenchmark DesignSafety & Policy
format_quote

Tags highlighted in blue indicate direct relevance to features active in this version. View the full reference list on the methodology index.