Benchmark Methodologyv1.1.0

Superseded

Published April 1, 2026 · Superseded by v1.2.0

Required tests and rubric anchors introduced. Hallucination stress test and error injection made mandatory.

warning

This version has been superseded. View the current methodology (v1.3.0)

updateWhat changed in v1.1.0

AddedHallucination stress test is now required in every benchmark session.

AddedError injection requirement for U4 — evaluator must include at least one deliberately invalid input.

AddedU3 scope disclosure: evaluator must note whether multi-step reflects single-message or multi-turn.

AddedScenario averaging rule: if two questions are combined into one scenario, scores are averaged.

AddedRubric anchors (1–5 descriptors) added to all six universal dimensions.

AddedRepeatability guidance and consistency note added.

AddedRequired disclosure: all published benchmarks must state whether repeatability testing was conducted.

calculateScore Calculation

40% weight

Universal Score

sum(U1–U6) / 30 × 100

60% weight

Domain Score

avg(D1–D5 scenario scores)

Final ranking

Composite Score

(U × 0.40) + (D × 0.60)

checklistRequired Tests (All Agents)

RequiredUniversal Standardised Task

"Given a set of information, organize it, take an action based on it, and produce a summary of what was done." Adapted to the agent's domain.

RequiredHallucination Stress Test

Ask about something partially in scope but with a wrong or unverifiable premise. Desired behavior: decline or ask for clarification. Failure: confident answer with invented specifics.

RequiredError / Invalid Input Test (U4 injection)

Give the agent a deliberately broken or contradictory input. Desired behavior: detect the problem, explain it clearly, suggest an alternative. Required for an honest U4 score.

tuneUniversal Dimensions

Six capabilities tested across all agents regardless of domain. Each scored 1–5. Raw sum max = 30. Normalized: sum / 30 × 100.

U1Task Completion

Can the agent finish a clearly defined task end to end?

5	Completes fully with correct output, no intervention needed
4	Completes with minor imperfections but output is usable
3	Completes partially — key parts missing or wrong
2	Starts the task but fails or stalls midway
1	Cannot begin or immediately fails

U2Instruction Interpretation

How accurately does the agent understand natural language input?

5	Perfectly interprets both vague and precise instructions
4	Handles precise instructions well; minor misreads on vague ones
3	Needs clarification on most non-trivial instructions
2	Frequently misinterprets intent even with clear phrasing
1	Cannot parse instructions meaningfully

U3Multi-Step Execution

Can the agent chain dependent actions across 3+ steps?

Measures multi-step reasoning within a single message or session. Does not measure multi-turn memory across separate sessions.

5	Executes all steps in correct order, maintains full context
4	Completes all steps but loses minor context between them
3	Completes most steps but breaks on dependencies
2	Can do individual steps but cannot chain them
1	Cannot handle more than a single action

U4Error Handling

What happens when something goes wrong mid-task?

Requires at least one deliberate invalid or contradictory input per benchmark session.

5	Detects error, explains it clearly, and recovers or offers an alternative
4	Detects error and communicates it but needs human help to recover
3	Detects error but response is vague or unhelpful
2	Fails silently or produces misleading output
1	Crashes, hangs, or produces destructive results

U5Autonomy Level

How much human intervention is required per task?

5	Fully autonomous — zero interventions needed
4	1 minor intervention (confirmation, not correction)
3	2–3 interventions including at least one correction
2	Requires hand-holding at most steps
1	Cannot proceed without continuous human guidance

U6Output Quality

Is the final deliverable correct, useful, and well-formed?

5	Production-ready — no editing needed
4	Correct and usable — minor trimming or formatting needed
3	Roughly correct — significant editing or restructuring needed
2	Output exists but is largely wrong or unusable
1	No meaningful output produced

categoryDomain Templates (4)

Each agent is tested against 5 scenarios tailored to its category. Each scenario is scored on 3 criteria (max 15). Domain score = average of all 5 scenario scores (0–100).

Customer Servicecustomer_service_v1

Criteria: Accuracy, Completeness, Usefulnessexpand_more

Examples: Intercom Fin, Zendesk AI, Freshdesk Freddy

Informational Query: Resolve a straightforward informational query using knowledge base

Topic Change: Handle a mid-conversation topic change

Emotional Message: Respond to an emotionally charged message; escalate appropriately

False Premise: Answer a question with a false or unverifiable premise

Complex Query: Respond to a multi-part complex query in a single coherent response

Criteria per scenario: Accuracy, Completeness, Usefulness (each 1–5, max 15)

Automation & Workflowautomation_workflow_v1

Criteria: Accuracy, Completeness, Usefulnessexpand_more

Examples: Make, Zapier, n8n

3-Step Automation: Build a 3-step automation from a natural language description

Conditional Branching: Create a workflow with conditional branching (if/else logic)

Integration Data Map: Connect two integrations and pass data between them correctly

Mid-Workflow Error: Handle a mid-workflow integration error

Workflow Modification: Modify an existing workflow based on a changed requirement

Criteria per scenario: Accuracy, Completeness, Usefulness (each 1–5, max 15)

Coding & Developmentcoding_development_v1

Criteria: Correctness, Completeness, Code Qualityexpand_more

Examples: GitHub Copilot, Cursor, Devin

Code Generation: Write a functional code block from a natural language description

Debugging: Debug a broken piece of code and explain the fix

Multi-Step Pipeline: Execute a multi-step workflow: read input → transform → write output

Broken Request: Handle an intentionally broken or impossible request

Refactoring: Refactor code for clarity, performance, or security

Criteria per scenario: Correctness, Completeness, Code Quality (each 1–5, max 15)

Data Analysisdata_analysis_v1

Criteria: Query Accuracy, Visualization Fit, Insight Qualityexpand_more

Examples: Hex, Julius, ThoughtSpot

Query Generation: Write and execute a query from a natural language question

Visualization: Build an appropriate visualization from a dataset

End-to-End Pipeline: Multi-step: filter → aggregate → visualize → summarize

Multi-Source Join: Join or relate multiple data sources

Insight Summary: Produce a narrative insight summary from a completed analysis

Criteria per scenario: Query Accuracy, Visualization Fit, Insight Quality (each 1–5, max 15)

repeatRepeatability

All benchmarks under this version use a single-session evaluation — each question is asked once. Multi-session repeatability testing is deferred due to resource constraints.

No repeatability penalty was applied under this version. Scores reflect raw single-session results.

All published benchmarks under this version must disclose whether repeatability testing was conducted.

menu_bookReferences

Research papers and benchmarks that inform the evaluation dimensions, scoring design, and reliability practices used in this methodology version.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?open_in_new

Jimenez, C. E., Yang, J., Wettig, A., et al. · ICLR 2024 · 2024

Evaluates language models on real GitHub issue resolution, measuring end-to-end task completion on production codebases. Informs our emphasis on outcome-based scoring over process metrics.

Task CompletionDomain-Specific EvaluationBenchmark Design

WebArena: A Realistic Web Environment for Building Autonomous Agentsopen_in_new

Zhou, S., Xu, F. F., Zhu, H., et al. · ICLR 2024 · 2023

Creates a reproducible web environment for evaluating agents on realistic tasks like flight booking and form filling. Influenced our approach to web-interaction domain scenarios and multi-step workflow evaluation.

Web InteractionMulti-Step ReasoningTask CompletionAutonomy

GAIA: A Benchmark for General AI Assistantsopen_in_new

Mialon, G., Fourrier, C., Swift, C., et al. · arXiv preprint · 2023

Proposes tiered difficulty levels for evaluating general-purpose AI assistants with unambiguous, verifiable answers. Supports our design of universal dimensions that apply across all agent domains.

Scoring FrameworkTask CompletionMulti-Step ReasoningAutonomy

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domainsopen_in_new

Yao, S., Wang, Y., et al. · arXiv preprint · 2024

Measures reliability of tool-calling agents in conversational workflows, focusing on trial consistency across repeated runs. Directly informs our repeatability penalty and the importance of multi-run evaluation.

RepeatabilityDomain-Specific EvaluationError HandlingAutonomy

TheAgentCompany: Benchmarking LLM Agents on Consequential Real-World Tasksopen_in_new

Xu, F. F., Zhou, Y., et al. · arXiv preprint · 2024

Evaluates agents on enterprise-style tasks across organizational workflows requiring cross-application coordination. Influenced our enterprise and automation domain templates.

Enterprise WorkflowsMulti-Step ReasoningDomain-Specific EvaluationTask Completion

On the Reliability of LLM Benchmarks: A Study of Variance and Reproducibilityopen_in_new

Alzahrani, N., Barnett, S., et al. · arXiv preprint · 2025

Demonstrates that small changes in seeds and dataset splits produce measurable benchmark fluctuations, reinforcing the need for repeatability disclosure and confidence adjustments in published scores.

RepeatabilityBenchmark DesignScoring Framework

A Survey on the Evaluation of LLM-based Agentsopen_in_new

Liu, X., Zhu, Y., et al. · arXiv preprint · 2025

Comprehensive survey of agent evaluation across software engineering, web interaction, tool use, and scientific tasks. Frames the evaluation landscape that our methodology operates within.

Benchmark DesignScoring FrameworkDomain-Specific Evaluation

How Are LLM-based Agents Evaluated in Practice? Insights from Industryopen_in_new

Zhang, T., et al. · arXiv preprint · 2025

Finds that 75% of production teams bypass benchmarks in favor of A/B tests and user feedback. Highlights the gap between academic evaluation and real-world deployment that our methodology aims to bridge.

Benchmark DesignScoring FrameworkEnterprise Workflows

Measuring Massive Multitask Language Understandingopen_in_new

Hendrycks, D., Burns, C., Basart, S., et al. · ICLR 2021 · 2021

Establishes the standard for multi-domain knowledge evaluation across 57 subjects. While designed for static models, its domain-stratified scoring approach influenced our per-domain evaluation structure.

Scoring FrameworkDomain-Specific EvaluationBenchmark Design

10.

Evaluating Large Language Models Trained on Codeopen_in_new

Chen, M., Tworek, J., Jun, H., et al. · arXiv preprint · 2021

Introduces functional correctness as the evaluation criterion for code generation, using pass@k as a metric. Informs our coding domain template where output correctness is paramount.

Task CompletionDomain-Specific EvaluationScoring Framework

11.

AgencyBench: Evaluating LLM Agents on Long-Horizon Compositional Tasksopen_in_new

Li, Y., et al. · arXiv preprint · 2025

Benchmarks agents on workflows averaging 1 million tokens and 90 tool calls, revealing how performance degrades on long-horizon tasks. Supports our multi-step evaluation dimension (U3) and the need for execution endurance testing.

Multi-Step ReasoningAutonomyBenchmark Design

12.

REALM-Bench: A Benchmark for Realistic Agent Learning with Multifaceted Challengesopen_in_new

Wang, Z., et al. · arXiv preprint · 2025

Tests agents on parallel processes, resource constraints, and unexpected disruptions. Validates our error handling dimension (U4) and the importance of testing agent behavior under adversarial or constrained conditions.

Error HandlingMulti-Step ReasoningSafety & PolicyAutonomy

13.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmarkopen_in_new

Siegel, Z., et al. · arXiv preprint · 2025

Evaluates agents on scientific paper reproducibility — following multi-step procedures with precision. Reinforces the methodology's emphasis on instruction interpretation (U2) and task completion fidelity (U1).

Task CompletionMulti-Step ReasoningDomain-Specific Evaluation

14.

Beyond Accuracy: Evaluating the Efficiency and Fairness of LLM-based Agentsopen_in_new

Park, J., et al. · arXiv preprint · 2025

Finds efficiency metrics appeared in only 14 of 23 benchmark papers; fairness in only 1. Underscores the gap in holistic evaluation that our output quality dimension (U6) and transparency disclosures address.

Scoring FrameworkBenchmark DesignSafety & Policy

format_quote

Tags highlighted in blue indicate direct relevance to features active in this version. View the full reference list on the methodology index.

arrow_backv1.0.02026-03-31

listAll versions

2026-04-01v1.2.0arrow_forward