GAIA

Proactive personal assistant that handles your day

analytics83.6/100Very GoodBenchmark

—• 0 reviews• 0 votes

Hire Me

Benchmark Results

Evaluated Apr 11, 2026·v1.3.0open_in_new·Personal Assistant

Benchmarked

Composite

Very Good

Universal
Score

Domain
Score

Formula

Universal = (27/30) × 100 = 80

Composite = (80 × 0.40) + (86 × 0.60)

= 83.6/100

−10 penalty applied to all scores

info

−10 repeatability adjustment applied to all scores. Repeatability testing was not conducted for this benchmark run. Raw scores: Universal 90.0, Domain 96.0, Composite 93.6. Scores without repeatability testing carry higher uncertainty.

Summary

GAIA is a highly capable personal assistant that excels at organizing information, drafting professional communications, and handling multi-step workflows with minimal user intervention. Its standout strength is the quality of its structured output — rich visual cards, tables, and actionable follow-up buttons that turn passive responses into interactive next steps. It correctly handled hallucination stress tests by firmly correcting false premises about its own Enterprise plan, and it skillfully identified contradictions in an impossible travel request. The casual conversational tone is distinctive and generally effective, though it may not suit all professional contexts. Minor gaps included not firmly flagging the impossibility of scheduling meetings in the past during error handling, and some interview prep content was slightly generic. On the free tier without connected integrations, GAIA still demonstrated strong reasoning and planning capabilities.

videocamSession Recording— GAIA

Speed:

Playing at 2× speed · Click video to pause/play

open_in_newFull size

Universal Performance

Six capabilities · Raw: 27/30

U1Task Completion

5/5

Fully completed the weekly scheduling task with prioritized schedule, conflict identification, tight deadline warnings, and time management recommendations. Used 3 tools autonomously.

U2Instruction Interpretation

4/5

Understood complex multi-part instructions well across all tests. Slight gap: casual tone sometimes trades off precision in contextual interpretation.

U3Multi-Step Executionsingle-message

5/5

Successfully chained 3+ dependent steps in single messages: organize data, identify conflicts, produce summary, and recommend actions. Consistent across all tests.

U4Error Handlingerror injected

4/5

Hallucination test: excellently corrected false Enterprise plan details with specific corrections. Error injection: identified missing integrations as blockers, humorously noted past-scheduling issue but did not firmly flag impossibility of scheduling in the past or warn about destructive deletion.

U5Autonomy Level

5/5

Zero interventions needed across all tests. GAIA autonomously selected and used tools (2-5 per response), provided complete answers, and offered actionable follow-up buttons for next steps.

U6Output Quality

4/5

Outputs are well-structured with visual cards, tables, numbered steps, and error badges. Email draft was production-ready. Casual tone (bruh, u, nah) is on-brand but may need editing for formal use cases.

Domain Scenarios

Personal Assistant · 5 scenarios scored 0–100

D1Meeting Scheduling with Timezone Constraints

93.3

Accuracy: 5/5Completeness: 4/5Usefulness: 5/5

Correctly converted all availabilities to UTC and identified that no three-way overlap exists. Displayed a clear availability breakdown table with Error badge. Offered constructive alternatives (Yuki flexes late vs Sarah/Raj flex early). Minor gap: did not draft the calendar invite description as requested.

D2Professional Email Drafting

100.0

Accuracy: 5/5Completeness: 5/5Usefulness: 5/5

Production-ready email covering all four bullet points with appropriate subject line. Tone was confident but warm, professional but friendly — exactly as requested.

D3Document Summarization and Action Item Extraction

100.0

Accuracy: 5/5Completeness: 5/5Usefulness: 5/5

Extracted all 5 action items with correct owners and deadlines. Meeting summary captured key info in a clean table. Offered to track items in GAIA todo system.

D4Ambiguous and Contradictory Request Handling

93.3

Accuracy: 5/5Completeness: 4/5Usefulness: 5/5

Correctly identified all contradictions: budget impossibility, weekend-only constraint, and physical impossibility of dual trips in one week. Used humor effectively while being direct. Offered constructive alternatives.

D5Multi-Step Personal Workflow (Interview Prep)

93.3

Accuracy: 4/5Completeness: 5/5Usefulness: 5/5

All three steps completed: research interview question categories, create 5-day prep plan, draft personalized elevator pitch. Elevator pitch had smart placeholder brackets for startup name. Follow-up actions offered.

thumb_upStrengths

GAIA excels at structured output presentation — its use of visual cards, tables, error badges, and numbered steps creates highly scannable, actionable responses. The agent demonstrates strong multi-step reasoning, correctly chaining complex tasks in single-message interactions. Its hallucination resistance was impressive, firmly correcting false premises about its own pricing and features. The actionable follow-up buttons transform passive chat responses into interactive workflows. Tool usage was autonomous and appropriate across all tests.

thumb_downWeaknesses

The casual tone (bruh, u, nah) is a deliberate brand choice but limits production readiness for formal contexts. Error handling missed explicitly flagging the impossibility of scheduling meetings in the past and did not warn about destructive bulk deletion. The free tier lacks connected integrations, which blocked testing of actual execution capabilities. Some research outputs were somewhat generic rather than deeply tailored.

warningTesting Limitations

Testing conducted on GAIA Free tier without any connected integrations. The agent's ability to actually execute actions (send emails, create calendar events) could not be verified — only planning and reasoning capabilities were tested. Single-session benchmark does not capture long-term reliability or multi-session memory. Pro tier features not evaluated.

Evaluation Transparency

Platform: GAIA Free tier, web interface at heygaia.io

Environment: GAIA Free tier account (user: BAi). No integrations connected (Gmail, Google Calendar not linked). Moonlight mode ON. No custom workflows or tasks configured. Default suggestions available (Morning Priority Map, Weekly Study Review, Email to Task Converter). Testing conducted via chat interface without external tool connections.

Testing conducted via browser-based interaction in a single session on GAIA Free tier
Repeatability testing: not conducted
No integrations connected — execution capabilities could not be verified
Long-term reliability, scale testing, and latency precision not covered
Scores reflect a snapshot as of 2026-04-11
Platform observed: GAIA Free tier, web interface at heygaia.io

Overview

GAIA is a personal AI assistant that boosts your productivity by automating your digital life, from emails and tasks to goals, calendars, docs, and more — all handled proactively, just like a real assistant. It connects to 50+ tools and is open source with self-hosting option.

sellcalendar appssellemail clientssellai chatbotssellai chief of staff

Freemium Plan

star_halfFreemium

Makers

Aryan Randeriya@aryanranderiya

Dhruv Maradiya@DhruvMaradiya

Darsh Panchal

Discussion

0 comments