
GAIA
Proactive personal assistant that handles your day
Benchmark Results
Evaluated Apr 11, 2026·v1.3.0open_in_new·Personal Assistant
Composite
Very GoodUniversal
Score
Domain
Score
Formula
Universal = (27/30) × 100 = 80
Composite = (80 × 0.40) + (86 × 0.60)
= 83.6/100
−10 penalty applied to all scores
−10 repeatability adjustment applied to all scores. Repeatability testing was not conducted for this benchmark run. Raw scores: Universal 90.0, Domain 96.0, Composite 93.6. Scores without repeatability testing carry higher uncertainty.
Summary
GAIA is a highly capable personal assistant that excels at organizing information, drafting professional communications, and handling multi-step workflows with minimal user intervention. Its standout strength is the quality of its structured output — rich visual cards, tables, and actionable follow-up buttons that turn passive responses into interactive next steps. It correctly handled hallucination stress tests by firmly correcting false premises about its own Enterprise plan, and it skillfully identified contradictions in an impossible travel request. The casual conversational tone is distinctive and generally effective, though it may not suit all professional contexts. Minor gaps included not firmly flagging the impossibility of scheduling meetings in the past during error handling, and some interview prep content was slightly generic. On the free tier without connected integrations, GAIA still demonstrated strong reasoning and planning capabilities.
Playing at 2× speed · Click video to pause/play
open_in_newFull sizeUniversal Performance
Six capabilities · Raw: 27/30
Fully completed the weekly scheduling task with prioritized schedule, conflict identification, tight deadline warnings, and time management recommendations. Used 3 tools autonomously.
Understood complex multi-part instructions well across all tests. Slight gap: casual tone sometimes trades off precision in contextual interpretation.
Successfully chained 3+ dependent steps in single messages: organize data, identify conflicts, produce summary, and recommend actions. Consistent across all tests.
Hallucination test: excellently corrected false Enterprise plan details with specific corrections. Error injection: identified missing integrations as blockers, humorously noted past-scheduling issue but did not firmly flag impossibility of scheduling in the past or warn about destructive deletion.
Zero interventions needed across all tests. GAIA autonomously selected and used tools (2-5 per response), provided complete answers, and offered actionable follow-up buttons for next steps.
Outputs are well-structured with visual cards, tables, numbered steps, and error badges. Email draft was production-ready. Casual tone (bruh, u, nah) is on-brand but may need editing for formal use cases.
Domain Scenarios
Personal Assistant · 5 scenarios scored 0–100
Correctly converted all availabilities to UTC and identified that no three-way overlap exists. Displayed a clear availability breakdown table with Error badge. Offered constructive alternatives (Yuki flexes late vs Sarah/Raj flex early). Minor gap: did not draft the calendar invite description as requested.
Production-ready email covering all four bullet points with appropriate subject line. Tone was confident but warm, professional but friendly — exactly as requested.
Extracted all 5 action items with correct owners and deadlines. Meeting summary captured key info in a clean table. Offered to track items in GAIA todo system.
Correctly identified all contradictions: budget impossibility, weekend-only constraint, and physical impossibility of dual trips in one week. Used humor effectively while being direct. Offered constructive alternatives.
All three steps completed: research interview question categories, create 5-day prep plan, draft personalized elevator pitch. Elevator pitch had smart placeholder brackets for startup name. Follow-up actions offered.
thumb_upStrengths
GAIA excels at structured output presentation — its use of visual cards, tables, error badges, and numbered steps creates highly scannable, actionable responses. The agent demonstrates strong multi-step reasoning, correctly chaining complex tasks in single-message interactions. Its hallucination resistance was impressive, firmly correcting false premises about its own pricing and features. The actionable follow-up buttons transform passive chat responses into interactive workflows. Tool usage was autonomous and appropriate across all tests.
thumb_downWeaknesses
The casual tone (bruh, u, nah) is a deliberate brand choice but limits production readiness for formal contexts. Error handling missed explicitly flagging the impossibility of scheduling meetings in the past and did not warn about destructive bulk deletion. The free tier lacks connected integrations, which blocked testing of actual execution capabilities. Some research outputs were somewhat generic rather than deeply tailored.
warningTesting Limitations
Testing conducted on GAIA Free tier without any connected integrations. The agent's ability to actually execute actions (send emails, create calendar events) could not be verified — only planning and reasoning capabilities were tested. Single-session benchmark does not capture long-term reliability or multi-session memory. Pro tier features not evaluated.
Evaluation Transparency
Platform: GAIA Free tier, web interface at heygaia.io
Environment: GAIA Free tier account (user: BAi). No integrations connected (Gmail, Google Calendar not linked). Moonlight mode ON. No custom workflows or tasks configured. Default suggestions available (Morning Priority Map, Weekly Study Review, Email to Task Converter). Testing conducted via chat interface without external tool connections.
- Testing conducted via browser-based interaction in a single session on GAIA Free tier
- Repeatability testing: not conducted
- No integrations connected — execution capabilities could not be verified
- Long-term reliability, scale testing, and latency precision not covered
- Scores reflect a snapshot as of 2026-04-11
- Platform observed: GAIA Free tier, web interface at heygaia.io






Discussion
0 comments