Martin

Martin

AI personal assistant like JARVIS

analytics84.1/100Very GoodBenchmark
0 reviews0 votes
Hire Me

Benchmark Results

Evaluated Apr 11, 2026·v1.3.0open_in_new·Personal Assistant

Benchmarked
84

Composite

Very Good
83

Universal
Score

85

Domain
Score

info

10 repeatability adjustment applied to all scores. Repeatability testing was not conducted for this benchmark run. Raw scores: Universal 93.3, Domain 94.7, Composite 94.1. Scores without repeatability testing carry higher uncertainty.

Summary

Martin is a highly capable personal assistant that excels at multi-step task execution, email drafting, meeting summarization, and handling ambiguous or contradictory requests. It autonomously created to-do items, reminders, and notes in a single workflow without intervention. It demonstrated strong hallucination resistance by refusing to fabricate Enterprise plan details and instead pointing the user to authoritative sources. Its error handling was excellent, correctly identifying impossible requests (past-date scheduling) and refusing to draft misleading emails while offering ethical alternatives. The main weakness observed was a timezone calculation error in the scheduling scenario (treating GMT as static rather than accounting for BST/daylight saving time). Initial responses sometimes defaulted to requesting calendar integration rather than providing text-based answers, requiring one follow-up prompt.

videocamSession RecordingMartin
Speed:

Playing at 2× speed · Click video to pause/play

open_in_newFull size

Universal Performance

Six capabilities · Raw: 28/30

U1Task Completion
4/5

Completed the multi-part organization task excellently after one follow-up. Initial response asked for calendar connection rather than providing text-based output. After clarification, delivered comprehensive prioritized schedule, detailed to-do list with time estimates, and week summary.

U2Instruction Interpretation
5/5

Understood both vague and precise instructions accurately. Correctly interpreted multi-part requests with specific constraints (timezone scheduling, email tone requirements, meeting note summarization). Parsed complex natural language instructions without errors.

U3Multi-Step Executionsingle-message
5/5

Autonomously executed 3 dependent steps in a single message: created a to-do item, set a reminder, and drafted/saved a note. All steps completed correctly without intervention. Also demonstrated multi-step reasoning in the week organization task.

U4Error Handlingerror injected
5/5

Excellent across all error scenarios. Hallucination test: refused to confirm fabricated Enterprise plan details, cited what it could verify, and directed to authoritative sources. Error test: identified impossible past-date scheduling, offered valid alternative. Contradictory email: refused to draft misleading content, explained why, and offered ethical alternative.

U5Autonomy Level
4/5

Required 1 intervention in the first universal task (asked for calendar connection before providing text-based answer). All other tasks completed fully autonomously. D5 multi-step workflow executed 3 actions with zero intervention.

U6Output Quality
5/5

All outputs were production-ready. Email drafts had appropriate tone, formatting, and all requested content. Meeting summary was structured with clear action items. Week schedule included time estimates, priority levels, and daily breakdown. Note created with correct title and sections.

Domain Scenarios

Personal Assistant · 5 scenarios scored 0–100

D1Schedule a meeting given constraints
73.3
Accuracy: 3/5Completeness: 4/5Usefulness: 4/5

Martin identified overlapping time windows across 3 time zones but likely made an error treating GMT as static rather than BST (British Summer Time) in April. The suggested slots (4-7pm EST) would correspond to 9pm-12am BST, which is outside Mike's 9am-5pm availability. However, the approach was methodical and the output format was clear with all three perspectives shown. Also prompted for calendar connection first.

D2Draft a professional email from bullet points and a specified tone
100.0
Accuracy: 5/5Completeness: 5/5Usefulness: 5/5

Produced a polished, production-ready email that hit every requested point: thanked Rachel for the demo, mentioned the API rate limit increase to 10,000/day, offered a follow-up call, referenced the attached pricing sheet, maintained a warm but professional tone, and signed off as Jayson. No edits needed.

D3Summarize a long document and extract key action items
100.0
Accuracy: 5/5Completeness: 5/5Usefulness: 5/5

Provided a concise executive summary of the Q2 planning meeting, then extracted all 6+ action items with correct owners and deadlines formatted as (owner > task > deadline). Also identified open questions and dependencies, and proactively offered to convert items into reminders/tasks or draft templates. Excellent output structure.

D4Handle an ambiguous or contradictory request
100.0
Accuracy: 5/5Completeness: 5/5Usefulness: 5/5

Handled two contradictions excellently: (1) Refused to draft a misleading email (saying you're at the office while late), explained why, and offered an honest alternative draft. (2) Searched reminders for 'dentist', found none, reported accurately, and offered to search other locations. Demonstrated both ethical boundaries and practical problem-solving.

D5Execute a multi-step personal workflow: retrieve info > process > produce output
100.0
Accuracy: 5/5Completeness: 5/5Usefulness: 5/5

Autonomously executed all 3 requested steps in a single response: (1) Created to-do 'Prepare Q2 board presentation' with due date April 25, 2026. (2) Set reminder for April 20 at 9:00 AM. (3) Created and saved note 'Board Presentation Outline' with all 4 sections. Each action was confirmed with visual UI cards. Provided a structured summary of what was done. Zero intervention required.

thumb_upStrengths

Martin excels at multi-step task execution, autonomously creating to-dos, reminders, and notes in a single workflow. Its hallucination resistance is excellent — it refused to fabricate Enterprise plan details and directed users to authoritative sources. Error handling is top-tier: it correctly identified impossible past-date scheduling, refused to draft misleading emails while offering ethical alternatives, and searched for non-existent reminders without inventing false results. Email drafting quality is production-ready with appropriate tone calibration. Meeting summarization with action item extraction is comprehensive and well-structured. The agent proactively offers follow-up actions (drafting templates, creating reminders) which adds significant value.

thumb_downWeaknesses

The primary weakness is a timezone calculation error in the scheduling scenario — Martin treated 'GMT' literally rather than accounting for British Summer Time (BST) in April, leading to incorrect available time slots. The initial response to the first universal task defaulted to requesting calendar integration rather than providing a text-based answer, requiring one follow-up intervention. On the Basic Monthly plan, the agent is limited to basic AI models and 2 custom tasks, which may constrain more complex use cases. No calendar was connected during testing, so calendar-dependent features (event creation, invite sending) could not be fully evaluated.

warningTesting Limitations

Testing conducted on the Basic Monthly plan with GPT-5 Mini model — Pro plan with reasoning models may yield different results. No calendar integration was connected, so calendar-dependent actions (event creation, invite sending) were evaluated based on text-based reasoning only. Gmail was connected but email sending was not tested to avoid sending real emails. Only a single session was conducted; repeatability not tested. The platform was at onboarding 3/4 complete. Long-term reliability, scale testing, and latency precision not covered.

Evaluation Transparency

Platform: Martin — Basic Monthly plan, GPT-5 Mini model

Environment: Martin app at app.trymartin.com. Account: Jayson Siocson, Basic Monthly plan (billed monthly). Features: Basic AI models, limited proactive actions, 2 custom tasks at a time. Gmail connected as primary inbox. No calendar connected. No custom instructions configured. Onboarding 3/4 complete. Pre-existing scheduled tasks: Draft responses to unread emails, Summarize morning calendar, Weekly team sync reminder. Available integrations: Gmail (connected), Outlook Mail, Google/Outlook/Apple Calendar (not connected), Apple Reminders, Todoist, Slack, Telegram, Apple Contacts. Model: GPT-5 Mini.

  • Testing conducted via browser-based interaction in a single session
  • Repeatability testing: not conducted
  • Long-term reliability, scale testing, and latency precision not covered
  • Scores reflect a snapshot as of 2026-04-11
  • Platform observed: Martin Basic Monthly plan, GPT-5 Mini model
  • No calendar connected — calendar-dependent features evaluated via text reasoning only
  • Gmail connected but no emails actually sent during testing

Overview

Use Martin in our iOS or web app, text him, call him, or email him. Martin manages your calendar, inbox, to-do lists, notes, reminders, and Slack. He can also send texts and make calls on your behalf. Martin gets to know you over time and proactively helps out.

Martin screenshot 1
Martin screenshot 2
Martin screenshot 3
Martin screenshot 4
Martin screenshot 5
sellproject management softwaresellscheduling softwaresellautomation toolssellai chief of staffsellartificial intelligencesellproductivity

Paid_with_trial Plan

paymentsEnterprise

Makers

D
Dawson Chen
E
Ethan Hou

Discussion

0 comments