Want the prompt I used for this test? And my AI Prompt Library with 30+ outbound prompts?
Upgrade now in my newsletter here.
-
I tested seven AI models on the same account research prompt, 12 specific instructions, one target company (Replit), one buyer lens (TrackRec). This is my March 2026 benchmark.
The models: Perplexity Sonar, GPT 5.2 Thinking, Grok 4.2 Beta, Grok 4, Claude Opus 4.6, Claygent (Argon), and Gemini 3 Pro. I scored every model on six weighted criteria, tracked which instructions each model actually completed, classified why they missed what they missed, and manually verified every disputed claim.
Agenda:
- Why I expanded from 3 scoring criteria to 6 — and how adding Business Relevance changed the rankings
- What instruction completion reveals that scores alone don't (Perplexity: 10/12, Gemini: 1/12)
- The difference between hallucinations and false claims — and why it matters for automation at scale
- Why four models found September funding and stopped looking (the persistence failure pattern)
- The $400M funding round that may or may not be real — REPORTED vs VERIFIED as a new verification category
- Which model to use for high-value accounts vs volume enrichment in Clay
- Web app vs API vs Clay: why your results will be different and what I'm testing in the next benchmark
Referenced:
- TrackRec: https://www.trackrec.co
- Replit: https://replit.com
- Perplexity: https://www.perplexity.ai
- Clay: https://www.clay.com
- RepVue: https://www.repvue.com
- The account research prompt: Available for Outbound Kitchen paid members
-
Who I am? Elric Legloire, founder of Outbound Kitchen.
When you're ready
👨🍳 Want to work with me? Send me a DM
---
Connect with me
📌 Connect on LinkedIn
📹 Subscribe on YouTube
🐦 Connect on X
-
Chapters:(0:00) - Why I keep benchmarking AI models
(1:45) - The test setup: TrackRec researching Replit
(3:00) - What changed from the last test(6 criteria, instruction tracking)
(3:30) - The new rankings
(4:05) - Perplexity: VP of SDR, podcast, RepVue miss
(5:00) - GPT 5.2: zero false claims, Glassdoor depth
(5:30) - The $400M funding round — is it real?
(7:00) - Grok 4.2: 56 seconds, best RepVue data
(8:00) - Bottom four models (quick summary)
(8:55) - Verification: hallucinations vs false claims
(10:05) - Which models I recommend
(10:45) - Web app vs Clay availability
(11:30) - What's next