AnalysisFeatured10 min read

Google Gemini 3 Deep Think Upgrade Beats GPT-5.2 and Opus 4.6 on Benchmarks

Google's February 12 upgrade to Gemini 3 Deep Think mode scored 48.4% on Humanity's Last Exam, surpassing both OpenAI's GPT-5.2 and Anthropic's Opus 4.6. The model also topped GPQA Diamond and MMLU-Pro.

R
Research
Feb 14, 2026

Google DeepMind has released a significant upgrade to Gemini 3's Deep Think reasoning mode on February 12, 2026, achieving benchmark scores that surpass both OpenAI's GPT-5.2 and Anthropic's Claude Opus 4.6 on several key evaluations.

The standout result is a 48.4% score on Humanity's Last Exam — a notoriously difficult benchmark designed to test the absolute limits of AI reasoning across mathematics, science, philosophy, and logic. For comparison, GPT-5.2 Pro scored 42.1% and Claude Opus 4.6 scored 39.8% on the same benchmark.

Additional benchmark highlights include:

GPQA Diamond (graduate-level science): 78.2% (vs GPT-5.2's 74.6% and Opus 4.6's 71.3%)

MMLU-Pro (massive multitask understanding): 91.7% (vs GPT-5.2's 89.4% and Opus 4.6's 88.1%)

MATH-500 (competition mathematics): 96.8% (vs GPT-5.2's 94.2% and Opus 4.6's 93.5%)

SWE-bench Verified (software engineering): 72.4% (below Opus 4.6's 80.8% and GPT-5.2's 75.1%)

The Deep Think mode uses iterative rounds of reasoning that explore multiple hypotheses simultaneously before converging on a solution. Unlike standard chain-of-thought approaches, Deep Think can backtrack, reconsider assumptions, and synthesize across different reasoning paths.

Google AI Ultra subscribers ($20/month) get access to Deep Think mode in the Gemini app, while developers can access it through the Vertex AI API and Google AI Studio. The model retains Gemini 3 Pro's 1M token context window and multimodal capabilities.

However, the results tell a nuanced story. While Gemini 3 Deep Think excels at pure reasoning and academic benchmarks, it trails both competitors in practical software engineering tasks (SWE-bench) and agentic workflows. Anthropic's Opus 4.6 maintains a clear lead in coding tasks with its 80.8% SWE-bench score.

Jeff Dean, Google's Chief Scientist, commented that Deep Think represents a 'qualitative shift in how models approach complex problems' and hinted at further improvements coming with the anticipated Gemini 3 Ultra release later this quarter.

The three-way competition between Google, OpenAI, and Anthropic continues to intensify, with each company now holding clear advantages in different capability domains — a pattern that benefits the broader AI ecosystem and pushes all players to improve.

R
Research
Feb 14, 2026 · 10 min read
Back to News