Google Gemini 2.5 AI模型登場，強大推理能力領先

Google DeepMind團隊今日正式發布了他們最智能的AI模型——Gemini 2.5，這是一款具備「思考能力」的革命性AI系統，能夠在回應問題前進行推理思考，從而顯著提升準確性和整體表現。

Gemini 2.5系列：思考型AI模型的新標準

Gemini 2.5被定位為「思考型模型」(thinking models)，其首款2.5 Pro實驗版在多項基準測試中均取得領先地位，並在LMArena排行榜上以顯著優勢名列第一。Google DeepMind的CTO Koray Kavukcuoglu表示，這款模型能夠在回應問題前進行推理思考，從而達到更高的準確性和理解能力。

「在AI領域，系統的『推理能力』不僅僅是分類和預測，更指其分析資訊、得出邏輯結論、整合上下文和微妙之處，並做出明智決策的能力。」官方部落格如此解釋這項突破性技術。

增強推理能力達到新高度

Gemini 2.5 Pro在要求高級推理的多項基準測試中均處於領先地位。即使不使用增加成本的測試時技術（如多數投票），它在數學和科學基準測試如GPQA和AIME 2025上仍然名列前茅。

更令人印象深刻的是，在「人類的最終考試」(Humanity’s Last Exam)這一由數百名主題專家設計的數據集上，該模型在不使用工具的情況下獲得了18.8%的最佳成績，這一數據集旨在捕捉人類知識和推理的前沿水平。

AI模型基準測試比較

Benchmark		Gemini 2.5 Pro Experimental (03-25)	OpenAI o3-mini High	OpenAI GPT-4.5	Claude 3.7 Sonnet 6.6k Extended Thinking	Grok 3 Beta Extended Thinking	DeepSeek R1
Reasoning & Knowledge
Humanity’s Last Exam (no tools)		18.8%	14.0%*	6.4%	8.9%	—	8.6%*
Science
GPQA diamond	single attempt (pass@1)	84.0%	79.7%	71.4%	78.2%	80.2%	71.5%
	multiple attempts	—	—	—	84.8%	84.6%	—
Mathematics
AIME 2025	single attempt (pass@1)	86.7%	86.5%	—	49.5%	77.3%	70.0%
	multiple attempts	—	—	—	—	93.3%	—
AIME 2024	single attempt (pass@1)	92.0%	87.3%	36.7%	61.3%	83.9%	79.8%
	multiple attempts	—	—	—	80.0%	93.3%	—
Code generation
LiveCodeBench v5	single attempt (pass@1)	70.4%	74.1%	—	—	70.6%	64.3%
	multiple attempts	—	—	—	—	79.4%	—
Code editing
Aider Polyglot	whole / diff	74.0% / 68.6%	60.4% diff	44.9% diff	64.9% diff	—	56.9% diff
Agentic coding
SWE-bench verified		63.8%	49.3%	38.0%	70.3%	—	49.2%
Factuality
SimpleQA		52.9%	13.8%	62.5%	—	43.6%	30.1%
Visual reasoning
MMMU	single attempt (pass@1)	81.7%	no MM support	74.4%	75.0%	76.0%	no MM support
	multiple attempts	—	no MM support	—	—	78.0%	no MM support
Image understanding
VIbe-Eval (Reka)		69.4%	no MM support	—	—	—	no MM support
Long context
MRCR	128k	91.5%	36.3%	48.8%	—	—	—
	1M	83.1%	—	—	—	—	—
Multilingual performance
Global MMLU (Lite)		89.8%	—	—	—	—	—

Methodology: All Gemini 2.5 Pro scores are pass @1 (no majority voting or parallel test time compute unless indicated otherwise). They are all run with the AI Studio API for the model id gemini-2.5-pro-experimental-20250325. OpenAI scores are run with OpenAI API (o3-mini and GPT-4.5). Grok results are reported using Grok as a judge.

* indicates evaluated on test problems only (without images).