Google DeepMind團隊今日正式發布了他們最智能的AI模型——Gemini 2.5,這是一款具備「思考能力」的革命性AI系統,能夠在回應問題前進行推理思考,從而顯著提升準確性和整體表現。

Gemini 2.5系列:思考型AI模型的新標準
Gemini 2.5被定位為「思考型模型」(thinking models),其首款2.5 Pro實驗版在多項基準測試中均取得領先地位,並在LMArena排行榜上以顯著優勢名列第一。Google DeepMind的CTO Koray Kavukcuoglu表示,這款模型能夠在回應問題前進行推理思考,從而達到更高的準確性和理解能力。
「在AI領域,系統的『推理能力』不僅僅是分類和預測,更指其分析資訊、得出邏輯結論、整合上下文和微妙之處,並做出明智決策的能力。」官方部落格如此解釋這項突破性技術。
增強推理能力達到新高度
Gemini 2.5 Pro在要求高級推理的多項基準測試中均處於領先地位。即使不使用增加成本的測試時技術(如多數投票),它在數學和科學基準測試如GPQA和AIME 2025上仍然名列前茅。
更令人印象深刻的是,在「人類的最終考試」(Humanity’s Last Exam)這一由數百名主題專家設計的數據集上,該模型在不使用工具的情況下獲得了18.8%的最佳成績,這一數據集旨在捕捉人類知識和推理的前沿水平。
Benchmark |
Gemini 2.5 Pro Experimental (03-25) |
OpenAI o3-mini High |
OpenAI GPT-4.5 |
Claude 3.7 Sonnet 6.6k Extended Thinking |
Grok 3 Beta Extended Thinking |
DeepSeek R1 |
|
---|---|---|---|---|---|---|---|
Reasoning & Knowledge | |||||||
Humanity’s Last Exam (no tools) |
18.8% | 14.0%* | 6.4% | 8.9% | — | 8.6%* | |
Science | |||||||
GPQA diamond | single attempt (pass@1) |
84.0% | 79.7% | 71.4% | 78.2% | 80.2% | 71.5% |
multiple attempts | — | — | — | 84.8% | 84.6% | — | |
Mathematics | |||||||
AIME 2025 | single attempt (pass@1) |
86.7% | 86.5% | — | 49.5% | 77.3% | 70.0% |
multiple attempts | — | — | — | — | 93.3% | — | |
AIME 2024 | single attempt (pass@1) |
92.0% | 87.3% | 36.7% | 61.3% | 83.9% | 79.8% |
multiple attempts | — | — | — | 80.0% | 93.3% | — | |
Code generation | |||||||
LiveCodeBench v5 | single attempt (pass@1) |
70.4% | 74.1% | — | — | 70.6% | 64.3% |
multiple attempts | — | — | — | — | 79.4% | — | |
Code editing | |||||||
Aider Polyglot | whole / diff | 74.0% / 68.6% |
60.4% diff |
44.9% diff |
64.9% diff |
— | 56.9% diff |
Agentic coding | |||||||
SWE-bench verified | 63.8% | 49.3% | 38.0% | 70.3% | — | 49.2% | |
Factuality | |||||||
SimpleQA | 52.9% | 13.8% | 62.5% | — | 43.6% | 30.1% | |
Visual reasoning | |||||||
MMMU | single attempt (pass@1) |
81.7% | no MM support | 74.4% | 75.0% | 76.0% | no MM support |
multiple attempts | — | no MM support | — | — | 78.0% | no MM support | |
Image understanding | |||||||
VIbe-Eval (Reka) | 69.4% | no MM support | — | — | — | no MM support | |
Long context | |||||||
MRCR | 128k | 91.5% | 36.3% | 48.8% | — | — | — |
1M | 83.1% | — | — | — | — | — | |
Multilingual performance | |||||||
Global MMLU (Lite) | 89.8% | — | — | — | — | — |
Methodology: All Gemini 2.5 Pro scores are pass @1 (no majority voting or parallel test time compute unless indicated otherwise). They are all run with the AI Studio API for the model id gemini-2.5-pro-experimental-20250325. OpenAI scores are run with OpenAI API (o3-mini and GPT-4.5). Grok results are reported using Grok as a judge.
* indicates evaluated on test problems only (without images).
程式編寫能力大幅提升
在編碼方面,Gemini 2.5相較2.0版本取得了巨大飛躍。2.5 Pro擅長創建視覺效果出眾的網頁應用程式和代理程式碼應用,同時在代碼轉換和編輯方面表現卓越。在行業標準的代理程式碼評估SWE-Bench Verified上,Gemini 2.5 Pro配合客製化代理設置取得了63.8%的分數。
Google還展示了2.5 Pro如何利用其推理能力,僅通過一行提示即能生成可執行程式碼來創建視頻遊戲的案例。
基於Gemini優勢打造更強大功能
Gemini 2.5延續了Gemini系列的核心優勢——原生多模態能力和長上下文窗口。2.5 Pro現已配備100萬令牌的上下文窗口(即將提升至200萬),表現強勁且優於前代產品。它能夠理解海量數據集,並處理來自不同資訊來源的複雜問題,包括文字、音頻、圖像、視頻,甚至整個程式碼庫。
現已開放使用
開發者和企業現可在Google AI Studio開始體驗Gemini 2.5 Pro,Gemini Advanced用戶也可在桌面和移動裝置的模型下拉選單中選擇它。Google宣布該模型即將在未來幾週內登陸Vertex AI,並將推出定價方案,使用戶能夠以更高的速率限制擴大生產使用。