<p>模型性能的獨立驗證已成為全球人工智能生態系統面臨的最關鍵挑戰之一。正如人工分析（Artificial Analysis，AA）的聯合創始人喬治·卡梅倫（George Cameron）和米卡·希爾 - 史密斯（Micah Hill-Smith）在最近的一次採訪中所解釋的，單靠專有實驗室報告自身的性能指標會引入根本的利益衝突，從而扭曲競爭格局。卡梅倫和希爾 - 史密斯與 Latent Space 的 Swyx 討論了他們公司快速發展的情況，該公司迅速成為評估大型語言模型（LLMs）的獨立黃金標準。這家澳大利亞成立的公司源於一個簡單的需求：對模型性能、速度和成本缺乏可靠、客觀的數據。</p>
<p>創始人們很快意識到，依賴構建這些模型的實驗室自報的指標是一種愚蠢的做法。這種系統性偏見意味着，試圖構建可靠應用程序的開發者面臨着不可能的決策環境。實驗室通常通過不同的提示來操控評估或挑選示例，導致膨脹且不可重複的分數。卡梅倫指出了一個特別嚴重的例子：“谷歌的 Gemini 1.0 Ultra 使用 32-shot 提示擊敗了 GPT-4 在 MMLU 上的表現。” 為了應對這一問題，人工分析採用了嚴格的方法論，包括自行進行所有評估，並實施 “神秘顧客政策”，註冊不在自己域名下的賬户，以防止實驗室在私有的、優化的端點上提供不同的模型。</p>
<p>提供免費公共數據的使命仍然是 AA 身份的核心，使開發者和公司能夠在日益複雜的人工智能堆棧中導航。這種公共透明性得到了兩個商業部門的支持。第一個是企業基準洞察訂閲，提供關於關鍵部署決策的標準化報告，例如選擇無服務器推理、託管解決方案或租賃芯片進行自託管。第二個收入來源是為人工智能公司本身提供的私人定製基準，幫助他們理解自己模型在專業標準下的表現。重要的是，創始人們保持嚴格的防火牆：“沒有人支付費用以出現在公共排行榜上。”</p>
<p>AA 繼續超越簡單的 MMLU 分數等飽和基準。他們的旗艦產品人工分析智能指數（Artificial Analysis Intelligence Index，V3）綜合了十種不同的評估，包括代理基準和長上下文推理測試，通過重複運行呈現一個具有 95% 置信區間的單一分數。</p>
<p>AA-全知指數（AA-Omniscience Index）直接解決了幻覺的關鍵問題。該分數懲罰錯誤答案，同時獎勵模型承認 “我不知道”。遺漏指數顯示，Anthropic 的 Claude 模型在幻覺率方面始終領先，即使它們並不總是最聰明的整體模型。這突顯了企業用户在優先考慮事實可靠性與尖端推理能力之間的關鍵權衡。</p>
<p>對評估領域的一個重要貢獻是 GDP Val-AA，這是一個評估模型在 44 個現實世界中經濟價值高的白領任務上的基準，涉及複雜文檔如電子表格和 PDF。該評估使用 Stirrup 代理工具進行，允許多輪對話（最多 100 輪）和外部工具使用，包括代碼執行和文件系統訪問。這些現實世界任務固有的複雜性要求一個複雜的評判者，因此 AA 選擇 Gemini 3 Pro 作為 LLM 評判者，這一方法經過嚴格測試以確保沒有自我偏好偏見。關注多輪、使用工具的代理工作流程至關重要，因為行業正在從單一查詢性能指標轉向評估模型自主完成複雜多步驟任務的能力。</p>
<p>談話還涉及了人工智能成本的悖論，這在 “人工智能成本的微笑曲線” 中得以體現。儘管實現 GPT-4 級別的智能現在比發佈時便宜 100 到 1000 倍——這要歸功於像亞馬遜 Nova 這樣的小型高效模型，但在要求高的代理工作流程中部署前沿推理模型的成本仍然很高，甚至由於對長上下文窗口和高度稀疏模型的依賴而在增加。創始人們指出，這一趨勢表明未來將由大規模稀疏模型主導，並指出全知指數中的準確性與總參數數量密切相關，而不僅僅是活躍參數。這突顯了實驗室構建具有龐大但稀疏激活知識庫的模型的持續激勵。</p>
<p>這種對效率的關注擴展到諸如令牌效率與輪次效率等指標。一個模型每個令牌的成本可能更高，但如果它在更少的對話輪次中解決複雜任務，則用户的整體成本更低。AA 密切測量這些變量，指出新模型在必要時使用更多令牌的能力正在提高，從而在推理過程中實現更緊湊的令牌分佈。</p>
<p>最後，AA 通過其開放指數（Openness Index）解決模型透明性問題，根據預訓練數據、後訓練數據、方法論、訓練代碼和許可條款的可用性對模型進行 0 到 18 的評分。這個指標對於優先考慮開源完整性和可重複性的開發者至關重要。在開放指數中領先的模型包括 AI2 OLMo 2 和 Nous Hermes，反映了對透明度的承諾，而專有實驗室往往忽視這一點。在這個快速加速的領域中，保持客觀、相關基準的挑戰是巨大的，這要求 AA 不斷創新其評估方法，以確保提供的指標反映出開發者在構建下一代人工智能應用時所面臨的真實能力和權衡。</p>

加特納

<p>獨立驗證人工智能模型性能至關重要，因為依賴專有實驗室可能會導致利益衝突。人工分析（Artificial Analysis，AA）旨在通過嚴格的方法論和公開透明性提供對大型語言模型（LLMs）的客觀評估。他們的基準測試，包括人工分析智能指數和全知指數，評估模型在各種標準上的表現，包括事實可靠性和在現實任務中的表現。AA 還通過其開放性指數強調模型透明性，根據數據可用性和方法論對模型進行評分，促進人工智能開發中的開源誠信</p>

<p>The independent verification of model performance has become one of the most critical challenges facing the global AI ecosystem. As George Cameron and Micah Hill-Smith, co-founders of Artificial Analysis (AA), explained in a recent interview, relying solely on proprietary labs to report their own performance metrics introduces a fundamental conflict of interest that distorts the competitive landscape. Cameron and Hill-Smith spoke with Swyx of Latent Space about the rapid evolution of their company, which has quickly become the independent gold standard for evaluating large language models (LLMs). The Australian-founded firm was born out of a simple necessity: a pervasive lack of reliable, objective data regarding model performance, speed, and cost.</p>
<div class="lb-trans"><p>模型性能的獨立驗證已成為全球人工智能生態系統面臨的最關鍵挑戰之一。正如人工分析（Artificial Analysis，AA）的聯合創始人喬治·卡梅倫（George Cameron）和米卡·希爾 - 史密斯（Micah Hill-Smith）在最近的一次採訪中所解釋的，單靠專有實驗室報告自身的性能指標會引入根本的利益衝突，從而扭曲競爭格局。卡梅倫和希爾 - 史密斯與 Latent Space 的 Swyx 討論了他們公司快速發展的情況，該公司迅速成為評估大型語言模型（LLMs）的獨立黃金標準。這家澳大利亞成立的公司源於一個簡單的需求：對模型性能、速度和成本缺乏可靠、客觀的數據。</p>
</div><p>The founders realized quickly that relying on self-reported metrics from the labs building these models was a fool’s errand. This systemic bias meant that developers trying to build reliable applications faced an impossible decision-making landscape. Labs often manipulate evaluations by prompting models differently or cherry-picking examples, leading to inflated, non-reproducible scores. Cameron points to one particularly egregious example: “Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU.” To counteract this, Artificial Analysis adopted a rigorous methodology, including running all evaluations themselves and implementing a “mystery shopper policy,” registering accounts not on their own domain to prevent labs from serving different models on private, optimized endpoints.</p>
<div class="lb-trans"><p>創始人們很快意識到，依賴構建這些模型的實驗室自報的指標是一種愚蠢的做法。這種系統性偏見意味着，試圖構建可靠應用程序的開發者面臨着不可能的決策環境。實驗室通常通過不同的提示來操控評估或挑選示例，導致膨脹且不可重複的分數。卡梅倫指出了一個特別嚴重的例子：“谷歌的 Gemini 1.0 Ultra 使用 32-shot 提示擊敗了 GPT-4 在 MMLU 上的表現。” 為了應對這一問題，人工分析採用了嚴格的方法論，包括自行進行所有評估，並實施 “神秘顧客政策”，註冊不在自己域名下的賬户，以防止實驗室在私有的、優化的端點上提供不同的模型。</p>
</div><p>The mission of providing free, public data remains central to AA’s identity, allowing developers and companies to navigate the increasingly complex AI stack. This public transparency is supported by two commercial arms. The first is an enterprise benchmarking insights subscription, offering standardized reports on critical deployment decisions, such as choosing between serverless inference, managed solutions, or leasing chips for self-hosting. The second revenue stream is private custom benchmarking for AI companies themselves, helping them understand their own models’ performance against specialized criteria. Critically, the founders maintain a strict firewall: “No one pays to be on the public leaderboard.”</p>
<div class="lb-trans"><p>提供免費公共數據的使命仍然是 AA 身份的核心，使開發者和公司能夠在日益複雜的人工智能堆棧中導航。這種公共透明性得到了兩個商業部門的支持。第一個是企業基準洞察訂閲，提供關於關鍵部署決策的標準化報告，例如選擇無服務器推理、託管解決方案或租賃芯片進行自託管。第二個收入來源是為人工智能公司本身提供的私人定製基準，幫助他們理解自己模型在專業標準下的表現。重要的是，創始人們保持嚴格的防火牆：“沒有人支付費用以出現在公共排行榜上。”</p>
</div><p>AA continues to push beyond saturated benchmarks like simple MMLU scores. Their flagship Artificial Analysis Intelligence Index (V3) synthesizes ten different evaluations, including agentic benchmarks and long-context reasoning tests, presenting a single score with 95% confidence intervals via repeated runs.</p>
<div class="lb-trans"><p>AA 繼續超越簡單的 MMLU 分數等飽和基準。他們的旗艦產品人工分析智能指數（Artificial Analysis Intelligence Index，V3）綜合了十種不同的評估，包括代理基準和長上下文推理測試，通過重複運行呈現一個具有 95% 置信區間的單一分數。</p>
</div><p>The AA-Omniscience Index directly addresses the critical issue of hallucination. This score penalizes incorrect answers while rewarding the model for admitting, “I don’t know.” The Omissions Index reveals that Anthropic’s Claude models consistently lead with the lowest hallucination rates, even if they aren’t always the smartest overall model. This highlights a crucial trade-off for enterprise users prioritizing factual reliability over bleeding-edge reasoning capabilities.</p>
<div class="lb-trans"><p>AA-全知指數（AA-Omniscience Index）直接解決了幻覺的關鍵問題。該分數懲罰錯誤答案，同時獎勵模型承認 “我不知道”。遺漏指數顯示，Anthropic 的 Claude 模型在幻覺率方面始終領先，即使它們並不總是最聰明的整體模型。這突顯了企業用户在優先考慮事實可靠性與尖端推理能力之間的關鍵權衡。</p>
</div><p>A major contribution to the evaluation space is GDP Val-AA, a benchmark evaluating models on 44 real-world, economically valuable white-collar tasks involving complex documents like spreadsheets and PDFs. This evaluation is performed using the Stirrup agent harness, which allows for multi-turn conversations (up to 100 turns) and external tool use, including code execution and file system access. The complexity inherent in these real-world tasks demands a sophisticated judge, leading AA to use Gemini 3 Pro as an LLM judge, a methodology they rigorously tested to ensure no self-preference bias. This focus on multi-turn, tool-using agentic workflows is vital, as the industry moves beyond single-query performance metrics toward assessing models’ ability to autonomously complete complex, multi-step tasks.</p>
<div class="lb-trans"><p>對評估領域的一個重要貢獻是 GDP Val-AA，這是一個評估模型在 44 個現實世界中經濟價值高的白領任務上的基準，涉及複雜文檔如電子表格和 PDF。該評估使用 Stirrup 代理工具進行，允許多輪對話（最多 100 輪）和外部工具使用，包括代碼執行和文件系統訪問。這些現實世界任務固有的複雜性要求一個複雜的評判者，因此 AA 選擇 Gemini 3 Pro 作為 LLM 評判者，這一方法經過嚴格測試以確保沒有自我偏好偏見。關注多輪、使用工具的代理工作流程至關重要，因為行業正在從單一查詢性能指標轉向評估模型自主完成複雜多步驟任務的能力。</p>
</div><p>The conversation also touched upon the paradox of AI costs, captured in the “smiling curve of AI costs.” While achieving GPT-4 level intelligence is now 100 to 1,000 times cheaper than it was at launch—thanks to smaller, efficient models like Amazon Nova—the cost of deploying frontier reasoning models in demanding agentic workflows remains high or is even increasing due to the reliance on long context windows and highly sparse models. The founders noted that this trend suggests a future dominated by massive sparse models, pointing out that accuracy in the Omniscience Index correlates strongly with total parameter count, not just active parameters. This highlights the ongoing incentive for labs to build models with vast, albeit sparsely activated, knowledge bases.</p>
<div class="lb-trans"><p>談話還涉及了人工智能成本的悖論，這在 “人工智能成本的微笑曲線” 中得以體現。儘管實現 GPT-4 級別的智能現在比發佈時便宜 100 到 1000 倍——這要歸功於像亞馬遜 Nova 這樣的小型高效模型，但在要求高的代理工作流程中部署前沿推理模型的成本仍然很高，甚至由於對長上下文窗口和高度稀疏模型的依賴而在增加。創始人們指出，這一趨勢表明未來將由大規模稀疏模型主導，並指出全知指數中的準確性與總參數數量密切相關，而不僅僅是活躍參數。這突顯了實驗室構建具有龐大但稀疏激活知識庫的模型的持續激勵。</p>
</div><p>This focus on efficiency extends to metrics like token efficiency versus turn efficiency. A model might cost more per token, but if it solves a complex task in fewer conversational turns, the overall cost to the user is cheaper. AA measures these variables closely, noting that newer models are becoming better at using more tokens only when necessary, resulting in tighter token distributions during inference.</p>
<div class="lb-trans"><p>這種對效率的關注擴展到諸如令牌效率與輪次效率等指標。一個模型每個令牌的成本可能更高，但如果它在更少的對話輪次中解決複雜任務，則用户的整體成本更低。AA 密切測量這些變量，指出新模型在必要時使用更多令牌的能力正在提高，從而在推理過程中實現更緊湊的令牌分佈。</p>
</div><p>Finally, AA addresses model transparency with its Openness Index, scoring models from 0 to 18 based on the availability of pre-training data, post-training data, methodology, training code, and licensing terms. This metric is essential for developers prioritizing open-source integrity and reproducibility. Leading the Openness Index are models like AI2 OLMo 2 and Nous Hermes, reflecting a commitment to transparency that proprietary labs often forgo. The challenges of maintaining objective, relevant benchmarks in a field accelerating this rapidly are immense, requiring AA to continuously innovate its evaluation methodologies to ensure that the metrics provided reflect the true capabilities and trade-offs faced by developers building the next generation of AI applications.</p>
<div class="lb-trans"><p>最後，AA 通過其開放指數（Openness Index）解決模型透明性問題，根據預訓練數據、後訓練數據、方法論、訓練代碼和許可條款的可用性對模型進行 0 到 18 的評分。這個指標對於優先考慮開源完整性和可重複性的開發者至關重要。在開放指數中領先的模型包括 AI2 OLMo 2 和 Nous Hermes，反映了對透明度的承諾，而專有實驗室往往忽視這一點。在這個快速加速的領域中，保持客觀、相關基準的挑戰是巨大的，這要求 AA 不斷創新其評估方法，以確保提供的指標反映出開發者在構建下一代人工智能應用時所面臨的真實能力和權衡。</p>
</div>

新的加特納：為何獨立的 LLM 基準測試至關重要