<p>模型性能的独立验证已成为全球人工智能生态系统面临的最关键挑战之一。正如人工分析（Artificial Analysis，AA）的联合创始人乔治·卡梅伦（George Cameron）和米卡·希尔 - 史密斯（Micah Hill-Smith）在最近的一次采访中所解释的，单靠专有实验室报告自身的性能指标会引入根本的利益冲突，从而扭曲竞争格局。卡梅伦和希尔 - 史密斯与 Latent Space 的 Swyx 讨论了他们公司快速发展的情况，该公司迅速成为评估大型语言模型（LLMs）的独立黄金标准。这家澳大利亚成立的公司源于一个简单的需求：对模型性能、速度和成本缺乏可靠、客观的数据。</p>
<p>创始人们很快意识到，依赖构建这些模型的实验室自报的指标是一种愚蠢的做法。这种系统性偏见意味着，试图构建可靠应用程序的开发者面临着不可能的决策环境。实验室通常通过不同的提示来操控评估或挑选示例，导致膨胀且不可重复的分数。卡梅伦指出了一个特别严重的例子：“谷歌的 Gemini 1.0 Ultra 使用 32-shot 提示击败了 GPT-4 在 MMLU 上的表现。” 为了应对这一问题，人工分析采用了严格的方法论，包括自行进行所有评估，并实施 “神秘顾客政策”，注册不在自己域名下的账户，以防止实验室在私有的、优化的端点上提供不同的模型。</p>
<p>提供免费公共数据的使命仍然是 AA 身份的核心，使开发者和公司能够在日益复杂的人工智能堆栈中导航。这种公共透明性得到了两个商业部门的支持。第一个是企业基准洞察订阅，提供关于关键部署决策的标准化报告，例如选择无服务器推理、托管解决方案或租赁芯片进行自托管。第二个收入来源是为人工智能公司本身提供的私人定制基准，帮助他们理解自己模型在专业标准下的表现。重要的是，创始人们保持严格的防火墙：“没有人支付费用以出现在公共排行榜上。”</p>
<p>AA 继续超越简单的 MMLU 分数等饱和基准。他们的旗舰产品人工分析智能指数（Artificial Analysis Intelligence Index，V3）综合了十种不同的评估，包括代理基准和长上下文推理测试，通过重复运行呈现一个具有 95% 置信区间的单一分数。</p>
<p>AA-全知指数（AA-Omniscience Index）直接解决了幻觉的关键问题。该分数惩罚错误答案，同时奖励模型承认 “我不知道”。遗漏指数显示，Anthropic 的 Claude 模型在幻觉率方面始终领先，即使它们并不总是最聪明的整体模型。这突显了企业用户在优先考虑事实可靠性与尖端推理能力之间的关键权衡。</p>
<p>对评估领域的一个重要贡献是 GDP Val-AA，这是一个评估模型在 44 个现实世界中经济价值高的白领任务上的基准，涉及复杂文档如电子表格和 PDF。该评估使用 Stirrup 代理工具进行，允许多轮对话（最多 100 轮）和外部工具使用，包括代码执行和文件系统访问。这些现实世界任务固有的复杂性要求一个复杂的评判者，因此 AA 选择 Gemini 3 Pro 作为 LLM 评判者，这一方法经过严格测试以确保没有自我偏好偏见。关注多轮、使用工具的代理工作流程至关重要，因为行业正在从单一查询性能指标转向评估模型自主完成复杂多步骤任务的能力。</p>
<p>谈话还涉及了人工智能成本的悖论，这在 “人工智能成本的微笑曲线” 中得以体现。尽管实现 GPT-4 级别的智能现在比发布时便宜 100 到 1000 倍——这要归功于像亚马逊 Nova 这样的小型高效模型，但在要求高的代理工作流程中部署前沿推理模型的成本仍然很高，甚至由于对长上下文窗口和高度稀疏模型的依赖而在增加。创始人们指出，这一趋势表明未来将由大规模稀疏模型主导，并指出全知指数中的准确性与总参数数量密切相关，而不仅仅是活跃参数。这突显了实验室构建具有庞大但稀疏激活知识库的模型的持续激励。</p>
<p>这种对效率的关注扩展到诸如令牌效率与轮次效率等指标。一个模型每个令牌的成本可能更高，但如果它在更少的对话轮次中解决复杂任务，则用户的整体成本更低。AA 密切测量这些变量，指出新模型在必要时使用更多令牌的能力正在提高，从而在推理过程中实现更紧凑的令牌分布。</p>
<p>最后，AA 通过其开放指数（Openness Index）解决模型透明性问题，根据预训练数据、后训练数据、方法论、训练代码和许可条款的可用性对模型进行 0 到 18 的评分。这个指标对于优先考虑开源完整性和可重复性的开发者至关重要。在开放指数中领先的模型包括 AI2 OLMo 2 和 Nous Hermes，反映了对透明度的承诺，而专有实验室往往忽视这一点。在这个快速加速的领域中，保持客观、相关基准的挑战是巨大的，这要求 AA 不断创新其评估方法，以确保提供的指标反映出开发者在构建下一代人工智能应用时所面临的真实能力和权衡。</p>

加特纳

<p>独立验证人工智能模型性能至关重要，因为依赖专有实验室可能会导致利益冲突。人工分析（Artificial Analysis，AA）旨在通过严格的方法论和公开透明性提供对大型语言模型（LLMs）的客观评估。他们的基准测试，包括人工分析智能指数和全知指数，评估模型在各种标准上的表现，包括事实可靠性和在现实任务中的表现。AA 还通过其开放性指数强调模型透明性，根据数据可用性和方法论对模型进行评分，促进人工智能开发中的开源诚信</p>

<p>The independent verification of model performance has become one of the most critical challenges facing the global AI ecosystem. As George Cameron and Micah Hill-Smith, co-founders of Artificial Analysis (AA), explained in a recent interview, relying solely on proprietary labs to report their own performance metrics introduces a fundamental conflict of interest that distorts the competitive landscape. Cameron and Hill-Smith spoke with Swyx of Latent Space about the rapid evolution of their company, which has quickly become the independent gold standard for evaluating large language models (LLMs). The Australian-founded firm was born out of a simple necessity: a pervasive lack of reliable, objective data regarding model performance, speed, and cost.</p>
<div class="lb-trans"><p>模型性能的独立验证已成为全球人工智能生态系统面临的最关键挑战之一。正如人工分析（Artificial Analysis，AA）的联合创始人乔治·卡梅伦（George Cameron）和米卡·希尔 - 史密斯（Micah Hill-Smith）在最近的一次采访中所解释的，单靠专有实验室报告自身的性能指标会引入根本的利益冲突，从而扭曲竞争格局。卡梅伦和希尔 - 史密斯与 Latent Space 的 Swyx 讨论了他们公司快速发展的情况，该公司迅速成为评估大型语言模型（LLMs）的独立黄金标准。这家澳大利亚成立的公司源于一个简单的需求：对模型性能、速度和成本缺乏可靠、客观的数据。</p>
</div><p>The founders realized quickly that relying on self-reported metrics from the labs building these models was a fool’s errand. This systemic bias meant that developers trying to build reliable applications faced an impossible decision-making landscape. Labs often manipulate evaluations by prompting models differently or cherry-picking examples, leading to inflated, non-reproducible scores. Cameron points to one particularly egregious example: “Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU.” To counteract this, Artificial Analysis adopted a rigorous methodology, including running all evaluations themselves and implementing a “mystery shopper policy,” registering accounts not on their own domain to prevent labs from serving different models on private, optimized endpoints.</p>
<div class="lb-trans"><p>创始人们很快意识到，依赖构建这些模型的实验室自报的指标是一种愚蠢的做法。这种系统性偏见意味着，试图构建可靠应用程序的开发者面临着不可能的决策环境。实验室通常通过不同的提示来操控评估或挑选示例，导致膨胀且不可重复的分数。卡梅伦指出了一个特别严重的例子：“谷歌的 Gemini 1.0 Ultra 使用 32-shot 提示击败了 GPT-4 在 MMLU 上的表现。” 为了应对这一问题，人工分析采用了严格的方法论，包括自行进行所有评估，并实施 “神秘顾客政策”，注册不在自己域名下的账户，以防止实验室在私有的、优化的端点上提供不同的模型。</p>
</div><p>The mission of providing free, public data remains central to AA’s identity, allowing developers and companies to navigate the increasingly complex AI stack. This public transparency is supported by two commercial arms. The first is an enterprise benchmarking insights subscription, offering standardized reports on critical deployment decisions, such as choosing between serverless inference, managed solutions, or leasing chips for self-hosting. The second revenue stream is private custom benchmarking for AI companies themselves, helping them understand their own models’ performance against specialized criteria. Critically, the founders maintain a strict firewall: “No one pays to be on the public leaderboard.”</p>
<div class="lb-trans"><p>提供免费公共数据的使命仍然是 AA 身份的核心，使开发者和公司能够在日益复杂的人工智能堆栈中导航。这种公共透明性得到了两个商业部门的支持。第一个是企业基准洞察订阅，提供关于关键部署决策的标准化报告，例如选择无服务器推理、托管解决方案或租赁芯片进行自托管。第二个收入来源是为人工智能公司本身提供的私人定制基准，帮助他们理解自己模型在专业标准下的表现。重要的是，创始人们保持严格的防火墙：“没有人支付费用以出现在公共排行榜上。”</p>
</div><p>AA continues to push beyond saturated benchmarks like simple MMLU scores. Their flagship Artificial Analysis Intelligence Index (V3) synthesizes ten different evaluations, including agentic benchmarks and long-context reasoning tests, presenting a single score with 95% confidence intervals via repeated runs.</p>
<div class="lb-trans"><p>AA 继续超越简单的 MMLU 分数等饱和基准。他们的旗舰产品人工分析智能指数（Artificial Analysis Intelligence Index，V3）综合了十种不同的评估，包括代理基准和长上下文推理测试，通过重复运行呈现一个具有 95% 置信区间的单一分数。</p>
</div><p>The AA-Omniscience Index directly addresses the critical issue of hallucination. This score penalizes incorrect answers while rewarding the model for admitting, “I don’t know.” The Omissions Index reveals that Anthropic’s Claude models consistently lead with the lowest hallucination rates, even if they aren’t always the smartest overall model. This highlights a crucial trade-off for enterprise users prioritizing factual reliability over bleeding-edge reasoning capabilities.</p>
<div class="lb-trans"><p>AA-全知指数（AA-Omniscience Index）直接解决了幻觉的关键问题。该分数惩罚错误答案，同时奖励模型承认 “我不知道”。遗漏指数显示，Anthropic 的 Claude 模型在幻觉率方面始终领先，即使它们并不总是最聪明的整体模型。这突显了企业用户在优先考虑事实可靠性与尖端推理能力之间的关键权衡。</p>
</div><p>A major contribution to the evaluation space is GDP Val-AA, a benchmark evaluating models on 44 real-world, economically valuable white-collar tasks involving complex documents like spreadsheets and PDFs. This evaluation is performed using the Stirrup agent harness, which allows for multi-turn conversations (up to 100 turns) and external tool use, including code execution and file system access. The complexity inherent in these real-world tasks demands a sophisticated judge, leading AA to use Gemini 3 Pro as an LLM judge, a methodology they rigorously tested to ensure no self-preference bias. This focus on multi-turn, tool-using agentic workflows is vital, as the industry moves beyond single-query performance metrics toward assessing models’ ability to autonomously complete complex, multi-step tasks.</p>
<div class="lb-trans"><p>对评估领域的一个重要贡献是 GDP Val-AA，这是一个评估模型在 44 个现实世界中经济价值高的白领任务上的基准，涉及复杂文档如电子表格和 PDF。该评估使用 Stirrup 代理工具进行，允许多轮对话（最多 100 轮）和外部工具使用，包括代码执行和文件系统访问。这些现实世界任务固有的复杂性要求一个复杂的评判者，因此 AA 选择 Gemini 3 Pro 作为 LLM 评判者，这一方法经过严格测试以确保没有自我偏好偏见。关注多轮、使用工具的代理工作流程至关重要，因为行业正在从单一查询性能指标转向评估模型自主完成复杂多步骤任务的能力。</p>
</div><p>The conversation also touched upon the paradox of AI costs, captured in the “smiling curve of AI costs.” While achieving GPT-4 level intelligence is now 100 to 1,000 times cheaper than it was at launch—thanks to smaller, efficient models like Amazon Nova—the cost of deploying frontier reasoning models in demanding agentic workflows remains high or is even increasing due to the reliance on long context windows and highly sparse models. The founders noted that this trend suggests a future dominated by massive sparse models, pointing out that accuracy in the Omniscience Index correlates strongly with total parameter count, not just active parameters. This highlights the ongoing incentive for labs to build models with vast, albeit sparsely activated, knowledge bases.</p>
<div class="lb-trans"><p>谈话还涉及了人工智能成本的悖论，这在 “人工智能成本的微笑曲线” 中得以体现。尽管实现 GPT-4 级别的智能现在比发布时便宜 100 到 1000 倍——这要归功于像亚马逊 Nova 这样的小型高效模型，但在要求高的代理工作流程中部署前沿推理模型的成本仍然很高，甚至由于对长上下文窗口和高度稀疏模型的依赖而在增加。创始人们指出，这一趋势表明未来将由大规模稀疏模型主导，并指出全知指数中的准确性与总参数数量密切相关，而不仅仅是活跃参数。这突显了实验室构建具有庞大但稀疏激活知识库的模型的持续激励。</p>
</div><p>This focus on efficiency extends to metrics like token efficiency versus turn efficiency. A model might cost more per token, but if it solves a complex task in fewer conversational turns, the overall cost to the user is cheaper. AA measures these variables closely, noting that newer models are becoming better at using more tokens only when necessary, resulting in tighter token distributions during inference.</p>
<div class="lb-trans"><p>这种对效率的关注扩展到诸如令牌效率与轮次效率等指标。一个模型每个令牌的成本可能更高，但如果它在更少的对话轮次中解决复杂任务，则用户的整体成本更低。AA 密切测量这些变量，指出新模型在必要时使用更多令牌的能力正在提高，从而在推理过程中实现更紧凑的令牌分布。</p>
</div><p>Finally, AA addresses model transparency with its Openness Index, scoring models from 0 to 18 based on the availability of pre-training data, post-training data, methodology, training code, and licensing terms. This metric is essential for developers prioritizing open-source integrity and reproducibility. Leading the Openness Index are models like AI2 OLMo 2 and Nous Hermes, reflecting a commitment to transparency that proprietary labs often forgo. The challenges of maintaining objective, relevant benchmarks in a field accelerating this rapidly are immense, requiring AA to continuously innovate its evaluation methodologies to ensure that the metrics provided reflect the true capabilities and trade-offs faced by developers building the next generation of AI applications.</p>
<div class="lb-trans"><p>最后，AA 通过其开放指数（Openness Index）解决模型透明性问题，根据预训练数据、后训练数据、方法论、训练代码和许可条款的可用性对模型进行 0 到 18 的评分。这个指标对于优先考虑开源完整性和可重复性的开发者至关重要。在开放指数中领先的模型包括 AI2 OLMo 2 和 Nous Hermes，反映了对透明度的承诺，而专有实验室往往忽视这一点。在这个快速加速的领域中，保持客观、相关基准的挑战是巨大的，这要求 AA 不断创新其评估方法，以确保提供的指标反映出开发者在构建下一代人工智能应用时所面临的真实能力和权衡。</p>
</div>

新的加特纳：为何独立的 LLM 基准测试至关重要