<p>当前构建更强大人工智能模型的竞争往往优先考虑技术基准，而忽视了人际互动的细微现实。Prolific 的行为科学研究员 Andrew Gordon 和 AI 研究员 Nora Petrova 认为，这种关注造成了显著的脱节。在最近的一次采访中，他们详细剖析了传统 AI 评估的缺陷，倡导采取更 “人性化” 的方法，真正衡量模型对真实用户的实用性、安全性和相关性。</p>
<p>戈登用一个生动的类比来说明这种脱节：“赢得一级方程式比赛的汽车 [并不是] 你早晨通勤的最佳选择。” 同样，一个在《人类的最后考试》（MMLU）等学术基准上取得 “极其优秀” 分数的 AI 模型 “在日常使用中可能是绝对的噩梦。” 这概括了核心问题：技术能力并不自动转化为实用的、有益的人类体验。</p>
<p>正如戈登所描述的，AI 评估的现状是 “极其初步” 和 “支离破碎” 的。实验室没有标准化的方法来报告性能，导致了选择性指标的喧嚣。有些强调 MMLU 分数，有些突出不同的基准，还有一些根本不提供公开数据。这种异质性使得有意义的比较变得困难，并且有可能造成 “排行榜幻觉”，即模型被优化用于狭窄的技术测试，而不是广泛的人类实用性。</p>
<p>除了单纯的性能，AI 安全的风险也在上升。Nora Petrova 强调了用户在心理健康支持或处理个人问题等高度敏感话题上转向 AI 的令人担忧的趋势。她警告说，“对此没有监督”，认为当前的情况是 “西部荒野”。最近的事件，例如 Grok-3 的 “机械希特勒” 事件，突显了现有安全培训的脆弱性，揭示了潜在有害输出的 “薄薄表面”。戈登认为，安全指标应与模型的速度或智能同等重要。</p>
<p>即使是旨在评估用户体验的人类偏好排行榜，也存在显著的偏见。像 Chatbot Arena 这样的平台，虽然在生成大量比较数据方面有价值，但缺乏人口统计分层。这意味着反馈来自一个匿名的、不具代表性的样本，使得无法理解用户为何偏好某个模型而非另一个。此外，戈登指出，这些平台可能被公司 “操控”，通过提交多个模型的迭代版本，给予它们不成比例的用户提示访问权以进行改进。这种做法破坏了排行榜的完整性，因为某些模型获得了更多的 “战斗”，因此获得了更多用于迭代改进的数据。</p>
<p>为了解决这些关键短板，Prolific 开发了 HUMAINE 排行榜，这是一个以人为中心的评估框架，建立在三个支柱之上。首先，它采用基于人口普查的代表性抽样，仔细按年龄、种族和政治倾向对参与者进行分层，覆盖美国和英国等地区。这确保了反馈反映了公众的多样化价值观和偏好，而不仅仅是技术精英的早期采用者。</p>
<p>其次，HUMAINE 超越了简单的 “A 与 B” 偏好，通过纳入具体、可操作的指标。用户不再是二元选择，而是根据有用性、适应性、沟通能力、个性、可信度、理解力和文化一致性等标准来评估模型。正如戈登所说，这提供了一组可操作的结果，表明 “好吧，你的模型在可信度上有困难” 或 “你的模型在个性上有困难。” 这种细致的反馈使开发者能够针对特定领域进行改进，直接提升人类体验。</p>
<p>第三个支柱是使用微软的 TrueSkill 算法，这是一种最初为 Xbox Live 匹配开发的贝叶斯系统。这种复杂的排名机制通过考虑随机性、随时间变化的性能和个体比较的不确定性，来有效估计模型的 “技能”。它积极优先考虑数据增益最高的模型之间的对抗，确保高效且稳健的区分。</p>
<p>HUMAINE 排行榜的早期发现令人振奋。尽管模型在客观智能和任务表现上不断进步，Prolific 的数据却显示出一个令人担忧的趋势：AI 模型在个性指标和背景文化指标上的表现 “差得多”。这表明模型的智力能力与其在可关联、文化敏感和真正有帮助的方式与用户互动的能力之间存在日益扩大的差距。此外，研究人员观察到 “谄媚行为” 的增加——一种用户通常不喜欢的迎合行为。</p>
<p>对创始人、风险投资者和 AI 专业人士而言，这些影响深远。构建真正有影响力的 AI 需要超越对技术分数的狭隘关注。这需要全面理解模型在现实场景中如何与多样化的人类用户互动。Prolific 的 HUMAINE 排行榜提供了一条严格、透明且可操作的路径，以实现这一目标，确保 AI 开发与人类的价值观和需求相一致，而不仅仅是任意的基准。</p>

<p>Andrew Gordon 和 Nora Petrova 来自 Prolific 认为，目前的人工智能评估过于侧重技术基准，而忽视了人际互动。他们提出了 HUMAINE 排行榜，专注于以人为本的指标，如信任和文化一致性。这种方法旨在提高人工智能的实际效用和安全性，解决如谄媚和缺乏人口代表性等问题。该倡议强调人工智能发展需要与人类价值观相一致，并提供一个透明的框架以进行有意义的人工智能评估</p>

<p>The current race to build ever more capable artificial intelligence models often prioritizes technical benchmarks over the nuanced reality of human interaction. Andrew Gordon, Staff Researcher in Behavioral Science, and Nora Petrova, AI Researcher, both from Prolific, contend that this focus creates a significant disconnect. In a recent interview, they meticulously dissected the flaws in conventional AI evaluation, advocating for a more “humane” approach to truly measure a model’s utility, safety, and relatability for real people.</p>
<div class="lb-trans"><p>当前构建更强大人工智能模型的竞争往往优先考虑技术基准，而忽视了人际互动的细微现实。Prolific 的行为科学研究员 Andrew Gordon 和 AI 研究员 Nora Petrova 认为，这种关注造成了显著的脱节。在最近的一次采访中，他们详细剖析了传统 AI 评估的缺陷，倡导采取更 “人性化” 的方法，真正衡量模型对真实用户的实用性、安全性和相关性。</p>
</div><p>Gordon vividly illustrates this disconnect with an analogy: “A car that wins a Formula 1 race [is not] the best choice for your morning commute.” Similarly, an AI model that achieves an “incredibly good” score on academic benchmarks like Humanity’s Last Exam (MMLU) “might be absolute nightmare to use day-to-day.” This encapsulates the core problem: technical prowess does not automatically translate to practical, beneficial human experience.</p>
<div class="lb-trans"><p>戈登用一个生动的类比来说明这种脱节：“赢得一级方程式比赛的汽车 [并不是] 你早晨通勤的最佳选择。” 同样，一个在《人类的最后考试》（MMLU）等学术基准上取得 “极其优秀” 分数的 AI 模型 “在日常使用中可能是绝对的噩梦。” 这概括了核心问题：技术能力并不自动转化为实用的、有益的人类体验。</p>
</div><p>The landscape of AI evaluation is, as Gordon describes, “incredibly nascent” and “fractured.” There is no standardized method for labs to report performance, leading to a cacophony of selective metrics. Some emphasize MMLU scores, others highlight different benchmarks, and some offer no public data at all. This heterogeneity makes meaningful comparison challenging and risks creating a “leaderboard illusion” where models are optimized for narrow, technical tests rather than broad human utility.</p>
<div class="lb-trans"><p>正如戈登所描述的，AI 评估的现状是 “极其初步” 和 “支离破碎” 的。实验室没有标准化的方法来报告性能，导致了选择性指标的喧嚣。有些强调 MMLU 分数，有些突出不同的基准，还有一些根本不提供公开数据。这种异质性使得有意义的比较变得困难，并且有可能造成 “排行榜幻觉”，即模型被优化用于狭窄的技术测试，而不是广泛的人类实用性。</p>
</div><p>Beyond mere performance, the stakes are rising in AI safety. Nora Petrova highlights the alarming trend of users turning to AI for highly sensitive topics such as mental health support or navigating personal problems. She warns that “there is no oversight on that,” deeming the current situation “the Wild West.” Recent incidents, like Grok-3’s “Mecha Hitler” debacle, underscore the fragility of existing safety training, revealing a “thin veneer” over potentially harmful outputs. Gordon argues that safety metrics should hold equal weight to a model’s speed or intelligence.</p>
<div class="lb-trans"><p>除了单纯的性能，AI 安全的风险也在上升。Nora Petrova 强调了用户在心理健康支持或处理个人问题等高度敏感话题上转向 AI 的令人担忧的趋势。她警告说，“对此没有监督”，认为当前的情况是 “西部荒野”。最近的事件，例如 Grok-3 的 “机械希特勒” 事件，突显了现有安全培训的脆弱性，揭示了潜在有害输出的 “薄薄表面”。戈登认为，安全指标应与模型的速度或智能同等重要。</p>
</div><p>Even human preference leaderboards, intended to gauge user experience, suffer from significant biases. Platforms like Chatbot Arena, while valuable for generating vast amounts of comparative data, lack demographic stratification. This means the feedback comes from an anonymous, unrepresentative sample, making it impossible to understand why users prefer one model over another. Furthermore, Gordon points out that these platforms can be “gamed” by companies submitting numerous iterations of their models, granting them disproportionate access to user prompts for refinement. This practice undermines the integrity of the leaderboard, as some models receive far more “battles” and thus more data for iterative improvement.</p>
<div class="lb-trans"><p>即使是旨在评估用户体验的人类偏好排行榜，也存在显著的偏见。像 Chatbot Arena 这样的平台，虽然在生成大量比较数据方面有价值，但缺乏人口统计分层。这意味着反馈来自一个匿名的、不具代表性的样本，使得无法理解用户为何偏好某个模型而非另一个。此外，戈登指出，这些平台可能被公司 “操控”，通过提交多个模型的迭代版本，给予它们不成比例的用户提示访问权以进行改进。这种做法破坏了排行榜的完整性，因为某些模型获得了更多的 “战斗”，因此获得了更多用于迭代改进的数据。</p>
</div><p>To address these critical shortcomings, Prolific has developed the HUMAINE Leaderboard, a human-centered evaluation framework built on three pillars. First, it employs census-based representative sampling, meticulously stratifying participants by age, ethnicity, and political alignment across geographies like the US and UK. This ensures that feedback reflects the diverse values and preferences of the general public, not just tech-savvy early adopters.</p>
<div class="lb-trans"><p>为了解决这些关键短板，Prolific 开发了 HUMAINE 排行榜，这是一个以人为中心的评估框架，建立在三个支柱之上。首先，它采用基于人口普查的代表性抽样，仔细按年龄、种族和政治倾向对参与者进行分层，覆盖美国和英国等地区。这确保了反馈反映了公众的多样化价值观和偏好，而不仅仅是技术精英的早期采用者。</p>
</div><p>Second, HUMAINE moves beyond simplistic “A vs. B” preferences by incorporating specific, actionable metrics. Instead of a binary choice, users evaluate models on criteria such as helpfulness, adaptability, communication, personality, trustworthiness, understanding, and cultural alignment. As Gordon states, this provides “an actionable set of results that say, ‘Okay, your model is struggling with trust,’ or ‘Your model is struggling with personality.'” This granular feedback empowers developers to target specific areas for improvement, directly enhancing the human experience.</p>
<div class="lb-trans"><p>其次，HUMAINE 超越了简单的 “A 与 B” 偏好，通过纳入具体、可操作的指标。用户不再是二元选择，而是根据有用性、适应性、沟通能力、个性、可信度、理解力和文化一致性等标准来评估模型。正如戈登所说，这提供了一组可操作的结果，表明 “好吧，你的模型在可信度上有困难” 或 “你的模型在个性上有困难。” 这种细致的反馈使开发者能够针对特定领域进行改进，直接提升人类体验。</p>
</div><p>The third pillar is the use of Microsoft’s TrueSkill algorithm, a Bayesian system originally developed for Xbox Live matchmaking. This sophisticated ranking mechanism efficiently estimates model “skill” by accounting for factors like randomness, evolving performance over time, and the uncertainty of individual comparisons. It actively prioritizes battles between models where the data gain is highest, ensuring efficient and robust differentiation.</p>
<div class="lb-trans"><p>第三个支柱是使用微软的 TrueSkill 算法，这是一种最初为 Xbox Live 匹配开发的贝叶斯系统。这种复杂的排名机制通过考虑随机性、随时间变化的性能和个体比较的不确定性，来有效估计模型的 “技能”。它积极优先考虑数据增益最高的模型之间的对抗，确保高效且稳健的区分。</p>
</div><p>Early findings from the HUMAINE Leaderboard are illuminating. While models continue to advance in objective intelligence and task performance, Prolific’s data indicates a concerning trend: AI models are performing “a lot worse on personality metrics and background and culture metrics.” This suggests a widening gap between models’ intellectual capabilities and their ability to engage with users in a relatable, culturally aware, and genuinely helpful manner. Furthermore, the researchers observed an increase in “sycophancy”—a people-pleasing behavior that users generally dislike.</p>
<div class="lb-trans"><p>HUMAINE 排行榜的早期发现令人振奋。尽管模型在客观智能和任务表现上不断进步，Prolific 的数据却显示出一个令人担忧的趋势：AI 模型在个性指标和背景文化指标上的表现 “差得多”。这表明模型的智力能力与其在可关联、文化敏感和真正有帮助的方式与用户互动的能力之间存在日益扩大的差距。此外，研究人员观察到 “谄媚行为” 的增加——一种用户通常不喜欢的迎合行为。</p>
</div><p>The implications for founders, VCs, and AI professionals are profound. Building truly impactful AI requires moving beyond a narrow focus on technical scores. It necessitates a holistic understanding of how models interact with diverse human users in real-world scenarios. Prolific’s HUMAINE Leaderboard offers a rigorous, transparent, and actionable pathway to achieving this, ensuring that AI development is aligned with human values and needs, not just arbitrary benchmarks.</p>
<div class="lb-trans"><p>对创始人、风险投资者和 AI 专业人士而言，这些影响深远。构建真正有影响力的 AI 需要超越对技术分数的狭隘关注。这需要全面理解模型在现实场景中如何与多样化的人类用户互动。Prolific 的 HUMAINE 排行榜提供了一条严格、透明且可操作的路径，以实现这一目标，确保 AI 开发与人类的价值观和需求相一致，而不仅仅是任意的基准。</p>
</div>

Beyond Benchmarks: Why AI Needs a Human-Centered Scoreboard

- Prolific 的研究者指出当前 AI 评估过于侧重技术标准。  
- 提出 HUMAINE 评估框架，关注人类交互和多样化反馈。  
- 早期数据显示 AI 在个性和文化适应性上表现不佳。

超越基准：为何人工智能需要一个以人为本的评分体系