<p>OpenAI 在周四发布了一项新的基准测试，旨在评估其人工智能模型在各个行业和职位上与人类专业人士的表现相比如何。该测试名为 GDPval，是理解 OpenAI 的系统在经济价值工作中是否接近超越人类的早期尝试——这是公司开发人工通用智能（AGI）的创始使命的关键部分。</p>
<p>OpenAI 表示，其发现 GPT-5 模型和 Anthropic 的 Claude Opus 4.1“已经接近行业专家所产生的工作质量。”</p>
<p>这并不是说 OpenAI 的模型会立即开始取代人类的工作。尽管一些首席执行官预测人工智能将在短短几年内取代人类的工作，但 OpenAI 承认，GDPval 今天覆盖的人类实际工作任务非常有限。然而，这是公司衡量人工智能在这一里程碑上进展的最新方式之一。</p>
<p>GDPval 基于九个对美国国内生产总值贡献最大的行业，包括医疗保健、金融、制造业和政府等领域。该基准测试了这些行业中 44 个职业的人工智能模型表现，从软件工程师到护士再到记者。</p>
<p>在 OpenAI 的测试第一版 GDPval-v0 中，OpenAI 要求经验丰富的专业人士比较人工智能生成的报告与其他专业人士产生的报告，然后选择最佳报告。例如，一个提示要求投资银行家为最后一公里配送行业创建竞争者分析，并与人工智能生成的报告进行比较。OpenAI 随后计算人工智能模型在所有 44 个职业中与人类报告的 “胜率” 平均值。</p>
<p>对于 GPT-5-high，这是一个增强版的 GPT-5，具有额外的计算能力，公司表示该人工智能模型在 40.6% 的情况下被评为优于或与行业专家持平。</p>
<p>OpenAI 还测试了 Anthropic 的 Claude Opus 4.1 模型，该模型在 49% 的任务中被评为优于或与行业专家持平。OpenAI 表示，它认为 Claude 得分如此之高是因为其倾向于制作令人愉悦的图形，而不是单纯的性能。</p>
<p>Techcrunch 活动</p>
<p>旧金山 |2025 年 10 月 27 日至 29 日</p>
<p><img src="https://imageproxy.pbkrs.com/https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png/query-dz02ODA?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png/query-dz02ODA"/></p>
<p>来源：OpenAI</p>
<p>值得注意的是，大多数职场专业人士的工作远不止向老板提交研究报告，而这正是 GDPval-v0 所测试的内容。OpenAI 对此表示认可，并表示计划在未来创建更为全面的测试，以涵盖更多行业和互动工作流程。</p>
<p>尽管如此，公司仍然认为 GDPval 的进展是显著的。</p>
<p>在接受 TechCrunch 采访时，OpenAI 首席经济学家 Aaron Chatterji 博士表示，GDPval 的结果表明，这些职位上的人们现在可以利用人工智能模型来花更多时间在更有意义的任务上。</p>
<p>“[因为] 模型在某些方面变得越来越优秀，” Chatterji 说，“那些职位上的人们现在可以利用该模型，随着能力的提升，逐渐将一些工作转移出去，做一些潜在的更高价值的事情。”</p>
<p>OpenAI 的评估负责人 Tejal Patwardhan 在接受 TechCrunch 采访时表示，她对 GDPval 的进展速度感到鼓舞。OpenAI 的 GPT-4o 模型仅得到了 13.7% 的得分（与人类的胜利和平局），该模型大约在 15 个月前发布。现在 GPT-5 的得分几乎是其三倍，Patwardhan 预计这一趋势将持续下去。</p>
<p>硅谷有多种基准测试用于衡量人工智能模型的进展，并评估某个模型是否处于最先进水平。其中最受欢迎的包括 AIME 2025（竞争性数学问题测试）和 GPQA Diamond（博士级科学问题测试）。然而，几个人工智能模型在这些基准测试上接近饱和，许多人工智能研究人员指出需要更好的测试，以衡量人工智能在现实任务中的熟练程度。</p>
<p>像 GDPval 这样的基准测试在这一讨论中可能变得越来越重要，因为 OpenAI 主张其人工智能模型对广泛行业具有价值。</p>

OpenAI

<p>OpenAI 推出了一个新的基准测试 GDPval，以评估其 GPT-5 模型在各个行业与人类专业人士的表现。结果显示，GPT-5 在 40.6% 的任务中表现与行业专家相当，而 Anthropic 的 Claude Opus 4.1 得分为 49%。尽管该基准目前涵盖的任务有限，OpenAI 计划扩展其范围，以更好地反映现实世界的工作职能。GDPval 所显示的进展表明，人工智能可以帮助专业人士专注于更有意义的工作，并预计人工智能能力将持续改善</p>

<p>OpenAI released a new benchmark on Thursday that tests how its AI models perform compared to human professionals across a wide range of industries and jobs. The test, GDPval, is an early attempt at understanding how close OpenAI’s systems are to outperforming humans at economically valuable work — a key part of the company’s founding mission to develop artificial general intelligence or AGI.</p>
<div class="lb-trans"><p>OpenAI 在周四发布了一项新的基准测试，旨在评估其人工智能模型在各个行业和职位上与人类专业人士的表现相比如何。该测试名为 GDPval，是理解 OpenAI 的系统在经济价值工作中是否接近超越人类的早期尝试——这是公司开发人工通用智能（AGI）的创始使命的关键部分。</p>
</div><p>OpenAI says its found that its GPT-5 model and Anthropic’s Claude Opus 4.1 “are already approaching the quality of work produced by industry experts.”</p>
<div class="lb-trans"><p>OpenAI 表示，其发现 GPT-5 模型和 Anthropic 的 Claude Opus 4.1“已经接近行业专家所产生的工作质量。”</p>
</div><p>That’s not to say that OpenAI’s models are going to start replacing humans in their jobs immediately. Despite some CEOs’ predictions that AI will take the jobs of humans in just a few years, OpenAI admits that GDPval today covers a very limited number of tasks people do in their real jobs. However, it is one of the latest ways the company is measuring AI’s progress towards this milestone.</p>
<div class="lb-trans"><p>这并不是说 OpenAI 的模型会立即开始取代人类的工作。尽管一些首席执行官预测人工智能将在短短几年内取代人类的工作，但 OpenAI 承认，GDPval 今天覆盖的人类实际工作任务非常有限。然而，这是公司衡量人工智能在这一里程碑上进展的最新方式之一。</p>
</div><p>GDPval is based on nine industries that contribute the most to America’s gross domestic product, including domains such as healthcare, finance, manufacturing, and government. The benchmark tests an AI model’s performance in 44 occupations among those industries, ranging from software engineers to nurses to journalists.</p>
<div class="lb-trans"><p>GDPval 基于九个对美国国内生产总值贡献最大的行业，包括医疗保健、金融、制造业和政府等领域。该基准测试了这些行业中 44 个职业的人工智能模型表现，从软件工程师到护士再到记者。</p>
</div><p>For OpenAI’s first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals, and then choose the best one. For example, one prompt asked investment bankers to create a competitor landscape for the last mile delivery industry, and compare them to AI-generated reports. OpenAI then averages an AI model’s “win rate” against the human reports across all 44 occupations.</p>
<div class="lb-trans"><p>在 OpenAI 的测试第一版 GDPval-v0 中，OpenAI 要求经验丰富的专业人士比较人工智能生成的报告与其他专业人士产生的报告，然后选择最佳报告。例如，一个提示要求投资银行家为最后一公里配送行业创建竞争者分析，并与人工智能生成的报告进行比较。OpenAI 随后计算人工智能模型在所有 44 个职业中与人类报告的 “胜率” 平均值。</p>
</div><p>For GPT-5-high, a souped up version of GPT-5 with extra computational power, the company says the AI model was ranked as better than or on par with industry experts 40.6% of the time.</p>
<div class="lb-trans"><p>对于 GPT-5-high，这是一个增强版的 GPT-5，具有额外的计算能力，公司表示该人工智能模型在 40.6% 的情况下被评为优于或与行业专家持平。</p>
</div><p>OpenAI also tested Anthropic’s Claude Opus 4.1 model, which was ranked as better than or on par with industry experts in 49% of tasks. OpenAI says that it believes Claude scored so high because of its tendency to make pleasing graphics, rather than sheer performance.</p>
<div class="lb-trans"><p>OpenAI 还测试了 Anthropic 的 Claude Opus 4.1 模型，该模型在 49% 的任务中被评为优于或与行业专家持平。OpenAI 表示，它认为 Claude 得分如此之高是因为其倾向于制作令人愉悦的图形，而不是单纯的性能。</p>
</div><p>Techcrunch event</p>
<div class="lb-trans"><p>Techcrunch 活动</p>
</div><p>San Francisco|October 27-29, 2025</p>
<div class="lb-trans"><p>旧金山 |2025 年 10 月 27 日至 29 日</p>
</div><p><img src="https://imageproxy.pbkrs.com/https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png/query-dz02ODA" alt="" original-src="https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png?w=680"/></p>
<p>Credit: OpenAI</p>
<div class="lb-trans"><p>来源：OpenAI</p>
</div><p>It’s worth noting that most working professionals do a lot more than submit research reports to their boss, which is all that GDPval-v0 tests for. OpenAI acknowledges this, and says it plans to create more robust tests in the future that can account for more industries and interactive workflows.</p>
<div class="lb-trans"><p>值得注意的是，大多数职场专业人士的工作远不止向老板提交研究报告，而这正是 GDPval-v0 所测试的内容。OpenAI 对此表示认可，并表示计划在未来创建更为全面的测试，以涵盖更多行业和互动工作流程。</p>
</div><p>Nonetheless, the company sees the progress on GDPval as notable.</p>
<div class="lb-trans"><p>尽管如此，公司仍然认为 GDPval 的进展是显著的。</p>
</div><p>In an interview with TechCrunch, OpenAI’s chief economist Dr. Aaron Chatterji said GDPval’s results suggest that people in these jobs can now use AI models to spend time on more meaningful tasks.</p>
<div class="lb-trans"><p>在接受 TechCrunch 采访时，OpenAI 首席经济学家 Aaron Chatterji 博士表示，GDPval 的结果表明，这些职位上的人们现在可以利用人工智能模型来花更多时间在更有意义的任务上。</p>
</div><p>“[Because] the model is getting good at some of these things,” Chatterji says, “people in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things.”</p>
<div class="lb-trans"><p>“[因为] 模型在某些方面变得越来越优秀，” Chatterji 说，“那些职位上的人们现在可以利用该模型，随着能力的提升，逐渐将一些工作转移出去，做一些潜在的更高价值的事情。”</p>
</div><p>OpenAI’s evaluations lead Tejal Patwardhan tells TechCrunch that she’s encouraged by the rate of progress on GDPval. OpenAI’s GPT-4o model scored just 13.7% (wins and ties versus humans), which was released roughly 15 months ago. Now GPT-5 scores nearly triple that, a trend Patwardhan expects to continue.</p>
<div class="lb-trans"><p>OpenAI 的评估负责人 Tejal Patwardhan 在接受 TechCrunch 采访时表示，她对 GDPval 的进展速度感到鼓舞。OpenAI 的 GPT-4o 模型仅得到了 13.7% 的得分（与人类的胜利和平局），该模型大约在 15 个月前发布。现在 GPT-5 的得分几乎是其三倍，Patwardhan 预计这一趋势将持续下去。</p>
</div><p>Silicon Valley has a wide range of benchmarks it uses to measure the progress of AI models, and assess whether a given model is state-of-the-art. Among the most popular are AIME 2025 (a test of competitive math problems) and GPQA Diamond (a test of PhD level science questions). However, several AI models are nearing saturation on some of these benchmarks, and many AI researchers have cited the need for better tests that can measure AI’s proficiency on real-world tasks.</p>
<div class="lb-trans"><p>硅谷有多种基准测试用于衡量人工智能模型的进展，并评估某个模型是否处于最先进水平。其中最受欢迎的包括 AIME 2025（竞争性数学问题测试）和 GPQA Diamond（博士级科学问题测试）。然而，几个人工智能模型在这些基准测试上接近饱和，许多人工智能研究人员指出需要更好的测试，以衡量人工智能在现实任务中的熟练程度。</p>
</div><p>Benchmarks like GDPval could become increasingly important in that conversation, as OpenAI makes the case that its AI models are valuable for a wide range of industries.</p>
<div class="lb-trans"><p>像 GDPval 这样的基准测试在这一讨论中可能变得越来越重要，因为 OpenAI 主张其人工智能模型对广泛行业具有价值。</p>
</div>

OpenAI 表示，GPT-5 在广泛的工作领域中表现得与人类相当