<p>OpenAI 在週四發佈了一項新的基準測試，旨在評估其人工智能模型在各個行業和職位上與人類專業人士的表現相比如何。該測試名為 GDPval，是理解 OpenAI 的系統在經濟價值工作中是否接近超越人類的早期嘗試——這是公司開發人工通用智能（AGI）的創始使命的關鍵部分。</p>
<p>OpenAI 表示，其發現 GPT-5 模型和 Anthropic 的 Claude Opus 4.1“已經接近行業專家所產生的工作質量。”</p>
<p>這並不是説 OpenAI 的模型會立即開始取代人類的工作。儘管一些首席執行官預測人工智能將在短短几年內取代人類的工作，但 OpenAI 承認，GDPval 今天覆蓋的人類實際工作任務非常有限。然而，這是公司衡量人工智能在這一里程碑上進展的最新方式之一。</p>
<p>GDPval 基於九個對美國國內生產總值貢獻最大的行業，包括醫療保健、金融、製造業和政府等領域。該基準測試了這些行業中 44 個職業的人工智能模型表現，從軟件工程師到護士再到記者。</p>
<p>在 OpenAI 的測試第一版 GDPval-v0 中，OpenAI 要求經驗豐富的專業人士比較人工智能生成的報告與其他專業人士產生的報告，然後選擇最佳報告。例如，一個提示要求投資銀行家為最後一公里配送行業創建競爭者分析，並與人工智能生成的報告進行比較。OpenAI 隨後計算人工智能模型在所有 44 個職業中與人類報告的 “勝率” 平均值。</p>
<p>對於 GPT-5-high，這是一個增強版的 GPT-5，具有額外的計算能力，公司表示該人工智能模型在 40.6% 的情況下被評為優於或與行業專家持平。</p>
<p>OpenAI 還測試了 Anthropic 的 Claude Opus 4.1 模型，該模型在 49% 的任務中被評為優於或與行業專家持平。OpenAI 表示，它認為 Claude 得分如此之高是因為其傾向於製作令人愉悦的圖形，而不是單純的性能。</p>
<p>Techcrunch 活動</p>
<p>舊金山 |2025 年 10 月 27 日至 29 日</p>
<p><img src="https://imageproxy.pbkrs.com/https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png/query-dz02ODA?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png/query-dz02ODA"/></p>
<p>來源：OpenAI</p>
<p>值得注意的是，大多數職場專業人士的工作遠不止向老闆提交研究報告，而這正是 GDPval-v0 所測試的內容。OpenAI 對此表示認可，並表示計劃在未來創建更為全面的測試，以涵蓋更多行業和互動工作流程。</p>
<p>儘管如此，公司仍然認為 GDPval 的進展是顯著的。</p>
<p>在接受 TechCrunch 採訪時，OpenAI 首席經濟學家 Aaron Chatterji 博士表示，GDPval 的結果表明，這些職位上的人們現在可以利用人工智能模型來花更多時間在更有意義的任務上。</p>
<p>“[因為] 模型在某些方面變得越來越優秀，” Chatterji 説，“那些職位上的人們現在可以利用該模型，隨着能力的提升，逐漸將一些工作轉移出去，做一些潛在的更高價值的事情。”</p>
<p>OpenAI 的評估負責人 Tejal Patwardhan 在接受 TechCrunch 採訪時表示，她對 GDPval 的進展速度感到鼓舞。OpenAI 的 GPT-4o 模型僅得到了 13.7% 的得分（與人類的勝利和平局），該模型大約在 15 個月前發佈。現在 GPT-5 的得分幾乎是其三倍，Patwardhan 預計這一趨勢將持續下去。</p>
<p>硅谷有多種基準測試用於衡量人工智能模型的進展，並評估某個模型是否處於最先進水平。其中最受歡迎的包括 AIME 2025（競爭性數學問題測試）和 GPQA Diamond（博士級科學問題測試）。然而，幾個人工智能模型在這些基準測試上接近飽和，許多人工智能研究人員指出需要更好的測試，以衡量人工智能在現實任務中的熟練程度。</p>
<p>像 GDPval 這樣的基準測試在這一討論中可能變得越來越重要，因為 OpenAI 主張其人工智能模型對廣泛行業具有價值。</p>

OpenAI

<p>OpenAI 推出了一個新的基準測試 GDPval，以評估其 GPT-5 模型在各個行業與人類專業人士的表現。結果顯示，GPT-5 在 40.6% 的任務中表現與行業專家相當，而 Anthropic 的 Claude Opus 4.1 得分為 49%。儘管該基準目前涵蓋的任務有限，OpenAI 計劃擴展其範圍，以更好地反映現實世界的工作職能。GDPval 所顯示的進展表明，人工智能可以幫助專業人士專注於更有意義的工作，並預計人工智能能力將持續改善</p>

<p>OpenAI released a new benchmark on Thursday that tests how its AI models perform compared to human professionals across a wide range of industries and jobs. The test, GDPval, is an early attempt at understanding how close OpenAI’s systems are to outperforming humans at economically valuable work — a key part of the company’s founding mission to develop artificial general intelligence or AGI.</p>
<div class="lb-trans"><p>OpenAI 在週四發佈了一項新的基準測試，旨在評估其人工智能模型在各個行業和職位上與人類專業人士的表現相比如何。該測試名為 GDPval，是理解 OpenAI 的系統在經濟價值工作中是否接近超越人類的早期嘗試——這是公司開發人工通用智能（AGI）的創始使命的關鍵部分。</p>
</div><p>OpenAI says its found that its GPT-5 model and Anthropic’s Claude Opus 4.1 “are already approaching the quality of work produced by industry experts.”</p>
<div class="lb-trans"><p>OpenAI 表示，其發現 GPT-5 模型和 Anthropic 的 Claude Opus 4.1“已經接近行業專家所產生的工作質量。”</p>
</div><p>That’s not to say that OpenAI’s models are going to start replacing humans in their jobs immediately. Despite some CEOs’ predictions that AI will take the jobs of humans in just a few years, OpenAI admits that GDPval today covers a very limited number of tasks people do in their real jobs. However, it is one of the latest ways the company is measuring AI’s progress towards this milestone.</p>
<div class="lb-trans"><p>這並不是説 OpenAI 的模型會立即開始取代人類的工作。儘管一些首席執行官預測人工智能將在短短几年內取代人類的工作，但 OpenAI 承認，GDPval 今天覆蓋的人類實際工作任務非常有限。然而，這是公司衡量人工智能在這一里程碑上進展的最新方式之一。</p>
</div><p>GDPval is based on nine industries that contribute the most to America’s gross domestic product, including domains such as healthcare, finance, manufacturing, and government. The benchmark tests an AI model’s performance in 44 occupations among those industries, ranging from software engineers to nurses to journalists.</p>
<div class="lb-trans"><p>GDPval 基於九個對美國國內生產總值貢獻最大的行業，包括醫療保健、金融、製造業和政府等領域。該基準測試了這些行業中 44 個職業的人工智能模型表現，從軟件工程師到護士再到記者。</p>
</div><p>For OpenAI’s first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals, and then choose the best one. For example, one prompt asked investment bankers to create a competitor landscape for the last mile delivery industry, and compare them to AI-generated reports. OpenAI then averages an AI model’s “win rate” against the human reports across all 44 occupations.</p>
<div class="lb-trans"><p>在 OpenAI 的測試第一版 GDPval-v0 中，OpenAI 要求經驗豐富的專業人士比較人工智能生成的報告與其他專業人士產生的報告，然後選擇最佳報告。例如，一個提示要求投資銀行家為最後一公里配送行業創建競爭者分析，並與人工智能生成的報告進行比較。OpenAI 隨後計算人工智能模型在所有 44 個職業中與人類報告的 “勝率” 平均值。</p>
</div><p>For GPT-5-high, a souped up version of GPT-5 with extra computational power, the company says the AI model was ranked as better than or on par with industry experts 40.6% of the time.</p>
<div class="lb-trans"><p>對於 GPT-5-high，這是一個增強版的 GPT-5，具有額外的計算能力，公司表示該人工智能模型在 40.6% 的情況下被評為優於或與行業專家持平。</p>
</div><p>OpenAI also tested Anthropic’s Claude Opus 4.1 model, which was ranked as better than or on par with industry experts in 49% of tasks. OpenAI says that it believes Claude scored so high because of its tendency to make pleasing graphics, rather than sheer performance.</p>
<div class="lb-trans"><p>OpenAI 還測試了 Anthropic 的 Claude Opus 4.1 模型，該模型在 49% 的任務中被評為優於或與行業專家持平。OpenAI 表示，它認為 Claude 得分如此之高是因為其傾向於製作令人愉悦的圖形，而不是單純的性能。</p>
</div><p>Techcrunch event</p>
<div class="lb-trans"><p>Techcrunch 活動</p>
</div><p>San Francisco|October 27-29, 2025</p>
<div class="lb-trans"><p>舊金山 |2025 年 10 月 27 日至 29 日</p>
</div><p><img src="https://imageproxy.pbkrs.com/https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png/query-dz02ODA" alt="" original-src="https://techcrunch.com/wp-content/uploads/2025/09/Screenshot-2025-09-25-at-9.10.47AM.png?w=680"/></p>
<p>Credit: OpenAI</p>
<div class="lb-trans"><p>來源：OpenAI</p>
</div><p>It’s worth noting that most working professionals do a lot more than submit research reports to their boss, which is all that GDPval-v0 tests for. OpenAI acknowledges this, and says it plans to create more robust tests in the future that can account for more industries and interactive workflows.</p>
<div class="lb-trans"><p>值得注意的是，大多數職場專業人士的工作遠不止向老闆提交研究報告，而這正是 GDPval-v0 所測試的內容。OpenAI 對此表示認可，並表示計劃在未來創建更為全面的測試，以涵蓋更多行業和互動工作流程。</p>
</div><p>Nonetheless, the company sees the progress on GDPval as notable.</p>
<div class="lb-trans"><p>儘管如此，公司仍然認為 GDPval 的進展是顯著的。</p>
</div><p>In an interview with TechCrunch, OpenAI’s chief economist Dr. Aaron Chatterji said GDPval’s results suggest that people in these jobs can now use AI models to spend time on more meaningful tasks.</p>
<div class="lb-trans"><p>在接受 TechCrunch 採訪時，OpenAI 首席經濟學家 Aaron Chatterji 博士表示，GDPval 的結果表明，這些職位上的人們現在可以利用人工智能模型來花更多時間在更有意義的任務上。</p>
</div><p>“[Because] the model is getting good at some of these things,” Chatterji says, “people in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things.”</p>
<div class="lb-trans"><p>“[因為] 模型在某些方面變得越來越優秀，” Chatterji 説，“那些職位上的人們現在可以利用該模型，隨着能力的提升，逐漸將一些工作轉移出去，做一些潛在的更高價值的事情。”</p>
</div><p>OpenAI’s evaluations lead Tejal Patwardhan tells TechCrunch that she’s encouraged by the rate of progress on GDPval. OpenAI’s GPT-4o model scored just 13.7% (wins and ties versus humans), which was released roughly 15 months ago. Now GPT-5 scores nearly triple that, a trend Patwardhan expects to continue.</p>
<div class="lb-trans"><p>OpenAI 的評估負責人 Tejal Patwardhan 在接受 TechCrunch 採訪時表示，她對 GDPval 的進展速度感到鼓舞。OpenAI 的 GPT-4o 模型僅得到了 13.7% 的得分（與人類的勝利和平局），該模型大約在 15 個月前發佈。現在 GPT-5 的得分幾乎是其三倍，Patwardhan 預計這一趨勢將持續下去。</p>
</div><p>Silicon Valley has a wide range of benchmarks it uses to measure the progress of AI models, and assess whether a given model is state-of-the-art. Among the most popular are AIME 2025 (a test of competitive math problems) and GPQA Diamond (a test of PhD level science questions). However, several AI models are nearing saturation on some of these benchmarks, and many AI researchers have cited the need for better tests that can measure AI’s proficiency on real-world tasks.</p>
<div class="lb-trans"><p>硅谷有多種基準測試用於衡量人工智能模型的進展，並評估某個模型是否處於最先進水平。其中最受歡迎的包括 AIME 2025（競爭性數學問題測試）和 GPQA Diamond（博士級科學問題測試）。然而，幾個人工智能模型在這些基準測試上接近飽和，許多人工智能研究人員指出需要更好的測試，以衡量人工智能在現實任務中的熟練程度。</p>
</div><p>Benchmarks like GDPval could become increasingly important in that conversation, as OpenAI makes the case that its AI models are valuable for a wide range of industries.</p>
<div class="lb-trans"><p>像 GDPval 這樣的基準測試在這一討論中可能變得越來越重要，因為 OpenAI 主張其人工智能模型對廣泛行業具有價值。</p>
</div>

OpenAI 表示，GPT-5 在廣泛的工作領域中表現得與人類相當