---
title: "HLE \"The Last Exam of Humanity\" has for the first time surpassed 60 points! Eigen-1, based on DeepSeek V3.1, significantly outperforms Grok4 and GPT-5"
description: "Eigen-1 multi-agent system achieved a historic breakthrough on the HLE Bio/Chem Gold test set, with a Pass@1 accuracy of 48.3% and a Pass@5 accuracy of 61.74%, surpassing 60 points for the first time,"
type: "news"
locale: "en"
url: "https://longbridge.com/en/news/259215649.md"
published_at: "2025-09-28T11:59:11.000Z"
---

# HLE "The Last Exam of Humanity" has for the first time surpassed 60 points! Eigen-1, based on DeepSeek V3.1, significantly outperforms Grok4 and GPT-5

> Eigen-1 multi-agent system achieved a historic breakthrough on the HLE Bio/Chem Gold test set, with a Pass@1 accuracy of 48.3% and a Pass@5 accuracy of 61.74%, surpassing 60 points for the first time, leading Google Gemini 2.5 Pro, OpenAI GPT-5, and Grok 4. This achievement is based on the open-source DeepSeek V3.1, rather than a closed-source large model

Recently, the Eigen-1 multi-agent system, jointly developed by teams from Yale University including Tang Xiangru and Wang Yujie, Shanghai Jiao Tong University’s Xu Wanghan, UCLA’s Wan Guancheng, Oxford University’s Yin Zhenfei, and Eigen AI’s Jin Di and Wang Hanrui, achieved a historic breakthrough—reaching a Pass@1 accuracy of 48.3% and a Pass@5 accuracy soaring to 61.74% on the HLE Bio/Chem Gold test set, crossing the 60-point threshold for the first time. This achievement far surpasses that of Google Gemini 2.5 Pro, OpenAI GPT-5, and Grok 4. Most excitingly, this accomplishment is not reliant on closed-source super large models, but is entirely built on the open-source DeepSeek V3.1

### Related Stocks

- [OpenAI.NA - OpenAI](https://longbridge.com/en/quote/OpenAI.NA.md)
- [GOOG.US - Alphabet - C](https://longbridge.com/en/quote/GOOG.US.md)

## Related News & Research

| Title | Description | URL |
|-------|-------------|-----|
| 馬斯克旗下 Grok 美國市佔升至近 18% 未受傳播色情內容影響 | 馬斯克旗下的 AI 聊天機器人 Grok 在美國的市場份額已升至近 18%，成為第三大聊天機器人，僅次於 ChatGPT 和 Google Gemini。儘管 Grok 捲入生成色情內容的爭議，但其使用率未受影響。分析師認為，社交媒體平台  | [Link](https://longbridge.com/en/news/275965405.md) |
| ChatGPT 開始測試投放廣告 | OpenAI 開始在 ChatGPT 的免費版和最低付費版中測試廣告，旨在增加收入以應對成本上升。測試面向美國成年用户，涵蓋免費和 Go 訂閲方案（每月 8 美元）。儘管大多數用户未付費，OpenAI 承諾廣告不會影響回答內容，用户對話內容 | [Link](https://longbridge.com/en/news/275484431.md) |
| GPT-5 在法律對決中勝過人類評委 | 法律學者發現，OpenAI 的 GPT-5 在遵循法律方面的表現優於人類法官，合規率達到 100%，而法官的合規率僅為 52%。在一項研究中，GPT-5 在法律場景中進行了測試，顯示沒有邏輯錯誤，這與之前的 AI 模型不同。這些發現引發了關 | [Link](https://longbridge.com/en/news/276008190.md) |
| 一切向 “錢” 看！ChatGPT 正式開測廣告，網上罵聲一片 | OpenAI 開始對免費與低價訂閲用户測試廣告功能，以緩解高昂運營成本。此舉引發用户強烈反對，被批損害體驗與信任。競爭對手 Anthropic 藉機諷刺，OpenAI CEO 則激烈回擊。此舉背後是為支撐其千億美元級融資談判，向資本市場證明 | [Link](https://longbridge.com/en/news/275435957.md) |
| OpenAI 首款硬件據報今年推 類似 AirPods 受累記憶體短缺要「降格」 | OpenAI 計劃推出首款硬體「Dime」，類似 AirPods，預計今年發布。因內存短缺，原本的高規格設計被簡化，最終產品將為簡單耳機。該產品原定搭載高性能 Exynos 晶片，具備獨立計算能力，但因成本問題調整。預計由富士康在越南生產， | [Link](https://longbridge.com/en/news/275219739.md) |

---

> **Disclaimer**: This article is for reference only and does not constitute any investment advice.