--- title: "OpenAI launches \"PaperBench\" test to prove the strongest AI agent has not surpassed humans" description: "OpenAI launched a new benchmark test \"PaperBench\" yesterday, aimed at assessing the ability of AI Agents to replicate top AI research. The test results show that even the most advanced AI models did n" type: "news" locale: "en" url: "https://longbridge.com/en/news/234314311.md" published_at: "2025-04-03T02:41:26.000Z" --- # OpenAI launches "PaperBench" test to prove the strongest AI agent has not surpassed humans > OpenAI launched a new benchmark test "PaperBench" yesterday, aimed at assessing the ability of AI Agents to replicate top AI research. The test results show that even the most advanced AI models did not surpass the human baseline. PaperBench requires AI Agents to replicate 20 papers from the ICML 2024 conference from scratch, and the results indicate that the best-performing AI Agent achieved only a 21% replication score. OpenAI has open-sourced the relevant code to facilitate research on the engineering capabilities of AI Agents OpenAI launches "PaperBench" test to prove the strongest AI Agent has not surpassed humans OpenAI announced yesterday (2nd) the launch of a new benchmark test called "PaperBench," aimed at assessing the ability of AI Agents to replicate top AI research. The results indicate that even the most advanced models have not yet surpassed the human benchmark. PaperBench requires AI Agents to replicate 20 Spotlight and Oral papers presented at the ICML 2024 conference from scratch, including understanding the core contributions of the papers, independently developing codebases, and successfully executing related experiments. To ensure a fair and objective assessment, the research team designed a hierarchical scoring system that breaks down each replication task into 8,316 independently scoreable subtasks. OpenAI stated that all scoring criteria were developed in collaboration with the original paper authors to ensure the accuracy and practicality of the evaluation. The team also developed a judgment system based on large language models that can automatically score the AI Agent's replication attempts. The test results show that the currently best-performing AI Agent, Claude 3.5 Sonnet (new version) developed by Anthropic, achieved an average replication score of only 21%. The research team also invited top machine learning PhD students to complete the same test, and the results indicate that AI models have not yet surpassed the capabilities of human experts in research replication. OpenAI has now open-sourced the relevant code to promote further research in the industry on the engineering capabilities of AI Agents ### Related Stocks - [OpenAI.NA - OpenAI](https://longbridge.com/en/quote/OpenAI.NA.md) ## Related News & Research | Title | Description | URL | |-------|-------------|-----| | Sam Altman And Dario Amodei Stir Controversy At India AI Summit Amid Photo-Op Gesture— OpenAI CEO Says 'I Just Wasn't Sure...' | At the India AI Impact Summit, OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei stirred controversy by opting out of | [Link](https://longbridge.com/en/news/276431749.md) | | After AMD, OpenAI Partners With Tata To Build Massive 1GW AI Data Center In India | OpenAI has partnered with Tata Group and Tata Consultancy Services to develop a large-scale AI data center in India, wit | [Link](https://longbridge.com/en/news/276304570.md) | | OpenAI expands agentic commerce push | By embedding structured product data and checkout flows directly into ChatGPT, OpenAI is seeking to position AI as the f | [Link](https://longbridge.com/en/news/276071558.md) | | Chinese tech companies progress 'remarkable,' OpenAI's Altman tells CNBC | OpenAI's Sam Altman praised the rapid progress of Chinese tech companies in AI during an AI summit in New Delhi. He note | [Link](https://longbridge.com/en/news/276315901.md) | | Altman and Amodei share a moment of awkwardness at India’s big AI summit | At the India AI Impact Summit, a moment of awkwardness arose when OpenAI's Sam Altman and Anthropic's Dario Amodei did n | [Link](https://longbridge.com/en/news/276340986.md) | --- > **Disclaimer**: This article is for reference only and does not constitute any investment advice.