--- title: "OpenAI launches SWE-bench Verified: Existing frameworks underestimate the software engineering capabilities of models" description: "OpenAI launches SWE-bench Verified, an improvement on the existing SWE-bench to more reliably assess AI models' ability to solve software problems. This initiative aims to evaluate their performance i" type: "news" locale: "en" url: "https://longbridge.com/en/news/211496428.md" published_at: "2024-08-13T23:47:30.000Z" --- # OpenAI launches SWE-bench Verified: Existing frameworks underestimate the software engineering capabilities of models > OpenAI launches SWE-bench Verified, an improvement on the existing SWE-bench to more reliably assess AI models' ability to solve software problems. This initiative aims to evaluate their performance in challenging tasks as systems approach AGI. This is business-related information and constitutes a significant event for the company OpenAI just launched a more reliable code generation evaluation benchmark: SWE-bench Verified. The most important sentence in the blog post is: "As our systems approach AGI more and more, we need to evaluate them in increasingly challenging tasks." This benchmark is an improved version (subset) of the existing SWE-bench, aimed at more reliably evaluating the ability of AI models to solve real-world software problems. SWE-bench is a popular software engineering evaluation suite used to assess the ability of large language models (LLMs) to solve real software problems extracted from GitHub. It evaluates by providing code repositories and problem descriptions to AI agents, and asking them to generate patches to fix the problems. While LLMs have made remarkable progress on SWE-bench, OpenAI's research found that the benchmark has some issues that may underestimate the models' autonomous software engineering capabilities. Specifically, OpenAI identified three main issues with SWE-bench: **1\. Unit tests are too strict**: Unit tests used to evaluate the correctness of solutions are often too specific, even irrelevant to the problem, which may result in rejecting correct solutions. **2\. Unclear problem descriptions**: Many samples have vague problem descriptions, leading to ambiguity in understanding the problem and its solutions. **3\. Difficult to set up development environment**: Sometimes it is challenging to reliably set up the SWE-bench development environment for agents, which may cause unit tests to fail regardless of the solution. To address these issues, OpenAI collaborated with professional software developers to manually screen each sample in the SWE-bench test set to ensure the scope of unit tests is appropriate and problem descriptions are clear. In the end, they released SWE-bench Verified, which is a verified subset containing 500 samples, replacing the original SWE-bench and SWE-bench Lite test sets. Additionally, OpenAI also worked with the authors of SWE-bench to develop a new evaluation tool that uses containerized Docker environments, making evaluations on SWE-bench easier and more reliable. On SWE-bench Verified, GPT-4o solved 33.2% of the samples, while the best-performing open-source agent framework Agentless doubled its score to 16% OpenAI's research highlights the importance of a deep understanding and improvement of benchmark evaluation, especially as AI systems approach closer to Artificial General Intelligence (AGI). With the continuous improvement in AI model capabilities, we need to evaluate their performance more carefully to ensure that the evaluation results accurately reflect the true capabilities of the models. OpenAI suggests: **Deep Understanding of Benchmarks**: Even well-designed benchmarks may have issues that need ongoing improvement. **Consider Progress in the Ecosystem**: Pay attention to the community's advancements in agent frameworks and consider potential external enhancements when assessing risks. **Recognize Limitations**: Evaluations based on static datasets have inherent limitations and require supplementary evaluation methods. For more information: https://openai.com/index/introducing-swe-bench-verified/ ### Related Stocks - [OpenAI.NA - OpenAI](https://longbridge.com/en/quote/OpenAI.NA.md) ## Related News & Research | Title | Description | URL | |-------|-------------|-----| | OpenAI 高管:工程师变成 “魔法师”,AI 将开启新一轮创业狂潮 | OpenAI 内部曝光:95% 工程师已用 AI 编程,代码审查全由 Codex 接管!负责人 Sherwin Wu 预言,未来两年模型将具备数小时长任务处理能力,工程师正变为指挥智能体的 “巫师”。随着模型吞噬中间层,为 “超级个体” 服 | [Link](https://longbridge.com/en/news/275998627.md) | | 为 AI 交易 “背书”!OpenAI 正敲定新一轮融资:以 8300 亿美元估值募资高达 1000 亿美元 | OpenAI 正以 8300 亿美元估值推进新一轮融资,目标筹集 1000 亿美元。软银拟领投 300 亿美元,亚马逊和英伟达可能各投 500 亿及 300 亿美元,微软拟投数十亿美元。本轮融资是 OpenAI 自去年秋季公司制改革以来的首 | [Link](https://longbridge.com/en/news/276298180.md) | | 每千次展示 60 美元!OpenAI 用高价拉开 “AI 广告” 大幕 | 为应对 AI 巨额开支,OpenAI 正式测试广告,CPM60 美元起步、最低投入 20 万美元,定位高端渠道,直接挑战谷歌万亿美元市场,WPP 等顶级代理已率先合作。但转型风险并存:需平衡用户信任,承诺不用私聊数据;对手 Anthropi | [Link](https://longbridge.com/en/news/275993077.md) | | 学习英伟达刺激芯片销售,AMD 为 “AI 云” 借款做担保 | AMD 为扩大市场份额祭出金融 “狠招”!为初创公司 Crusoe 的 3 亿美元购芯贷款提供担保,承诺在其无客户时 “兜底” 租用芯片。这一复刻英伟达 “租卡云” 路径的策略虽能短期推高销量,但也令 AMD 在 AI 需求放缓时面临更大的 | [Link](https://longbridge.com/en/news/276401504.md) | | 最高法裁决后特朗普动用替补选择:加征 10% 全球关税 | 美国总统特朗普在最高法院裁决后宣布将加征 10% 的全球关税,以补救被推翻的关税措施。根据《1974 年贸易法》第 122 条款,现有的关税将全面生效。最高法院裁定特朗普政府的部分关税措施缺乏法律授权。市场风险提示,投资需谨慎。 | [Link](https://longbridge.com/en/news/276477629.md) | --- > **Disclaimer**: This article is for reference only and does not constitute any investment advice.