OpenAI launches SWE-bench Verified: Existing frameworks underestimate the software engineering capabilities of models

OpenAI just launched a more reliable code generation evaluation benchmark: SWE-bench Verified. The most important sentence in the blog post is: "As our systems approach AGI more and more, we need to evaluate them in increasingly challenging tasks." This benchmark is an improved version (subset) of the existing SWE-bench, aimed at more reliably evaluating the ability of AI models to solve real-world software problems. SWE-bench is a popular software engineering evaluation suite used to assess the ability of large language models (LLMs) to solve real software problems extracted from GitHub. It evaluates by providing code repositories and problem descriptions to AI agents, and asking them to generate patches to fix the problems. While LLMs have made remarkable progress on SWE-bench, OpenAI's research found that the benchmark has some issues that may underestimate the models' autonomous software engineering capabilities. Specifically, OpenAI identified three main issues with SWE-bench: 1. Unit tests are too strict: Unit tests used to evaluate the correctness of solutions are often too specific, even irrelevant to the problem, which may result in rejecting correct solutions. 2. Unclear problem descriptions: Many samples have vague problem descriptions, leading to ambiguity in understanding the problem and its solutions. 3. Difficult to set up development environment: Sometimes it is challenging to reliably set up the SWE-bench development environment for agents, which may cause unit tests to fail regardless of the solution. To address these issues, OpenAI collaborated with professional software developers to manually screen each sample in the SWE-bench test set to ensure the scope of unit tests is appropriate and problem descriptions are clear. In the end, they released SWE-bench Verified, which is a verified subset containing 500 samples, replacing the original SWE-bench and SWE-bench Lite test sets. Additionally, OpenAI also worked with the authors of SWE-bench to develop a new evaluation tool that uses containerized Docker environments, making evaluations on SWE-bench easier and more reliable. On SWE-bench Verified, GPT-4o solved 33.2% of the samples, while the best-performing open-source agent framework Agentless doubled its score to 16% OpenAI's research highlights the importance of a deep understanding and improvement of benchmark evaluation, especially as AI systems approach closer to Artificial General Intelligence (AGI). With the continuous improvement in AI model capabilities, we need to evaluate their performance more carefully to ensure that the evaluation results accurately reflect the true capabilities of the models. OpenAI suggests: Deep Understanding of Benchmarks: Even well-designed benchmarks may have issues that need ongoing improvement. Consider Progress in the Ecosystem: Pay attention to the community's advancements in agent frameworks and consider potential external enhancements when assessing risks. Recognize Limitations: Evaluations based on static datasets have inherent limitations and require supplementary evaluation methods. For more information: https://openai.com/index/introducing-swe-bench-verified/