<p>前沿人工智能模型在生成模板函数方面表现出色，但对其调试能力的新严格测试表明，它们距离处理生产故障仍然相去甚远。</p>
<p>今天发布的基准测试 OTelBench 测试了 14 个领先的大型语言模型在执行一项基本的站点可靠性工程（SRE）任务上的能力：使用行业标准 OpenTelemetry（OTel）为微服务添加分布式追踪。结果对人工智能 SRE 的炒作周期是一个严峻的现实检查。</p>
<p>在涵盖 11 种编程语言的 23 项任务中，整体通过率仅为 14%。即使是表现最好的模型，Anthropic 的 Claude Opus 4.5，成功率也仅为 29%，而 GPT 5.2 的成功率为 26%。</p>
<p>分布式追踪对于现代微服务架构至关重要。当用户点击 “登录” 时，这个单一的动作可能会跨越数十个服务。OTel 提供了必要的工具——添加到应用程序中的代码——将这些分散的事件链接成一个单一的、连贯的时间线，使工程师能够准确找出请求失败的地方。</p>
<p>OTelBench 的研究人员，包括 Przemek Delewski 和 Jacek Migdał，设计了对人类 SRE 来说微不足道的任务，涉及约 300 行代码的简短、干净的微服务。如果模型无法处理这些，它们肯定无法应对现实世界中庞大且遗留系统。</p>
<h2>上下文差距与无声故障</h2>
<p>基准测试识别出的最关键的故障模式不是编译错误，而是缺乏商业上下文。</p>
<p>在一个例子中，模型被呈现出一个模拟两种不同用户行为的网络服务：一次成功的搜索和一次失败的令牌检索尝试。人类工程师会立即将这两者识别为两个独立的用户旅程，需要两个具有不同 TraceID 的独特追踪。</p>
<p>大型语言模型在这个测试中表现不佳。它们没有生成两个独立的追踪，而是机械地对每个 HTTP 调用进行了工具化，并将两个用户行为混淆为一个单一的、令人困惑的时间线。它们成功地对低级 HTTP 调用进行了工具化，但未能正确传播上下文，将整个序列视为一个扁平的事件列表，而不是两个层次分明的树状结构。</p>
<p>这一发现至关重要：许多模型生成的代码虽然编译正确，但产生了格式错误或无用的追踪。对于 SRE 工作而言，“它构建成功” 远远不够。格式错误的追踪可以说比没有追踪更糟，因为它在故障期间提供了误导性的数据。</p>
<p>现代系统的多语言特性也被证明是一个无法克服的障碍。基准测试要求模型跨越 11 种语言工作，反映了云环境的现实。虽然 Go 和 C++ 取得了一定的成功，但模型在几个关键语言上完全失败。在 Java、Ruby 或 Swift 中，14 个模型没有解决任何任务，常常在依赖管理和构建系统上挣扎——这些技能远远超出了简单代码生成的范畴。</p>
<p>研究人员指出，这需要多语言后端开发技能，包括对 C++ 的 CMake 或 Go 的模块系统的了解，这些通常超出了训练截止日期或当前人工智能开发的核心关注点。</p>
<p>### 成本与效率权衡</p>
<p>尽管整体表现较低，但基准测试确实揭示了成本效率的显著差异。</p>
<p>表现最好的模型 Claude Opus 4.5 也是最昂贵的。然而，预算意识强的赢家是 Gemini 3 Flash。尽管其成本是 Claude Opus 4.5 的 11 倍便宜且速度是其两倍，Gemini 3 Flash 的通过率达到了 19%，大幅超越了更昂贵的 Gemini 3 Pro，后者仅得到了 16%。这表明，对于低级 SRE 支持而言，速度和成本效率目前主导了在通用智能上的边际收益。</p>
<p>OTelBench 的结果由 QuesmaOrg 发布，确认了人工智能 SRE 的承诺在 2026 年初仍然主要是市场炒作。虽然大型语言模型可以协助编写函数，但它们缺乏可靠地工具化和调试复杂分布式系统所需的长远推理和上下文意识。</p>
<p>结论很明确：在模型能够可靠地区分不同用户旅程并处理多语言构建系统的复杂现实之前，工程师应当预期自己编写和维护 OpenTelemetry 工具化。行业需要像 OTelBench 这样的更好基准测试，以推动朝着真正可靠的人工智能辅助的进展。</p>

全球科技股指数 ETF - iShares

XDAT

北美科技软件股指数 ETF - iShares

AGIX

云计算 ETF - GlobalX

标普软件与服务 ETF - SPDR

<p>一项新的基准测试 OTelBench 显示，领先的 AI 模型在调试能力方面存在困难，而这些能力对站点可靠性工程（SRE）至关重要。在对 14 个模型进行的测试中，使用 OpenTelemetry 添加分布式追踪的整体通过率仅为 14%。表现最好的模型是 Anthropic 的 Claude Opus 4.5，成功率为 29%。主要失败原因包括缺乏业务上下文和多语言系统的挑战。尽管一些成本效益较高的模型表现更好，但结果表明，AI 在 SRE 中的作用仍然有限，强调了工程师在模型改进之前需要自行处理 OpenTelemetry 的仪器化</p>

<p>Frontier AI models have become excellent at generating boilerplate functions, but a new, rigorous test of their debugging capabilities suggests they are nowhere near ready to handle production outages.</p>
<div class="lb-trans"><p>前沿人工智能模型在生成模板函数方面表现出色，但对其调试能力的新严格测试表明，它们距离处理生产故障仍然相去甚远。</p>
</div><p>A benchmark released today, OTelBench, tested 14 leading large language models on their ability to perform a fundamental Site Reliability Engineering (SRE) task: adding distributed tracing to microservices using the industry standard, OpenTelemetry (OTel). The results are a stark reality check for the AI SRE hype cycle.</p>
<div class="lb-trans"><p>今天发布的基准测试 OTelBench 测试了 14 个领先的大型语言模型在执行一项基本的站点可靠性工程（SRE）任务上的能力：使用行业标准 OpenTelemetry（OTel）为微服务添加分布式追踪。结果对人工智能 SRE 的炒作周期是一个严峻的现实检查。</p>
</div><p>The overall pass rate across 23 tasks spanning 11 programming languages was a dismal 14%. Even the best performing model, Anthropic’s Claude Opus 4.5, succeeded only 29% of the time, while GPT 5.2 managed 26%.</p>
<div class="lb-trans"><p>在涵盖 11 种编程语言的 23 项任务中，整体通过率仅为 14%。即使是表现最好的模型，Anthropic 的 Claude Opus 4.5，成功率也仅为 29%，而 GPT 5.2 的成功率为 26%。</p>
</div><p>Distributed tracing is essential for modern microservices architectures. When a user clicks “Login,” that single action might hop across dozens of services. OTel provides the necessary instrumentation—code added to the application—to link these scattered events into a single, coherent timeline, allowing engineers to pinpoint where a request failed.</p>
<div class="lb-trans"><p>分布式追踪对于现代微服务架构至关重要。当用户点击 “登录” 时，这个单一的动作可能会跨越数十个服务。OTel 提供了必要的工具——添加到应用程序中的代码——将这些分散的事件链接成一个单一的、连贯的时间线，使工程师能够准确找出请求失败的地方。</p>
</div><p>The OTelBench researchers, including Przemek Delewski and Jacek Migdał, designed tasks that would be trivial for a human SRE, involving short, clean microservices of around 300 lines of code. If the models cannot handle this, they certainly cannot handle the massive, legacy-ridden systems found in the real world.</p>
<div class="lb-trans"><p>OTelBench 的研究人员，包括 Przemek Delewski 和 Jacek Migdał，设计了对人类 SRE 来说微不足道的任务，涉及约 300 行代码的简短、干净的微服务。如果模型无法处理这些，它们肯定无法应对现实世界中庞大且遗留系统。</p>
</div><h2>The Context Gap and Silent Failures</h2>
<div class="lb-trans"><h2>上下文差距与无声故障</h2>
</div><p>The most critical failure mode identified by the benchmark was not compilation errors, but a profound lack of business context.</p>
<div class="lb-trans"><p>基准测试识别出的最关键的故障模式不是编译错误，而是缺乏商业上下文。</p>
</div><p>In one example, models were presented with a web service simulating two distinct user actions: a successful search and a failed token retrieval attempt. A human engineer would immediately recognize these as two separate user journeys, requiring two unique traces with different TraceIDs.</p>
<div class="lb-trans"><p>在一个例子中，模型被呈现出一个模拟两种不同用户行为的网络服务：一次成功的搜索和一次失败的令牌检索尝试。人类工程师会立即将这两者识别为两个独立的用户旅程，需要两个具有不同 TraceID 的独特追踪。</p>
</div><p>The LLMs failed this test consistently. Instead of generating two distinct traces, they mechanically instrumented every HTTP call and conflated both user actions into a single, confusing timeline. They successfully instrumented the low-level HTTP calls but failed to propagate the context correctly, treating the entire sequence as one flat list of events rather than two hierarchical trees.</p>
<div class="lb-trans"><p>大型语言模型在这个测试中表现不佳。它们没有生成两个独立的追踪，而是机械地对每个 HTTP 调用进行了工具化，并将两个用户行为混淆为一个单一的、令人困惑的时间线。它们成功地对低级 HTTP 调用进行了工具化，但未能正确传播上下文，将整个序列视为一个扁平的事件列表，而不是两个层次分明的树状结构。</p>
</div><p>This finding is crucial: many models produced code that compiled correctly but generated malformed or useless traces. For SRE work, “it builds” is not nearly enough. A malformed trace is arguably worse than no trace at all, as it provides misleading data during an outage.</p>
<div class="lb-trans"><p>这一发现至关重要：许多模型生成的代码虽然编译正确，但产生了格式错误或无用的追踪。对于 SRE 工作而言，“它构建成功” 远远不够。格式错误的追踪可以说比没有追踪更糟，因为它在故障期间提供了误导性的数据。</p>
</div><p>The polyglot nature of modern systems also proved to be an insurmountable hurdle. The benchmark required models to work across 11 languages, reflecting the reality of cloud environments. While Go and C++ saw moderate success, models failed completely on several key languages. None of the 14 models solved a single task in Java, Ruby, or Swift, often struggling with dependency management and build systems—skills that extend far beyond simple code generation.</p>
<div class="lb-trans"><p>现代系统的多语言特性也被证明是一个无法克服的障碍。基准测试要求模型跨越 11 种语言工作，反映了云环境的现实。虽然 Go 和 C++ 取得了一定的成功，但模型在几个关键语言上完全失败。在 Java、Ruby 或 Swift 中，14 个模型没有解决任何任务，常常在依赖管理和构建系统上挣扎——这些技能远远超出了简单代码生成的范畴。</p>
</div><p>The researchers note that this requires polyglot backend development skills, including knowledge of CMake for C++ or module systems for Go, which are often past the training cut-off dates or outside the core focus of current AI development.</p>
<div class="lb-trans"><p>研究人员指出，这需要多语言后端开发技能，包括对 C++ 的 CMake 或 Go 的模块系统的了解，这些通常超出了训练截止日期或当前人工智能开发的核心关注点。</p>
</div><p>### Cost and Efficiency Tradeoffs</p>
<div class="lb-trans"><p>### 成本与效率权衡</p>
</div><p>While performance was low across the board, the benchmark did reveal significant differences in cost efficiency.</p>
<div class="lb-trans"><p>尽管整体表现较低，但基准测试确实揭示了成本效率的显著差异。</p>
</div><p>The best-performing model, Claude Opus 4.5, was also the most expensive. However, the budget-conscious winner was Gemini 3 Flash. Despite being 11 times cheaper and twice as fast as Claude Opus 4.5, Gemini 3 Flash achieved a 19% pass rate, substantially outperforming the more expensive Gemini 3 Pro, which scored only 16%. This suggests that for low-level SRE assistance, speed and cost efficiency currently dominate marginal gains in generalized intelligence.</p>
<div class="lb-trans"><p>表现最好的模型 Claude Opus 4.5 也是最昂贵的。然而，预算意识强的赢家是 Gemini 3 Flash。尽管其成本是 Claude Opus 4.5 的 11 倍便宜且速度是其两倍，Gemini 3 Flash 的通过率达到了 19%，大幅超越了更昂贵的 Gemini 3 Pro，后者仅得到了 16%。这表明，对于低级 SRE 支持而言，速度和成本效率目前主导了在通用智能上的边际收益。</p>
</div><p>The OTelBench results, released by QuesmaOrg, confirm that the promise of AI SRE remains largely marketing hype in early 2026. While LLMs can assist in writing functions, they lack the long-horizon reasoning and contextual awareness required to reliably instrument and debug complex distributed systems.</p>
<div class="lb-trans"><p>OTelBench 的结果由 QuesmaOrg 发布，确认了人工智能 SRE 的承诺在 2026 年初仍然主要是市场炒作。虽然大型语言模型可以协助编写函数，但它们缺乏可靠地工具化和调试复杂分布式系统所需的长远推理和上下文意识。</p>
</div><p>The verdict is clear: until models can reliably distinguish between separate user journeys and handle the messy reality of polyglot build systems, engineers should expect to write and maintain their own OpenTelemetry instrumentation. The industry needs better benchmarks like OTelBench to drive progress toward truly reliable AI assistance.</p>
<div class="lb-trans"><p>结论很明确：在模型能够可靠地区分不同用户旅程并处理多语言构建系统的复杂现实之前，工程师应当预期自己编写和维护 OpenTelemetry 工具化。行业需要像 OTelBench 这样的更好基准测试，以推动朝着真正可靠的人工智能辅助的进展。</p>
</div>

AI OpenTelemetry 基准测试揭示了 LLM 调试的失败