The New Gartner: Why Independent LLM Benchmarking is Essential

The independent verification of model performance has become one of the most critical challenges facing the global AI ecosystem. As George Cameron and Micah Hill-Smith, co-founders of Artificial Analysis (AA), explained in a recent interview, relying solely on proprietary labs to report their own performance metrics introduces a fundamental conflict of interest that distorts the competitive landscape. Cameron and Hill-Smith spoke with Swyx of Latent Space about the rapid evolution of their company, which has quickly become the independent gold standard for evaluating large language models (LLMs). The Australian-founded firm was born out of a simple necessity: a pervasive lack of reliable, objective data regarding model performance, speed, and cost. The founders realized quickly that relying on self-reported metrics from the labs building these models was a fool’s errand. This systemic bias meant that developers trying to build reliable applications faced an impossible decision-making landscape. Labs often manipulate evaluations by prompting models differently or cherry-picking examples, leading to inflated, non-reproducible scores. Cameron points to one particularly egregious example: “Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU.” To counteract this, Artificial Analysis adopted a rigorous methodology, including running all evaluations themselves and implementing a “mystery shopper policy,” registering accounts not on their own domain to prevent labs from serving different models on private, optimized endpoints. The mission of providing free, public data remains central to AA’s identity, allowing developers and companies to navigate the increasingly complex AI stack. This public transparency is supported by two commercial arms. The first is an enterprise benchmarking insights subscription, offering standardized reports on critical deployment decisions, such as choosing between serverless inference, managed solutions, or leasing chips for self-hosting. The second revenue stream is private custom benchmarking for AI companies themselves, helping them understand their own models’ performance against specialized criteria. Critically, the founders maintain a strict firewall: “No one pays to be on the public leaderboard.” AA continues to push beyond saturated benchmarks like simple MMLU scores. Their flagship Artificial Analysis Intelligence Index (V3) synthesizes ten different evaluations, including agentic benchmarks and long-context reasoning tests, presenting a single score with 95% confidence intervals via repeated runs. The AA-Omniscience Index directly addresses the critical issue of hallucination. This score penalizes incorrect answers while rewarding the model for admitting, “I don’t know.” The Omissions Index reveals that Anthropic’s Claude models consistently lead with the lowest hallucination rates, even if they aren’t always the smartest overall model. This highlights a crucial trade-off for enterprise users prioritizing factual reliability over bleeding-edge reasoning capabilities. A major contribution to the evaluation space is GDP Val-AA, a benchmark evaluating models on 44 real-world, economically valuable white-collar tasks involving complex documents like spreadsheets and PDFs. This evaluation is performed using the Stirrup agent harness, which allows for multi-turn conversations (up to 100 turns) and external tool use, including code execution and file system access. The complexity inherent in these real-world tasks demands a sophisticated judge, leading AA to use Gemini 3 Pro as an LLM judge, a methodology they rigorously tested to ensure no self-preference bias. This focus on multi-turn, tool-using agentic workflows is vital, as the industry moves beyond single-query performance metrics toward assessing models’ ability to autonomously complete complex, multi-step tasks. The conversation also touched upon the paradox of AI costs, captured in the “smiling curve of AI costs.” While achieving GPT-4 level intelligence is now 100 to 1,000 times cheaper than it was at launch—thanks to smaller, efficient models like Amazon Nova—the cost of deploying frontier reasoning models in demanding agentic workflows remains high or is even increasing due to the reliance on long context windows and highly sparse models. The founders noted that this trend suggests a future dominated by massive sparse models, pointing out that accuracy in the Omniscience Index correlates strongly with total parameter count, not just active parameters. This highlights the ongoing incentive for labs to build models with vast, albeit sparsely activated, knowledge bases. This focus on efficiency extends to metrics like token efficiency versus turn efficiency. A model might cost more per token, but if it solves a complex task in fewer conversational turns, the overall cost to the user is cheaper. AA measures these variables closely, noting that newer models are becoming better at using more tokens only when necessary, resulting in tighter token distributions during inference. Finally, AA addresses model transparency with its Openness Index, scoring models from 0 to 18 based on the availability of pre-training data, post-training data, methodology, training code, and licensing terms. This metric is essential for developers prioritizing open-source integrity and reproducibility. Leading the Openness Index are models like AI2 OLMo 2 and Nous Hermes, reflecting a commitment to transparency that proprietary labs often forgo. The challenges of maintaining objective, relevant benchmarks in a field accelerating this rapidly are immense, requiring AA to continuously innovate its evaluation methodologies to ensure that the metrics provided reflect the true capabilities and trade-offs faced by developers building the next generation of AI applications.