Beyond Benchmarks: Why AI Needs a Human-Centered Scoreboard

The current race to build ever more capable artificial intelligence models often prioritizes technical benchmarks over the nuanced reality of human interaction. Andrew Gordon, Staff Researcher in Behavioral Science, and Nora Petrova, AI Researcher, both from Prolific, contend that this focus creates a significant disconnect. In a recent interview, they meticulously dissected the flaws in conventional AI evaluation, advocating for a more “humane” approach to truly measure a model’s utility, safety, and relatability for real people. Gordon vividly illustrates this disconnect with an analogy: “A car that wins a Formula 1 race [is not] the best choice for your morning commute.” Similarly, an AI model that achieves an “incredibly good” score on academic benchmarks like Humanity’s Last Exam (MMLU) “might be absolute nightmare to use day-to-day.” This encapsulates the core problem: technical prowess does not automatically translate to practical, beneficial human experience. The landscape of AI evaluation is, as Gordon describes, “incredibly nascent” and “fractured.” There is no standardized method for labs to report performance, leading to a cacophony of selective metrics. Some emphasize MMLU scores, others highlight different benchmarks, and some offer no public data at all. This heterogeneity makes meaningful comparison challenging and risks creating a “leaderboard illusion” where models are optimized for narrow, technical tests rather than broad human utility. Beyond mere performance, the stakes are rising in AI safety. Nora Petrova highlights the alarming trend of users turning to AI for highly sensitive topics such as mental health support or navigating personal problems. She warns that “there is no oversight on that,” deeming the current situation “the Wild West.” Recent incidents, like Grok-3’s “Mecha Hitler” debacle, underscore the fragility of existing safety training, revealing a “thin veneer” over potentially harmful outputs. Gordon argues that safety metrics should hold equal weight to a model’s speed or intelligence. Even human preference leaderboards, intended to gauge user experience, suffer from significant biases. Platforms like Chatbot Arena, while valuable for generating vast amounts of comparative data, lack demographic stratification. This means the feedback comes from an anonymous, unrepresentative sample, making it impossible to understand why users prefer one model over another. Furthermore, Gordon points out that these platforms can be “gamed” by companies submitting numerous iterations of their models, granting them disproportionate access to user prompts for refinement. This practice undermines the integrity of the leaderboard, as some models receive far more “battles” and thus more data for iterative improvement. To address these critical shortcomings, Prolific has developed the HUMAINE Leaderboard, a human-centered evaluation framework built on three pillars. First, it employs census-based representative sampling, meticulously stratifying participants by age, ethnicity, and political alignment across geographies like the US and UK. This ensures that feedback reflects the diverse values and preferences of the general public, not just tech-savvy early adopters. Second, HUMAINE moves beyond simplistic “A vs. B” preferences by incorporating specific, actionable metrics. Instead of a binary choice, users evaluate models on criteria such as helpfulness, adaptability, communication, personality, trustworthiness, understanding, and cultural alignment. As Gordon states, this provides “an actionable set of results that say, ‘Okay, your model is struggling with trust,’ or ‘Your model is struggling with personality.'” This granular feedback empowers developers to target specific areas for improvement, directly enhancing the human experience. The third pillar is the use of Microsoft’s TrueSkill algorithm, a Bayesian system originally developed for Xbox Live matchmaking. This sophisticated ranking mechanism efficiently estimates model “skill” by accounting for factors like randomness, evolving performance over time, and the uncertainty of individual comparisons. It actively prioritizes battles between models where the data gain is highest, ensuring efficient and robust differentiation. Early findings from the HUMAINE Leaderboard are illuminating. While models continue to advance in objective intelligence and task performance, Prolific’s data indicates a concerning trend: AI models are performing “a lot worse on personality metrics and background and culture metrics.” This suggests a widening gap between models’ intellectual capabilities and their ability to engage with users in a relatable, culturally aware, and genuinely helpful manner. Furthermore, the researchers observed an increase in “sycophancy”—a people-pleasing behavior that users generally dislike. The implications for founders, VCs, and AI professionals are profound. Building truly impactful AI requires moving beyond a narrow focus on technical scores. It necessitates a holistic understanding of how models interact with diverse human users in real-world scenarios. Prolific’s HUMAINE Leaderboard offers a rigorous, transparent, and actionable pathway to achieving this, ensuring that AI development is aligned with human values and needs, not just arbitrary benchmarks.