GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index | AI News Detail | Blockchain.News

Latest Update

4/18/2026 12:56:00 AM

GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index

According to @emollick, GDPval-AA is not a meaningful benchmark because it uses Gemini 3.1 to judge model outputs on public GDPval questions, which he argues adds little signal about true capability. As reported by Artificial Analysis, Claude Opus 4.7 leads GDPval-AA with 1,753 Elo and tops the Artificial Analysis Intelligence Index at 57.3, narrowly ahead of Gemini 3.1 Pro at 57.2 and GPT-5.4 at 56.8; the firm states GDPval-AA spans 44 occupations and 9 industries using an agentic loop with shell and browsing via the Stirrup harness. According to Artificial Analysis, Opus 4.7 improves on IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.), and GPQA Diamond (+1.8 p.p.), while reducing hallucinations to 36% and using ~35% fewer output tokens than Opus 4.6 to run the suite. For businesses, the dispute over GDPval-AA’s evaluator design highlights the need to diversify benchmarks (e.g., HLE, GPQA Diamond, TerminalBench, AA-Omniscience) and to audit judge-model dependence to avoid evaluator bias and overfitting, as indicated by both Ethan Mollick’s critique and Artificial Analysis’ published methodology.

Source

Analysis

The recent release of Claude Opus 4.7 by Anthropic marks a significant advancement in AI model capabilities, particularly in agentic performance and efficiency, as highlighted in benchmark reports from April 2026. According to Artificial Analysis, Claude Opus 4.7 achieved a score of 57 on their Intelligence Index, tying it with leading models like GPT-5.4 from OpenAI and Gemini 3.1 Pro from Google, with exact scores of 57.3, 56.8, and 57.2 respectively, within a 95% confidence interval of plus or minus 1 point. This positions Anthropic at the forefront of real-world agentic work, topping the GDPval-AA benchmark with an Elo rating of 1,753, surpassing previous versions and competitors by notable margins. However, this benchmark has faced criticism, as noted by AI expert Ethan Mollick in a Twitter post on April 18, 2026, who argued that GDPval-AA relies on Gemini 3.1 judging other models on public questions from GDPval, potentially offering limited insights into true capabilities. Despite this, the model's improvements are evident in reduced hallucination rates, dropping to 36% from 61% in Opus 4.6, and a 12-point increase to 26 on the AA-Omniscience Index as of April 2026 evaluations. Key features include a new 'xhigh' reasoning effort setting and task budgets in public beta, allowing for better prioritization in agentic loops. With unchanged pricing at $5 per million input tokens and $25 per million output tokens, and a 1M token context window, Opus 4.7 demonstrates efficiency by using 35% fewer output tokens—102 million versus 157 million for Opus 4.6—while scoring higher. These developments come amid Anthropic's API updates, removing extended thinking in favor of adaptive reasoning, as reported in the same Artificial Analysis update.

From a business perspective, Claude Opus 4.7's leadership in agentic benchmarks like GDPval-AA, which evaluates performance across 44 occupations and 9 major industries using shell access and web browsing via the open-source Stirrup harness, opens substantial opportunities for automation in knowledge work. Industries such as finance, healthcare, and legal services could leverage this for tasks requiring multi-step reasoning and tool integration, potentially reducing operational costs by streamlining workflows. For instance, the model's 79 Elo point lead over GPT-5.4 and Claude Sonnet 4.6 in GDPval-AA as of April 2026 suggests superior handling of complex, real-world scenarios, enabling businesses to implement AI agents for customer service, data analysis, and decision support. Market trends indicate a growing demand for such agentic AI, with projections from industry reports estimating the global AI market to reach $15.7 trillion by 2030, driven by advancements in models like Opus 4.7. Monetization strategies could include subscription-based API access or customized enterprise solutions, where companies integrate Opus 4.7 into platforms like Amazon Bedrock or Microsoft Azure for scalable deployment. However, implementation challenges persist, such as ensuring data privacy and mitigating biases in agentic loops, which Anthropic addresses through ethical guidelines emphasized in their April 2026 release notes. Competitive landscape sees Anthropic challenging OpenAI's dominance in long-horizon coding and Google's edge in scientific reasoning, fostering innovation across the sector.

Regulatory considerations are crucial as AI models advance; the EU AI Act, effective from 2024, classifies high-risk systems like agentic AI under strict compliance, requiring transparency in benchmarks. Ethical implications, including reduced hallucination through higher abstention rates—70% attempt rate for Opus 4.7 versus 82% for 4.6—promote best practices in reliable AI deployment, minimizing misinformation risks in business applications. Looking ahead, Opus 4.7's efficiency gains, with evaluation costs dropping 11% to about $4,406 despite higher performance, signal a trend toward cost-effective AI, potentially accelerating adoption in SMEs. Future implications include enhanced multimodal capabilities, though current benchmarks focus on text-based agentic tasks.

In closing, the April 2026 benchmark results for Claude Opus 4.7 underscore its potential to transform industries by enabling sophisticated AI agents that handle diverse occupational tasks with improved accuracy and efficiency. Businesses can capitalize on this by exploring integration strategies, such as combining Opus 4.7 with existing CRM systems for automated insights, addressing market opportunities in a competitive landscape where Anthropic, OpenAI, and Google vie for supremacy. Predictions suggest that by 2027, agentic AI could contribute to 20% productivity gains in knowledge sectors, per industry forecasts, provided challenges like benchmark reliability—echoing Mollick's critique of GDPval-AA—are resolved through more robust, independent evaluations. Practical applications extend to software development, where gains in benchmarks like TerminalBench Hard (+5.3 percentage points) facilitate advanced coding tasks, and scientific research via improvements in SciCode (+2.6 points). Overall, this release not only ties the top models but also pushes the envelope on practical AI utility, encouraging enterprises to invest in training and compliance to harness these breakthroughs effectively.

What are the key improvements in Claude Opus 4.7 compared to previous versions? Claude Opus 4.7 shows advancements in agentic performance, with a 134 Elo point increase in GDPval-AA over Opus 4.6, reduced hallucination rates, and lower token usage, as per April 2026 benchmarks from Artificial Analysis.

Is GDPval-AA a reliable benchmark for AI models? While it measures performance across occupations and industries, critics like Ethan Mollick argue it lacks depth due to its judgment methodology, suggesting the need for alternative metrics.

Claude Opus Gemini 3.1 GPQA GPT 5.4 IFBench

Ethan Mollick

@emollick

Professor @Wharton studying AI, innovation & startups. Democratizing education using tech

GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index

Analysis

Ethan Mollick

Premium Sponsors

Trending topics